arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

15
Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction Asahi Ushio and Federico Liberatore and Jose Camacho-Collados School of Computer Science and Informatics Cardiff University, United Kingdom {UshioA,LiberatoreF,CamachoColladosJ}@cardiff.ac.uk Abstract Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well- known tf-idf as default, despite the existence of other suitable alternatives, including graph- based models. In this paper, we perform an exhaustive and large-scale empirical compar- ison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some inter- esting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. We re- lease our code at https://github.com/ asahi417/kex. 1 Introduction Keyword extraction has been an essential task in many scientific fields as a first step to extract rele- vant terms from text corpora. Despite the simplicity of the task, it still poses practical problems, and often researchers resort to simple but reliable tech- niques such as tf-idf (Jones, 1972). In turn, term weighting schemes such as tf-idf paved the way for developing large-scale Information Retrieval (IR) systems (Ramos et al., 2003; Wu et al., 2008). Its simple formulation is still widely used nowadays, not only for keyword extraction but also as a an im- portant component in IR (Jabri et al., 2018; Marcos- Pablos and García-Peñalvo, 2020) and Natural Lan- guage Processing (NLP) tasks (Riedel et al., 2017; Arroyo-Fernández et al., 2019). While there exist supervised and neural tech- niques (Lahiri et al., 2017; Xiong et al., 2019; Sun et al., 2020) as well as ensembles of unsupervised methods (Campos et al., 2020; Tang et al., 2020) that can provide competitive performance, in this paper we go back to the basics and analyze in detail the single components of unsupervised methods for keyword extraction. In fact, it is still common to rely on unsupervised methods for keyword extrac- tion given their versatility and the lack of training sets in specialized domains. Unsupervised keyword extraction methods can be split into statistical and graph-based (see Section 2). However, despite their wide adoption in NLP and IR (even outside keyword extraction), there has not been a large-scale comparison of existing tech- niques to date. The advantages and disadvantages of statistical methods in contrast to graphical meth- ods, for example, has remained largely unexplored. In order to fill this gap, in this paper we per- form an extensive analysis of single unsupervised keyword extraction techniques in a wide range of settings and datasets. To the best of our knowledge, this is the first large-scale empirical evaluation per- formed across base statistical and graphical key- word extraction methods. Our analysis sheds light on some properties of statistical methods largely unknown. For instance, our experiments show that a statistical weighting scheme based on the hyper- geometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or better tf-idf (Jones, 1972), while having additional advantages with respect to flexibility and efficiency. As for the graph-based methods, they can be more reliable than statistical methods without being con- siderably slower in practice. In fact, a graph-based method initialized with tf-idf or lexical specificity performs best overall. The remainder of the paper is organized as fol- lows. First, we provide an overview of standard statistical and graph-based keyword extraction tech- niques (Section 2). Then, we present our experi- mental setting and overall results (Section 3). To arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Transcript of arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Page 1: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Back to the Basics: A Quantitative Analysis of Statistical andGraph-Based Term Weighting Schemes for Keyword Extraction

Asahi Ushio and Federico Liberatore and Jose Camacho-ColladosSchool of Computer Science and Informatics

Cardiff University, United Kingdom{UshioA,LiberatoreF,CamachoColladosJ}@cardiff.ac.uk

Abstract

Term weighting schemes are widely used inNatural Language Processing and InformationRetrieval. In particular, term weighting is thebasis for keyword extraction. However, thereare relatively few evaluation studies that shedlight about the strengths and shortcomings ofeach weighting scheme. In fact, in most casesresearchers and practitioners resort to the well-known tf-idf as default, despite the existenceof other suitable alternatives, including graph-based models. In this paper, we perform anexhaustive and large-scale empirical compar-ison of both statistical and graph-based termweighting methods in the context of keywordextraction. Our analysis reveals some inter-esting findings such as the advantages of theless-known lexical specificity with respect totf-idf, or the qualitative differences betweenstatistical and graph-based methods. Finally,based on our findings we discuss and devisesome suggestions for practitioners. We re-lease our code at https://github.com/asahi417/kex.

1 Introduction

Keyword extraction has been an essential task inmany scientific fields as a first step to extract rele-vant terms from text corpora. Despite the simplicityof the task, it still poses practical problems, andoften researchers resort to simple but reliable tech-niques such as tf-idf (Jones, 1972). In turn, termweighting schemes such as tf-idf paved the way fordeveloping large-scale Information Retrieval (IR)systems (Ramos et al., 2003; Wu et al., 2008). Itssimple formulation is still widely used nowadays,not only for keyword extraction but also as a an im-portant component in IR (Jabri et al., 2018; Marcos-Pablos and García-Peñalvo, 2020) and Natural Lan-guage Processing (NLP) tasks (Riedel et al., 2017;Arroyo-Fernández et al., 2019).

While there exist supervised and neural tech-niques (Lahiri et al., 2017; Xiong et al., 2019; Sun

et al., 2020) as well as ensembles of unsupervisedmethods (Campos et al., 2020; Tang et al., 2020)that can provide competitive performance, in thispaper we go back to the basics and analyze in detailthe single components of unsupervised methods forkeyword extraction. In fact, it is still common torely on unsupervised methods for keyword extrac-tion given their versatility and the lack of trainingsets in specialized domains.

Unsupervised keyword extraction methods canbe split into statistical and graph-based (see Section2). However, despite their wide adoption in NLPand IR (even outside keyword extraction), there hasnot been a large-scale comparison of existing tech-niques to date. The advantages and disadvantagesof statistical methods in contrast to graphical meth-ods, for example, has remained largely unexplored.

In order to fill this gap, in this paper we per-form an extensive analysis of single unsupervisedkeyword extraction techniques in a wide range ofsettings and datasets. To the best of our knowledge,this is the first large-scale empirical evaluation per-formed across base statistical and graphical key-word extraction methods. Our analysis sheds lighton some properties of statistical methods largelyunknown. For instance, our experiments show thata statistical weighting scheme based on the hyper-geometric distribution such as lexical specificity(Lafon, 1980) can perform at least as well as orbetter tf-idf (Jones, 1972), while having additionaladvantages with respect to flexibility and efficiency.As for the graph-based methods, they can be morereliable than statistical methods without being con-siderably slower in practice. In fact, a graph-basedmethod initialized with tf-idf or lexical specificityperforms best overall.

The remainder of the paper is organized as fol-lows. First, we provide an overview of standardstatistical and graph-based keyword extraction tech-niques (Section 2). Then, we present our experi-mental setting and overall results (Section 3). To

arX

iv:2

104.

0802

8v1

[cs

.LG

] 1

6 A

pr 2

021

Page 2: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

better understand the results, we provide an anal-ysis on some relevant features (Section 4) and afocused discussion about advantages and disadvan-tages of each kind of model (Section 5). Finally, inSection 6 we present our concluding remarks andpossible avenues for future work.

2 Keyword Extraction

Given a document with m words [w1 · · ·wm], key-word extraction is a task to find n noun phrases,which can comprehensively represent the docu-ment. As each of such phrases consists of con-tiguous words in the document, the task can beseen as an ordinary ranking problem over all candi-date phrases appeared in the document. A typicalkeyword extraction pipeline is thus implementedas first to construct a set of candidate phrases Pdfor a target document d, and second to computeimportance scores for all of individual words in d.1

Finally, the top-n phrases {yj |j = 1 . . . n} ⊂ Pdin terms of the aggregated score are selected as theprediction (Mihalcea and Tarau, 2004). Figure 1shows an overview of the overarching methodologyfor unsupervised keyword extraction.

To compute word-level scores, we can mainlysee two lines of research: statistical and graph-based models. Although there are a few contri-butions that focus on training supervised modelsfor keyword extraction (Witten et al., 2005), dueto the absence of large labeled data and domain-specificity, most efforts are still unsupervised,which is the focus of this paper.

2.1 Statistical ModelsA statistical model attains an importance scorebased on word-level statistics or surface features,such as a word frequency or a length of word.2 Asimple keyword extraction method could be to sim-ply use term frequency (tf) as a scoring functionfor each word, which tend to work reasonably well.However, this simple measure may miss importantinformation such as the relative importance of agiven word in a corpus. For instance, prepositionssuch as in or articles such as the tend to be highlyfrequent in a text corpus. However, they barely rep-resent a keyword in a given text document. To this

1In the case of multi-token candidate phrases, this score isaveraged among its tokens.

2We are aware that statistical may not be strictly accurateto refer to tf-idf or purely frequency-based models, but in thiscase we follow previous conventions by grouping all thesemethods based on word-level frequency statistics as statistical(Aizawa, 2003).

Figure 1: An overview of the keyword extractionpipeline.

end, different variants have been proposed, whichwe summarize in two main alternatives: tf-idf (Sec-tion 2.1.1) and lexical specificity (Section 2.1.2).

2.1.1 TF-IDFAs an extension of tf, term frequency–inverse docu-ment frequency (tf-idf) (Jones, 1972) is one of mostpopular and effective methods been used for sta-tistical keyword extraction (El-Beltagy and Rafea,2009), as well as still being an important compo-nent in modern information retrieval applications(Marcos-Pablos and García-Peñalvo, 2020; Guuet al., 2020). Given a set of documents D and aword w from a document d ∈ D, tf-idf is definedas the proportion between its word frequency andits inverse document frequency3, as

stfidf(w|d) = tf(w|d) · |D|log2 df(w|D)

(1)

where we define | · | as the number of elements in aset, tf (w|d) as a frequency of w in d, and df (w|D)as a document frequency of w over a dataset D. Inpractice, tf (w|d) is often computed by counting thenumber of times that w occurs in d, while df (w|D)by the number of documents in D that contain w.

3While there other formulations and normalization tech-niques for tf-idf (Paik, 2013), in this paper we focus on thetraditional inverse-document frequency formulation.

Page 3: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

To give a few examples of statistical modelsbased on tf-idf and its derivatives in a keyword ex-traction context, KP-miner (El-Beltagy and Rafea,2009) utilizes tf-idf, a word length, and the abso-lute position of a word in a document to determinethe importance score, while RAKE (Rose et al.,2010) uses the term degree, the number of differentword it co-occurs with, divided by tf. Recently,YAKE (Campos et al., 2020) established strongbaselines on public datasets by combining variousstatistical features including casing, sentence posi-tion, term/sentence-frequency, and term-dispersion.In this paper, however, we focus on the vanillaimplementation of term frequency and tf-idf.

2.1.2 Lexical specificityIn this section we describe lexical specificity (La-fon, 1980), which is a statistical metric to extractrelevant words from a subcorpus using a largercorpus as reference. In short, lexical specificity ex-tracts a set of most representative words for a giventext based on the hypergeometric distribution. Thehypergeometric distribution represents the discreteprobability of k successes in n draws, without re-placement. In the case of lexical specificity, k rep-resents word frequency and n the size of a corpus.While not as widely adoped as tf-idf, lexical speci-ficity has been used in similar term extraction tasks(Drouin, 2003), but also in textual data analysis(Lebart et al., 1998), domain-based disambiguation(Billami et al., 2014) or as a weighting scheme forbuilding vector representations for concepts andentities (Camacho-Collados et al., 2016) or senseembeddings (Scarlini et al., 2020) in NLP.

Formally, the lexical specificity for a word w ina document d is defined as

sspec(w|d) = − log10

F∑l=f

Phg(x=l,md,M, f, F )

(2)where md is the total number of words in d andPhg(x=l,m,M, f, F ) represents the probabilityof a given word to appear l times exactly in d ac-cording to the hypergeometric distribution parame-terised with md, M , f , and F , which are definedas below.

M =∑d∈D

md, f = tf(w|d), F =∑d∈D

tf(w|d)

Note also that, unlike in tf-idf, for lexical speci-ficity a perfect partition of documents of D (refer-ence corpus) is not required. This also opens up

to other possibilities, such as using larger corporaas reference, for example. The implementationof (2) can be also computed in an efficient wayas described in (Camacho-Collados et al., 2016),relatively faster than tf-idf as estimated from ourempirical speed test in Section 3.5.

2.2 Graph-based Methods

The basic idea behind graph-based keyword extrac-tion is to identify the most relevant words froma graph constructed from a text document, wherewords are nodes and their connections are mea-sured in different ways. For this, PageRank (Pageet al., 1999) and its derivatives have been proved tobe the most successful graph-based algorithms forextracting relevant keywords (Mihalcea and Tarau,2004; Wan and Xiao, 2008a; Florescu and Caragea,2017; Sterckx et al., 2015; Bougouin et al., 2013).

Formally, let G = (V, E) be a graph where V andE are its associated set of vertices and edges. In atypical word graph construction on a document d(Mihalcea and Tarau, 2004), V is defined as the setof all unique words in d and each edge ewi,wj ∈ Erepresents a strength of the connection between twowords wi, wj ∈ V . Then, a Markov chain from wj

to wi on a word graph can be defined as

p(wi|wj) = (1− λ)ewi,wj∑

wk∈Viewi,wk

+ λpb(wi)

(3)where Vi ⊂ V is a set of incoming nodes to wi,pb(·) is a prior probabilistic distribution over V ,and 0 ≤ λ ≤ 1 is a parameter to control the effectof pb(·). The prior term pb(·), which is originallya uniform distribution, is introduced to enable anytransitions even if there are no direct connectionsamong them, and this probabilistic model (3) iscalled the random surfer model (Page et al., 1999).Once we have a word graph, PageRank is appliedon it to estimate a probability p(w) for every wordw ∈ V , which is used as an importance score forthe word.

TextRank (Mihalcea and Tarau, 2004) uses anundirected graph and defines the edge weight asewi,wj = 1 if wi and wj co-occurred within lcontiguous sequence of words in d, otherwiseewi,wj = 0. SingleRank (Wan and Xiao, 2008a)extends TextRank by modifying the edge weight asthe number of co-occurrence of wi and wj withinthe l-length sliding window and ExpandRank (Wanand Xiao, 2008b) multiplies the weight by cosinesimilarity of tf-idf vector within neighbouring doc-

Page 4: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

uments. To reflect a statistical prior knowledgeto the estimation, recent works proposed to usenon-uniform distributions for pb(·). In (Florescuand Caragea, 2017), the authors observe that key-words are likely to occur very close to the first fewsentences in a document in academic paper andproposed PositionRank in which pb(·) is defined asthe inverse of the absolute position of each wordin a document. TopicalPageRank (TPR) (Jardineand Teufel, 2014; Sterckx et al., 2015) introducesa topic distribution inferred by Latent DirichletAllocation (LDA) as a pb(·), so that the estima-tion contains more semantic diversity across top-ics. TopicRank (Bougouin et al., 2013) clusters thecandidates before running PageRank to group simi-lar words together, and MultipartiteRank (Boudin,2018) extends it by employing a multipartite graphfor a better candidate selection within a cluster.

Finally, there are a few other works that directlyrun graph clustering (Liu et al., 2009; Grinevaet al., 2009), but they use edges to connect clustersinstead of words, with semantic relatedness as aweight. Although these techniques do a good jobat capturing high-level semantics, the relatedness-based weights rely on external resources such asWikipedia (Grineva et al., 2009), and thus add an-other layer of complexity in terms of generaliza-tion. For these reasons, they are excluded from thisstudy.

3 Experimental setting and Results

In this section, we explain our experimental setting,including datasets considered (Section 3.1), pre-processing operations (Section 3.2), comparisonsystems (Section 3.3), evaluation metrics (Section3.4), and present the main results (Section 3.5).

As mentioned throughout the paper, our exper-iments are focused on keyword extraction. In allexperiments, we use Python and a 16 cores Ubuntumachine equipped with 3.8GHz i7 core and 64GiBmemory. Upon acceptance, we will release thecode and data to reproduce all our experiments.

3.1 Datasets

To evaluate keyword extraction methods, we con-sider 15 different public datasets in English.4 Ta-ble 1 provides high-level statistics of each dataset,

4All the datasets were fetched from a public data repositoryfor keyword extraction data.: https://github.com/LIAAD/KeywordExtractor-Datasets

including length and number of keyphrases5 (bothaverage and standard deviation). Most datasetsare comprised of full-length academic articles, butthere are also datasets of abstracts or paragraphsfrom academic papers, as well as other formatssuch as news and technical reports. This is alsoreflected on their diverse size. While there aredatasets with a relatively small average documentsize with fewer than 40 tokens on average such asInspec, kdd and www (based on abstracts), andSemEval2017 (paragraph-based), other datasetssuch as SemEval2020, citeulike180, wiki20 andKrapivin2009 have an average document size overof over 800 tokens.

Each entry in a dataset consists of a source doc-ument and a set of gold keyphrases, where thesource document is processed through the pipelinedescribed in section 3.2 and the gold keyphrase setis filtered to include only phrases which appear inits candidate set.

Domains. Inspec (Hulth, 2003), Krapivin2009(Krapivin et al., 2009), SemEval2017 (Augensteinet al., 2017), kdd (Gollapalli and Caragea, 2014),www (Gollapalli and Caragea, 2014), and wiki20(Medelyan and Witten, 2008) consist of documentsfrom the computer science (CS) domain, whilePubMed (Schutz et al., 2008) and Schutz2008(Schutz et al., 2008) are from biomedicine (BM),and citeulike180 (Medelyan et al., 2009) is frombioinfomatics (BI). Other than academic domains,fao30 and fao780 (Medelyan and Witten, 2008)are taken from agricultural reports, and KPCrowd(Marujo et al., 2013), Nguyen2007 (Nguyen andKan, 2007), and SemEval2010 (Kim et al., 2010)include a mixture of several domains such as fi-nance and social science. Finally, theses1006 con-tains 100 full master and Ph.D. theses from theUniversity of Waikato, New Zealand.

3.2 Preprocessing

Before running keyword extraction on each dataset,we apply standard text preprocessing operations.The documents first are tokenized into words bysegtok7, a python library for tokenization and sen-tence splitting. Then, each word is stemmed to re-duce it to its base form for comparison purpose by

5In this paper we use keyword and keyphrase almost in-stinctively, as some datasets contain keyphrases comprisingmore than a single token.

6https://github.com/zelandiya/keyword-extraction-datasets

7https://pypi.org/project/segtok/

Page 5: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Data Size Domain Type Diversity # NPs # tokens Vocab size # keyphrasestotal multi-words

avg std avg std avg std avg std avg std

KPCrowd 500 - news 2.3 77.4 62.0 447.3 476.7 197.1 140.6 16.5 12.0 3.7 3.8Inspec 2000 CS abstract 1.8 27.0 12.3 137.8 66.6 76.2 28.6 5.8 3.5 4.8 3.2

Krapivin2009 2304 CS article 8.4 814.5 252.1 9130.5 2524.4 1081.1 256.2 3.8 2.1 2.9 1.9Nguyen2007 209 - article 6.5 618.0 113.9 5931.4 1023.1 908.5 142.4 7.2 4.3 4.8 3.2

PubMed 500 BM article 5.6 566.3 196.5 4461.3 1626.4 799.8 223.7 5.7 2.7 1.5 1.3Schutz2008 1231 BM article 3.5 630.3 287.7 4200.7 2251.1 1217.2 468.4 28.5 10.3 10.1 4.9

SemEval2010 243 CS article 8.0 897.8 207.7 9740.2 2443.4 1217.9 209.1 11.6 3.3 8.8 3.3SemEval2017 493 - paragraph 1.9 39.7 12.9 198.0 60.3 106.1 27.5 9.3 4.9 6.3 3.4citeulike180 183 BI article 4.7 822.3 173.0 5521.1 978.8 1171.1 202.9 7.8 3.4 1.1 1.0

fao30 30 AG article 4.8 774.0 93.2 5437.8 927.5 1125.1 157.1 15.9 5.6 5.5 2.6fao780 779 AG article 5.1 776.3 147.2 5590.5 902.4 1087.3 210.3 4.2 2.3 1.6 1.3

theses100 100 - article 4.8 728.4 131.3 5396.7 958.4 1134.0 192.3 2.4 1.5 0.8 0.8kdd 755 CS abstract 1.7 16.0 17.0 81.9 93.0 48.1 45.7 0.7 0.9 0.6 0.8

wiki20 20 CS report 6.6 816.7 322.4 7145.6 3609.8 1088.4 295.4 12.8 3.2 6.7 2.7www 1330 CS abstract 1.7 17.9 16.5 90.6 89.1 52.7 43.3 0.9 1.0 0.5 0.7

Table 1: Dataset statistics, where size refers to the number of documents; diversity refers to a measure of varietyof vocabulary computed as the total number of words divided by the number of unique words; number of nounphrases (NPs) refers to candidate phrases extracted by our pipeline; number of tokens is the size of the dataset;vocab size is the number of unique tokens, and number of keyphrase shows the statistics of gold keyphrases forwhich we report the total number keyphrases, as well as the number of keyphrases composed by more than onetoken (multi-tokens). In terms of statistics, we show the average (avg) and the standard deviation (std).

Porter Stemmer from NLTK (Bird et al., 2009), awidely used python library for text processing. Part-of-speech annotation is carried out using NLTKtagger.

To select a candidate phrase set Pd, followingthe literature (Wan and Xiao, 2008b), we considercontiguous nouns in the document d that form anoun phrase satisfying the regular expression (AD-JECTIVE)*(NOUN)+.8 We then filter the candi-dates with a stopword list taken from the officialYAKE implementation9 (Campos et al., 2020). Fi-nally, for the statistical methods and the graph-based methods based on them (i.e., LexRank andTFIDFRank), we compute prior statistics includingterm frequency (tf), tf-idf, and LDA by Gensim(Rehurek and Sojka, 2010) within each dataset.

3.3 Comparison Models

In the following, we list the keyword extractionmodels which are considered in our experiments.

3.3.1 Statistical modelsAs statistical models, we include keyword extrac-tion methods based on tf, tf-idf, and lexical speci-ficity referred as TF, TFIDF, and LexSpec10 respec-

8While the vast majority of keywords in the considereddatasets follow this structure, there are a few cases of differ-ent Part-of-Speech tags as keywords, or where this simpleformulation can miss a correct candidate. Nonetheless, ourexperimental setting is focused on comparing keyword extrac-tion measures, within the same preprocessing framework.

9https://github.com/LIAAD/yake10Lexical specificity is implemented by following the im-

plementation in (Camacho-Collados et al., 2016).

tively.11 Each model uses its statistics as a score forthe individual words and then aggregates them toscore the candidate phrases (see Section 2.1). Wealso add a heuristic baseline which takes the first nphrases as its prediction (FirstN).

3.3.2 Graph-based modelsAs graph-based models, we compare five distinctmethods: TextRank (Mihalcea and Tarau, 2004),SingleRank (Wan and Xiao, 2008a), PositionRank(Florescu and Caragea, 2017), SingleTPR (Ster-ckx et al., 2015), and TopicRank (Bougouin et al.,2013). Additionally, we propose two extensionsof SingleRank, which we call TFIDFRank andLexRank, where a word distribution computed bytf-idf or lexical specificity is used for pb(·). Sup-posing that we have computed tf-idf (1) for a givendataset D, the prior distribution for TFIDFRank isdefined as

pb(w) =stfidf(w|d)∑

w∈V stfidf(w|d)(4)

for w ∈ V . Likewise, LexRank relies on the pre-computed lexical specificity prior (2), which de-fines the prior distribution for d as

pb(w) =sspec(w)∑

w∈V sspec(w). (5)

11As mentioned in Section 2.1, we do not include YAKE(Campos et al., 2020) as our experiments are focused on ana-lyzing single features on their own in a unified setting. YAKEutilizes a combination of various textual features, but the en-semble of different features into a single model is out of scopein this paper.

Page 6: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

All remaining specifications follow SingleRank’sgraph construction procedure. As implementationsof graph operations such as PageRank and wordgraph construction, we use NetworkX (Hagberget al., 2008), a graph analyzer in Python.

3.4 Evaluation Metrics

To evaluate the keyword extraction models, weemploy three standard metrics in the literature inkeyword extraction and information retrieval: pre-cision at 5 (P@5), precision at 10 (P@10), andmean reciprocal rank (MRR). In general, precisionat k is computed as

P@k =

∑|D|d=1 |yd ∩ y

kd |∑|D|

d=1 min{|yd|, k}(6)

where yd is the set of gold keyphrases providedwith a document d in the dataset D and ykd is a setof estimated top-k keyphrases from a model for thedocument.12

MRR measures the ranking quality given by amodel as follows:

MRR =1

|D|

|D|∑d=1

1

min{k∣∣ |ykd ∩ yd| > 1

} (7)

In this case, MRR takes into account the positionof the first correct keyword from the ranked list ofpredictions.

3.5 Results

In this section, we report our main experimentalresults comparing unsupervised keyword extractionmethods. Table 4 shows the results obtained byall comparison systems. As can be observed, theperformances highly depend on datasets and themetrics.

The algorithms in each metric that achieve thebest accuracy across datasets the most are TFID-FRank for P@5, SingleRank and TopicRank forP@10, and LexSpec and TFIDF for MRR. In the av-eraged metrics over all datasets, lexical specificityand tf-idf based models (TFIDF, LexSpec, TFID-FRank, and LexRank) are shown to perform quitewell on all metrics. In particular, the hybrid models

12The minimum operation between the number of goldkeyphrases and gold labels in the denominator of Eq. 6 isincluded as to provide a measure between 0 and 1, given thevarying number of gold labels. This has also been consid-ered in other retrieval tasks with similar settings (Camacho-Collados et al., 2018).

Prior Model Time Time Timeprior total per doc

tfTF

10.211.5 0.0058

LexSpec 12.1 0.0061LexRank 25.5 0.0128

tf-idf TFIDF 10.3 22.4 0.0112TFIDFRank 26.5 0.0133

LDA SingleTPR 16.2 29.4 0.0147

-

FirstN

-

11.5 0.0058TextRank 14.9 0.0075

SingleRank 15.0 0.0075PositionRank 15.0 0.0075

TopicRank 19.0 0.0095

Table 2: Average clock time (sec.) to process en-tire dataset over 100 independent trials on the Inspecdataset.

LexRank and TFIDFRank achieve the best accu-racy on all the metrics, with LexSpec and TFIDFbeing competitive in MRR. Overall, despite theirsimplicity, both lexical specificity and tf-idf appearto be able to exploit effective features for keywordextraction from a variety of datasets, and performrobustly to domain shifts including document size,format, as well as the source domain. In the follow-ing sections we perform a more in-depth analysison these results and the overall performance of eachtype model.

Moreover, TF gives a remarkably low accuracyon every metric and the huge gap in between TFand TFIDF can be interpreted as the improvementgiven by the inverse document frequency. Docu-ment frequency relies on a partition of corpus, andhence corpus with an unreliable document partitionsuch as noisy social media texts or web-crawledcontents can degrade TFIDF performance as heuris-tics to build a meaningful partition would be re-quired. On the other hand, lexical specificity onlyneeds a term frequency, so we can expect it to bemore robust than TFIDF in such a practical situa-tion.

Execution time. In terms of complexity for eachalgorithm, we report the average process time over100 independent trials on the Inspec dataset in Ta-ble 2, which also includes the time to compute eachstatistical prior over the dataset. In general, none ofthe models perform very slow but, as expected, sta-tistical models are faster than graph-based modelsdue to the overhead introduced by the PageRankalgorithm, although they need to compute each sta-tistical prior beforehand.

Page 7: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

FirstN TF

LexS

pec

TFID

FTe

xtRa

nkSin

gleRa

nkPo

sition

Rank

LexR

ank

TFID

FRan

kSin

gleTP

RTo

picRa

nk

FirstN

TF

LexSpec

TFIDF

TextRank

SingleRank

PositionRank

LexRank

TFIDFRank

SingleTPR

TopicRank

100 12 18 17 16 17 29 17 17 15 29

12 100 8 8 17 16 13 11 12 31 15

18 8 100 77 40 46 37 64 63 34 31

17 8 77 100 34 38 32 53 56 28 29

16 17 40 34 100 76 40 55 56 57 25

17 16 46 38 76 100 48 69 69 66 26

29 13 37 32 40 48 100 48 47 41 23

17 11 64 53 55 69 48 100 90 51 28

17 12 63 56 56 69 47 90 100 50 28

16 31 34 28 57 66 41 51 50 100 23

29 15 31 29 25 26 23 28 27 23 10020

40

60

80

100

Table 3: Overall pairwise agreement scores for the top5 predictions.

4 Analysis

Following the main results presented in the previ-ous section, we perform an analysis on differentaspects of the evaluation. In particular, we focuson the agreement among methods (Section 4.1),overall performance (Section 4.2), and the featuresrelated to each dataset leading to each method’sperformace (Section 4.3).

4.1 Agreement Analysis

For a visualization purpose, we compute agreementscores over all possible pairs of models as the per-centage of predicted keywords the two models havein common in the top-5 prediction, as displayed inTable 3. Interestingly, the most similar models interms of the agreement score are TFIDFRank andLexRank. Not surprisingly, TFIDF and LexSpecalso hold a very high similarity that implies thosetwo statistical measures capture quite close features.However, as we will discuss in Section 5.1, theyalso have a few marked differences. Moreover, wecan see that graph-based models provide fairly highagreement scores, except for TopicRank, which canbe due to the difference in the word graph construc-tion procedure. In fact, TopicRank unifies similarword before building a word graph and that resultsin such a distinct behaviour among graph-basedmodels (see Section 2.2). In the discussion section,we investigate the relation among each model inmore detail.

4.2 Mean Precision Analysis

The objective of the following statistical analy-sis is to compare the overall performance of thekey extraction methods in terms of their mean per-

formance (i.e., P@k, for k = 5, 10 and MRR).For this analysis all instances from all datasets areconsidered together, without splitting the perfor-mance by datasets like in our main experiments.Therefore, this analysis considers observationsthat correspond to a pair in the Cartesian product(document ×method), that is, the application of akeyword extraction method to a document. Overall,96,280 observations are included.

Table 5 illustrates the mean P@k and MMR foreach key extraction method. The best P@5, P@10,and MMR are obtained by TFIDFRank, TopicRank,and TFIDFRank, respectively. A method domi-nates another in terms of performance if it is non-worse in all the metrics and strictly better in atleast one metric. Following this rule, it is possibleto rank the methods according to their dominanceorder, i.e., the Pareto ranking. According to thisranking, the top methods are those that are non-dominated, followed by those that are dominatedonly by methods of the first group, et cetera. Theresulting ranking is presented in the following:

1. TFIDFRank and TopicRank.2. LexRank, LexSpec, and PositionRank.3. SingleRank and TFIDF.4. FirstN and TextRank.5. SingleTPR.6. TF.As can be observed in this ranking and in the re-

sults of Table 5, LexSpec slightly but consistentlyoutperforms TFIDF, which is an interesting resulton its own given the predominance in the use ofTFIDF in the literature and in practical applica-tions.13

4.3 Regression Analysis on Methods andDatasets

The objective of this analysis is to understand whatare a dataset’s characteristics that make one methodbetter than another at extracting keywords. For thispurpose, a regression model is built for every per-formance metric (P@k, for k = 5, 10, and MRR)and pair of key extraction methods (m1 and m2).

In the regression models, each observation is apair in the Cartesian product (dataset ×method).The following independent variables are consid-ered: avg_word and sd_word (i.e., average andstandard deviation of the number of tokens inthe dataset, representing the length of the doc-uments), avg_vocab and sd_vocab (i.e., average

13We extend the discussion about the comparison ofLexSpec and TFIDF in Section 5.1.

Page 8: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Metric DatasetStatistical Graph-based

FirstN TF Lex TFIDF Text Single Position Lex TFIDF Single TopicSpec Rank Rank Rank Rank Rank TPR Rank

P@5

KPCrowd 37.9 26.5 41.6 41.3 32.1 32.0 33.5 33.8 33.9 26.8 39.5Inspec 34.8 19.8 34.0 34.3 36.2 36.7 35.8 36.0 36.2 33.0 35.1Krapivin2009 19.4 0.1 10.0 8.8 7.3 10.1 16.1 11.1 10.8 8.2 9.9Nguyen2007 19.7 0.2 18.7 17.7 14.1 18.6 21.8 19.7 19.8 15.2 14.7PubMed 10.6 3.9 8.4 7.4 11.4 11.9 11.4 9.8 9.8 10.3 8.6Schutz2008 17.1 1.6 39.4 39.2 34.3 36.7 18.5 39.2 39.7 14.5 47.0SemEval2010 15.1 1.5 14.7 12.9 13.4 17.4 23.2 16.8 16.6 13.5 16.5SemEval2017 31.5 17.7 47.8 49.2 42.8 44.7 42.4 47.9 48.7 36.5 38.5citeulike180 6.8 9.9 18.8 15.6 23.6 24.7 20.9 23.6 25.0 24.4 17.2fao30 17.3 16.0 24.0 20.7 26.0 30.0 24.0 29.3 29.3 32.0 24.7fao780 10.5 3.7 14.4 12.8 14.8 16.8 15.6 15.9 15.9 17.2 14.5kdd 35.2 10.1 32.4 33.8 29.4 32.2 33.8 34.9 36.8 16.5 34.0theses100 7.3 0.9 15.9 15.5 9.4 12.9 14.2 15.5 15.0 12.4 12.9wiki20 13.0 13.0 17.0 21.0 13.0 19.0 14.0 22.0 23.0 20.0 16.0www 32.3 14.1 32.1 33.2 26.9 27.9 32.6 27.8 30.9 17.9 30.8AVG 20.6 9.3 24.6 24.2 22.3 24.8 23.9 25.6 26.1 19.9 24.0

P@10

KPCrowd 36.6 25.6 35.1 35.4 26.8 27.3 29.4 29.0 29.3 25.4 36.6Inspec 33.8 22.6 35.4 35.6 38.3 38.6 36.8 37.6 37.7 34.4 33.9Krapivin2009 19.6 0.1 9.9 8.6 7.3 10.0 15.6 10.9 10.6 8.2 10.1Nguyen2007 17.4 0.7 15.6 14.6 13.6 16.6 19.2 17.2 17.4 14.4 14.7PubMed 10.1 4.2 6.9 6.3 10.0 10.6 10.1 8.5 8.3 9.0 8.5Schutz2008 14.7 4.2 28.5 28.3 25.8 27.5 17.5 28.7 29.0 13.0 41.7SemEval2010 12.3 1.0 11.1 10.9 11.2 14.1 17.8 13.5 13.7 11.1 14.7SemEval2017 31.9 21.2 46.8 47.6 43.4 44.2 42.4 47.0 46.9 37.4 35.0citeulike180 6.8 8.5 14.4 12.3 18.2 19.1 17.3 17.5 18.6 18.5 14.5fao30 15.1 13.0 15.8 14.0 20.9 22.9 19.5 20.5 21.6 24.0 18.5fao780 10.2 3.6 13.3 11.3 13.6 15.4 14.7 14.9 14.5 15.9 13.9kdd 35.2 10.1 32.3 33.7 29.3 32.1 33.9 34.8 36.8 16.5 33.9theses100 7.1 0.8 15.9 15.1 9.2 13.0 13.8 15.5 14.6 12.1 12.6wiki20 13.5 9.3 12.4 12.4 12.4 15.5 13.0 14.5 15.5 15.0 16.6www 32.2 14.0 32.1 33.2 26.9 27.9 32.7 27.8 31.0 17.9 30.9KPCrowd 61.7 46.8 75.5 74.6 64.1 63.3 65.7 67.5 67.0 50.8 62.3AVG 19.8 9.3 21.7 21.3 20.5 22.3 22.2 22.5 23.0 18.2 22.4

MRR

Inspec 58.4 33.6 53.3 53.7 52.3 53.3 58.1 54.2 54.6 51.1 58.8Krapivin2009 36.4 1.3 23.0 21.2 18.2 22.3 31.7 23.8 24.0 18.9 22.0Nguyen2007 43.6 2.8 38.8 42.3 31.3 35.1 43.8 37.0 38.4 30.4 34.2PubMed 23.4 13.5 23.8 21.7 32.1 30.9 31.0 27.4 26.5 26.4 20.0Schutz2008 24.7 8.6 76.7 76.8 68.9 70.9 38.5 75.6 76.4 33.6 67.4SemEval2010 49.7 4.5 35.8 34.6 32.9 35.5 47.8 35.3 36.4 29.8 35.9SemEval2017 52.2 32.9 68.9 68.9 61.6 63.7 62.7 67.7 67.4 57.2 63.9citeulike180 21.0 23.7 55.6 48.3 58.5 62.9 51.3 63.3 66.1 63.0 40.5fao30 31.1 38.3 61.8 49.1 60.2 70.0 48.6 66.1 67.0 75.1 50.6fao780 17.5 8.8 40.2 36.7 37.1 39.7 36.9 40.6 40.0 39.0 32.5kdd 52.5 28.2 54.4 55.5 49.1 53.1 56.4 55.6 57.1 39.3 52.6theses100 16.7 2.8 37.1 34.5 25.7 29.1 27.5 35.1 35.3 30.2 29.8wiki20 27.5 27.7 52.7 47.7 40.1 45.7 31.1 52.2 46.5 37.7 35.5www 51.1 32.0 52.3 53.5 45.6 47.5 52.5 49.1 51.2 37.4 48.2AVG 37.8 20.4 50.0 47.9 45.2 48.2 45.6 50.0 50.3 41.3 43.6

Table 4: Mean precision at top 5 (P@5) and top 10 (P@10), and mean reciprocal rank (MRR). The best score ineach dataset is highlighted using a bold font.

and standard deviation of the number of uniquetokens in the dataset, representing the lexical rich-ness of the documents), avg_phrase and sd_phrase(i.e., average and standard deviation of the num-ber of noun phrases in the dataset, representingthe number of candidate keywords in the docu-ments), avg_keyword and sd_keyword (i.e., aver-age and standard deviation of the number of goldkeyphrases associated to the dataset). The regres-

sion models estimate the dependent variable as

∆avg_score = avg_scorem1 − avg_scorem2

where avg_scorem1 and avg_scorem2 are theaverage performance metrics obtained by the meth-ods m1 and m2 on the dataset’s documents, respec-tively. Feature selection is carried out by forwardstepwise-selection using BIC penalization to re-

Page 9: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

P@5 P@10 MRRFirstN 24.7 24.0 42.0TF 9.2 10.1 19.3LexSpec 26.2 24.5 48.1TFIDF 26.0 24.2 47.4TextRank 24.0 22.8 43.8SingleRank 25.8 24.4 46.0PositionRank 25.5 24.9 46.1LexRank 26.4 24.8 47.8TFIDFRank 27.1 25.3 48.4SingleTPR 19.0 18.6 37.0TopicRank 26.6 25.4 45.1

Table 5: Key extraction methods’ mean precision at k,for k = 5, 10, and MRR. Each column is independentlycolour-coded according to a gradient that goes fromgreen (best/highest value) to red (worst/lowest value).

R2

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

40

Figure 2: Distribution of the regression models’ ad-justed coefficients of determination.

move non-significant variables. Each model con-siders 15 observations and, overall, 165 regressionmodels are fitted.

The adjusted coefficient of determination(adjR2) is used as a measure of goodness of fit;its distribution is illustrated in Figure 2, whichshows overall good explanatory capabilities for theregression models. More in detail, The 0%, 25%,50%, 75%, and 100% quantiles are -0.0481, 0.5779,0.7092, 0.8263, and 0.9679, respectively (note thatadjR2 allows for negative values). Thus, morethan 75% of the models have an adjR2 > 0.58,and more than 50% have adjR2 > 0.71, suggest-ing that, in general, the considered dataset’s char-acteristics explain satisfactorily the differences inthe results obtained by the key extraction meth-ods. Therefore, the variables can be used to deter-mine what method is more performant for a givendataset.

In addition to this overall interpretation, the coef-

m1 m2 Metric adjR2

P@5 0.78LexSpec TFIDF P@10 0.88

MRR 0.63P@5 0.84

SingleRank TopicRank P@10 0.73MRR 0.74P@5 0.67

LexSpec SingleRank P@10 0.57MRR 0.43P@5 0.65

SingleRank TFIDF P@10 0.63MRR 0.56

Table 6: Goodness of fit of selected regression mod-els. The columns present the keyword extraction meth-ods compared (m1 andm2), the metric chosen (metric),and the corresponding adjusted coefficient of determi-nation (adjR2).

ficients of the regression models can be used to un-derstand under what circumstances each model ispreferable. In fact, a positive coefficient identifiesa variable that positively correlates with a greaterprecision for m1. On the other hand, a negativecoefficient corresponds to a variable that positivelycorrelates with a greater precision for m2. Onlythe models having an adjR2 > 0.50 and their sta-tistically significant variables (i.e., p-value<0.05)should be considered for interpretation.

Table 6 illustrates a summary of the goodnessof fit for a selection of models that are analysedin the following section. Most models, expect forLexSpec VS SingleRank using MRR, achieve aadjR2 > 0.50, which indicates that the consid-ered variables explain most of the differences inperformance between the models. Finally, Table 7illustrates the significant variables in the regressionmodels. These are used in the following section todraw insights on the methods’ preferences in termsof datasets’ characteristics.

5 Discussion

In this section, we provide a focused comparisonamong the different types of model, highlightingtheir main differences, advantages and disadvan-tages.14 First, we discuss the two main statisticalmethods analyzed in this paper, namely LexSpecand TFIDF (Section 5.1). Then, we analyze anddiscuss graphical methods, and in particular Sin-gleRank and TopicRank (Section 5.2). Finally, weprovide an overview of the main differences be-tween statistical and graph-based methods (Section

14More details about individual regression analyses areavailable in the appendix.

Page 10: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

m1 metric statistically significant variables metric m2

LexSpecP@5 avg_word (*), sd_vocab (*), avg_keyword (*) sd_word (**) P@5

TFIDFP@10 avg_phrase (**), sd_phrase (*), sd_keyword (*) sd_word (**), avg_vocab (*) P@10MRR avg_phrase (**) avg_word (**), avg_vocab (**) MRR

SingleRankP@5 avg_vocab (**), sd_phrase (**), avg_keyword (*) sd_word (**), sd_vocab (***) P@5

TopicRankP@10 sd_phrase (*) sd_word (*), sd_vocab (**) P@10MRR avg_phrase (***), avg_keyword (*) avg_word (***), avg_vocab (**) MRR

SingleRankP@5 avg_phrase (*), sd_phrase (*), avg_keyword (**) sd_word (*), avg_vocab (*), sd_keyword (**) P@5

LexSpecP@10 avg_phrase (*), avg_keyword (*) sd_word (*), avg_vocab (*), sd_keyword (*) P@10

SingleRankP@5 avg_phrase (*), avg_keyword (*) sd_word (*), avg_vocab (*), sd_keyword (*) P@5

TFIDFP@10 avg_phrase (*), avg_keyword (*) avg_vocab (*), sd_keyword (*) P@10MRR avg_phrase (*), avg_keyword (*) sd_word (*), avg_vocab (*), sd_keyword (*) MRR

Table 7: Significant variables in the regression models comparing key-extraction methods’ performance. Onlymodels having adjR2 > 0.5 are represented. Columns m1 and m2 report the compared methods; columns‘metric’shows the performance metric considered; the central columns illustrate the statistically significant variables thatpositively affect the performance of each model. The significance of the variables is indicated between parenthesis,according to the following scale: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05.

5.3).

5.1 LexSpec vs. TFIDF

According to Table 5, LexSpec and TFIDF havesimilar average performance, although LexSpec ob-tains slightly better scores in all the precision levels.Both measures attain also a high agreement scoreas explained in Section 4.1. However, in the Paretoranking, LexSpec ranks second, while TFIDF ranksthird. Therefore, in general, the former should bepreferred over the latter.

Still, TFIDF might perform well in certaindatasets. According to Table 7, the choice of thekey-extraction method strongly depends on themetric used. For P@5, TFIDF performs better indatasets having a higher variability in the number ofwords (sd_word), while LexSpec prefers datasetswith longer documents (avg_word), more variabil-ity in terms of lexical richness (sd_vocab) andmore gold keyphrases (avg_keyword). For P@10,LexSpec exhibits a very different behaviour. In fact,it now performs significantly better in datasets withhigh average and high variability in the number ofnoun phrases (avg_phrase and sd_phrase), as wellas variable number of gold keywords (sd_keyword).On the other hand, TFIDF prefers datasets withhigh variability in the documents’ length (sd_word)and high lexical richness (avg_vocab). Finally,for MRR, the only characteristics that concernsLexSpec is the number of candidate keyphrases(avg_phrase), while TFIDF prefers datasets withlonger and lexically richer documents (avg_wordand avg_vocab).

Broadly speaking, LexSpec and TFIDF also havequalitative difference. LexSpec, being based onthe hypergeometric distribution, has a statistical

nature, and probablities can be directly inferredfrom it. On this respect, TFIDF is heuristics-based,although it also can be integrated within a proba-bilistic framework (Joachims, 1996) or interpretedfrom an information-theoretic perspective (Aizawa,2003). In practical terms, LexSpec has the advan-tage of not requiring a partition into documentsunlike the traditional formulation of TFIDF. Onthe flip side, TFIDF has been generally found rela-tively simple to tune for specific settings (Cui et al.,2014).

5.2 SingleRank vs. TopicRank

In this analysis we compare two qualitatively differ-ent graph-based methods, namely Single Rank (arepresentative of vanilla graph-based methods) andTopicRank, which leverages topic models. Topi-cRank outperforms SingleRank in P@5 and P@10,while SingleRank achieves a better average MRRscore (see Table 5). However, in terms of rank-ing, TopicRank ranks first (due to obtaining thebest average P@10), while Single Rank ranks third.Therefore, in general, TopicRank should be pre-ferred, unless MRR is the metric of choice.

The insights drawn from the regression modelsare summarised in the following. Table 7 showsthat the performance of SingleRank depends on themetric used. On the other hand, TopicRank hasa more stable set of preferences. However, it isstill possible to identify a pattern. In fact, Topi-cRank is positively influenced by the number ofwords and the lexical richness of the documentsin a dataset (avg_word, sd_word, avg_vocab, andsd_vocab), while SingleRank is affected by thenumber of noun phrases and keyphrases associ-ated to the documents (avg_phrase, sd_phrase, and

Page 11: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

avg_keyword).

5.3 Statistical vs. Graph-based

When comparing SingleRank versus TFIDF andLexSpec in terms of average performance (see Ta-ble 5), it can be seen that SingleRank completelydominates TFIDF. On the other hand, SingleR-ank and LexSpec both rank second, as the latterachieves a better MRR. Therefore, it is recommend-able to use SingleRank when the metric of choiceis precision at k, and LexSpec when MRR is usedinstead. This result seems to suggest that whilestatistical methods can be reliably used to extractrelevant terms when precision is required (reminderthat MRR rewards systems extracting the first cor-rect candidate in top ranks), graphical methods canextract a more coherent set of keywords overallthanks to its graph-connectivity measures. Thisidea should be investigated more in detail in futureresearch.

In terms of dataset features, Table 7 shows thatthe behaviour of SingleRank is very stable. In fact,across all metrics, SingleRank performs better fordatasets with a high average of noun phrases andkeyphrases (avg_phrase and avg_keyword). On theother hand, the statistical methods (i.e. TFIDF andLexSpec) achieve better performances on datasetswith a high standard deviation for the number ofwords and keyphrases, and a high average num-ber of unique tokens (sd_word, sd_keyword, andavg_vocab). In conclusion, SingleRank performsbetter in datasets having a high number of candi-date and gold keyphrases, while its performance ishindered in datasets having more lexical richness.

As far as efficiency is concerned, statistical meth-ods are faster overall in terms of computation time(see Table 2). However, all methods are overallefficient, and this factor should not be of especialrelevant unless computations need to be done onthe fly or on a very large scale. As an advantageof graphical models, these do not require a priorcomputation over the whole dataset. Therefore,graph-based models could potentially reduce thegap in overall execution time in online learningsettings, where new documents are added after theinitial computations.

6 Conclusion

In this paper, we have presented a large-scale em-pirical comparison of unsupervised keyword extrac-tion measures. Our study was focused on two differ-

ent types of keyword extraction methods, namelystatistical methods relying on frequency-based fea-tures, and graph-based methods exploiting the inter-connectivity of words in a corpus. Our analysis onfifteen diverse keyword extraction datasets revealsvarious insights with respect to each type of methodand provide practical suggestions.

In addition to well-known term weightingschemes such as tf-idf, our comparison includes sta-tistical methods such as lexical specificity, whichshows better performance than tf-idf while beingsignificantly less used in the literature. We havealso explored various types of graph-based meth-ods based on PageRank and on topic models, withvarying conclusions with respect to performanceand execution time. Finally, all the results and anal-yses can be used for future research as reference tounderstand in detail the advantages and disadvan-tages of each approach in different settings, bothqualitatively and quantitatively.

As future work, we plan to extend this analy-sis to fathom the extent and characteristics of theinteractions of different methods and their comple-mentarity. Moreover, we are planning to extendthis empirical comparison to other settings wherethe methods are used as weighting schemes forNLP and IR applications, and for languages otherthan English.

ReferencesAkiko Aizawa. 2003. An information-theoretic per-

spective of tf–idf measures. Information Processing& Management, 39(1):45–65.

Ignacio Arroyo-Fernández, Carlos-Francisco Méndez-Cruz, Gerardo Sierra, Juan-Manuel Torres-Moreno,and Grigori Sidorov. 2019. Unsupervised sentencerepresentations as word information series: Revisit-ing tf–idf. Computer Speech & Language, 56:107–129.

Isabelle Augenstein, Mrinal Das, Sebastian Riedel,Lakshmi Vikraman, and Andrew McCallum. 2017.SemEval 2017 task 10: ScienceIE - extractingkeyphrases and relations from scientific publica-tions. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017),pages 546–555, Vancouver, Canada. Association forComputational Linguistics.

Mokhtar-Boumeyden Billami, José Camacho-Collados, Evelyne Jacquey, and Laurence Kister.2014. Annotation sémantique et validation termi-nologique en texte intégral en SHS. In Proceedingsof TALN, pages 363–376.

Page 12: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyz-ing text with the natural language toolkit. " O’ReillyMedia, Inc.".

Florian Boudin. 2018. Unsupervised keyphrase extrac-tion with multipartite graphs. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 667–672, New Orleans, Louisiana. As-sociation for Computational Linguistics.

Adrien Bougouin, Florian Boudin, and Béatrice Daille.2013. Topicrank: Graph-based topic ranking forkeyphrase extraction. In International joint con-ference on natural language processing (IJCNLP),pages 543–551.

Jose Camacho-Collados, Claudio Delli Bovi, LuisEspinosa-Anke, Sergio Oramas, Tommaso Pasini,Enrico Santus, Vered Shwartz, Roberto Navigli, andHoracio Saggion. 2018. SemEval-2018 task 9: Hy-pernym discovery. In Proceedings of The 12th Inter-national Workshop on Semantic Evaluation, pages712–724, New Orleans, Louisiana. Association forComputational Linguistics.

José Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2016. Nasari: Integrating ex-plicit knowledge and corpus statistics for a multilin-gual representation of concepts and entities. Artifi-cial Intelligence, 240:36–64.

Ricardo Campos, Vítor Mangaravite, Arian Pasquali,Alipio Jorge, Célia Nunes, and Adam Jatowt. 2020.Yake! keyword extraction from single documentsusing multiple local features. Information Sciences,509:257–289.

Jia Cui, Jonathan Mamou, Brian Kingsbury, and Bhu-vana Ramabhadran. 2014. Automatic keyword se-lection for keyword search development and tuning.In 2014 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages7839–7843. IEEE.

Patrick Drouin. 2003. Term extraction using non-technical corpora as a point of leverage. Terminol-ogy, 9(1):99–115.

Samhaa R El-Beltagy and Ahmed Rafea. 2009. Kp-miner: A keyphrase extraction system for en-glish and arabic documents. Information systems,34(1):132–144.

Corina Florescu and Cornelia Caragea. 2017. Position-rank: An unsupervised approach to keyphrase ex-traction from scholarly documents. In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1105–1115.

Sujatha Das Gollapalli and Cornelia Caragea. 2014.Extracting keyphrases from research papers using ci-tation networks. In Proceedings of the AAAI Confer-ence on Artificial Intelligence, volume 28.

Maria Grineva, Maxim Grinev, and Dmitry Lizorkin.2009. Extracting key terms from noisy and multi-theme documents. In Proceedings of the 18th inter-national conference on World wide web, pages 661–670.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa-supat, and Mingwei Chang. 2020. Retrieval aug-mented language model pre-training. In Inter-national Conference on Machine Learning, pages3929–3938. PMLR.

Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart.2008. Exploring network structure, dynamics, andfunction using networkx. In Proceedings of the7th Python in Science Conference, pages 11 – 15,Pasadena, CA USA.

Anette Hulth. 2003. Improved automatic keyword ex-traction given more linguistic knowledge. In Pro-ceedings of the 2003 Conference on Empirical Meth-ods in Natural Language Processing, pages 216–223.

Siham Jabri, Azzeddine Dahbi, Taoufiq Gadi, and Ab-delhak Bassir. 2018. Ranking of text documents us-ing tf-idf weighting and association rules mining. In2018 4th International Conference on Optimizationand Applications (ICOA), pages 1–6. IEEE.

James Jardine and Simone Teufel. 2014. Topical pager-ank: A model of scientific expertise for biblio-graphic search. In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics, pages 501–510.

Thorsten Joachims. 1996. A probabilistic analysis ofthe rocchio algorithm with tfidf for text categoriza-tion. Technical report, Carnegie-mellon univ pitts-burgh pa dept of computer science.

Karen Sparck Jones. 1972. A statistical interpretationof term specificity and its application in retrieval.Journal of documentation.

Su Nam Kim, Olena Medelyan, Min-Yen Kan, andTimothy Baldwin. 2010. SemEval-2010 task 5 : Au-tomatic keyphrase extraction from scientific articles.In Proceedings of the 5th International Workshop onSemantic Evaluation, pages 21–26, Uppsala, Swe-den. Association for Computational Linguistics.

Mikalai Krapivin, Aliaksandr Autaeu, and MaurizioMarchese. 2009. Large dataset for keyphrases ex-traction.

Pierre Lafon. 1980. Sur la variabilité de la fréquencedes formes dans un corpus. Mots. Les langages dupolitique, 1(1):127–165.

Shibamouli Lahiri, Rada Mihalcea, and Po-Hsiang Lai.2017. Keyword extraction from emails. Nat. Lang.Eng., 23(2):295–317.

Ludovic Lebart, A Salem, and Lisette Berry. 1998. Ex-ploring textual data. Kluwer Academic Publishers.

Page 13: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Zhiyuan Liu, Peng Li, Yabin Zheng, and MaosongSun. 2009. Clustering to find exemplar terms forkeyphrase extraction. In Proceedings of the 2009conference on empirical methods in natural lan-guage processing, pages 257–266.

Samuel Marcos-Pablos and Francisco J García-Peñalvo. 2020. Information retrieval methodologyfor aiding scientific database search. Soft Comput-ing, 24(8):5551–5560.

Luis Marujo, Márcio Viveiros, and João Paulo da SilvaNeto. 2013. Keyphrase cloud generation of broad-cast news. arXiv preprint arXiv:1306.4606.

Olena Medelyan, Eibe Frank, and Ian H. Witten.2009. Human-competitive tagging using automatickeyphrase extraction. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing, pages 1318–1327, Singapore. As-sociation for Computational Linguistics.

Olena Medelyan and Ian H Witten. 2008. Domain-independent automatic keyphrase indexing withsmall training sets. Journal of the AmericanSociety for Information Science and Technology,59(7):1026–1040.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing, pages 404–411.

Thuy Dung Nguyen and Min-Yen Kan. 2007.Keyphrase extraction in scientific publications. InInternational conference on Asian digital libraries,pages 317–326. Springer.

Lawrence Page, Sergey Brin, Rajeev Motwani, andTerry Winograd. 1999. The pagerank citation rank-ing: Bringing order to the web. Technical report,Stanford InfoLab.

Jiaul H Paik. 2013. A novel tf-idf weighting schemefor effective ranking. In Proceedings of the 36th in-ternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 343–352.

Juan Ramos et al. 2003. Using tf-idf to determine wordrelevance in document queries. In Proceedings ofthe first instructional conference on machine learn-ing, volume 242, pages 29–48. Citeseer.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks, pages 45–50, Val-letta, Malta. ELRA.

Benjamin Riedel, Isabelle Augenstein, Georgios P Sp-ithourakis, and Sebastian Riedel. 2017. A sim-ple but tough-to-beat baseline for the fake newschallenge stance detection task. arXiv preprintarXiv:1707.03264.

Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents. Text Mining: Applicationsand Theory, pages 1–20.

Bianca Scarlini, Tommaso Pasini, and Roberto Navigli.2020. Sensembert: Context-enhanced sense embed-dings for multilingual word sense disambiguation.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 34, pages 8758–8765.

Alexander Thorsten Schutz et al. 2008. Keyphrase ex-traction from single documents in the open domainexploiting linguistic and statistical methods.

Lucas Sterckx, Thomas Demeester, Johannes Deleu,and Chris Develder. 2015. Topical word importancefor fast keyphrase extraction. In Proceedings of the24th International Conference on World Wide Web,pages 121–122.

Si Sun, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu,and Jie Bao. 2020. Joint keyphrase chunkingand salience ranking with bert. arXiv preprintarXiv:2004.13639.

Zhong Tang, Wenqiang Li, and Yan Li. 2020. An im-proved term weighting scheme for text classification.Concurrency and Computation: Practice and Expe-rience, 32(9):e5604.

Xiaojun Wan and Jianguo Xiao. 2008a. Collabrank: to-wards a collaborative approach to single-documentkeyphrase extraction. In Proceedings of the 22ndInternational Conference on Computational Linguis-tics (Coling 2008), pages 969–976.

Xiaojun Wan and Jianguo Xiao. 2008b. Singledocument keyphrase extraction using neighborhoodknowledge. In AAAI, volume 8, pages 855–860.

Ian H Witten, Gordon W Paynter, Eibe Frank, CarlGutwin, and Craig G Nevill-Manning. 2005. Kea:Practical automated keyphrase extraction. In Designand Usability of Digital Libraries: Case Studies inthe Asia Pacific, pages 129–152. IGI global.

Ho Chung Wu, Robert Wing Pong Luk, Kam Fai Wong,and Kui Lam Kwok. 2008. Interpreting tf-idf termweights as making relevance decisions. ACM Trans-actions on Information Systems (TOIS), 26(3):1–37.

Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Cam-pos, and Arnold Overwijk. 2019. Open domain webkeyphrase extraction beyond language modeling. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 5175–5184, Hong Kong, China. Association for Computa-tional Linguistics.

Page 14: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

A Appendix

A.1 Regression modelsIn the following tables, the regression models forthe comparisons considered in Section 5 are pre-sented. For each variable, the tables show the es-timated coefficient value, the standard error, thet-value, and the p-value. The last column identifiesthe significance of the coefficient, according to thefollowing scale: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1. The adjusted coefficient of determination(adjR2) is provided in the caption. Note that onlythe models having adjR2 > 0.5 are shown.

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.5941079 0.4962972 -3.212 0.01239 *avg_word 0.0008803 0.0003262 2.699 0.02713 *sd_word -0.0041636 0.0008272 -5.033 0.00101 **avg_vocab -0.0138346 0.0067862 -2.039 0.07583 .sd_vocab 0.019841 0.0085917 2.309 0.04974 *avg_phrase 0.0177073 0.0090282 1.961 0.08548 .avg_keyword 0.1343021 0.0539089 2.491 0.03745 *

Table 8: LexSpec VS TFIDF; metric: P@5; adjR2 =0.78.

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.1315465 0.2313255 -4.892 0.000858 ***sd_word -0.0017211 0.0003994 -4.309 0.001964 **avg_vocab -0.004639 0.0017163 -2.703 0.024278 *avg_phrase 0.0098229 0.0023125 4.248 0.00215 **sd_phrase 0.0124841 0.0047468 2.63 0.027358 *sd_keyword 0.1032414 0.0422304 2.445 0.037077 *

Table 9: LexSpec VS TFIDF; metric: P@10; adjR2 =0.88.

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.1943708 1.2660339 -0.943 0.36771avg_word -0.0027185 0.0007066 -3.847 0.00323 **avg_vocab -0.0347223 0.0109055 -3.184 0.00975 **avg_phrase 0.0759399 0.018437 4.119 0.00208 **avg_keyword 0.2660168 0.1249109 2.13 0.05906 .

Table 10: LexSpec VS TFIDF; metric: MRR; adjR2 =0.63.

Estimate Std. Error t value Pr(>|t|)(Intercept) 2.567422 1.26513 2.029 0.076932 .sd_word -0.010703 0.002645 -4.046 0.003704 **avg_vocab 0.007276 0.001986 3.664 0.006366 **sd_vocab -0.151724 0.023117 -6.563 0.000176 ***sd_phrase 0.214204 0.045641 4.693 0.001555 **avg_keyword 0.690368 0.214972 3.211 0.012398 *sd_keyword -0.738872 0.409539 -1.804 0.108859

Table 11: SingleRank VS TopicRank; metric: P@5;adjR2 = 0.84.

Estimate Std. Error t value Pr(>|t|)(Intercept) 4.539467 1.940424 2.339 0.04746 *sd_word -0.01031 0.004057 -2.541 0.03465 *avg_vocab 0.006845 0.003046 2.247 0.0548 .sd_vocab -0.15218 0.035457 -4.292 0.00264 **sd_phrase 0.197827 0.070003 2.826 0.02229 *avg_keyword 0.696279 0.329719 2.112 0.06769 .sd_keyword -0.912998 0.62814 -1.453 0.18416

Table 12: SingleRank VS TopicRank; metric: P@10;adjR2 = 0.73.

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.887312 2.109815 -0.895 0.392055avg_word -0.00617 0.001178 -5.24 0.000379 ***avg_vocab -0.064177 0.018174 -3.531 0.005435 **avg_phrase 0.151258 0.030725 4.923 0.000602 ***avg_keyword 0.467208 0.208161 2.244 0.048635 *

Table 13: SingleRank VS TopicRank; metric: MRR;adjR2 = 0.74.

Estimate Std. Error t value Pr(>|t|)(Intercept) -0.664793 1.540688 -0.431 0.6775sd_word 0.007625 0.003021 2.524 0.03558 *avg_vocab 0.041161 0.013866 2.968 0.01791 *avg_phrase -0.060791 0.018227 -3.335 0.0103 *sd_phrase -0.053885 0.032652 -1.65 0.13749 *avg_keyword -1.052742 0.285889 -3.682 0.0062 **sd_keyword 1.935404 0.514711 3.76 0.00554 **

Table 14: LexSpec VS SingleRank; metric: P@5;adjR2 = 0.67.

Estimate Std. Error t value Pr(>|t|)(Intercept) -0.603773 1.654271 -0.365 0.7236sd_word 0.002687 0.001166 2.305 0.0466 *avg_vocab 0.026014 0.011146 2.334 0.0445 *avg_phrase -0.042142 0.015843 -2.66 0.026 *avg_keyword -0.896504 0.282468 -3.174 0.0113 *sd_keyword 1.594743 0.531522 3 0.0149 *

Table 15: LexSpec VS SingleRank; metric: P@10;adjR2 = 0.57.

Estimate Std. Error t value Pr(>|t|)(Intercept) -0.904265 1.991195 -0.454 0.6618sd_word -0.010164 0.003904 -2.603 0.0315 *avg_vocab -0.046465 0.017921 -2.593 0.032 *avg_phrase 0.074571 0.023556 3.166 0.0133 *sd_phrase 0.061018 0.0422 1.446 0.1862avg_keyword 1.079557 0.369484 2.922 0.0192 *sd_keyword -1.741422 0.665215 -2.618 0.0308 *

Table 16: SingleRank VS TFIDF; metric: P@5;adjR2 = 0.65.

Page 15: arXiv:2104.08028v1 [cs.LG] 16 Apr 2021

Estimate Std. Error t value Pr(>|t|)(Intercept) -0.520477 1.784994 -0.292 0.778sd_word -0.007615 0.0035 -2.176 0.0613 .avg_vocab -0.041068 0.016065 -2.556 0.0338 *avg_phrase 0.064078 0.021117 3.034 0.0162 *sd_phrase 0.049712 0.03783 1.314 0.2252avg_keyword 1.020944 0.331222 3.082 0.0151 *sd_keyword -1.647247 0.596328 -2.762 0.0246 *

Table 17: SingleRank VS TFIDF; metric: P@10;adjR2 = 0.63.

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.092856 3.733491 -0.293 0.7772sd_word -0.016361 0.005055 -3.236 0.0119 *avg_vocab -0.132977 0.047274 -2.813 0.0227 *sd_vocab 0.099227 0.055651 1.783 0.1124avg_phrase 0.19399 0.061332 3.163 0.0133 *avg_keyword 2.138082 0.681792 3.136 0.0139 *sd_keyword -3.369139 1.290018 -2.612 0.031 *

Table 18: SingleRank VS TFIDF; metric: MRR;adjR2 = 0.56.