Mayank Singh Abhishek Niranjan Divyansh Gupta Nikhil …

Citation sentence reuse behavior of scientists: A case study onmassive bibliographic text dataset of computer science

Mayank SinghDept. of Computer Science and Engg.

IIT Kharagpur, [email protected]

Abhishek NiranjanDept. of Computer Science and Engg.


Divyansh GuptaDept. of Computer Science and Engg.


Nikhil Angad BakshiDept. of Mechanical Engg.


Animesh MukherjeeDept. of Computer Science and Engg.


Pawan GoyalDept. of Computer Science and Engg.


ABSTRACTOur current knowledge of scholarly plagiarism is largely basedon the similarity between full text research articles. In this paper,we propose an innovative and novel conceptualization of schol-arly plagiarism in the form of reuse of explicit citation sentencesin scienti�c research articles. Note that while full-text plagiarismis an indicator of a gross-level behavior, copying of citation sen-tences is a more nuanced micro-scale phenomenon observed evenfor well-known researchers. �e current work poses several in-teresting questions and a�empts to answer them by empiricallyinvestigating a large bibliographic text dataset from computer sci-ence containing millions of lines of citation sentences. In particular,we report evidences of massive copying behavior. We also presentseveral striking real examples throughout the paper to showcasewidespread adoption of this undesirable practice. In contrast to thepopular perception, we �nd that copying tendency increases as anauthor matures. �e copying behavior is reported to exist in all�elds of computer science; however, the theoretical �elds indicatemore copying than the applied �elds.

CCS CONCEPTS•Information systems →Near-duplicate and plagiarism de-tection;

KEYWORDSCitation context, plagiarism, text reuseACM Reference format:Mayank Singh, Abhishek Niranjan, Divyansh Gupta, Nikhil Angad Bakshi,Animesh Mukherjee, and Pawan Goyal. 2017. Citation sentence reusebehavior of scientists: A case study onmassive bibliographic text dataset of computer science. In Proceedings of�e ACM/IEEE-CS Joint Conference on Digital Libraries, Toronto, Ontario,Canada, June 2017 (JCDL ’17), 5 pages.DOI: 10.475/1234

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected] ’17, Toronto, Ontario, Canada© 2016 ACM. 123-4567-24-567/08/06. . .$15.00DOI: 10.475/1234

1 INTRODUCTIONScholarly text “reuse” detection is a well known problem. It has re-ceived even more a�ention in the past decade due to overwhelmingincrease in the literature volume and ever increasing cases of pla-giarism [9]. Traditionally, the focus has always been in the analysisof full text of the research articles. A highly celebrated work oncopy detection by Brin et al. [5] proposed a working system, calledCOPS, that detects copies (complete or partial) of research articles.�ey also proposed algorithms and metrics required for evaluatingdetection mechanisms. Bao et al. [3] applied Semantic SequenceKin for document copy detection. Zu et al. [17] tried to detect pla-giarism if references are not given. �ey develop a taxonomy ofplagiarism delicts along with features for the quanti�cation of styleaspects. Recently, Citron et al. [7] described three classes of textreuse. �ey studied text reuse via a systematic pairwise comparisonof the text content of all articles submi�ed to arXiv.org between1991 – 2012. �ey report that in some countries 15% of submissionsare detected as containing duplicated material. Lesk [10] presentedscope of plagiarism within arXiv. �ey concluded that arXiv is nowidentifying the papers that have substantial overlap and is waitingto see if that a�ects the submission.

In recent past, numerous cases of widespread plagiarism havebeen detected leading to severe consequences. For example, Profes-sor Ma�hew Whitaker from Arizona State University was made toresign a�er a series of plagiarism controversies1 . Professor ShahidAzam from University of Regina has been accused for plagiariz-ing his own student master’s thesis [2]. Leading to an even moredisastrous consequence, a city court in India brie�y sent formervice-chancellor to jail on allegations of plagiarism [2].

Plagiarism detection is considered computationally demandingas majority of proposed techniques rely on similarity between fulltext research articles. In this paper, we propose a more nuancedmicro-scale copying phenomenon by examining crowdsourced datagenerated in the form of explicit citation sentences. In contrast tofull-text plagiarism that exhibits a gross-level behavior, we presenta �rst a�empt to understand the copying of citation sentencesthat corresponds to a more nuanced micro-scale phenomenon. Webelieve that researchers should be capable to describe previous liter-ature without plagiarizing. To improve the quality and innovation,scienti�c community should strongly discourage such activities.

1Wikipedia article: h�ps://en.wikipedia.org/wiki/Ma�hew C. Whitaker#Controversy

arX

iv:1

705.

0249

9v1

[cs

.DL

] 6

May

201

7

JCDL ’17, June 2017, Toronto, Ontario, Canada Singh et al.

2 DATASETIn this paper, we use two computer science datasets crawled fromMicroso� Academic Search (MAS)2. �e �rst dataset [6] consists ofbibliographic information (the title, the abstract, the keywords, itsauthor(s), the year of publication, the publication venue, and thereferences) of more than 2.4 million papers published between 1859– 2012. �e second dataset [14] consists of more than 26 millioncitation sentences present all across the computer science articlespublished in the same time window. �e scripts and processed datais available online3 for download.

3 CITATION SENTENCES AND SIMILARITY�roughout this paper we use the terms ‘citation sentence’ and‘citation context’ interchangeably. If paper P refers to paper C ,then P is termed the citing paper while C is termed the cited paper.Given paper P , we consider those sentences as citation sentences(CS ) that explicitly cite the previous paper C . Note that, P canrefer to C at many places in the text leading to multiple CS for thesame cited-citer pair. We process raw CS by replacing all citationplaceholders (reference indexes, author names plus year etc.) witha single word “CITATION”. We have been successful in replacing16 di�erent citation placeholder formats identi�ed by Singh etal. [13]. A representative citation context from our dataset where [1]cites [8], before and a�er pre-processing is as follows:Before: Recommender systems are a personalized information �lter-ing technology [4], designed to assist users in locating items of interestby providing useful recommendations.A�er: Recommender systems are a personalized information �lteringtechnology CITATION, designed to assist users in locating items ofinterest by providing useful recommendations.

Similarity computation: �is study massively relies on similarityscores between two citation contexts. �erefore, we utilize vectorspace model to compute similarity scores using tf-idf weightingscheme. We construct vocabulary from rich scienti�c text presentin the second dataset. Due to computation complexity associatedwith similarity computations, out of ∼1.5 million tokens, we onlyconsider top 100,000 frequently occurring tokens. For each pair ofCS vectors, similarity scores are generated using standard cosinesimilarity metric (CosSim). We employ python’s machine learninglibrary scikit-learn4 for all the computations.

4 MOTIVATIONAL STATISTICSWe begin this work by examining the most intriguing and trivialquestion – Whether CS are really copied and to what extent? Tomotivate the reader, we present a representative example of exten-sive copying from our dataset. Later in this section, we demonstratethat copying behavior is not a rare phenomenon; in contrast, itseems to have become a widespread convention.A representative example: We found a large number of articlesin our dataset whose incoming CS were partially or completelycopied. In particular, we found cases where a paper receives anexact copy of CS from di�erent citing papers. As a representativeexample, we found �ve copies of the citation context – “We do not

2h�p://academic.research.microso�.com3h�ps://tinyurl.com/kzv4lhg4h�p://scikit-learn.org/stable/

aim at formalizing some speci�c kind of state diagram, which hasalready successfully been done, see CITATION for example”. Here“CITATION” placeholder consists of seven cited papers.Overall copying behavior: Further, we a�empted to understandthe overall CS copying behavior. To start with, we randomly se-lected 24,800 papers. For each paper p, we compute pairwise cosinesimilarity between the incoming CS . Figure 1 presents the distri-bution of similarity scores. As expected, most CS pairs have verylow similarity scores (CosSim ≤ 0.2). However, we also observesigni�cant number of pairs having high similarity (CosSim ≥ 0.8).Most surprisingly, we found a sharp peak at CosSim = 1; highlight-ing a very interesting observation that manyCS are directly copiedwithout any change. We also found papers where CS are copiedfrom multiple articles.Overall, we found ∼ 26 thousand articles that consists of at leastone citation context di�o copied (CosSim = 1) from another paper.We have used a strict metric for this; in speci�c, we concatenate themultiple instances for CS between the same pair of papers so thatevery pair has a unique citation context. Among this set, we �nd148 articles each having at least �ve CS , with ≥ 60% of CS beingexact copy from previously published papers.

0.0 0.2 0.4 0.6 0.8 1.0

Cosine Similarity

10−4

10−3

10−2

10−1

100

101

102

Fre

quen

cy(n

orm

aliz

edlo

gsc

ale)

Figure 1: Cosine similarity score distribution for CS pairs.

�ese initial statistics o�er compelling evidence to study the phe-nomenon of copying CS in-depth. In the next section, we presentextensive empirical analysis to understand the e�ect and propertiesof this interesting albeit undesirable phenomenon.

5 LARGE SCALE EMPIRICAL STUDYIn this section, we pose several interesting questions to be�er un-derstand the characteristics of this micro-scale copying behavior.To answer these questions, we conduct in-depth empirical analysison the computer science dataset described in section 2. Note thatdue to associated computation complexities in processing largetext data, we perform individual experiments on smaller samples5of full dataset. For be�er interpretation, the posed questions aregrouped into three categories: 1) Paperwise, 2) Authorwise, and3) Fieldwise. We consider copying if pairwise cosine similaritybetween incoming CS is higher than 0.8.

5.1 Paperwise analysis5.1.1 Does publication age impact copying behavior? In this sec-

tion, we a�empt to investigate the temporal nature of the copyingbehavior. To start with, we select top 500 cited papers (P ). For eachselected paper p ∈ P , we study its incoming CS (that describe p)within ∆t years a�er publication. For the current study, we use5We describe sample statistics in the beginning of each experiment.

Citation sentence reuse behavior of scientists: A case study onmassive bibliographic text dataset of computer science JCDL ’17, June 2017, Toronto, Ontario, Canada

∆t = 1, 2, 3, 4, 5. Evidence suggests that copying behavior changeswith the passage of time. More speci�cally, it �rst decreases andthen starts increasing over the time-period with the exception for∆t = 1. In each y ∈ ∆t , we compute the fraction of CS copied(denoted by fq ) for each of the paper q, which cites paper p ∈ P .We compute the fraction of copied CS at ∆t year a�er publicationby F (∆t) =

∑nq=1 fqn . For each ∆t , we compare similarity between

CS generated in the year ∆t with all the CS generated between∆(t = 0) to ∆(t − 1). We �nd the value of average fraction of copiedCS = 8.33% in the �rst year a�er publication. In the next four suc-cessive years, the fraction varies as follows, 8.97%, 8.02%, 8.52%,9.12%, indicating that the copying behavior decreases �rst and thengradually increases.

5.1.2 Are there di�erences in the cited versus the non-cited copy-ing? Next, we divide copied CS into two subsets, namely, i) cited(CC), and ii) not cited (NC). CC consists of copied CS where thesource paper (from which context is being copied) is cited, whereasNC consists of copied CS where the source paper is not cited. Fig-ure 2a reports proportion of copied CS into these subsets at �vetime periods a�er publication. To our surprise, we �nd that frac-tion of two subsets remains same over the years. Fraction of NC issigni�cantly lower than fraction of CC.

1 2 3 4 50.0

0.2

0.4

0.6

0.8

1.0

Fra

ctio

nof

copi

edCS

(a)

Cited Not cited

1 2 3 4 50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Years after publication (∆t) (b)

Self citedNon-self citedNot cited

Figure 2: a) Fraction of copied CS in two subsets, namely,cited and not cited. �e fraction of two subsets remainsnearly the same over the years. b) Fraction of CS in threesubsets, namely, self cited, non-self cited and not cited. �efraction of self cited subset decline over the years leading toincrease in the fraction of non-self cited subset.

�e copy and cited behavior is a combination of two distinctforms of copying, namely, self copying and non-self copying. In selfcopying, an author copies her own older CS , whereas in non-selfcopying, an author copies CS wri�en by other authors. �erefore,we split CC subset into further two subsets, namely, i) self cited (SC),and ii) non-self cited (NSC). Figure 2b reports proportion of copiedCS into three subsets (SC , NSC and NC) at �ve time periods a�erpublication. �e single most striking observation from Figure 2b isthat fraction of SC is much higher thanNSC indicating that majorityof the CS are copied by an author in their own future publications.However, as authors’ interest shi�s from one topic to other, thefraction of SC declines resulting into increase in the fraction ofNSC.5.2 Authorwise analysis

5.2.1 Does an author’s increasing experience influence her copy-ing behavior? In this section, we present empirical results to provethat researchers in their early stage of academic career behave dif-ferently than a�er gaining experience. We begin this analysis by

selecting 7175 random authors that have at least 20 years of citationhistory. For each author, we compute fraction of copied CS frompreviously published papers. Figure 3a presents average fraction ofcopied CS over 20 years of the author life span. In contrast to thepopular perception, we observe that the copying tendency increasesas an author matures.

5.2.2 Does an author’s popularity influence his copying behav-ior? We observe that an author’s popularity plays a critical rolein in�uencing his copying pa�erns. For this study, we select au-thors with varying popularity but with similar academic age6. Weselect all authors that started their career from the year 1995. Wecompute the authors’ h-index (in 2012) to measure their individualpopularity. To be�er visualize the in�uencing behavior, we dividethe authors into three h-index buckets:

• Bucket 1: h-index < 5• Bucket 2: 5 ≤ h-index < 15• Bucket 3: h-index ≥ 15

Here Bucket 1 represents the least popular authors whereasBucket 3 consists of the most popular authors. We compute thefraction of copied CS for each author in each bucket and presentaggregated statistics. We observe that the most popular authors(Bucket 3) show maximum copying tendency. Whereas least pop-ular authors (Bucket 1) show least copying tendency. On average,2.07%CS of Bucket 1 are copied from previous papers. �is fractionincreases to 4.17% and 6.16% for Bucket 2 and Bucket 3 respec-tively. To investigate in more detail, we divide copiedCS into threesubsets as described in section 5.1.1. Figure 3b presents proportionfor three di�erent copying behaviors in three h-index buckets. Frac-tion of SC increases as h-index increases. Popular authors preferto copy their own CS while less popular authors try to copy CSwri�en by other authors. Figure 3b reports signi�cant proportionof NC in Bucket 1. We observe similar results if we consider theauthor’s increasing publication count (in place of h-index) on hercopying behavior.

0 5 10 15 20

Author publication age (years)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Fra

ctio

nof

copi

edCS

(a)

TotalSelf citedNon-self citedNot cited

h-index< 5 5 ≤h-index< 15 h-index≥ 15

0.0

0.2

0.4

0.6

0.8

1.0

Fra

ctio

nof

copi

edCS

(b)

Self citedNon-self citedNot cited

Figure 3: a) Average fraction of copied CS over 20 years ofauthor life span. Copying tendency increases as author ma-tures. b) Proportion of three copying behaviors in three h-index buckets. Popular authors prefer to copy their ownCS .

5.3 Fieldwise AnalysisAs described in section 2, our dataset consists of papers from 24�elds of computer science. In this section, we a�empt to demon-strate the copying behavior in di�erent �elds of research. Werandomly select 20,000 research papers from 24 distinct �elds ofcomputer science. �e distribution of CS in each �eld is shown inFigure 5a. Two interesting questions that require �eldwise analysisare:6In order to keep the author experience same.

JCDL ’17, June 2017, Toronto, Ontario, Canada Singh et al.

CE HCI SM BCB PLH&

AA&

T SE GPRTSE OS AI SC

NLP CV CN MMD&

PCWWW IR DB ML

S&P DM

0.0

0.2

0.4

0.6

0.8

1.0

Fra

ctio

nof

copi

edC

S Self cited Non-self cited Not cited

Figure 4: Proportions of three copying behaviors for 24 computer science �elds. Majority of the copied contexts are originatingfrom self citations. Applied �elds have more tendency of copying and not citing than theoretical �elds.

5.3.1 Is paper-wise copying behavior di�erent in di�erent com-puter science fields? Similar to the experiment in section 5.1.1, foreach �eld, we divide copied CS into three categories, SC, NSC andNC. Figure 4 presents proportion of three categories for 24 com-puter science �elds. For all �elds, majority of the copied contextsare originating from self citations. Next major proportion goesto non-self citations. Overall, we observe that applied �elds havemore tendency of copying and not citing than theoretical �elds.

5.3.2 Is copying behavior same across all computer science fields?We perform this experiment along similar lines as the motivationalstudy (see section 4), except that now papers are divided into 24computer science �elds7. Figure 5b presents cosine similarity distri-bution for two representative �elds. We were surprised to �nd cleardemarcation between theoretical �elds8 like, Algorithms & �eoryetc., and applied �elds like, Computer Networks etc. Even though,for small values of CosSim , all �elds show similar behavior, forhigher CosSim(≥ 0.8) values (see inset Figure 5), theoretical �eldsshow higher copying behavior as compared to applied �elds. Notethat, a sharp peak at CosSim = 1 is observed across all �elds.

0.0 0.2 0.4 0.6 0.8 1.0

Cosine Similarity

10−3

10−2

10−1

100

101

102

Fre

quen

cy(n

orm

aliz

edlo

gsc

ale)

(b)Algorithms and TheoryComputer Networks

Field distribution

CEA&TDM

AI

SMRTSE

S&CMM

D&PC

PL

CNS&P

GPWWW

NLP

OS

ML

HCI

H&ABSB

CVIR

SEDB(a)

Figure 5: Fieldwise analysis: a) Distribution of CS sampledfrom 24 �elds. b) Comparison between cosine similaritydistribution of two representative computer science �elds.Inset shows signi�cant di�erence between Algorithms &�eory (theoretical) and Computer Networks (applied) forhigher CosSim(≥ 0.8) values.

6 CONCLUSIONS�is paper has investigatedmicro-scale phenomenon of text reuse inscienti�c articles. �roughout the paper, we pose several interestingresearch questions and present an in-depth empirical analysis toanswer them. We have provided further evidence that the theoritical�elds indicate more copying than the applied �elds. Finally, anumber of potential limitations need to be considered. First, thecurrent study employs a computer science dataset only. In the7h�p://tinyurl.com/n2rwkbs8h�ps://en.wikipedia.org/wiki/Computer science

future, we plan to extend this study to other research �elds aswell. Second, cosine similarity metric may be too simplistic forthis complex study. To further our research, we plan to employword embeddings for more meaningful similarity computations.On account of the fact that the current work is only a preliminarya�empt to understand micro-scale copying phenomenon of citationsentences, future extensions could possibly lead to rescaling ofseveral popularity metrics such as h-index, impact factor etc. asnumber of times a paper has been cited may not be a true metricfor impact of a paper or researchers in the research community.Alternatively, if this behavior is kept in check the quality of researchcan be expected to improve and the current popularity metrics willconform to the intuition behind their meaning.

REFERENCES[1] Ioannis Arapakis, Yashar Moshfeghi, Hideo Joho, Reede Ren, David Hannah, and

Joemon M Jose. 2009. Integrating facial expressions into user pro�ling for theimprovement of a multimodal recommender system. In Multimedia and Expo,2009. ICME 2009. IEEE International Conference on. IEEE, 1440–1443.

[2] Jonathan Bailey. 2014. University of Regina prof investigated for allegedly plagia-rizing student’s work. h�p://www.ithenticate.com/plagiarism-detection-blog/top-plagiarism-scandals-2014. (2014). [Online; accessed 24-January-2017].

[3] Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu, and Xiao-Di Zhang.2004. Semantic sequence kin: A method of document copy detection. InAdvancesin Knowledge Discovery and Data Mining. Springer, 529–538.

[4] Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, and Qin-Bao Song. 2003. A survey onnatural language text copy detection. Journal of so�ware 14, 10 (2003), 1753–1760.

[5] Sergey Brin, James Davis, and Hector Garcia-Molina. 1995. Copy detectionmechanisms for digital documents. In ACM SIGMOD Record, Vol. 24. ACM, 398–409.

[6] Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, andAnimesh Mukherjee. 2014. Towards a Strati�ed Learning Approach to PredictFuture Citation Counts. In Proceedings of the 14th ACM/IEEE-CS Joint Conferenceon Digital Libraries (JCDL ’14). IEEE Press, 351–360.

[7] Daniel T Citron and Paul Ginsparg. 2015. Pa�erns of text reuse in a scienti�ccorpus. Proceedings of the National Academy of Sciences 112, 1 (2015), 25–30.

[8] Eui-Hong Sam Han and George Karypis. 2005. Feature-based recommendationsystem. In Proceedings of the 14th ACM international conference on Informationand knowledge management. ACM, 446–452.

[9] Guy Judge and others. 2008. Plagiarism: Bringing Economics and EducationTogether (With a Li�le Help from IT). Computers in Higher Education EconomicsReview 20, 1 (2008), 21–26.

[10] Michael Lesk. 2015. How many scienti�c papers are not original? Proceedings ofthe National Academy of Sciences 112, 1 (2015), 6–7.

[11] Noboru Nakanishi and Izumi Ojima. 1999. Notes on Unfair Papers by Mebarki etal. on“�antum Nonsymmetric Gravity”. arXiv preprint hep-th/9912039 (1999).

[12] Dong-Chul Park and Young-June Woo. 2001. Weighted centroid neural networkfor edge preserving image compression. IEEE transactions on neural networks 12,5 (2001), 1134–1146.

[13] Mayank Singh, Barnopriyo Barua, Priyank Palod, Manvi Garg, Sidhartha Sa-tapathy, Samuel Bushi, Kumar Ayush, Krishna Sai Rohith, Tulasi Gamidi,Pawan Goyal, and Animesh Mukherjee. 2016. OCR++: A Robust FrameworkFor Information Extraction from Scholarly Articles. In Proceedings of COLING2016, the 26th International Conference on Computational Linguistics: TechnicalPapers. �e COLING 2016 Organizing Commi�ee, Osaka, Japan, 3390–3400.h�p://aclweb.org/anthology/C16-1320

[14] Mayank Singh, Vikas Patidar, Suhansanu Kumar, Tanmoy Chakraborty, AnimeshMukherjee, and Pawan Goyal. 2015. �e Role Of Citation Context In Predicting

http://www.ithenticate.com/plagiarism-detection-blog/top-plagiarism-scandals-2014

http://www.ithenticate.com/plagiarism-detection-blog/top-plagiarism-scandals-2014

http://aclweb.org/anthology/C16-1320

Citation sentence reuse behavior of scientists: A case study onmassive bibliographic text dataset of computer science JCDL ’17, June 2017, Toronto, Ontario, Canada

Long-Term Citation Pro�les: An Experimental Study Based On A Massive Biblio-graphic Text Dataset. In Proceedings of the 24th ACM International on Conferenceon Information and Knowledge Management. ACM, 1271–1280.

[15] Ma�hew C Whitaker. 2005. Race work: �e rise of civil rights in the urban West.U of Nebraska Press.

[16] Ma�hew C Whitaker. 2014. Peace be Still: Modern Black America from World WarII to Barack Obama. U of Nebraska Press.

[17] Sven Meyer Zu Eissen and Benno Stein. 2006. Intrinsic plagiarism detection. InAdvances in Information Retrieval. Springer, 565–569.

Mayank Singh Abhishek Niranjan Divyansh Gupta Nikhil …

Documents

Transcript of Mayank Singh Abhishek Niranjan Divyansh Gupta Nikhil …