Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss...

13
Detecting Near Duplicates in Software Documentation D.V. Luciv 1 , D.V. Koznov 1 , G.A. Chernishev 1 , A.N. Terekhov 1 1 Saint Petersburg State University 199034 St. Petersburg, Universitetskaya Emb., 7โ€“9 E-mail: {d.lutsiv, d.koznov, g.chernyshev, a.terekhov}@spbu.ru 2017-11-15 Abstract Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of โ€œnear duplicateโ€ fragments, i.e. chunks of text that were copied from a single source and were later modi๏ฌed in di๏ฌ€erent ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal de๏ฌnition of near duplicates and present an algorithm for their detec- tion in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact dupli- cates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and com- mercial projects. Our evaluation is very comprehensive โ€” it covers various documentation types: design and requirement speci๏ฌcations, programming guides and API documenta- tion, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a signi๏ฌcant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documen- tation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the bene๏ฌts of duplicate management in software documents. Keywords: software documentation, near duplicates, documentation reuse, software clone detection. 1 Introduction Every year software is becoming increasingly more complex and extensive, and so does soft- ware documentation. During the software life cycle documentation tends to accumulate a lot of duplicates due to the copy and paste pat- tern. At ๏ฌrst, some text fragment is copied several times, then each copy is modi๏ฌed, pos- sibly in its own way. Thus, di๏ฌ€erent copies of This work is partially supported by RFBR grant No 16-01-00304. initially similar fragments become โ€œnear dupli- catesโ€. Depending on the document type [1], duplicates can be either desired or not, but in any case duplicates increase documentation complexity and thus, maintenance and author- ing costs [2]. Textual duplicates in software documenta- tion, both exact and near ones, are extensively studied [2, 3, 4, 5]. However, there are no meth- ods for detection of near duplicates, only for exact ones and mainly using software clone de- tection techniques [2, 3, 6]. At the same time Juergens et al. [2] indicates the importance of near duplicates and recommends to โ€œpay par- arXiv:1711.04705v1 [cs.SE] 13 Nov 2017

Transcript of Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss...

Page 1: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation

D.V. Luciv1, D.V. Koznov1, G.A. Chernishev1, A.N. Terekhov1

1 Saint Petersburg State University199034 St. Petersburg, Universitetskaya Emb., 7โ€“9

E-mail: {d.lutsiv, d.koznov, g.chernyshev, a.terekhov}@spbu.ru

2017-11-15

Abstract

Contemporary software documentation is as complicated as the software itself. During itslifecycle, the documentation accumulates a lot of โ€œnear duplicateโ€ fragments, i.e. chunks oftext that were copied from a single source and were later modified in different ways. Suchnear duplicates decrease documentation quality and thus hamper its further utilization. Atthe same time, they are hard to detect manually due to their fuzzy nature. In this paperwe give a formal definition of near duplicates and present an algorithm for their detec-tion in software documents. This algorithm is based on the exact software clone detectionapproach: the software clone detection tool Clone Miner was adapted to detect exact dupli-cates in documents. Then, our algorithm uses these exact duplicates to construct near ones.We evaluate the proposed algorithm using the documentation of 19 open source and com-mercial projects. Our evaluation is very comprehensive โ€” it covers various documentationtypes: design and requirement specifications, programming guides and API documenta-tion, user manuals. Overall, the evaluation shows that all kinds of software documentationcontain a significant number of both exact and near duplicates. Next, we report on theperformed manual analysis of the detected near duplicates for the Linux Kernel Documen-tation. We present both quantative and qualitative results of this analysis, demonstratealgorithm strengths and weaknesses, and discuss the benefits of duplicate management insoftware documents.

Keywords: software documentation, near duplicates, documentation reuse, software clonedetection.

1 IntroductionEvery year software is becoming increasinglymore complex and extensive, and so does soft-ware documentation. During the software lifecycle documentation tends to accumulate a lotof duplicates due to the copy and paste pat-tern. At first, some text fragment is copiedseveral times, then each copy is modified, pos-sibly in its own way. Thus, different copies of

This work is partially supported by RFBR grantNo 16-01-00304.

initially similar fragments become โ€œnear dupli-catesโ€. Depending on the document type [1],duplicates can be either desired or not, butin any case duplicates increase documentationcomplexity and thus, maintenance and author-ing costs [2].

Textual duplicates in software documenta-tion, both exact and near ones, are extensivelystudied [2, 3, 4, 5]. However, there are no meth-ods for detection of near duplicates, only forexact ones and mainly using software clone de-tection techniques [2, 3, 6]. At the same timeJuergens et al. [2] indicates the importance ofnear duplicates and recommends to โ€œpay par-

arX

iv:1

711.

0470

5v1

[cs

.SE

] 1

3 N

ov 2

017

Page 2: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

2 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

ticular attention to subtle differences in dupli-cated textโ€. However, there are studies whichaddressed near duplicates, for example in [7]the authors developed a duplicate specificationtool for JavaDoc documentation. This tool al-lows user to specify near duplicates and ma-nipulate them. Their approach was based onan informal definition of near duplicates andthe problem of duplicate detection was not ad-dressed. In our previous studies [8, 9] we pre-sented a near duplicate detection approach. Itscore idea is to uncover near duplicates and thento apply the reuse techniques described in ourearlier studies [4, 5]. Clone detection tool CloneMiner [10] was adapted for detection of exactduplicates in documents, then near duplicateswere extracted as combinations of exact du-plicates. However, only near duplicates withone variation point were considered. In otherwords, the approach can detect only near dupli-cates that consist of two exact duplicates witha single chunk of variable text between them:๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก1 ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’1 ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก2.

In this paper we give the formal def-inition of near duplicates with an ar-bitrary number of variation points,exhibiting the following pattern:๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก1 ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’1 ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก2 ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’2 ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก3 . . .๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘›โˆ’1 ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก๐‘›. Our definition is theformalized version of the definition givenin the reference [11]. We also present ageneralization of the algorithm described in[8, 9]. The algorithm is implemented in theDocumentation Refactoring Toolkit [12], whichis a part of the DocLine project [4]. In thispaper, an evaluation of the proposed algorithmis also presented. The documentation of 19open source and commercial projects is used.The results of the detailed manual analysisof the detected near duplicates for the LinuxKernel Documentation [13] are reported.

The paper is structured as follows. Sec-tion 1 provides a survey of duplicate manage-ment for software documentation and gives abrief overview of near duplicate detection meth-ods in information retrieval, theoretical com-puter science, and software clone detection ar-eas. Section 2 describes the context of thisstudy by examining our previous work to un-

derline the contribution of the current paper.In Section 3 the formal definition of the nearduplicate is given and the near duplicate de-tection algorithm is presented. Also, the algo-rithm correctness theorem is formulated. Fi-nally, Section 4 presents evaluation results.

2 Related work

Let us consider how near duplicates are em-ployed in documentation-oriented software en-gineering research. Horie et al. [14] considerthe problem of text fragment duplicates in JavaAPI documentation. The authors introduce anotion of crosscutting concern, which is essen-tially a textual duplicate appearing in docu-mentation. The authors present a tool namedCommentWeaver, which provides several mech-anisms for modularization of the API documen-tation. It is implemented as an extention ofJavadoc tool, and provides new tags for con-trolling reusable text fragments. However, nearduplicates are not considered, facilities for du-plicate detection are not provided.

Nosal and Poruban [7] extend the ap-proach from [14] by introducing near dupli-cates. In this study the notion of documen-tation phrase is used to denote the near dupli-cate. Parametrization is used to define varia-tive parts of duplicates, similarly to our ap-proach [4, 5]. However, the authors left theproblem of near duplicate detection untouched.

In [3] Nosal and Poruban present the resultsof a case study in which they searched for exactduplicates in internal documentation (sourcecode comments) of an open source project set.They used a modified copy/paste detectiontool, which was originally developed for codeanalysis and found considerable number of textduplicates. However, near duplicates were notconsidered in this paper.

Wingkvist et al. adapted a clone detectiontool to measure the document uniqueness ina collection [6]. The authors used found du-plicates for documentation quality estimation.However, they did not address near duplicatedetection.

The work of Juergens et al. [2] is the closestone to our research and presents a case study

Page 3: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 3

for analyzing redundancy in requirement spec-ifications. The authors analyze 28 industrialdocuments. At the first step, they found dupli-cates using a clone detection tool. Then, theauthors filtered the found duplicates by man-ually removing false positives and performed aclassification of the results. They report thatthe average duplicate coverage of documentsthey analyzed is 13.6%: some documents havea low coverage (0.9%, 0.7% and even 0%), butthere are ones that have a high coverage (35%,51.1%, 71.6%). Next, the authors discuss howto use discovered duplicates and how to de-tect related duplicates in the source code. Theimpact of duplicates on the document readingprocess is also studied. Furthermore, the au-thors propose a classification of meaningful du-plicates and false positive duplicates. However,it should be noted that they consider only re-quirement specifications and ignore other kindsof software documentation. Also, they do notuse near duplicates.

Rago et al. [15] apply natural language pro-cessing and machine learning techniques to theproblem of searching duplicate functionality inrequirement specifications. These documentsare considered as a set of textual use cases; theapproach extracts sequence chain (usage sce-narios) for every use case and compares thepairs of chains to find duplicate subchains. Theauthors evaluate their approach using severalindustrial requirement specifications. It shouldbe noted that this study considers a very spe-cial type of requirement specifications, whichare not widely used in industry. Near dupli-cates are also not considered.

Algorithms for duplicate detection in textualcontent have been developed in several other ar-eas. Firstly, the information retrieval commu-nity considered several near-duplicate detectionproblems: document similarity search [16, 17],(local) text reuse detection [18, 19], templatedetection in web document collections [20, 21].Secondly, the plagiarism detection communityalso extensively studied the detection of simi-larity between documents [22, 23]. Thus, thereare a number of efficient solutions. However,the majority of them are focused on document-level similarity, with the goal of attaining high

performance on the large collections of docu-ments. More importantly, these studies dif-fer from ours by ignoring duplicate meaning-fulness. Nonetheless, adapting these methodsfor the near-duplicate search in software docu-mentation could improve the quality metrics ofthe search. It is a promising avenue for furtherstudies.

The theoretical computer science communityalso addressed the duplicate detection problem.However, the primary focus of this communitywas the development of (an approximate) stringmatching algorithms. For example, in order tomatch two text fragments of an equal length,the Hamming distance was used [24]. The Lev-enshtein distance [25] is employed to accountnot only for symbol modification, but also forsymbol insertion and removal. For this metricit is possible to handle all the three editing op-erations at the cost of performance [26]. Signif-icant performance optimizations to string pat-tern matching using the Levenshtein distancewere applied in the fuzzy Bitap algorithm [27]which was later optimized to handle longer pat-terns efficiently [28]. Similarity preserving sig-natures like MinHash [29] can be used to speedup pattern matching for approximate matchingproblem. Another indexing approach is pro-posed in [30], with an algorithm for the fastdetection of documents that share a commonsliding window with the query pattern but dif-fer by at most ๐œ tokens. While being useful forour goals in general, these studies do not re-late to this paper directly. All of these studiesare focused on the performance improvement ofsimple string matching tasks. Achieving highperformance is undoubtedly an important as-pect, but these algorithms become less neces-sary when employed for documents of 3โ€“4 MBsof plain text โ€” a common size of a large indus-trial documentation file.

Various techniques have been employed todetect near duplicate clones in a source code.SourcererCC [31] detects near duplicates ofcode blocks using a static bag-of-tokens strat-egy that is resilient to minor differences be-tween code blocks. Clone candidates of a codeblock are queried from a partial inverted indexfor better scalability. DECKARD [32] com-

Page 4: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

4 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

putes certain characteristic vectors of code toapproximate the structure of Abstract SyntaxTrees in Euclidean space. Locality sensitivehashing (LSH) [33] is used to group similar vec-tors with the Euclidean distance metric, form-ing clones. NICAD [34] is a text-based near du-plicate detection tool that also uses tree-basedstructural analysis with a lightweight parsing ofthe source code to implement flexible pretty-printing, code normalization, source transfor-mation and code filtering for better results.However, these techniques are not directly ca-pable of detecting duplicates in text documentsas they involve some degree of parsing of theunderlying source code for duplicate detection.Suitable customization for this purpose can beexplored in the future.

3 Background

3.1 Exact duplicate detection andClone Miner

Not only documentation, but also software it-self is often developed with a lot of copy/pastedinformation. To cope with duplicates in thesource code, software clone detection methodsare used. This area is quite mature; a sys-tematic review of clone detection methods andtools can be found in [35]. In this paper, theClone Miner [10] software clone detection toolis used to detect exact duplicates in softwaredocumentation. Clone Miner is a token-basedsource code clone detector. A token in the con-text of text documents is a single word sepa-rated from other words by some separator: โ€˜.โ€™,โ€˜(โ€™, โ€˜)โ€™, etc. For example, the following textfragment consists of 2 tokens: โ€œFM registersโ€.Clone Miner considers input text as an orderedcollection of lexical tokens and applies suffixarray-based string matching algorithms [36] toretrieve the repeated parts (clones). In thisstudy we use the Clone Miner tool. We haveselected it for its simplicity and its ability tobe easily integrated with other tools using acommand line interface.

3.2 Basic near duplicate detec-tion and refactoring

We have already demonstrated in Section 1that near duplicate detection in software doc-umentation is an important research avenue.In our previous studies [8, 9], we presentedan approach that offers a partial solution forthis problem. At first, similarly to Juergenset al. [2], Wingkvist et al. [6], we applied soft-ware clone detection techniques to exact du-plicate detection [37]. Then, in [8, 9] we pro-posed an approach to near duplicate detection.It is essentially as follows: having exact dupli-cates found by Clone Miner, we extract setsof duplicate groups where clones are locatedclose to each other. For example, suppose thatthe following phrase can be found in the text5 times with different variations (various portnumbers): โ€œinet daemon can listen on ... portand then transfer the connection to appropri-ate handlerโ€. In this case we have two duplicategroups with 5 clones in each group: one groupincludes the text โ€œinet daemon can listen onโ€,while the other includes โ€œport and then trans-fer the connection to appropriate handlerโ€. Wecombine these duplicate groups into a groupof near duplicates: every member of this grouphas one variation to capture different port num-bers. In these studies we developed this ap-proach only for one variation, i.e. we โ€œgluedโ€only pairs of exact duplicate groups.

Then it is possible to apply the adaptivereuse technique [38, 39] to near duplicatesand to perform automatic refactoring of doc-umentation. Our toolkit [12] allows to createreusable text fragment definition templates forthe selected near duplicate group. Then it ispossible to substitute all the occurrences of thegroupโ€™s members with parameterized referencesto definition. The overall scheme of the processis shown in Fig. 1.

The experiments with the described ap-proach showed a considerable number of inter-esting near duplicates. Thus, we decided togeneralize the algorithm from [8, 9] in orderto allow an arbitrary number of exact near-duplicates combinations. This will allow tohave several variation parts in the resulting

Page 5: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 5

Inputdocument

Preparation forclone detection

Clone detection(Clone Miner)

FilteringRefactoringModified

document

Document clonedetection

Figure 1: The process of near duplicate search

near duplicates. The generalized algorithm isdescribed below.

4 Near duplicate detectionalgorithm

4.1 Definitions

Let us define the terms necessary for describ-ing the proposed algorithm. We consider docu-ment ๐ท as a sequence of symbols. Any sym-bol of ๐ท has a coordinate corresponding toits offset from the beginning of the document,and this coordinate is a number belonging to[1, ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐ท)] interval, where ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐ท) is thenumber of symbols in ๐ท.

Definition 1. For ๐ท we define a text frag-ment as an occurrence of some text substringin ๐ท. Hence, each text fragment has a corre-sponding integer interval [๐‘, ๐‘’], where ๐‘ is thecoordinate of its first symbol and ๐‘’ is the coor-dinate of its last symbol. For text fragment ๐‘” ofdocument ๐ท, we say that ๐‘” โˆˆ ๐ท.

Let us introduce the following sets: ๐ท* is aset of all text fragments of ๐ท, ๐ผ๐ท is a set of allinteger intervals within interval [1, ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐ท)],๐‘†๐ท is a set of all strings of ๐ท.

Also, let us introduce the following notations:

โ€ข [๐‘”] : ๐ท* โ†’ ๐ผ๐ท is a function that takes textfragment ๐‘” and returns its interval.

โ€ข ๐‘ ๐‘ก๐‘Ÿ(๐‘”) : ๐ท* โ†’ ๐‘†๐ท is a function that takestext fragment ๐‘” and returns its text.

โ€ข ๐ผ : ๐ผ๐ท โ†’ ๐ท* is a function that takes inter-val ๐ผ and returns corresponding text frag-ment.

โ€ข |[๐‘, ๐‘’]| : ๐ผ๐ท โ†’ [0, ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐ท)] is a functionthat takes interval [๐‘”] = [๐‘, ๐‘’] and returns

its length as |[๐‘”]| = ๐‘’โˆ’๐‘+1. For simplicity,we will use |๐‘”| notion instead of |[๐‘”]|.

โ€ข For any ๐‘”1, ๐‘”2 โˆˆ ๐ท we consider their in-tersection ๐‘”1 โˆฉ ๐‘”2 as intersection of corre-sponding intervals [๐‘”1] โˆฉ [๐‘”2], and ๐‘”1 โŠ‚ ๐‘”2

implies [๐‘”1] โŠ‚ [๐‘”2].โ€ข We define the binary predicate ๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’

on ๐ท* ร— ๐ท*, which is true for text frag-ments ๐‘”1, ๐‘”2 โˆˆ ๐ท, iff ๐‘’1 < ๐‘2, when [๐‘”1] =[๐‘1, ๐‘’1], [๐‘”2] = [๐‘2, ๐‘’2].

Definition 2. Let us consider a set ๐บ of textfragments of ๐ท such that โˆ€๐‘”1, ๐‘”2 โˆˆ ๐บ (๐‘ ๐‘ก๐‘Ÿ(๐‘”1) =๐‘ ๐‘ก๐‘Ÿ(๐‘”2)) โˆง (๐‘”1 โˆฉ ๐‘”2 = โˆ…). We name those frag-ments as exact duplicates and ๐บ as exactduplicate group or exact group. We alsodenote number of elements in ๐บ as #๐บ.

Definition 3. For ordered set of exact dupli-cate groups ๐บ1, . . . , ๐บ๐‘ , we say that it formsvariational group โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ when the fol-lowing conditions are satisfied:

1. #๐บ1 = . . . = #๐บ๐‘ .

2. Text fragments having similar positions indifferent groups, occur in the same order indocument text: โˆ€๐‘”๐‘˜๐‘– โˆˆ ๐บ๐‘– โˆ€๐‘”๐‘˜๐‘— โˆˆ ๐บ๐‘— ((๐‘– <๐‘—)โ‡” ๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’(๐‘”๐‘˜๐‘– , ๐‘”

๐‘˜๐‘— )), and

โˆ€ ๐‘˜ โˆˆ {1, . . . , ๐‘ โˆ’ 1}๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’(๐‘”๐‘˜๐‘ , ๐‘”๐‘˜+11 ).

We also say that for any ๐บ๐‘˜ of this set ๐บ๐‘˜ โˆˆ๐‘‰ ๐บ.

Note 1. According to condition 2 of defini-tion 3, โˆ€๐‘”๐‘˜๐‘– โˆˆ ๐บ๐‘–,โˆ€๐‘”๐‘˜๐‘— โˆˆ ๐บ๐‘—(๐‘– = ๐‘— โ‡’ ๐‘”๐‘˜๐‘– โˆฉ ๐‘”๐‘˜๐‘— =โˆ…).

Note 2. When ๐‘‰ ๐บ = โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ and๐‘‰ ๐บโ€ฒ = โŸจ๐บโ€ฒ

1, . . . , ๐บโ€ฒ๐‘€โŸฉ, are variational groups,

โŸจ๐‘‰ ๐บ, ๐‘‰ ๐บโ€ฒโŸฉ = โŸจ๐บ1, . . . , ๐บ๐‘ , ๐บโ€ฒ1, . . . , ๐บ

โ€ฒ๐‘€โŸฉ is also

a variational group in case when is satisfies def-inition 3.

For example, suppose that we have ๐‘‰ ๐บ =โŸจ๐บ1, ๐บ2, ๐บ3โŸฉ and each of ๐บ๐‘– consists of threeclones ๐‘”๐‘˜๐‘– , ๐‘– โˆˆ {1, 2, 3}. Then, theseclones appear in the text in the followingorder: ๐‘”11 . . . ๐‘”

12 . . . ๐‘”

13 . . . . . . ๐‘”

21 . . . ๐‘”

22 . . . ๐‘”

23 . . . . . .

๐‘”31 . . . ๐‘”32 . . . ๐‘”

33. Next, it should be possible to

compute the distance between variations or ex-act duplicate groups. This is required to sup-port group merging inside our algorithm which

Page 6: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

6 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

selects several closest groups to form a new one.Thus, a distance function should be defined.

Definition 4. Distance between text frag-ments for any ๐‘”1, ๐‘”2 โˆˆ ๐ท is defined as follows:

๐‘‘๐‘–๐‘ ๐‘ก(๐‘”1, ๐‘”2) =

โŽงโŽชโŽจโŽชโŽฉ0, ๐‘”1 โˆฉ ๐‘”2 = โˆ…,๐‘2 โˆ’ ๐‘’1 + 1, ๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’(๐‘”1, ๐‘”2),

๐‘1 โˆ’ ๐‘’2 + 1, ๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’(๐‘”2, ๐‘”1),

(1)where [๐‘”1] = [๐‘1, ๐‘’1] and [๐‘”2] = [๐‘2, ๐‘’2].

Definition 5. Distance between exactgroups ๐บ1 and ๐บ2, having #๐บ1 = #๐บ2, isdefined as follows:

๐‘‘๐‘–๐‘ ๐‘ก(๐บ1, ๐บ2) = max๐‘˜โˆˆ{1,...,#๐บ1}

๐‘‘๐‘–๐‘ ๐‘ก(๐‘”๐‘˜1 , ๐‘”๐‘˜2) (2)

Definition 6. Distance between varia-tional groups ๐‘‰ ๐บ1 and ๐‘‰ ๐บ2, when there are๐บ1 โˆˆ ๐‘‰ ๐บ1, ๐บ2 โˆˆ ๐‘‰ ๐บ2 : #๐บ1 = #๐บ2, is definedas follows:

๐‘‘๐‘–๐‘ ๐‘ก(๐‘‰ ๐บ1, ๐‘‰ ๐บ2) = max๐บ1โˆˆ๐‘‰ ๐บ1,๐บ2โˆˆ๐‘‰ ๐บ2

๐‘‘๐‘–๐‘ ๐‘ก(๐บ1,๐บ2)

(3)

Definition 7. Length of exact group ๐บ is

defined as follows: ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐บ) =#๐บโˆ‘๐‘˜=1

(๐‘’๐‘˜โˆ’๐‘๐‘˜+1),

where ๐‘”๐‘˜ โˆˆ ๐บ, [๐‘”๐‘˜] = [๐‘๐‘˜, ๐‘’๐‘˜].

Definition 8. Length of variational group๐‘‰ ๐บ = โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ is defined as follows:

๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐‘‰ ๐บ) =๐‘โˆ‘๐‘–=1

๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž(๐บ๐‘–) (4)

Definition 9. Near duplicate group is sucha variational group โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ that satisfiesfollowing condition for โˆ€๐‘˜ โˆˆ {1, . . . ,#๐บ1}:

๐‘โˆ’1โˆ‘๐‘–=1

๐‘‘๐‘–๐‘ ๐‘ก(๐‘”๐‘˜๐‘– ,๐‘”๐‘˜๐‘–+1) โ‰ค 0.15 *

๐‘โˆ‘๐‘–=1

|๐‘”๐‘˜๐‘– |. (5)

This definition is constructed according tothe near duplicate concept from [38]: varia-tional part of near duplicates with similar infor-mation (delta) should not exceed 15% of theirexact duplicate (archetype) part.

Note 3. An exact group ๐บ can be consideredas a variational one formed by itself: โŸจ๐บโŸฉ.

Definition 10. Consider near duplicate groupโŸจ๐บ1, ๐บ2โŸฉ, where ๐บ1 and ๐บ2 are exact groups.We assume that this group contains a singleextension point, and the text fragments con-tained in positions [๐‘’๐‘˜1 +1, ๐‘๐‘˜2โˆ’1] are called ex-tension point values. In the general case, anear duplicate group โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ has ๐‘ โˆ’ 1extension points.

Definition 11. Consider two near dupli-cate groups ๐บ = โŸจ๐บ1, . . . , ๐บ๐‘›โŸฉ and ๐บโ€ฒ =โŸจ๐บโ€ฒ

1, . . . ๐บโ€ฒ๐‘šโŸฉ. Suppose that they form a

variational group โŸจ๐บ1, . . . , ๐บ๐‘›, ๐บโ€ฒ1, . . . , ๐บ

โ€ฒ๐‘šโŸฉ or

โŸจ๐บ1, . . . , ๐บ๐‘›, ๐บโ€ฒ1, . . . , ๐บ

โ€ฒ๐‘šโŸฉ, which in turn is also

a near duplicate group. In this case, we call ๐บand ๐บโ€ฒ nearby groups.

Definition 12. Nearby duplicates are dupli-cates belonging to nearby groups.

Note 4. Due to remark 3, definition 12 is ap-plicable to both near and exact duplicates.

4.2 Algorithm description

The algorithm that constructs the set of nearduplicate groups (๐‘†๐‘’๐‘ก๐‘‰ ๐บ) is presented below.Its input is the set of exact duplicate groups(๐‘†๐‘’๐‘ก๐บ) belonging to document ๐ท. It employsan interval tree โ€” a data structure whose pur-pose is to quickly locate intervals that inter-sect with a given interval. Initially, the ๐‘†๐‘’๐‘ก๐บset is created using the Clone Miner tool. Thecore idea of our algorithm is to repeatedly findand merge nearby exact groups from ๐‘†๐‘’๐‘ก๐บ. Ateach step, the resulting near duplicate groupsare added to ๐‘†๐‘’๐‘ก๐‘‰ ๐บ. Let us consider this algo-rithm in detail.

The initial interval tree for ๐‘†๐‘’๐‘ก๐บ is con-structed using the ๐ผ๐‘›๐‘–๐‘ก๐‘–๐‘Ž๐‘ก๐‘’() function (line2). The core part of the algorithm is a loopin which new near duplicate groups are con-structed (lines 3โ€“18). This loop repeats untilwe can construct at least one near duplicategroup, i.e. the set of newly constructed nearduplicate groups (๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค) is not empty (line18). Inside of this loop, the algorithm cyclesthrough all groups of ๐‘†๐‘’๐‘ก๐บ โˆช ๐‘†๐‘’๐‘ก๐‘‰ ๐บ. For each

Page 7: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 7

Algorithm 1: Near Duplicate Groups Con-structionInput data: ๐‘†๐‘’๐‘ก๐บResult: ๐‘†๐‘’๐‘ก๐‘‰ ๐บ

1 ๐‘†๐‘’๐‘ก๐‘‰ ๐บโ† โˆ…2 ๐ผ๐‘›๐‘–๐‘ก๐‘–๐‘Ž๐‘ก๐‘’()3 repeat4 ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค โ† โˆ…5 foreach ๐บ โˆˆ ๐‘†๐‘’๐‘ก๐บ โˆช ๐‘†๐‘’๐‘ก๐‘‰ ๐บ do6 ๐‘†๐‘’๐‘ก๐ถ๐‘Ž๐‘›๐‘‘โ† ๐‘๐‘’๐‘Ž๐‘Ÿ๐ต๐‘ฆ(๐บ)7 if ๐‘†๐‘’๐‘ก๐ถ๐‘Ž๐‘›๐‘‘ = โˆ… then8 ๐บโ€ฒ โ† ๐บ๐‘’๐‘ก๐ถ๐‘™๐‘œ๐‘ ๐‘’๐‘ ๐‘ก(๐บ,๐‘†๐‘’๐‘ก๐ถ๐‘Ž๐‘›๐‘‘)9 ๐‘…๐‘’๐‘š๐‘œ๐‘ฃ๐‘’(๐บ,๐บโ€ฒ)

10 if ๐ต๐‘’๐‘“๐‘œ๐‘Ÿ๐‘’(๐บ,๐บโ€ฒ) then11 ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค โ† ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค โˆช {โŸจ๐บ,๐บโ€ฒโŸฉ}12 else13 ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค โ† ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค โˆช {โŸจ๐บโ€ฒ, ๐บโŸฉ}14 end if15 end if16 end foreach17 ๐ฝ๐‘œ๐‘–๐‘›(๐‘†๐‘’๐‘ก๐‘‰ ๐บ, ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค)

18 until ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค = โˆ…19 ๐‘†๐‘’๐‘ก๐‘‰ ๐บโ† ๐‘†๐‘’๐‘ก๐‘‰ ๐บ โˆช ๐‘†๐‘’๐‘ก๐บ

of them, the ๐‘๐‘’๐‘Ž๐‘Ÿ๐ต๐‘ฆ function returns the setof nearby groups ๐‘†๐‘’๐‘ก๐ถ๐‘Ž๐‘›๐‘‘ (lines 5, 6), whichis then used for constructing near duplicategroups. Later, we will discuss this function inmore detail and prove its correctness, i.e. thatit actually returns groups that are close to ๐บ.Next, the closest group to ๐บ, denoted ๐บโ€ฒ, is se-lected from ๐‘†๐‘’๐‘ก๐ถ๐‘Ž๐‘›๐‘‘ (line 8) and a variationalgroup โŸจ๐บ,๐บโ€ฒโŸฉ or โŸจ๐บโ€ฒ, ๐บโŸฉ is created. This groupis added into ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค (lines 10โ€“14). Since ๐บand ๐บโ€ฒ are merged and therefore cease to existas independent entities, they are deleted from๐‘†๐‘’๐‘ก๐บ and ๐‘†๐‘’๐‘ก๐‘‰ ๐บ by the ๐‘…๐‘’๐‘š๐‘œ๐‘ฃ๐‘’ function (line9). Next, the ๐ฝ๐‘œ๐‘–๐‘› function adds ๐‘†๐‘’๐‘ก๐‘๐‘’๐‘ค to๐‘†๐‘’๐‘ก๐‘‰ ๐บ (line 17). It is essential to note thatthe ๐‘…๐‘’๐‘š๐‘œ๐‘ฃ๐‘’ and ๐ฝ๐‘œ๐‘–๐‘› functions perform someauxiliary actions described below.

In the end of the algorithm ๐‘†๐‘’๐‘ก๐บ is addedto ๐‘†๐‘’๐‘ก๐‘‰ ๐บ. The result โ€” ๐‘†๐‘’๐‘ก๐‘‰ ๐บ โ€” is pre-sented as the algorithmโ€™s output. This step isrequired in order for the output to contain notonly near duplicate groups, but also exact du-plicate groups which have not been used forcreation of near duplicate ones (line 19).

Let us describe the functions employed inthis algorithm.

The ๐ผ๐‘›๐‘–๐‘ก๐‘–๐‘Ž๐‘ก๐‘’() function builds the intervaltree. The idea of this data structure is the fol-lowing.

Suppose we have ๐‘› natural number intervals,where ๐‘1 is the minimum and ๐‘’๐‘› is the maxi-mum value of all interval endpoints, and ๐‘š isthe midpoint of [๐‘1, ๐‘’๐‘›]. The intervals are di-vided into three groups: fully located to theleft of ๐‘š, fully located to the right of ๐‘š, andintervals containing ๐‘š. The current node of theinterval tree stores the last interval group andreferences to its left and right child nodes con-taining the intervals to the left and to the rightof ๐‘š respectively. This procedure is repeatedfor each child node. Further details regardingthe construction of an interval tree can be foundin [40, 41].

In this study, we build our interval tree fromthe extended intervals that correspond to theexact duplicates found by CloneMiner. Theseextended intervals are obtained as follows: orig-inal intervals belonging to exact duplicates areenlarged by 15%. For example, if [๐‘, ๐‘’] isthe initial interval, then an extended one is[๐‘ โˆ’ 0.15 * (๐‘’ โˆ’ ๐‘ + 1), ๐‘’ + 0.15 * (๐‘’ โˆ’ ๐‘ + 1)].We will denote the extended interval that cor-responds to the exact duplicate ๐‘” as โ†• ๐‘”. Wealso modify our interval tree as follows: eachstored interval keeps the reference to the corre-sponding exact duplicate group.

The ๐‘…๐‘’๐‘š๐‘œ๐‘ฃ๐‘’ function removes groups fromsets and their intervals from the interval tree.The interval deletion algorithm is described inreferences [40, 41].

The ๐ฝ๐‘œ๐‘–๐‘› function, in addition to the op-erations described above, adds intervals ofthe newly created near duplicate group ๐บ =โŸจ๐บ1, . . . , ๐บ๐‘โŸฉ to the interval tree. The stan-dard insertion algorithm described in refer-ences [40, 41] is used. Extended intervalsadded to the tree of each near duplicate ๐‘”๐‘˜ =(๐‘”๐‘˜1 ,..., ๐‘”

๐‘˜๐‘), where ๐‘˜ โˆˆ {1, . . . ,#๐บ1}, have the

form of [๐‘๐‘˜1 โˆ’ ๐‘ฅ๐‘˜, ๐‘’๐‘˜๐‘ + ๐‘ฅ๐‘˜], where ๐‘ฅ๐‘˜ = 0.15 *๐‘โˆ‘๐‘–=1

|๐‘”๐‘˜๐‘– | โˆ’๐‘โˆ’1โˆ‘๐‘–=1

๐‘‘๐‘–๐‘ ๐‘ก(๐‘”๐‘˜๐‘– , ๐‘”๐‘˜๐‘–+1). We will denote this

extended interval of ๐‘”๐‘˜ (now, a near duplicate)as โ†• ๐‘”๐‘˜ as well.

Page 8: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

8 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

The ๐‘๐‘’๐‘Ž๐‘Ÿ๐ต๐‘ฆ function selects nearby groupsfor some group ๐บ (its parameter). To do this,for each text fragment from ๐บ a collection ofintervals that intersect with its interval is ex-tracted.Text fragments that correspond to theseintervals turn out to be neighboring to the ini-tial fragment, i.e. for them, condition (5) issatisfied. The retrieval is done using the inter-val tree search algorithm [40, 41]. We constructthe ๐บ๐ฟ1 set, which contains groups that are ex-pected to be nearby to ๐บ:

๐บ๐ฟ1(๐บ) = {๐บโ€ฒ|(๐บโ€ฒ โˆˆ ๐‘†๐‘’๐‘ก๐บ โˆช ๐‘†๐‘’๐‘ก๐‘‰ ๐บ)โˆงโˆƒ๐‘” โˆˆ ๐บ, ๐‘”โ€ฒ โˆˆ ๐บโ€ฒ : โ†• ๐‘”โˆฉ โ†• ๐‘”โ€ฒ = โˆ….

(6)

That is, the ๐บ๐ฟ1 set consists of groups thatcontain at least one duplicate that is close toat least one duplicate from ๐บ. Then, only thegroups that can form a variational group with๐บ are selected and placed into the ๐บ๐ฟ2 set:

๐บ๐ฟ2(๐บ) = {๐บโ€ฒ|๐บโ€ฒ โˆˆ ๐บ๐ฟ1 โˆง (โŸจ๐บ,๐บโ€ฒโŸฉorโŸจ๐บโ€ฒ, ๐บโŸฉis variational group)}.

(7)

Finally, the ๐บ๐ฟ3 set (the ๐‘๐‘’๐‘Ž๐‘Ÿ๐ต๐‘ฆ functionโ€™soutput) is created. The only groups placed inthis set are those from ๐บ๐ฟ2 whose all elementsare close to corresponding elements of ๐บ:

๐บ๐ฟ3(๐บ) = {๐บโ€ฒ|๐บโ€ฒ โˆˆ ๐บ๐ฟ2 โˆงโˆ€๐‘˜ โˆˆ {1, . . . ,#๐บ} : โ†• ๐‘”๐‘˜โˆฉ โ†• ๐‘”โ€ฒ๐‘˜ = โˆ…}

(8)

Theorem 1. Suggested algorithm detects nearduplicate groups that conform to definition 9.

It is easy to show by construction of ๐‘๐‘’๐‘Ž๐‘Ÿ๐ต๐‘ฆthat for some group ๐บ it returns the set of itsnearby groups (see definition 11). That is, eachof these groups can be used to form a near du-plicate group with ๐บ. Then for the set the al-gorithm selects the group closest to ๐บ and con-structs a new near duplicate group. The cor-rectness of all intermediate sets and other usedfunctions is immediate from their constructionmethods.

5 EvaluationThe proposed algorithm was implemented inthe Duplicate Finder Toolkit [12]. Our proto-type uses the intervaltree library [42] as an

implementation of the interval tree data struc-ture.

We have evaluated 19 industrial documentsbelonging to various types: requirement speci-fication, programming guides, API documenta-tion, user manuals, etc. (see Table 1). The sizeof the evaluated documents is up to 3 Mb.

Our evaluation produced the following re-sults. The majority of the duplicate groupsdetected are exact duplicates (88.3โ€“96.5%).Groups having one variation point amount to3.3โ€“12.5%, two variation points โ€“ to 0โ€“1.7%,and three variation points โ€“ to less 1%, etc.A few near duplicates with 11, 12 13, and 16variation points also were detected. We per-formed a manual analysis of the automaticallydetected near duplicates for Linux Kernel Doc-umentation (programming guide, document 1in the Table 1) [13]. We found 70 meaningfultext groups (5.4%), 30 meaningful groups forthe code example (2.3%), and 1191 false posi-tive groups (92.3%). We found 21 near dupli-cate groups, i.e. 21% of the meaningful dupli-cate groups. Therefore, the share of near du-plicates significantly increases after discardingfalse positives.

Having analyzed the evaluation results, wecan make the following conclusions:

1. During our experiments we did not manageto find any near duplicates in considereddocuments that were not detected by ouralgorithm. However, we should note thatthe claim of the algorithmโ€™s high recallneeds a more detailed justification.

2. Analyzing the Linux KernelDocumentation, we have concludedthat it does not have any cohesive style:it was created sporadically by differentauthors. Virtually all its duplicates aresituated locally, i.e. close to each other.For example, some author created adescription of some driverโ€™s functionalityusing copy/paste for its similar features.At the same time, another driver wasdescribed by a different author who didnot use the first driverโ€™s description atall. Consequently, there are practically noduplicates that are found throughout the

Page 9: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 9

Table 1: Near-duplicate groups detected

Do

cu

me

nt

Siz

e, K

b

To

tal n

ea

r

dup

gro

up

s

Exa

ct

dup

lica

tes

gro

up

s,

%

1-v

ariatio

n

gro

up

s,

%

2-v

ariatio

n

gro

up

s,

%

3-v

ariatio

n

gro

up

s,

%

1 892 1291 93.9 5.1 0.9 0.2

2 2924 6056 93.3 5.3 0.9 0.2

3 1810 4220 95.9 3.4 0.5 0.2

4 686 1500 96.5 3.3 0.2 0.1

5 1311 4688 95.9 3.3 0.5 0.0

6 3136 6587 93.9 5.1 0.7 0.2

7 1491 4537 92.0 6.0 1.2 0.4

8 3160 7804 95.6 3.5 0.5 0.2

9 1104 152 93.4 5.3 0.7 0.7

10 1800 4685 92.7 5.8 0.9 0.3

11 1056 2436 91.5 6.7 1.1 0.3

12 36 59 83.1 11.9 1.7 0.0

13 166 392 89.5 9.7 0.5 0.3

14 103 208 91.3 7.7 0.5 0.0

15 98 117 94.0 4.3 0.9 0.9

16 241 394 88.8 9.1 0.8 0.3

17 43 16 81.3 12.5 6.3 0.0

18 50 77 88.3 11.7 0.0 0.0

19 167 145 88.3 9.7 0.7 1.4

whole text. Examples, warnings, notes,and other documentation elements thatare preceded by different introductorysentences are not styled cohesively aswell. Thus, our algorithm can be used foranalyzing the degree of documentationuniformity.

3. The algorithm performs well on two-element groups, finding near duplicategroups with a different number ofextension points. It appears that, ingeneral, there are way fewer near duplicategroups with more than two elements.

4. Many detected duplicate groups consist offigure and table captions, page headers,parts of the table of contents and soon โ€” that is, they are not of anyinterest to us. Also, many found duplicatesare scattered across different elementsof document structure, for example, aduplicate can be a part of a headerand a small fragment of text right afterit. These kinds of duplicates are notdesired since they are not very usefulfor document writers. However, theyare detected because currently documentstructure is not taken into account during

the search.

5. The 0.15 value used in detecting nearduplicate groups does not allow to findsome significant groups (mainly smallones, 10โ€“20 tokens in size). It is possiblethat it would be more effective to use somefunction instead of a constant, which coulddepend, for example, on the length of thenear duplicate.

6. Moreover, often the detected duplicatedoes not contain variational informationthat is situated either in its end orin its beginning. Sometimes it couldbe beneficial to include it in order toensure semantic completeness. To solvethis problem, a clarification and a formaldefinition of semantic completeness of atext fragment is required. Our experimentsshow that this can be done in various ways(the simplest one is ensuring sentence-levelgranularity, i.e. including all text until thestart/end of sentence).

Page 10: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

10 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

6 Conclusion

In this paper the formal definition of near du-plicates in software documentation is given andthe algorithm for near duplicate detection ispresented. An evaluation of the algorithm us-ing a large number of both commercial andopen source documents is performed.

The evaluation shows that various typesof software documentation contain a signifi-cant number of exact and near duplicates. Anear duplicate detection tool could improve thequality of documentation, while a duplicatemanagement technique would simplify docu-mentation maintenance.

Although the proposed algorithm providesample evidence on text duplicates in industrialdocuments, it still requires improvements be-fore it can be applied to real-life tasks. Themain issues to be resolved are the quality ofthe near duplicates detected and a large num-ber of false positives. Also, a detailed analysisof near duplicate types in various sorts of soft-ware documents should be performed.

References

[1] Parnas D.L. Precise Documentation:The Key to Better Software. SpringerBerlin Heidelberg, Berlin, Heidelberg,2011, pp. 125โ€“148. DOI: 10.1007/978-3-642-15187-3_8.

[2] Juergens E., Deissenboeck F., Feilkas M.,Hummel B., Schaetz B., Wagner S., Do-mann C., and Streit J. Can clone de-tection support quality assessments of re-quirements specifications? Proceedingsof ACM/IEEE 32nd International Con-ference on Software Engineering, vol. 2.2010, pp. 79โ€“88. DOI: 10.1145/1810295.1810308.

[3] Nosalโ€™ M. and Poruban J. Preliminary re-port on empirical study of repeated frag-ments in internal documentation. Proceed-ings of Federated Conference on ComputerScience and Information Systems. 2016,pp. 1573โ€“1576.

[4] Koznov D.V. and Romanovsky K.Y. Do-cLine: A method for software productlines documentation development. Pro-gramming and Computer Software, vol. 34,no. 4, 2008, pp. 216โ€“224. DOI: 10.1134/S0361768808040051.

[5] Romanovsky K., Koznov D., and MinchinL. Refactoring the Documentation ofSoftware Product Lines. Lecture Notesin Computer Science, CEE-SET 2008,vol. 4980. Springer-Verlag, Berlin, Heidel-berg, 2011, pp. 158โ€“170. DOI: 10.1007/978-3-642-22386-0_12.

[6] Wingkvist A., Lowe W., Ericsson M., andLincke R. Analysis and visualization of in-formation quality of technical documenta-tion. Proceedings of the 4th European Con-ference on Information Management andEvaluation. 2010, pp. 388โ€“396.

[7] Nosalโ€™ M. and Poruban J. Reusable soft-ware documentation with phrase annota-tions. Central European Journal of Com-puter Science, vol. 4, no. 4, 2014, pp. 242โ€“258. DOI: 10.2478/s13537-014-0208-3.

[8] Koznov D., Luciv D., Basit H.A., LiehO.E., and Smirnov M. Clone Detection inReuse of Software Technical Documenta-tion. International Andrei Ershov Memo-rial Conference on Perspectives of Sys-tem Informatics (2015), Lecture Notes inComputer Science, vol. 9609. Springer Na-ture, 2016, pp. 170โ€“185. DOI: 10.1007/978-3-319-41579-6_14.

[9] Luciv D.V., Koznov D.V., Basit H.A.,and Terekhov A.N. On fuzzy repetitionsdetection in documentation reuse. Pro-gramming and Computer Software, vol. 42,no. 4, 2016, pp. 216โ€“224. DOI: 10.1134/s0361768816040046.

[10] Basit H.A., Puglisi S.J., Smyth W.F.,Turpin A., and Jarzabek S. Efficient To-ken Based Clone Detection with Flexi-ble Tokenization. Proceedings of the 6thJoint Meeting on European Software En-gineering Conference and the ACM SIG-SOFT Symposium on the Foundations of

Page 11: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 11

Software Engineering: Companion Papers.ACM, New York, NY, USA, 2007, pp. 513โ€“516. DOI: 10.1145/1295014.1295029.

[11] Bassett P.G. Framing Software Reuse:Lessons from the Real World. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1997.

[12] Documentation Refactoring Toolkit.URL: http://www.math.spbu.ru/user/kromanovsky/docline/index_en.html.

[13] Torvalds L. Linux Kernel Documenta-tion, Dec 2013 snapshot. URL: https://github.com/torvalds/linux/tree/master/Documentation/DocBook/.

[14] Horie M. and Chiba S. Tool Support forCrosscutting Concerns of API Documenta-tion. Proceedings of the 9th InternationalConference on Aspect-Oriented SoftwareDevelopment. ACM, New York, NY, USA,2010, pp. 97โ€“108. DOI: 10.1145/1739230.1739242.

[15] Rago A., Marcos C., and Diaz-Pace J.A.Identifying duplicate functionality in tex-tual use cases by aligning semantic actions.Software & Systems Modeling, vol. 15,no. 2, 2016, pp. 579โ€“603. DOI: 10.1007/s10270-014-0431-3.

[16] Huang T.K., Rahman M.S., MadhyasthaH.V., Faloutsos M., and Ribeiro B. AnAnalysis of Socware Cascades in On-line Social Networks. Proceedings of the22Nd International Conference on WorldWide Web. ACM, New York, NY, USA,2013, pp. 619โ€“630. DOI: 10.1145/2488388.2488443.

[17] Williams K. and Giles C.L. Near Du-plicate Detection in an Academic DigitalLibrary. Proceedings of the ACM Sym-posium on Document Engineering. ACM,New York, NY, USA, 2013, pp. 91โ€“94.DOI: 10.1145/2494266.2494312.

[18] Zhang Q., Zhang Y., Yu H., and HuangX. Efficient Partial-duplicate Detection

Based on Sequence Matching. Proceed-ings of the 33rd International ACM SI-GIR Conference on Research and Develop-ment in Information Retrieval. ACM, NewYork, NY, USA, 2010, pp. 675โ€“682. DOI:10.1145/1835449.1835562.

[19] Abdel Hamid O., Behzadi B., ChristophS., and Henzinger M. Detecting the Ori-gin of Text Segments Efficiently. Pro-ceedings of the 18th International Con-ference on World Wide Web. ACM, NewYork, NY, USA, 2009, pp. 61โ€“70. DOI:10.1145/1526709.1526719.

[20] Ramaswamy L., Iyengar A., Liu L., andDouglis F. Automatic Detection of Frag-ments in Dynamically Generated WebPages. Proceedings of the 13th Interna-tional Conference on World Wide Web.ACM, New York, NY, USA, 2004, pp. 443โ€“454. DOI: 10.1145/988672.988732.

[21] Gibson D., Punera K., and Tomkins A.The Volume and Evolution of Web PageTemplates. Special Interest Tracks andPosters of the 14th International Confer-ence on World Wide Web. ACM, NewYork, NY, USA, 2005, pp. 830โ€“839. DOI:10.1145/1062745.1062763.

[22] Valles E. and Rosso P. Detection of Near-duplicate User Generated Contents: TheSMS Spam Collection. Proceedings of the3rd International Workshop on Search andMining User-generated Contents. ACM,New York, NY, USA, 2011, pp. 27โ€“34.DOI: 10.1145/2065023.2065031.

[23] Barron-Cedeno A., Vila M., Martฤฑ M., andRosso P. Plagiarism Meets Paraphrasing:Insights for the Next Generation in Auto-matic Plagiarism Detection. Comput. Lin-guist., vol. 39, no. 4, 2013, pp. 917โ€“947.DOI: 10.1162/COLI_a_00153.

[24] Smyth W. Computing Patterns in Strings.Addison-Wesley, 2003.

[25] Levenshtein V. Binary codes capable ofcorrecting spurious insertions and dele-

Page 12: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

12 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

tions of ones. Problems of InformationTransmission, vol. 1, 1965, pp. 8โ€“17.

[26] Wagner R.A. and Fischer M.J. The String-to-String Correction Problem. Journal ofthe ACM, vol. 21, no. 1, 1974, pp. 168โ€“173.DOI: 10.1145/321796.321811.

[27] Wu S. and Manber U. Fast Text Searching:Allowing Errors. Commun. ACM, vol. 35,no. 10, 1992, pp. 83โ€“91. DOI: 10.1145/135239.135244.

[28] Myers G. A Fast Bit-vector Algorithm forApproximate String Matching Based onDynamic Programming. J. ACM, vol. 46,no. 3, 1999, pp. 395โ€“415. DOI: 10.1145/316542.316550.

[29] Broder A. On the Resemblance and Con-tainment of Documents. Proceedings ofthe Compression and Complexity of Se-quences. IEEE Computer Society, Wash-ington, DC, USA, 1997, pp. 21โ€“29.

[30] Wang P., Xiao C., Qin J., Wang W.,Zhang X., and Ishikawa Y. Local Simi-larity Search for Unstructured Text. Pro-ceedings of International Conference onManagement of Data. ACM, New York,NY, USA, 2016, pp. 1991โ€“2005. DOI:10.1145/2882903.2915211.

[31] Sajnani H., Saini V., Svajlenko J., RoyC.K., and Lopes C.V. SourcererCC: Scal-ing Code Clone Detection to Big-code.Proceedings of the 38th International Con-ference on Software Engineering. ACM,New York, NY, USA, 2016, pp. 1157โ€“1168.DOI: 10.1145/2884781.2884877.

[32] Jiang L., Misherghi G., Su Z., and GlonduS. DECKARD: Scalable and AccurateTree-Based Detection of Code Clones.Proceedings of the 29th International Con-ference on Software Engineering. IEEEComputer Society, Washington, DC, USA,2007, pp. 96โ€“105. DOI: 10.1109/ICSE.2007.30.

[33] Indyk P. and Motwani R. ApproximateNearest Neighbors: Towards Removing

the Curse of Dimensionality. Proceed-ings of the Thirtieth Annual ACM Sym-posium on Theory of Computing, STOCโ€™98. ACM, New York, NY, USA, 1998.ISBN 0-89791-962-9, pp. 604โ€“613. DOI:10.1145/276698.276876. URL: http://doi.acm.org/10.1145/276698.276876.

[34] Cordy J.R. and Roy C.K. The NiCadClone Detector. Proceedings of IEEE19th International Conference on ProgramComprehension. 2011, pp. 219โ€“220. DOI:10.1109/ICPC.2011.26.

[35] Rattan D., Bhatia R., and Singh M. Soft-ware clone detection: A systematic re-view. Information and Software Technol-ogy, vol. 55, no. 7, 2013, pp. 1165โ€“1199.DOI: 10.1016/j.infsof.2013.01.008.

[36] Abouelhoda M.I., Kurtz S., and OhlebuschE. Replacing Suffix Trees with EnhancedSuffix Arrays. J. of Discrete Algorithms,vol. 2, no. 1, 2004, pp. 53โ€“86. DOI: 10.1016/S1570-8667(03)00065-0.

[37] Koznov D.V., Shutak A.V., SmirnovM.N., and A. S.M. Clone detection inrefactoring software documentation [Poiskklonov pri refaktoringe tehnicheskoj doku-mentacii]. Computer Tools in Educa-tion [Kompโ€™juternye instrumenty v obra-zovanii], vol. 4, 2012, pp. 30โ€“40. (In Rus-sian).

[38] Bassett P.G. The Theory and Practice ofAdaptive Reuse. SIGSOFT Softw. Eng.Notes, vol. 22, no. 3, 1997, pp. 2โ€“9. DOI:10.1145/258368.258371.

[39] Jarzabek S., Bassett P., Zhang H., andZhang W. XVCL: XML-based VariantConfiguration Language. Proceedings ofthe 25th International Conference on Soft-ware Engineering. IEEE Computer Soci-ety, Washington, DC, USA, 2003, pp. 810โ€“811.

[40] de Berg M., Cheong O., van Kreveld M.,and Overmars M. chapter Interval Trees.Springer Berlin Heidelberg, 2008, pp. 220โ€“226. DOI: 10.1007/978-3-540-77974-2.

Page 13: Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Detecting Near Duplicates in Software Documentation 13

[41] Preparata F.P. and Shamos M.I. chap-ter Intersections of rectangles. Textsand Monographs in Computer Science.Springer, 1985, pp. 359โ€“363.

[42] PyIntervalTree. URL: https://github.com/chaimleib/intervaltree.