Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss...
Transcript of Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss...
Detecting Near Duplicates in Software Documentation
D.V. Luciv1, D.V. Koznov1, G.A. Chernishev1, A.N. Terekhov1
1 Saint Petersburg State University199034 St. Petersburg, Universitetskaya Emb., 7โ9
E-mail: {d.lutsiv, d.koznov, g.chernyshev, a.terekhov}@spbu.ru
2017-11-15
Abstract
Contemporary software documentation is as complicated as the software itself. During itslifecycle, the documentation accumulates a lot of โnear duplicateโ fragments, i.e. chunks oftext that were copied from a single source and were later modified in different ways. Suchnear duplicates decrease documentation quality and thus hamper its further utilization. Atthe same time, they are hard to detect manually due to their fuzzy nature. In this paperwe give a formal definition of near duplicates and present an algorithm for their detec-tion in software documents. This algorithm is based on the exact software clone detectionapproach: the software clone detection tool Clone Miner was adapted to detect exact dupli-cates in documents. Then, our algorithm uses these exact duplicates to construct near ones.We evaluate the proposed algorithm using the documentation of 19 open source and com-mercial projects. Our evaluation is very comprehensive โ it covers various documentationtypes: design and requirement specifications, programming guides and API documenta-tion, user manuals. Overall, the evaluation shows that all kinds of software documentationcontain a significant number of both exact and near duplicates. Next, we report on theperformed manual analysis of the detected near duplicates for the Linux Kernel Documen-tation. We present both quantative and qualitative results of this analysis, demonstratealgorithm strengths and weaknesses, and discuss the benefits of duplicate management insoftware documents.
Keywords: software documentation, near duplicates, documentation reuse, software clonedetection.
1 IntroductionEvery year software is becoming increasinglymore complex and extensive, and so does soft-ware documentation. During the software lifecycle documentation tends to accumulate a lotof duplicates due to the copy and paste pat-tern. At first, some text fragment is copiedseveral times, then each copy is modified, pos-sibly in its own way. Thus, different copies of
This work is partially supported by RFBR grantNo 16-01-00304.
initially similar fragments become โnear dupli-catesโ. Depending on the document type [1],duplicates can be either desired or not, butin any case duplicates increase documentationcomplexity and thus, maintenance and author-ing costs [2].
Textual duplicates in software documenta-tion, both exact and near ones, are extensivelystudied [2, 3, 4, 5]. However, there are no meth-ods for detection of near duplicates, only forexact ones and mainly using software clone de-tection techniques [2, 3, 6]. At the same timeJuergens et al. [2] indicates the importance ofnear duplicates and recommends to โpay par-
arX
iv:1
711.
0470
5v1
[cs
.SE
] 1
3 N
ov 2
017
2 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
ticular attention to subtle differences in dupli-cated textโ. However, there are studies whichaddressed near duplicates, for example in [7]the authors developed a duplicate specificationtool for JavaDoc documentation. This tool al-lows user to specify near duplicates and ma-nipulate them. Their approach was based onan informal definition of near duplicates andthe problem of duplicate detection was not ad-dressed. In our previous studies [8, 9] we pre-sented a near duplicate detection approach. Itscore idea is to uncover near duplicates and thento apply the reuse techniques described in ourearlier studies [4, 5]. Clone detection tool CloneMiner [10] was adapted for detection of exactduplicates in documents, then near duplicateswere extracted as combinations of exact du-plicates. However, only near duplicates withone variation point were considered. In otherwords, the approach can detect only near dupli-cates that consist of two exact duplicates witha single chunk of variable text between them:๐๐ฅ๐๐๐ก1 ๐ฃ๐๐๐๐๐๐๐1 ๐๐ฅ๐๐๐ก2.
In this paper we give the formal def-inition of near duplicates with an ar-bitrary number of variation points,exhibiting the following pattern:๐๐ฅ๐๐๐ก1 ๐ฃ๐๐๐๐๐๐๐1 ๐๐ฅ๐๐๐ก2 ๐ฃ๐๐๐๐๐๐๐2 ๐๐ฅ๐๐๐ก3 . . .๐ฃ๐๐๐๐๐๐๐๐โ1 ๐๐ฅ๐๐๐ก๐. Our definition is theformalized version of the definition givenin the reference [11]. We also present ageneralization of the algorithm described in[8, 9]. The algorithm is implemented in theDocumentation Refactoring Toolkit [12], whichis a part of the DocLine project [4]. In thispaper, an evaluation of the proposed algorithmis also presented. The documentation of 19open source and commercial projects is used.The results of the detailed manual analysisof the detected near duplicates for the LinuxKernel Documentation [13] are reported.
The paper is structured as follows. Sec-tion 1 provides a survey of duplicate manage-ment for software documentation and gives abrief overview of near duplicate detection meth-ods in information retrieval, theoretical com-puter science, and software clone detection ar-eas. Section 2 describes the context of thisstudy by examining our previous work to un-
derline the contribution of the current paper.In Section 3 the formal definition of the nearduplicate is given and the near duplicate de-tection algorithm is presented. Also, the algo-rithm correctness theorem is formulated. Fi-nally, Section 4 presents evaluation results.
2 Related work
Let us consider how near duplicates are em-ployed in documentation-oriented software en-gineering research. Horie et al. [14] considerthe problem of text fragment duplicates in JavaAPI documentation. The authors introduce anotion of crosscutting concern, which is essen-tially a textual duplicate appearing in docu-mentation. The authors present a tool namedCommentWeaver, which provides several mech-anisms for modularization of the API documen-tation. It is implemented as an extention ofJavadoc tool, and provides new tags for con-trolling reusable text fragments. However, nearduplicates are not considered, facilities for du-plicate detection are not provided.
Nosal and Poruban [7] extend the ap-proach from [14] by introducing near dupli-cates. In this study the notion of documen-tation phrase is used to denote the near dupli-cate. Parametrization is used to define varia-tive parts of duplicates, similarly to our ap-proach [4, 5]. However, the authors left theproblem of near duplicate detection untouched.
In [3] Nosal and Poruban present the resultsof a case study in which they searched for exactduplicates in internal documentation (sourcecode comments) of an open source project set.They used a modified copy/paste detectiontool, which was originally developed for codeanalysis and found considerable number of textduplicates. However, near duplicates were notconsidered in this paper.
Wingkvist et al. adapted a clone detectiontool to measure the document uniqueness ina collection [6]. The authors used found du-plicates for documentation quality estimation.However, they did not address near duplicatedetection.
The work of Juergens et al. [2] is the closestone to our research and presents a case study
Detecting Near Duplicates in Software Documentation 3
for analyzing redundancy in requirement spec-ifications. The authors analyze 28 industrialdocuments. At the first step, they found dupli-cates using a clone detection tool. Then, theauthors filtered the found duplicates by man-ually removing false positives and performed aclassification of the results. They report thatthe average duplicate coverage of documentsthey analyzed is 13.6%: some documents havea low coverage (0.9%, 0.7% and even 0%), butthere are ones that have a high coverage (35%,51.1%, 71.6%). Next, the authors discuss howto use discovered duplicates and how to de-tect related duplicates in the source code. Theimpact of duplicates on the document readingprocess is also studied. Furthermore, the au-thors propose a classification of meaningful du-plicates and false positive duplicates. However,it should be noted that they consider only re-quirement specifications and ignore other kindsof software documentation. Also, they do notuse near duplicates.
Rago et al. [15] apply natural language pro-cessing and machine learning techniques to theproblem of searching duplicate functionality inrequirement specifications. These documentsare considered as a set of textual use cases; theapproach extracts sequence chain (usage sce-narios) for every use case and compares thepairs of chains to find duplicate subchains. Theauthors evaluate their approach using severalindustrial requirement specifications. It shouldbe noted that this study considers a very spe-cial type of requirement specifications, whichare not widely used in industry. Near dupli-cates are also not considered.
Algorithms for duplicate detection in textualcontent have been developed in several other ar-eas. Firstly, the information retrieval commu-nity considered several near-duplicate detectionproblems: document similarity search [16, 17],(local) text reuse detection [18, 19], templatedetection in web document collections [20, 21].Secondly, the plagiarism detection communityalso extensively studied the detection of simi-larity between documents [22, 23]. Thus, thereare a number of efficient solutions. However,the majority of them are focused on document-level similarity, with the goal of attaining high
performance on the large collections of docu-ments. More importantly, these studies dif-fer from ours by ignoring duplicate meaning-fulness. Nonetheless, adapting these methodsfor the near-duplicate search in software docu-mentation could improve the quality metrics ofthe search. It is a promising avenue for furtherstudies.
The theoretical computer science communityalso addressed the duplicate detection problem.However, the primary focus of this communitywas the development of (an approximate) stringmatching algorithms. For example, in order tomatch two text fragments of an equal length,the Hamming distance was used [24]. The Lev-enshtein distance [25] is employed to accountnot only for symbol modification, but also forsymbol insertion and removal. For this metricit is possible to handle all the three editing op-erations at the cost of performance [26]. Signif-icant performance optimizations to string pat-tern matching using the Levenshtein distancewere applied in the fuzzy Bitap algorithm [27]which was later optimized to handle longer pat-terns efficiently [28]. Similarity preserving sig-natures like MinHash [29] can be used to speedup pattern matching for approximate matchingproblem. Another indexing approach is pro-posed in [30], with an algorithm for the fastdetection of documents that share a commonsliding window with the query pattern but dif-fer by at most ๐ tokens. While being useful forour goals in general, these studies do not re-late to this paper directly. All of these studiesare focused on the performance improvement ofsimple string matching tasks. Achieving highperformance is undoubtedly an important as-pect, but these algorithms become less neces-sary when employed for documents of 3โ4 MBsof plain text โ a common size of a large indus-trial documentation file.
Various techniques have been employed todetect near duplicate clones in a source code.SourcererCC [31] detects near duplicates ofcode blocks using a static bag-of-tokens strat-egy that is resilient to minor differences be-tween code blocks. Clone candidates of a codeblock are queried from a partial inverted indexfor better scalability. DECKARD [32] com-
4 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
putes certain characteristic vectors of code toapproximate the structure of Abstract SyntaxTrees in Euclidean space. Locality sensitivehashing (LSH) [33] is used to group similar vec-tors with the Euclidean distance metric, form-ing clones. NICAD [34] is a text-based near du-plicate detection tool that also uses tree-basedstructural analysis with a lightweight parsing ofthe source code to implement flexible pretty-printing, code normalization, source transfor-mation and code filtering for better results.However, these techniques are not directly ca-pable of detecting duplicates in text documentsas they involve some degree of parsing of theunderlying source code for duplicate detection.Suitable customization for this purpose can beexplored in the future.
3 Background
3.1 Exact duplicate detection andClone Miner
Not only documentation, but also software it-self is often developed with a lot of copy/pastedinformation. To cope with duplicates in thesource code, software clone detection methodsare used. This area is quite mature; a sys-tematic review of clone detection methods andtools can be found in [35]. In this paper, theClone Miner [10] software clone detection toolis used to detect exact duplicates in softwaredocumentation. Clone Miner is a token-basedsource code clone detector. A token in the con-text of text documents is a single word sepa-rated from other words by some separator: โ.โ,โ(โ, โ)โ, etc. For example, the following textfragment consists of 2 tokens: โFM registersโ.Clone Miner considers input text as an orderedcollection of lexical tokens and applies suffixarray-based string matching algorithms [36] toretrieve the repeated parts (clones). In thisstudy we use the Clone Miner tool. We haveselected it for its simplicity and its ability tobe easily integrated with other tools using acommand line interface.
3.2 Basic near duplicate detec-tion and refactoring
We have already demonstrated in Section 1that near duplicate detection in software doc-umentation is an important research avenue.In our previous studies [8, 9], we presentedan approach that offers a partial solution forthis problem. At first, similarly to Juergenset al. [2], Wingkvist et al. [6], we applied soft-ware clone detection techniques to exact du-plicate detection [37]. Then, in [8, 9] we pro-posed an approach to near duplicate detection.It is essentially as follows: having exact dupli-cates found by Clone Miner, we extract setsof duplicate groups where clones are locatedclose to each other. For example, suppose thatthe following phrase can be found in the text5 times with different variations (various portnumbers): โinet daemon can listen on ... portand then transfer the connection to appropri-ate handlerโ. In this case we have two duplicategroups with 5 clones in each group: one groupincludes the text โinet daemon can listen onโ,while the other includes โport and then trans-fer the connection to appropriate handlerโ. Wecombine these duplicate groups into a groupof near duplicates: every member of this grouphas one variation to capture different port num-bers. In these studies we developed this ap-proach only for one variation, i.e. we โgluedโonly pairs of exact duplicate groups.
Then it is possible to apply the adaptivereuse technique [38, 39] to near duplicatesand to perform automatic refactoring of doc-umentation. Our toolkit [12] allows to createreusable text fragment definition templates forthe selected near duplicate group. Then it ispossible to substitute all the occurrences of thegroupโs members with parameterized referencesto definition. The overall scheme of the processis shown in Fig. 1.
The experiments with the described ap-proach showed a considerable number of inter-esting near duplicates. Thus, we decided togeneralize the algorithm from [8, 9] in orderto allow an arbitrary number of exact near-duplicates combinations. This will allow tohave several variation parts in the resulting
Detecting Near Duplicates in Software Documentation 5
Inputdocument
Preparation forclone detection
Clone detection(Clone Miner)
FilteringRefactoringModified
document
Document clonedetection
Figure 1: The process of near duplicate search
near duplicates. The generalized algorithm isdescribed below.
4 Near duplicate detectionalgorithm
4.1 Definitions
Let us define the terms necessary for describ-ing the proposed algorithm. We consider docu-ment ๐ท as a sequence of symbols. Any sym-bol of ๐ท has a coordinate corresponding toits offset from the beginning of the document,and this coordinate is a number belonging to[1, ๐๐๐๐๐กโ(๐ท)] interval, where ๐๐๐๐๐กโ(๐ท) is thenumber of symbols in ๐ท.
Definition 1. For ๐ท we define a text frag-ment as an occurrence of some text substringin ๐ท. Hence, each text fragment has a corre-sponding integer interval [๐, ๐], where ๐ is thecoordinate of its first symbol and ๐ is the coor-dinate of its last symbol. For text fragment ๐ ofdocument ๐ท, we say that ๐ โ ๐ท.
Let us introduce the following sets: ๐ท* is aset of all text fragments of ๐ท, ๐ผ๐ท is a set of allinteger intervals within interval [1, ๐๐๐๐๐กโ(๐ท)],๐๐ท is a set of all strings of ๐ท.
Also, let us introduce the following notations:
โข [๐] : ๐ท* โ ๐ผ๐ท is a function that takes textfragment ๐ and returns its interval.
โข ๐ ๐ก๐(๐) : ๐ท* โ ๐๐ท is a function that takestext fragment ๐ and returns its text.
โข ๐ผ : ๐ผ๐ท โ ๐ท* is a function that takes inter-val ๐ผ and returns corresponding text frag-ment.
โข |[๐, ๐]| : ๐ผ๐ท โ [0, ๐๐๐๐๐กโ(๐ท)] is a functionthat takes interval [๐] = [๐, ๐] and returns
its length as |[๐]| = ๐โ๐+1. For simplicity,we will use |๐| notion instead of |[๐]|.
โข For any ๐1, ๐2 โ ๐ท we consider their in-tersection ๐1 โฉ ๐2 as intersection of corre-sponding intervals [๐1] โฉ [๐2], and ๐1 โ ๐2
implies [๐1] โ [๐2].โข We define the binary predicate ๐ต๐๐๐๐๐
on ๐ท* ร ๐ท*, which is true for text frag-ments ๐1, ๐2 โ ๐ท, iff ๐1 < ๐2, when [๐1] =[๐1, ๐1], [๐2] = [๐2, ๐2].
Definition 2. Let us consider a set ๐บ of textfragments of ๐ท such that โ๐1, ๐2 โ ๐บ (๐ ๐ก๐(๐1) =๐ ๐ก๐(๐2)) โง (๐1 โฉ ๐2 = โ ). We name those frag-ments as exact duplicates and ๐บ as exactduplicate group or exact group. We alsodenote number of elements in ๐บ as #๐บ.
Definition 3. For ordered set of exact dupli-cate groups ๐บ1, . . . , ๐บ๐ , we say that it formsvariational group โจ๐บ1, . . . , ๐บ๐โฉ when the fol-lowing conditions are satisfied:
1. #๐บ1 = . . . = #๐บ๐ .
2. Text fragments having similar positions indifferent groups, occur in the same order indocument text: โ๐๐๐ โ ๐บ๐ โ๐๐๐ โ ๐บ๐ ((๐ <๐)โ ๐ต๐๐๐๐๐(๐๐๐ , ๐
๐๐ )), and
โ ๐ โ {1, . . . , ๐ โ 1}๐ต๐๐๐๐๐(๐๐๐ , ๐๐+11 ).
We also say that for any ๐บ๐ of this set ๐บ๐ โ๐ ๐บ.
Note 1. According to condition 2 of defini-tion 3, โ๐๐๐ โ ๐บ๐,โ๐๐๐ โ ๐บ๐(๐ = ๐ โ ๐๐๐ โฉ ๐๐๐ =โ ).
Note 2. When ๐ ๐บ = โจ๐บ1, . . . , ๐บ๐โฉ and๐ ๐บโฒ = โจ๐บโฒ
1, . . . , ๐บโฒ๐โฉ, are variational groups,
โจ๐ ๐บ, ๐ ๐บโฒโฉ = โจ๐บ1, . . . , ๐บ๐ , ๐บโฒ1, . . . , ๐บ
โฒ๐โฉ is also
a variational group in case when is satisfies def-inition 3.
For example, suppose that we have ๐ ๐บ =โจ๐บ1, ๐บ2, ๐บ3โฉ and each of ๐บ๐ consists of threeclones ๐๐๐ , ๐ โ {1, 2, 3}. Then, theseclones appear in the text in the followingorder: ๐11 . . . ๐
12 . . . ๐
13 . . . . . . ๐
21 . . . ๐
22 . . . ๐
23 . . . . . .
๐31 . . . ๐32 . . . ๐
33. Next, it should be possible to
compute the distance between variations or ex-act duplicate groups. This is required to sup-port group merging inside our algorithm which
6 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
selects several closest groups to form a new one.Thus, a distance function should be defined.
Definition 4. Distance between text frag-ments for any ๐1, ๐2 โ ๐ท is defined as follows:
๐๐๐ ๐ก(๐1, ๐2) =
โงโชโจโชโฉ0, ๐1 โฉ ๐2 = โ ,๐2 โ ๐1 + 1, ๐ต๐๐๐๐๐(๐1, ๐2),
๐1 โ ๐2 + 1, ๐ต๐๐๐๐๐(๐2, ๐1),
(1)where [๐1] = [๐1, ๐1] and [๐2] = [๐2, ๐2].
Definition 5. Distance between exactgroups ๐บ1 and ๐บ2, having #๐บ1 = #๐บ2, isdefined as follows:
๐๐๐ ๐ก(๐บ1, ๐บ2) = max๐โ{1,...,#๐บ1}
๐๐๐ ๐ก(๐๐1 , ๐๐2) (2)
Definition 6. Distance between varia-tional groups ๐ ๐บ1 and ๐ ๐บ2, when there are๐บ1 โ ๐ ๐บ1, ๐บ2 โ ๐ ๐บ2 : #๐บ1 = #๐บ2, is definedas follows:
๐๐๐ ๐ก(๐ ๐บ1, ๐ ๐บ2) = max๐บ1โ๐ ๐บ1,๐บ2โ๐ ๐บ2
๐๐๐ ๐ก(๐บ1,๐บ2)
(3)
Definition 7. Length of exact group ๐บ is
defined as follows: ๐๐๐๐๐กโ(๐บ) =#๐บโ๐=1
(๐๐โ๐๐+1),
where ๐๐ โ ๐บ, [๐๐] = [๐๐, ๐๐].
Definition 8. Length of variational group๐ ๐บ = โจ๐บ1, . . . , ๐บ๐โฉ is defined as follows:
๐๐๐๐๐กโ(๐ ๐บ) =๐โ๐=1
๐๐๐๐๐กโ(๐บ๐) (4)
Definition 9. Near duplicate group is sucha variational group โจ๐บ1, . . . , ๐บ๐โฉ that satisfiesfollowing condition for โ๐ โ {1, . . . ,#๐บ1}:
๐โ1โ๐=1
๐๐๐ ๐ก(๐๐๐ ,๐๐๐+1) โค 0.15 *
๐โ๐=1
|๐๐๐ |. (5)
This definition is constructed according tothe near duplicate concept from [38]: varia-tional part of near duplicates with similar infor-mation (delta) should not exceed 15% of theirexact duplicate (archetype) part.
Note 3. An exact group ๐บ can be consideredas a variational one formed by itself: โจ๐บโฉ.
Definition 10. Consider near duplicate groupโจ๐บ1, ๐บ2โฉ, where ๐บ1 and ๐บ2 are exact groups.We assume that this group contains a singleextension point, and the text fragments con-tained in positions [๐๐1 +1, ๐๐2โ1] are called ex-tension point values. In the general case, anear duplicate group โจ๐บ1, . . . , ๐บ๐โฉ has ๐ โ 1extension points.
Definition 11. Consider two near dupli-cate groups ๐บ = โจ๐บ1, . . . , ๐บ๐โฉ and ๐บโฒ =โจ๐บโฒ
1, . . . ๐บโฒ๐โฉ. Suppose that they form a
variational group โจ๐บ1, . . . , ๐บ๐, ๐บโฒ1, . . . , ๐บ
โฒ๐โฉ or
โจ๐บ1, . . . , ๐บ๐, ๐บโฒ1, . . . , ๐บ
โฒ๐โฉ, which in turn is also
a near duplicate group. In this case, we call ๐บand ๐บโฒ nearby groups.
Definition 12. Nearby duplicates are dupli-cates belonging to nearby groups.
Note 4. Due to remark 3, definition 12 is ap-plicable to both near and exact duplicates.
4.2 Algorithm description
The algorithm that constructs the set of nearduplicate groups (๐๐๐ก๐ ๐บ) is presented below.Its input is the set of exact duplicate groups(๐๐๐ก๐บ) belonging to document ๐ท. It employsan interval tree โ a data structure whose pur-pose is to quickly locate intervals that inter-sect with a given interval. Initially, the ๐๐๐ก๐บset is created using the Clone Miner tool. Thecore idea of our algorithm is to repeatedly findand merge nearby exact groups from ๐๐๐ก๐บ. Ateach step, the resulting near duplicate groupsare added to ๐๐๐ก๐ ๐บ. Let us consider this algo-rithm in detail.
The initial interval tree for ๐๐๐ก๐บ is con-structed using the ๐ผ๐๐๐ก๐๐๐ก๐() function (line2). The core part of the algorithm is a loopin which new near duplicate groups are con-structed (lines 3โ18). This loop repeats untilwe can construct at least one near duplicategroup, i.e. the set of newly constructed nearduplicate groups (๐๐๐ก๐๐๐ค) is not empty (line18). Inside of this loop, the algorithm cyclesthrough all groups of ๐๐๐ก๐บ โช ๐๐๐ก๐ ๐บ. For each
Detecting Near Duplicates in Software Documentation 7
Algorithm 1: Near Duplicate Groups Con-structionInput data: ๐๐๐ก๐บResult: ๐๐๐ก๐ ๐บ
1 ๐๐๐ก๐ ๐บโ โ 2 ๐ผ๐๐๐ก๐๐๐ก๐()3 repeat4 ๐๐๐ก๐๐๐ค โ โ 5 foreach ๐บ โ ๐๐๐ก๐บ โช ๐๐๐ก๐ ๐บ do6 ๐๐๐ก๐ถ๐๐๐โ ๐๐๐๐๐ต๐ฆ(๐บ)7 if ๐๐๐ก๐ถ๐๐๐ = โ then8 ๐บโฒ โ ๐บ๐๐ก๐ถ๐๐๐ ๐๐ ๐ก(๐บ,๐๐๐ก๐ถ๐๐๐)9 ๐ ๐๐๐๐ฃ๐(๐บ,๐บโฒ)
10 if ๐ต๐๐๐๐๐(๐บ,๐บโฒ) then11 ๐๐๐ก๐๐๐ค โ ๐๐๐ก๐๐๐ค โช {โจ๐บ,๐บโฒโฉ}12 else13 ๐๐๐ก๐๐๐ค โ ๐๐๐ก๐๐๐ค โช {โจ๐บโฒ, ๐บโฉ}14 end if15 end if16 end foreach17 ๐ฝ๐๐๐(๐๐๐ก๐ ๐บ, ๐๐๐ก๐๐๐ค)
18 until ๐๐๐ก๐๐๐ค = โ 19 ๐๐๐ก๐ ๐บโ ๐๐๐ก๐ ๐บ โช ๐๐๐ก๐บ
of them, the ๐๐๐๐๐ต๐ฆ function returns the setof nearby groups ๐๐๐ก๐ถ๐๐๐ (lines 5, 6), whichis then used for constructing near duplicategroups. Later, we will discuss this function inmore detail and prove its correctness, i.e. thatit actually returns groups that are close to ๐บ.Next, the closest group to ๐บ, denoted ๐บโฒ, is se-lected from ๐๐๐ก๐ถ๐๐๐ (line 8) and a variationalgroup โจ๐บ,๐บโฒโฉ or โจ๐บโฒ, ๐บโฉ is created. This groupis added into ๐๐๐ก๐๐๐ค (lines 10โ14). Since ๐บand ๐บโฒ are merged and therefore cease to existas independent entities, they are deleted from๐๐๐ก๐บ and ๐๐๐ก๐ ๐บ by the ๐ ๐๐๐๐ฃ๐ function (line9). Next, the ๐ฝ๐๐๐ function adds ๐๐๐ก๐๐๐ค to๐๐๐ก๐ ๐บ (line 17). It is essential to note thatthe ๐ ๐๐๐๐ฃ๐ and ๐ฝ๐๐๐ functions perform someauxiliary actions described below.
In the end of the algorithm ๐๐๐ก๐บ is addedto ๐๐๐ก๐ ๐บ. The result โ ๐๐๐ก๐ ๐บ โ is pre-sented as the algorithmโs output. This step isrequired in order for the output to contain notonly near duplicate groups, but also exact du-plicate groups which have not been used forcreation of near duplicate ones (line 19).
Let us describe the functions employed inthis algorithm.
The ๐ผ๐๐๐ก๐๐๐ก๐() function builds the intervaltree. The idea of this data structure is the fol-lowing.
Suppose we have ๐ natural number intervals,where ๐1 is the minimum and ๐๐ is the maxi-mum value of all interval endpoints, and ๐ isthe midpoint of [๐1, ๐๐]. The intervals are di-vided into three groups: fully located to theleft of ๐, fully located to the right of ๐, andintervals containing ๐. The current node of theinterval tree stores the last interval group andreferences to its left and right child nodes con-taining the intervals to the left and to the rightof ๐ respectively. This procedure is repeatedfor each child node. Further details regardingthe construction of an interval tree can be foundin [40, 41].
In this study, we build our interval tree fromthe extended intervals that correspond to theexact duplicates found by CloneMiner. Theseextended intervals are obtained as follows: orig-inal intervals belonging to exact duplicates areenlarged by 15%. For example, if [๐, ๐] isthe initial interval, then an extended one is[๐ โ 0.15 * (๐ โ ๐ + 1), ๐ + 0.15 * (๐ โ ๐ + 1)].We will denote the extended interval that cor-responds to the exact duplicate ๐ as โ ๐. Wealso modify our interval tree as follows: eachstored interval keeps the reference to the corre-sponding exact duplicate group.
The ๐ ๐๐๐๐ฃ๐ function removes groups fromsets and their intervals from the interval tree.The interval deletion algorithm is described inreferences [40, 41].
The ๐ฝ๐๐๐ function, in addition to the op-erations described above, adds intervals ofthe newly created near duplicate group ๐บ =โจ๐บ1, . . . , ๐บ๐โฉ to the interval tree. The stan-dard insertion algorithm described in refer-ences [40, 41] is used. Extended intervalsadded to the tree of each near duplicate ๐๐ =(๐๐1 ,..., ๐
๐๐), where ๐ โ {1, . . . ,#๐บ1}, have the
form of [๐๐1 โ ๐ฅ๐, ๐๐๐ + ๐ฅ๐], where ๐ฅ๐ = 0.15 *๐โ๐=1
|๐๐๐ | โ๐โ1โ๐=1
๐๐๐ ๐ก(๐๐๐ , ๐๐๐+1). We will denote this
extended interval of ๐๐ (now, a near duplicate)as โ ๐๐ as well.
8 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
The ๐๐๐๐๐ต๐ฆ function selects nearby groupsfor some group ๐บ (its parameter). To do this,for each text fragment from ๐บ a collection ofintervals that intersect with its interval is ex-tracted.Text fragments that correspond to theseintervals turn out to be neighboring to the ini-tial fragment, i.e. for them, condition (5) issatisfied. The retrieval is done using the inter-val tree search algorithm [40, 41]. We constructthe ๐บ๐ฟ1 set, which contains groups that are ex-pected to be nearby to ๐บ:
๐บ๐ฟ1(๐บ) = {๐บโฒ|(๐บโฒ โ ๐๐๐ก๐บ โช ๐๐๐ก๐ ๐บ)โงโ๐ โ ๐บ, ๐โฒ โ ๐บโฒ : โ ๐โฉ โ ๐โฒ = โ .
(6)
That is, the ๐บ๐ฟ1 set consists of groups thatcontain at least one duplicate that is close toat least one duplicate from ๐บ. Then, only thegroups that can form a variational group with๐บ are selected and placed into the ๐บ๐ฟ2 set:
๐บ๐ฟ2(๐บ) = {๐บโฒ|๐บโฒ โ ๐บ๐ฟ1 โง (โจ๐บ,๐บโฒโฉorโจ๐บโฒ, ๐บโฉis variational group)}.
(7)
Finally, the ๐บ๐ฟ3 set (the ๐๐๐๐๐ต๐ฆ functionโsoutput) is created. The only groups placed inthis set are those from ๐บ๐ฟ2 whose all elementsare close to corresponding elements of ๐บ:
๐บ๐ฟ3(๐บ) = {๐บโฒ|๐บโฒ โ ๐บ๐ฟ2 โงโ๐ โ {1, . . . ,#๐บ} : โ ๐๐โฉ โ ๐โฒ๐ = โ }
(8)
Theorem 1. Suggested algorithm detects nearduplicate groups that conform to definition 9.
It is easy to show by construction of ๐๐๐๐๐ต๐ฆthat for some group ๐บ it returns the set of itsnearby groups (see definition 11). That is, eachof these groups can be used to form a near du-plicate group with ๐บ. Then for the set the al-gorithm selects the group closest to ๐บ and con-structs a new near duplicate group. The cor-rectness of all intermediate sets and other usedfunctions is immediate from their constructionmethods.
5 EvaluationThe proposed algorithm was implemented inthe Duplicate Finder Toolkit [12]. Our proto-type uses the intervaltree library [42] as an
implementation of the interval tree data struc-ture.
We have evaluated 19 industrial documentsbelonging to various types: requirement speci-fication, programming guides, API documenta-tion, user manuals, etc. (see Table 1). The sizeof the evaluated documents is up to 3 Mb.
Our evaluation produced the following re-sults. The majority of the duplicate groupsdetected are exact duplicates (88.3โ96.5%).Groups having one variation point amount to3.3โ12.5%, two variation points โ to 0โ1.7%,and three variation points โ to less 1%, etc.A few near duplicates with 11, 12 13, and 16variation points also were detected. We per-formed a manual analysis of the automaticallydetected near duplicates for Linux Kernel Doc-umentation (programming guide, document 1in the Table 1) [13]. We found 70 meaningfultext groups (5.4%), 30 meaningful groups forthe code example (2.3%), and 1191 false posi-tive groups (92.3%). We found 21 near dupli-cate groups, i.e. 21% of the meaningful dupli-cate groups. Therefore, the share of near du-plicates significantly increases after discardingfalse positives.
Having analyzed the evaluation results, wecan make the following conclusions:
1. During our experiments we did not manageto find any near duplicates in considereddocuments that were not detected by ouralgorithm. However, we should note thatthe claim of the algorithmโs high recallneeds a more detailed justification.
2. Analyzing the Linux KernelDocumentation, we have concludedthat it does not have any cohesive style:it was created sporadically by differentauthors. Virtually all its duplicates aresituated locally, i.e. close to each other.For example, some author created adescription of some driverโs functionalityusing copy/paste for its similar features.At the same time, another driver wasdescribed by a different author who didnot use the first driverโs description atall. Consequently, there are practically noduplicates that are found throughout the
Detecting Near Duplicates in Software Documentation 9
Table 1: Near-duplicate groups detected
Do
cu
me
nt
Siz
e, K
b
To
tal n
ea
r
dup
gro
up
s
Exa
ct
dup
lica
tes
gro
up
s,
%
1-v
ariatio
n
gro
up
s,
%
2-v
ariatio
n
gro
up
s,
%
3-v
ariatio
n
gro
up
s,
%
1 892 1291 93.9 5.1 0.9 0.2
2 2924 6056 93.3 5.3 0.9 0.2
3 1810 4220 95.9 3.4 0.5 0.2
4 686 1500 96.5 3.3 0.2 0.1
5 1311 4688 95.9 3.3 0.5 0.0
6 3136 6587 93.9 5.1 0.7 0.2
7 1491 4537 92.0 6.0 1.2 0.4
8 3160 7804 95.6 3.5 0.5 0.2
9 1104 152 93.4 5.3 0.7 0.7
10 1800 4685 92.7 5.8 0.9 0.3
11 1056 2436 91.5 6.7 1.1 0.3
12 36 59 83.1 11.9 1.7 0.0
13 166 392 89.5 9.7 0.5 0.3
14 103 208 91.3 7.7 0.5 0.0
15 98 117 94.0 4.3 0.9 0.9
16 241 394 88.8 9.1 0.8 0.3
17 43 16 81.3 12.5 6.3 0.0
18 50 77 88.3 11.7 0.0 0.0
19 167 145 88.3 9.7 0.7 1.4
whole text. Examples, warnings, notes,and other documentation elements thatare preceded by different introductorysentences are not styled cohesively aswell. Thus, our algorithm can be used foranalyzing the degree of documentationuniformity.
3. The algorithm performs well on two-element groups, finding near duplicategroups with a different number ofextension points. It appears that, ingeneral, there are way fewer near duplicategroups with more than two elements.
4. Many detected duplicate groups consist offigure and table captions, page headers,parts of the table of contents and soon โ that is, they are not of anyinterest to us. Also, many found duplicatesare scattered across different elementsof document structure, for example, aduplicate can be a part of a headerand a small fragment of text right afterit. These kinds of duplicates are notdesired since they are not very usefulfor document writers. However, theyare detected because currently documentstructure is not taken into account during
the search.
5. The 0.15 value used in detecting nearduplicate groups does not allow to findsome significant groups (mainly smallones, 10โ20 tokens in size). It is possiblethat it would be more effective to use somefunction instead of a constant, which coulddepend, for example, on the length of thenear duplicate.
6. Moreover, often the detected duplicatedoes not contain variational informationthat is situated either in its end orin its beginning. Sometimes it couldbe beneficial to include it in order toensure semantic completeness. To solvethis problem, a clarification and a formaldefinition of semantic completeness of atext fragment is required. Our experimentsshow that this can be done in various ways(the simplest one is ensuring sentence-levelgranularity, i.e. including all text until thestart/end of sentence).
10 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
6 Conclusion
In this paper the formal definition of near du-plicates in software documentation is given andthe algorithm for near duplicate detection ispresented. An evaluation of the algorithm us-ing a large number of both commercial andopen source documents is performed.
The evaluation shows that various typesof software documentation contain a signifi-cant number of exact and near duplicates. Anear duplicate detection tool could improve thequality of documentation, while a duplicatemanagement technique would simplify docu-mentation maintenance.
Although the proposed algorithm providesample evidence on text duplicates in industrialdocuments, it still requires improvements be-fore it can be applied to real-life tasks. Themain issues to be resolved are the quality ofthe near duplicates detected and a large num-ber of false positives. Also, a detailed analysisof near duplicate types in various sorts of soft-ware documents should be performed.
References
[1] Parnas D.L. Precise Documentation:The Key to Better Software. SpringerBerlin Heidelberg, Berlin, Heidelberg,2011, pp. 125โ148. DOI: 10.1007/978-3-642-15187-3_8.
[2] Juergens E., Deissenboeck F., Feilkas M.,Hummel B., Schaetz B., Wagner S., Do-mann C., and Streit J. Can clone de-tection support quality assessments of re-quirements specifications? Proceedingsof ACM/IEEE 32nd International Con-ference on Software Engineering, vol. 2.2010, pp. 79โ88. DOI: 10.1145/1810295.1810308.
[3] Nosalโ M. and Poruban J. Preliminary re-port on empirical study of repeated frag-ments in internal documentation. Proceed-ings of Federated Conference on ComputerScience and Information Systems. 2016,pp. 1573โ1576.
[4] Koznov D.V. and Romanovsky K.Y. Do-cLine: A method for software productlines documentation development. Pro-gramming and Computer Software, vol. 34,no. 4, 2008, pp. 216โ224. DOI: 10.1134/S0361768808040051.
[5] Romanovsky K., Koznov D., and MinchinL. Refactoring the Documentation ofSoftware Product Lines. Lecture Notesin Computer Science, CEE-SET 2008,vol. 4980. Springer-Verlag, Berlin, Heidel-berg, 2011, pp. 158โ170. DOI: 10.1007/978-3-642-22386-0_12.
[6] Wingkvist A., Lowe W., Ericsson M., andLincke R. Analysis and visualization of in-formation quality of technical documenta-tion. Proceedings of the 4th European Con-ference on Information Management andEvaluation. 2010, pp. 388โ396.
[7] Nosalโ M. and Poruban J. Reusable soft-ware documentation with phrase annota-tions. Central European Journal of Com-puter Science, vol. 4, no. 4, 2014, pp. 242โ258. DOI: 10.2478/s13537-014-0208-3.
[8] Koznov D., Luciv D., Basit H.A., LiehO.E., and Smirnov M. Clone Detection inReuse of Software Technical Documenta-tion. International Andrei Ershov Memo-rial Conference on Perspectives of Sys-tem Informatics (2015), Lecture Notes inComputer Science, vol. 9609. Springer Na-ture, 2016, pp. 170โ185. DOI: 10.1007/978-3-319-41579-6_14.
[9] Luciv D.V., Koznov D.V., Basit H.A.,and Terekhov A.N. On fuzzy repetitionsdetection in documentation reuse. Pro-gramming and Computer Software, vol. 42,no. 4, 2016, pp. 216โ224. DOI: 10.1134/s0361768816040046.
[10] Basit H.A., Puglisi S.J., Smyth W.F.,Turpin A., and Jarzabek S. Efficient To-ken Based Clone Detection with Flexi-ble Tokenization. Proceedings of the 6thJoint Meeting on European Software En-gineering Conference and the ACM SIG-SOFT Symposium on the Foundations of
Detecting Near Duplicates in Software Documentation 11
Software Engineering: Companion Papers.ACM, New York, NY, USA, 2007, pp. 513โ516. DOI: 10.1145/1295014.1295029.
[11] Bassett P.G. Framing Software Reuse:Lessons from the Real World. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1997.
[12] Documentation Refactoring Toolkit.URL: http://www.math.spbu.ru/user/kromanovsky/docline/index_en.html.
[13] Torvalds L. Linux Kernel Documenta-tion, Dec 2013 snapshot. URL: https://github.com/torvalds/linux/tree/master/Documentation/DocBook/.
[14] Horie M. and Chiba S. Tool Support forCrosscutting Concerns of API Documenta-tion. Proceedings of the 9th InternationalConference on Aspect-Oriented SoftwareDevelopment. ACM, New York, NY, USA,2010, pp. 97โ108. DOI: 10.1145/1739230.1739242.
[15] Rago A., Marcos C., and Diaz-Pace J.A.Identifying duplicate functionality in tex-tual use cases by aligning semantic actions.Software & Systems Modeling, vol. 15,no. 2, 2016, pp. 579โ603. DOI: 10.1007/s10270-014-0431-3.
[16] Huang T.K., Rahman M.S., MadhyasthaH.V., Faloutsos M., and Ribeiro B. AnAnalysis of Socware Cascades in On-line Social Networks. Proceedings of the22Nd International Conference on WorldWide Web. ACM, New York, NY, USA,2013, pp. 619โ630. DOI: 10.1145/2488388.2488443.
[17] Williams K. and Giles C.L. Near Du-plicate Detection in an Academic DigitalLibrary. Proceedings of the ACM Sym-posium on Document Engineering. ACM,New York, NY, USA, 2013, pp. 91โ94.DOI: 10.1145/2494266.2494312.
[18] Zhang Q., Zhang Y., Yu H., and HuangX. Efficient Partial-duplicate Detection
Based on Sequence Matching. Proceed-ings of the 33rd International ACM SI-GIR Conference on Research and Develop-ment in Information Retrieval. ACM, NewYork, NY, USA, 2010, pp. 675โ682. DOI:10.1145/1835449.1835562.
[19] Abdel Hamid O., Behzadi B., ChristophS., and Henzinger M. Detecting the Ori-gin of Text Segments Efficiently. Pro-ceedings of the 18th International Con-ference on World Wide Web. ACM, NewYork, NY, USA, 2009, pp. 61โ70. DOI:10.1145/1526709.1526719.
[20] Ramaswamy L., Iyengar A., Liu L., andDouglis F. Automatic Detection of Frag-ments in Dynamically Generated WebPages. Proceedings of the 13th Interna-tional Conference on World Wide Web.ACM, New York, NY, USA, 2004, pp. 443โ454. DOI: 10.1145/988672.988732.
[21] Gibson D., Punera K., and Tomkins A.The Volume and Evolution of Web PageTemplates. Special Interest Tracks andPosters of the 14th International Confer-ence on World Wide Web. ACM, NewYork, NY, USA, 2005, pp. 830โ839. DOI:10.1145/1062745.1062763.
[22] Valles E. and Rosso P. Detection of Near-duplicate User Generated Contents: TheSMS Spam Collection. Proceedings of the3rd International Workshop on Search andMining User-generated Contents. ACM,New York, NY, USA, 2011, pp. 27โ34.DOI: 10.1145/2065023.2065031.
[23] Barron-Cedeno A., Vila M., Martฤฑ M., andRosso P. Plagiarism Meets Paraphrasing:Insights for the Next Generation in Auto-matic Plagiarism Detection. Comput. Lin-guist., vol. 39, no. 4, 2013, pp. 917โ947.DOI: 10.1162/COLI_a_00153.
[24] Smyth W. Computing Patterns in Strings.Addison-Wesley, 2003.
[25] Levenshtein V. Binary codes capable ofcorrecting spurious insertions and dele-
12 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov
tions of ones. Problems of InformationTransmission, vol. 1, 1965, pp. 8โ17.
[26] Wagner R.A. and Fischer M.J. The String-to-String Correction Problem. Journal ofthe ACM, vol. 21, no. 1, 1974, pp. 168โ173.DOI: 10.1145/321796.321811.
[27] Wu S. and Manber U. Fast Text Searching:Allowing Errors. Commun. ACM, vol. 35,no. 10, 1992, pp. 83โ91. DOI: 10.1145/135239.135244.
[28] Myers G. A Fast Bit-vector Algorithm forApproximate String Matching Based onDynamic Programming. J. ACM, vol. 46,no. 3, 1999, pp. 395โ415. DOI: 10.1145/316542.316550.
[29] Broder A. On the Resemblance and Con-tainment of Documents. Proceedings ofthe Compression and Complexity of Se-quences. IEEE Computer Society, Wash-ington, DC, USA, 1997, pp. 21โ29.
[30] Wang P., Xiao C., Qin J., Wang W.,Zhang X., and Ishikawa Y. Local Simi-larity Search for Unstructured Text. Pro-ceedings of International Conference onManagement of Data. ACM, New York,NY, USA, 2016, pp. 1991โ2005. DOI:10.1145/2882903.2915211.
[31] Sajnani H., Saini V., Svajlenko J., RoyC.K., and Lopes C.V. SourcererCC: Scal-ing Code Clone Detection to Big-code.Proceedings of the 38th International Con-ference on Software Engineering. ACM,New York, NY, USA, 2016, pp. 1157โ1168.DOI: 10.1145/2884781.2884877.
[32] Jiang L., Misherghi G., Su Z., and GlonduS. DECKARD: Scalable and AccurateTree-Based Detection of Code Clones.Proceedings of the 29th International Con-ference on Software Engineering. IEEEComputer Society, Washington, DC, USA,2007, pp. 96โ105. DOI: 10.1109/ICSE.2007.30.
[33] Indyk P. and Motwani R. ApproximateNearest Neighbors: Towards Removing
the Curse of Dimensionality. Proceed-ings of the Thirtieth Annual ACM Sym-posium on Theory of Computing, STOCโ98. ACM, New York, NY, USA, 1998.ISBN 0-89791-962-9, pp. 604โ613. DOI:10.1145/276698.276876. URL: http://doi.acm.org/10.1145/276698.276876.
[34] Cordy J.R. and Roy C.K. The NiCadClone Detector. Proceedings of IEEE19th International Conference on ProgramComprehension. 2011, pp. 219โ220. DOI:10.1109/ICPC.2011.26.
[35] Rattan D., Bhatia R., and Singh M. Soft-ware clone detection: A systematic re-view. Information and Software Technol-ogy, vol. 55, no. 7, 2013, pp. 1165โ1199.DOI: 10.1016/j.infsof.2013.01.008.
[36] Abouelhoda M.I., Kurtz S., and OhlebuschE. Replacing Suffix Trees with EnhancedSuffix Arrays. J. of Discrete Algorithms,vol. 2, no. 1, 2004, pp. 53โ86. DOI: 10.1016/S1570-8667(03)00065-0.
[37] Koznov D.V., Shutak A.V., SmirnovM.N., and A. S.M. Clone detection inrefactoring software documentation [Poiskklonov pri refaktoringe tehnicheskoj doku-mentacii]. Computer Tools in Educa-tion [Kompโjuternye instrumenty v obra-zovanii], vol. 4, 2012, pp. 30โ40. (In Rus-sian).
[38] Bassett P.G. The Theory and Practice ofAdaptive Reuse. SIGSOFT Softw. Eng.Notes, vol. 22, no. 3, 1997, pp. 2โ9. DOI:10.1145/258368.258371.
[39] Jarzabek S., Bassett P., Zhang H., andZhang W. XVCL: XML-based VariantConfiguration Language. Proceedings ofthe 25th International Conference on Soft-ware Engineering. IEEE Computer Soci-ety, Washington, DC, USA, 2003, pp. 810โ811.
[40] de Berg M., Cheong O., van Kreveld M.,and Overmars M. chapter Interval Trees.Springer Berlin Heidelberg, 2008, pp. 220โ226. DOI: 10.1007/978-3-540-77974-2.
Detecting Near Duplicates in Software Documentation 13
[41] Preparata F.P. and Shamos M.I. chap-ter Intersections of rectangles. Textsand Monographs in Computer Science.Springer, 1985, pp. 359โ363.
[42] PyIntervalTree. URL: https://github.com/chaimleib/intervaltree.