Most Influential Paper - SANER 2017

How Clones are Maintained 2007 - 2017

Luigi Cerulo

Max Di Penta

Lerina Aversano

University of Sannio

Chapter 1 - How everything started

Chapter 2 - The follow-up

Chapter 3 - The impact

Chapter 4 - Take-aways

Chapter Zero Prologue

… SE Prophets envisioned a new future


Clone genealogies (ESEC/FSE 2005)

SAME SHIFT

INCONSISTENT CHANGE

ADD CONSISTENT

CHANGE

SUBTRACT

Figure 1: The relationship among evolution patterns

traces code clones in consecutive versions using a metric-based clone detector and classifies clones into four cate-gories: new clones, modified clones, never modified clones,and deleted clones. Their analysis does not address how ele-ments in a group of code clones change with respect to otherelements in the group. To the best of our knowledge, ourclone genealogy extractor (detailed in Section 4) is the firsttool that systematically analyzes clone evolution patternsby monitoring how a clone group evolves.

Techniques for Analyzing Structural ChangesOrigin analysis [16, 37] is similar to our genealogy analysis(described in detail in Section 3 and 4) because it employsa cloning relationship to trace code fragments across ver-sions. The goal of origin analysis is to understand structuralchanges during evolution, and it has been applied to detectsplitting and merging of code fragments. However it differsfrom our analysis that (1) it semi-automatically traces onlycode fragments specified by a user and (2) it does not mon-itor operational changes to a group of code clones, such aswhether clones change consistently (or inconsistently) withother elements in the same group.

Antoniol et al., proposed an automatic approach, based onvector space information retrieval, to identify several refac-toring events, namely class renaming, replacement, merge,and split [4]. A similar approach was used to identify “movemethod” refactoring events [32]. These analyses do not fo-cus on structural changes of code clones.

3. MODEL OF CLONE GENEALOGYTo study clone evolution structurally and semantically ratherthan quantitatively, we defined a model of clone genealogy.The genealogy of code clones describes how groups of codeclones change over multiple versions of a program. In aclone’s genealogy, the origin of a group to which the clonebelongs is traced to the previous version. The model as-sociates related clone groups that have originated from thesame ancestor clone group. In addition, the genealogy con-tains information about how each element in a group ofclones has changed with respect to other elements in thesame group.

We wrote our model in the Alloy modeling language [3] tocheck whether several evolution patterns can describe allpossible changes to a clone group and to clarify the rela-tionship among evolution patterns. (Our entire model isavailable on the web [1].)

The basic unit in our model is a Code Snippet, which has

two attributes, Text and Location. Text is an internal repre-sentation of code that a clone detector uses to compare codesnippets. For example, when using CCFinder [20], text is aparametrized token sequence, whereas when using CloneDr[10], text is an isomorphic AST. A Location is used to tracecode snippets across multiple versions of a program; thus,every code snippet in a particular version of a program has aunique location. To determine how much the text of a codesnippet has changed across versions, we define a TextSimi-larity function that measures the text similarity between twotexts t1 and t2 (0 ≤ TextSimilarity(t1, t2) ≤ 1). To trace acode snippet across versions, we define a LocationOverlap-ping function that measures how much two locations l1 andl2 overlap each other (0 ≤ LocationOverlapping(l1, l2) ≤ 1).A Clone Group is a set of code snippets with identical text.CG.text is a syntactic sugar for the text of any code snippetin a clone group CG. A Cloning Relationship is defined be-tween two clone groups CG1 and CG2 if and only if TextSim-ilarity(CG1.text,CG2.text) ≥ simth, where simth is a con-stant between 0 and 1. An Evolution Pattern is defined be-tween an old clone group OG in the k − 1th version and anew clone group NG in the kth version such that there existsa cloning relationship between NG and OG.

We defined several evolution patterns that describe all pos-sible changes to a clone group. The relationship among evo-lution patterns is shown in the Venn diagram in Figure 1.

• Same: all code snippets in NG did not change fromOG.TextSimilarity(NG.text,OG.text) = 1all cn:CodeSnippet | some co:CodeSnippet | cn in NG ⇒co in OG && LocationOverlapping(cn,co) = 1all co:CodeSnippet | some cn:CodeSnippet | co in OG ⇒cn in NG && LocationOverlapping(cn,co) = 1

• Add: at least one code snippet in NG is a newly addedone. For example, programmers added a new codesnippet to NG by copying an old code snippet in OG.TextSimilarity(NG.text,OG.text) ≥ simth

some cn:CodeSnippet | all co:CodeSnippet | co in OG ⇒cn in NG && LocationOverlapping(cn,co) = 0

• Subtract: at least one code snippet in OG does notappear in NG. For example, programmers refactoredor removed a code clone.TextSimilarity(NG.text,OG.text) ≥ simth

some co:CodeSnippet | all cn:CodeSnippet | cn in NG ⇒co in OG && LocationOverlapping(cn,co) = 0

• Consistent Change: all code snippets in OG have changedconsistently; thus they belong to NG together. Forexample, programmers applied the same change con-sistently to all code clones in OG.simth ≤TextSimilarity(NG.text,OG.text)< 1all co:CodeSnippet | some cn:CodeSnippet | co in OG ⇒cn in NG && LocationOverlapping(cn,co) > 0

• Inconsistent Change: at least one code snippet in OGchanged inconsistently; thus it does not belong to NGanymore. For example, a programmer forgot to changeone code snippet in OG.simth ≤TextSimilarity(NG.text,OG.text)< 1



SAME SHIFT

INCONSISTENT CHANGE

ADD CONSISTENT

CHANGE

SUBTRACT

















Change coupling and clones(FASE 2006)

Relation of Code Clones and Change Couplings 7

Number ofCouplings

Clone Coverage

Co

uplin

g C

overa

ge

Length of Clone

Fig. 2. Description of the metrics used in the visualization.

of circles in the chart. The mapping of metric values to graphical attributes isdepicted Figure 2.

The size of a circle is defined in proportion to the length of the clones. Themaximum diameter is fixed and corresponds to the length of the longest clone.All other diameters are calculated proportionally to the length of the rest of theclones:

Diameter(A) = MaxDiameter · ClonedLines(A,B)max(ClonedLines(X, Y ))

where MaxDiameter is a constant describing the maximal diameter of a circleand max(ClonedLines(X, Y )) is the maximum length of cloned fragments to bevisualized.

The fill color of a circle is defined in a way that the highest number of cou-plings is displayed as red. The intermediate colors are determined by variationsof the RGB value proportional to the relative number of couplings so that agradual transition to blue is achieved, which corresponds to zero couplings. TheR and B–values are calculated by

R =ChangeCouplings(C,D, I)

max(ChangeCouplings(X, Y, I))· 255, and B = 255�R

where R is the RGB–value for red and B the RGB–value for blue of the colorof the circle in the chart. C and D are the specific files under consideration.max(ChangeCouplings(X, Y, I)) represents the maximal number of change cou-plings between any two files X and Y during interval I.

Unlike a numerical approach, this visualization is not dependent on a signif-icant regression. The user is able to see possible problems and to react by closerinspection of the a↵ected files.


“Cloning considered harmful” considered harmful (WCRE 2006)


SAME SHIFT

INCONSISTENT CHANGE

ADD CONSISTENT

CHANGE

SUBTRACT

















Change coupling and clones(FASE 2006)

Relation of Code Clones and Change Couplings 7

Number ofCouplings

Clone Coverage

Co

uplin

g C

overa

ge

Length of Clone

Fig. 2. Description of the metrics used in the visualization.

of circles in the chart. The mapping of metric values to graphical attributes isdepicted Figure 2.

The size of a circle is defined in proportion to the length of the clones. Themaximum diameter is fixed and corresponds to the length of the longest clone.All other diameters are calculated proportionally to the length of the rest of theclones:

Diameter(A) = MaxDiameter · ClonedLines(A,B)max(ClonedLines(X, Y ))

where MaxDiameter is a constant describing the maximal diameter of a circleand max(ClonedLines(X, Y )) is the maximum length of cloned fragments to bevisualized.

The fill color of a circle is defined in a way that the highest number of cou-plings is displayed as red. The intermediate colors are determined by variationsof the RGB value proportional to the relative number of couplings so that agradual transition to blue is achieved, which corresponds to zero couplings. TheR and B–values are calculated by

R =ChangeCouplings(C,D, I)

max(ChangeCouplings(X, Y, I))· 255, and B = 255�R

where R is the RGB–value for red and B the RGB–value for blue of the colorof the circle in the chart. C and D are the specific files under consideration.max(ChangeCouplings(X, Y, I)) represents the maximal number of change cou-plings between any two files X and Y during interval I.

Unlike a numerical approach, this visualization is not dependent on a signif-icant regression. The user is able to see possible problems and to react by closerinspection of the a↵ected files.

Somebody was analyzing source code line trails (ldiff)…

Somebody was analyzing source code line trails (ldiff)…

MSR2007

Track the lifetime of software entities

54 I E E E S O F T W A R E w w w . c o m p u t e r . o r g / s o f t w a r e

ldiff’s ability to identify moved line blocks and thus its ability to track a software entity when its position in a file changes. To this end, we ran-domly generated new releases of 100 source code files selected from two open source projects (Post-greSQL and openSSH) by randomly moving code fragments within the source code file. The frag-ments varied from 1 line to a maximum of 1/10 of the total number of lines. We assessed the algo-rithm in terms of precision and recall:

precision = number of correctly detected moves /

number of detected moves.recall = number of correctly detected moves / number of generated moves.

As Figure 3a shows, the algorithm reveals a me-dian precision of 92 percent and the recall increas-ing with the number of iterations, from 62 percent with one iteration to 73 percent with four itera-tions. Whereas the precision remains almost con-stant across iterations (it increases 0.7 percent from the first to the fourth iteration), the recall increases by 21 percent from the first to the fourth iteration. This difference is marginally significant: p-value 0.05 computed using a one-tailed (because we’re expecting improvements over subsequent steps) Mann-Whitney test.

The second assessment aimed to evaluate the ldiff accuracy in identifying changed, added, de-leted, and unchanged source code lines by clas-sifying changes in 11 change sets. We randomly extracted change sets from the ArgoUML Con-current Versions System (CVS) repository, repre-senting different types of changes, such as bug fix-ing, refactoring, or enhancement. We assessed the tool’s precision by manually identifying false posi-tives in classifications the algorithm made. The 11 change sets affected from 11 to 72 files (median 19) and from 32 to 401 lines (median 42). Figure 3b shows the median ldiff and Unix diff accuracy and the interquartile range (between the third and first quartile). (For the ldiff syntax, see the “Ldiff: A Support Tool” sidebar.)

/* * foo(revision 1.3)*/int foo(float a, int b) { return a;}

Snapshots extracted fromConcurrent Versions System/

Subversion archive

Entity Aadded

Entity Bchanged

Entity A changed

Entity B deleted Time

Snapshot 1

Entity Atracking

Entity Btracking

Snapshot 2

LDA(1,2) LDA(2,3)

Snapshot 3

LDA(3,4)

Snapshot 4

LDA(4,5)

Snapshot 5

LDA(n – 1, n )

Snapshot n

DELCHGDELCHGCHG

CHGCHG

CHGADD ADD

CHG

ADD

ADDADD

DELCHG CHG

CHG// foo (revision 1.4)float foo(int a, int b) { if (b!=0) return (float)a/b; else return 0;}

// foo (revision 1.5)float foo(int a, int b) { int c=0 if (b!=0) return (float)a/b; return c;}

Figure 2. Tracking source code entities across subsequent system snapshots. The proposed approach enables locating a source code entity in subsequent code snapshots. It allows for identifying when a developer adds, deletes, or changes a source code line across subsequent snapshots.

Table 1Similarity metrics

Set-based metric Definition

Dice(X, Y ) The ratio between twice the intersection of X and Y and the sum of X and Y modules

Cosine(X, Y ) The cosine of the angle between X and Y represented as vectors of a Euclidean space

Jaccard(X, Y ) The fraction of common items (|X Y |) with respect to overall items (|X Y |)

Overlap(X, Y ) 1 if the set X is a subset of Y or the converse; 0 if there is no overlap; 1 otherwise

Sequence-based metric Definition

Levensthein(X, Y ) Measures the minimum edit distance that transforms X into Y in terms of add, delete, and substitute operations

Jaro(X, Y ) Measures typical spelling deviations

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 6, 2009 at 11:05 from IEEE Xplore. Restrictions apply.

IEEE

Sof

twar

e 26

.1 (2

009)

Somebody else used to study clone evolution

Nice surprise! We got a grant on software evolution

Ok… that was not so much money…

Chapter One How Everything Started

What we wanted to study…

Software clones are devils?

To what extend they can be assimilated as (bad/good?) software engineering practices?

Measure how clones are maintained

Tracking clone changes

Clone class A

Clone class B


Clone class A

Clone class B

Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6


Clone class A

Clone class B


Consistent change


Clone class A

Clone class B


Consistent changeLate propagation


Clone class A

Clone class B



Independent evolution

The Work

Only two projects

Only two projects

One clone detector

Only two projects

One clone detector

Automated clone tracking

Only two projects

One clone detector

Automated clone tracking

Manual classification

Some findings

Some findingsClass-level clones mostly consistently changed. Not the case for method and block


13%-32% of independent evolution


Between 13% and 16% of late propagation


Some findings

Late propagation often due to different schedule, caused bugs only in few cases

Class-level clones mostly consistently changed. Not the case for method and block

Between 13% and 16% of late propagation


We got the Paper!How Clones are Maintained: An Empirical Study

Lerina Aversano, Luigi Cerulo, Massimiliano Di PentaRCOST — Research Centre on Software Technology

Department of Engineering - University of SannioViale Traiano - 82100 Benevento, Italy

{aversano, lcerulo, dipenta}@unisannio.it

Abstract

Despite the conventional wisdom concerning the risksrelated to the use of source code cloning as a software de-velopment strategy, several studies appeared in literatureindicated that this is not true. In most cases clones are prop-erly maintained and, when this does not happen, is becausecloned code evolves independently.Stemming from previous works, this paper combines

clone detection and co–change analysis to investigate howclones are maintained when an evolution activity or a bugfixing impact a source code fragment belonging to a cloneclass. The two case studies reported confirm that, either forbug fixing or for evolution purposes, most of the cloned codeis consistently maintained during the same co–change orduring temporally close co–changes.

Keywords: Clone detection, software evolution, miningsoftware repositories

1. Introduction

Source code cloning is a practice commonly adopted inthe development of software systems. It has been estimatedthat industrial code contains up to 20% of cloned code [17]and that, roughly, the same percentage can be found in codefrom open source projects [2]. Clones are often thought tobe bad smells: maintenance interventions performed on asource code fragment, even due to bug fixing or to evolu-tion purposes, may need to be propagated on all clones ofsuch a fragment (if any). This would not happen in codenot containing clones, or where the clones have been re–factored. Nevertheless, whilst automatic support for clonere–factoring has been proposed [3] and, sometimes, clonestend to be re–factored during software evolution – like inthe case of the Linux Kernel [2] – clone re–factoring is arisky activity and a potential source of faults. For this rea-son, developers are almost always reluctant in performingit [7].

Several recent studies contradict the common wisdomthat cloning constitutes a risky practice: as found by Kim etal. [16]. As shown in a paper by Kasper and Godfrey [15],source code clones are not necessarily to be consideredharmful but, many times, as a way to develop software cre-ating, for example, new features starting for existing, simi-lar ones. Whilst this creates duplications, it also permits theuse of stable, already tested and used code.

This paper aims to report results from an empiri-cal study aiming to investigate how clones, detected in agiven release of a software system, are affected by mainte-nance intervention. The analysis is performed by intersect-ing cloned classes with data from Modification Transactions(MTs) mined from source code repositories. A MT iden-tifies groups of source code lines co-changed in the sametime window. The work is built upon the idea of clone pat-terns described by Kasper and Godfrey and of cloneevolution patterns described by Kim et al., and investi-gates whether clones (i) are updated consistently duringthe same MT or near MTs, confirming the correlation be-tween MTs and clones, as experienced by Geiger et al.[10]; (ii) evolve independently; or (iii) are subject to up-dates or bug fixes in different time frames. The latterconstitutes a potential problem, especially when the main-tenance intervention aims to fix a bug. The bug is fixedon the first clone but, either because the maintainer is notaware of the presence of a clone, or s/he for some rea-son cannot propagate the fix, a new bug appears later,raising the need for a new corrective maintenance interven-tion.

The empirical study was carried out on source code ex-tracted from the CVS repositories of two Java software sys-tems, ArgoUML and DNSJava. Both case studies indicatedthat in a very few cases clones were not consistently main-tained. In particular, when this happens in correspondenceof a bug fixing, developers almost always took care of prop-agating the change. This was especially true for smallersize, single contributor systems like DNSJava.

The paper is organized as follows. Section 2 describes

Submit where?How Clones are Maintained: An Empirical Study




Abstract




1. Introduction






WCRE?





Abstract




1. Introduction






Sorry! I’m WCRE PC co-chair





Abstract




1. Introduction






Lets try with CSMR, it is in Amsterdam!

We got accepted! Amsterdam we’re coming

We got accepted! Amsterdam we’re coming

From: Massimiliano Di Penta <[email protected]> Subject: [Fwd: CSMR 2007 Notification] Date: 30 Nov 2006 15:28:59 CET To: Lerina Aversano <[email protected]>, "Luigi Cerulo" <[email protected]>

great...ecco le revisioni ... non so in effetti tra il primo e il terzo quale e' il piu' negativo (magari il primo)

La critica del primo e' tutto sommato condivisibile, nel senso che considera il lavoro buono anche se molte cose si sapevano gia' (come del resto nel paper di Godfrey che nonostante una A aveva ricevuto qualche commento simile a WCRE) e questo e' yet another study.. (magari con qualche livello di dettaglio in piu')... da spiegare meglio nel camera ready copy

…

Guardate qui: se la gente dovesse seguire questa regola non si pubblicherebbe mai neanche su TSE ... !!

General advice: Please submit your paper to a workshop to discuss the setup of your experiments. A submission for a conference should analyse more (>= 10) throughly selected software systems. As you suggest, your clone detection tool is very conservative, and you should perform the analyses with several different tools. Only then, your claim would be sufficiently supported.

….

Ciao Max

Amsterdam

The Conference

The talk

Chapter Two The follow-up

We need to do much better… the classification is not fully automated yet

Folks, one reviewer was upset! We also need to enlarge the study. More systems, … more…

It would be great to get a student to help us on the project

One young student wrote us to spend a few months in our lab..

Suresh Thummalapenta

at the time PhD student at NCSU with Tao Xie now with Microsoft Research

This is great! Let’s ask Suresh to join the force on this project

CF CF CF CFCF CF3 2. Identification of clone fragment pairs evolution

3. Identification of clone class evolution

Clone class

CS2

1. Identification of clone section pairs evolution

LP LP CO

LP

LP

LP

LPCO

CO

CO

CF1 CF2 CF3

CS1 CS

1

CS2

CS2

CS1

121 2 3

CF CF21 CF3

1,2

1,2

2,3

2,3

1,3

1,3

Fine-level automated tracking approach

The StudyFour projects, C and Java

Both token-based and AST-based detectors

Relation of clone evolution patterns with • Clone granularity • Clone radius • Defect-proneness

Evolution Patterns

0%

20%

40%

60%

80%

ArgoUML JBoss OpenSSH PostgreSQL0%0%3%4%

16%

4%5%7%

39%

24%

52%

34%38%

71%

40%

55%

Consistent Indep. Evolution Late Propagation Unknown

Late Propagation

Two PostfreSQL Functions containing clones

The first underwent a bug fixing

The second changed six months after:“...I had previously fixed the identical bug in oper_select_candidate, but didn't realize that the same error was repeated over here...”

Independent EvolutionArgoUML Classes GeneratorJava and GeneratorDisplay containing cloned methods

GeneratorDisplay starts to implement enhanced visualization features

After that, both changes independently (no more clones)

Other Findings

Clone radius and granularity do not influence evolution patterns

Late propagation more correlated to defects than other evolution patterns

The EMSE PaperEmpir Software Eng (2010) 15:1–34DOI 10.1007/s10664-009-9108-x

An empirical study on the maintenanceof source code clones

Suresh Thummalapenta · Luigi Cerulo ·Lerina Aversano · Massimiliano Di Penta

Published online: 25 March 2009© Springer Science + Business Media, LLC 2009Editor: Murray Wood

Abstract Code cloning has been very often indicated as a bad software developmentpractice. However, many studies appearing in the literature indicate that this is notalways the case. In fact, either changes occurring in cloned code are consistentlypropagated, or cloning is used as a sort of templating strategy, where clonedsource code fragments evolve independently. This paper (a) proposes an automaticapproach to classify the evolution of source code clone fragments, and (b) reportsa fine-grained analysis of clone evolution in four different Java and C softwaresystems, aimed at investigating to what extent clones are consistently propagated orthey evolve independently. Also, the paper investigates the relationship between thepresence of clone evolution patterns and other characteristics such as clone radius,clone size and the kind of change the clones underwent, i.e., corrective maintenanceor enhancement.

Keywords Software clones · Software maintenance · Mining software repositories ·Clone evolution

S. ThummalapentaNorth Carolina State University, Raleigh, USAe-mail: [email protected]

L. Cerulo · L. Aversano · M. Di Penta (B)Department of Engineering,University of Sannio, Benevento, Italye-mail: [email protected]

L. Ceruloe-mail: [email protected]

L. Aversanoe-mail: [email protected]

Chapter Three: The Impact

People

Topics

Late Propagation

Clone changes

Clones and bugs

Tracking Entities

Tracking Design Patterns

An Empirical Study on the Evolution of Design Patterns

Lerina Aversano, Gerardo Canfora, Luigi Cerulo,Concettina Del Grosso, Massimiliano Di Penta

RCOST – Research Centre on Software Technology, University of SannioVia Traiano, 82100 Benevento, Italy

aversano@unisannio,it, [email protected], [email protected],[email protected], [email protected]

ABSTRACTDesign patterns are solutions to recurring design problems,conceived to increase benefits in terms of reuse, code qualityand, above all, maintainability and resilience to changes.

This paper presents results from an empirical study aimedat understanding the evolution of design patterns in threeopen source systems, namely JHotDraw, ArgoUML, andEclipse-JDT. Specifically, the study analyzes how frequentlypatterns are modified, to what changes they undergo andwhat classes co-change with the patterns. Results showhow patterns more suited to support the application pur-pose tend to change more frequently, and that different kindof changes have a different impact on co-changed classesand a different capability of making the system resilient tochanges.

Categories and Subject DescriptorsD.2.2 [Software Engineering]: Design Tools And Tech-niques—Object-oriented design methods

General TermsDesign, Experimentation, Measurement

KeywordsDesign patterns, Software Evolution, Mining Software Repo-sitories, Empirical Software Engineering

1. INTRODUCTIONIt has been claimed that the use of design patterns — i.e.,

of recurring design solutions for object-oriented systems —provides several advantages, such as increased reusability,and improved maintainability and comprehensibility of ex-isting systems [11]. A relevant benefit of design patterns isthe resilience to changes, avoiding that new requirements,and in general any kind of system evolution, causes majorre-design. Gamma et al. [11] state “Each design pattern lets

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ESEC/FSE’07, September 3–7, 2007, Cavtat near Dubrovnik, Croatia.Copyright 2007 ACM 978-1-59593-811-4/07/0009 ...$5.00.

some aspect of system structure vary independently of otheraspects, thereby making a system more robust to a particu-lar kind of change”. Advantages of design patterns includedecoupling a request from specific operations (Chain of Re-sponsibility and Command), making a system independentfrom software and hardware platforms (Abstract Factoryand Bridge), independent from algorithmic solutions (Itera-tor, Strategy, Visitor), or avoid modifying implementations(Adapter, Decorator, Visitor). Further discussion on designpattern advantages, and extensive pattern catalogues can befound in books such as [11] or [9].

While many benefits related to the use of design patternshave been stated, a little has been done to empirically in-vestigate pattern change proneness [3] or whether there is arelationships between the presence of defects in the sourcecode and the use of design patterns [24]. In particular, thereis lack of empirical studies aimed at analyzing what kind ofchanges each type of pattern undergoes during software evo-lution, and whether such a change can be related to changescontextually made on other classes not belonging to the pat-tern. The availability of source repositories for many object-oriented open source systems realized making use of designpatterns, of techniques for identifying change sets [10] —i.e., sets of artifacts changed together by the same author— from source code repositories, and of design pattern de-tection techniques and tools [1, 8, 15, 19, 23], triggers op-portunities for this kind of studies.

This paper reports and discusses results from an empir-ical study aimed at analyzing how design patterns changeduring a software system lifetime, and to what extent suchchanges cause modifications to other classes not part of thedesign pattern. The study has been performed on three Javasoftware systems, JHotDraw, ArgoUML and Eclipse-JDT.First, we detected design patterns on different subsequentreleases of the three systems by using the approach andtool presented by Tsantalis et al. [23]. Then, we mined co-changes from Concurrent Versioning System (CVS) repos-itories to identify when a pattern changed, what kind ofchange was performed, which classes co-changed with thepattern, whether these classes had a dependency to or fromthe pattern, and what was the relationship between the typeof change made and the resulting co-change.

The remainder of this paper is organized as follows. Af-ter a review of the literature in Section 2, Section 3 detailsthe process to extract the information needed to performthe empirical study. Section 4 describes the empirical studycontext and research questions. Section 5 reports and dis-cusses the case study results. Section 6 discusses the study

385

Tracking Design Pattern Evolution

JHotDraw ArgoUML Eclipse-JDT

Patterns Observer, Composite

Adapter-Command, Decorator, Factory

Visitor

Used for

Model View Controller of Draws, Handling composite figures

Adapting/ decorating UML objects to different views Execute menu actions

Visiting Java AST

Purpose of change

Adding new draw elements

Adding new menu actions and presentations

Adding new code analyses

Patterns with More Co-Changed Code

Pattern

# o

f Li

nes

adde

d/re

mov

ed in

co-

chan

ged

Clas

ses

Visitor

Template

State-Strategy

Singleton

Prototype

Observer

Factory

Decorator

Composite

Adapter-Command

16000

14000

12000

10000

8000

6000

4000

2000

0

Eclipse-JDT

Tracking Vulnerabilities

The life and death of statically detected vulnerabilities: An empirical study

Massimiliano Di Penta a,*, Luigi Cerulo b, Lerina Aversano a

a Dept. of Engineering, University of Sannio, Via Traiano, 82100 Benevento, Italyb Dept. of Biological and Environmental Studies, University of Sannio, Via Port’Arsa, 11 – 82100 Benevento, Italy

a r t i c l e i n f o

Available online xxxx

Keywords:Software vulnerabilitiesMining software repositoriesEmpirical study

a b s t r a c t

Vulnerable statements constitute a major problem for developers and maintainers of networking sys-tems. Their presence can ease the success of security attacks, aimed at gaining unauthorized access todata and functionality, or at causing system crashes and data loss. Examples of attacks caused by sourcecode vulnerabilities are buffer overflows, command injections, and cross-site scripting.

This paper reports on an empirical study, conducted across three networking systems, aimed at observ-ing the evolution and decay of vulnerabilities detected by three freely available static analysis tools. Inparticular, the study compares the decay of different kinds of vulnerabilities, characterizes the decay like-lihood through probability density functions, and reports a quantitative and qualitative analysis of thereasons for vulnerability removals. The study is performed by using a framework that traces the evolutionof source code fragments across subsequent commits.

! 2009 Elsevier B.V. All rights reserved.

1. Introduction

Vulnerable instructions are, very often, the cause of seriousproblems such as security attacks, system failures or crashes. Inhis Ph.D. thesis [1] Krsul defined a software vulnerability as ‘‘an in-stance of an error in the specification, development, or configuration ofsoftware such that its execution can violate the security policy”. Forbusiness-critical systems, the presence of vulnerable instructionsin the source code is often the cause of security attacks or, in othercases, of system failures or crashes. The problem is particularly rel-evant for any system that can be accessed over the Internet:e-banking or e-commerce systems, but also networking utilitiessuch as Web proxies or file sharing systems, and of course Webservers. All these systems can be attacked from hackers with theobjective of getting unauthorized access to system or data, orsimply to cause denial of services or data loss. The number ofattacks caused by some kinds of vulnerabilities is scaring: it hasbeen reported by CERT1 that statements vulnerable to buffer over-flows are the cause of 50% of software attacks. Recent studies reportan increasing trend in terms of other kinds of vulnerabilities, specif-ically cross-site scripting and SQL injection.2 In other cases, evenwhen no attack is performed vulnerability can cause system fail-ures/crashes, which can be a considerable risk for safety-criticalsystems.

Detecting the presence of such instructions is therefore crucialto ensure high security and reliability. Indeed, security advisoriesare regularly published – see for example those of Linux distribu-tions3 Microsoft,4 those published by CERT, or by securityfocus.5

These advisories, however, are posted when a problem alreadyoccurred in the application, a problem that was very often causedby the introduction in the source code of vulnerable statements. Thishighlights the needs to identify potential problems when they areintroduced, and to keep track of them during the software systemlifetime, as it is done, for example for source code clones [2].

A number of automatic tools have been developed for the iden-tification of potentially vulnerable source code statements. Most ofthese tools rely on static source code analysis performed in differ-ent ways: some tools merely use pattern matching e.g., with theaim of identifying programming language functions that areknown to be vulnerable, while others perform a more accurateanalysis, including data-flow analysis. Although several vulnerabil-ity detection tools exist and their effectiveness has been assessedby tool developers, up to now the literature lacks of studies aimedat analyzing how the presence of vulnerabilities varies during asoftware system lifetime, i.e., to what extent new vulnerabilitiestend to be introduced when new code is added, and to what extentduring the time developers modify the system to protect it againstvulnerability attacks. Nowadays, the availability of code reposito-ries for many open source systems, of techniques for integratingdata from versioning systems – Concurrent Versions Systems

0950-5849/$ - see front matter ! 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.infsof.2009.04.013

* Corresponding author. Tel.: +39 0824 305536; fax: +39 0824 50552.E-mail addresses: [email protected] (M.D. Penta), [email protected]

(L. Cerulo), [email protected] (L. Aversano).1 www.cert.org.2 http://cwe.mitre.org/documents/vuln-trends/index.html.

3 www.debian.org/security and www.redhat.com/security.4 www.microsoft.com/technet/security/advisory/default.mspx.5 www.securityfocus.com.

Information and Software Technology xxx (2009) xxx–xxx

Contents lists available at ScienceDirect

Information and Software Technology

journal homepage: www.elsevier .com/locate / infsof

ARTICLE IN PRESS

Please cite this article in press as: M.D. Penta et al., The life and death of statically detected vulnerabilities: An empirical study, Inform. Softw. Technol.(2009), doi:10.1016/j.infsof.2009.04.013

Vulnerability Decay

Vulnerability Decay

Buffer Overflows

Vulnerability Decay

Buffer Overflows Memory Problems

Code Siblings and Licensing

Code siblings: technical and legal implications of copying code betweenapplications

Daniel M. German†, Massimiliano Di Penta‡, Yann-Gael Gueheneuc⋆, and Giuliano Antoniol⋆

† University of Victoria, Victoria, BC, Canada‡ RCOST–University of Sannio, Benevento, Italy

⋆ PTIDEJ Team–SOCCER Lab., DGIGL, Ecole Polytechnique de Montreal, QC, [email protected], [email protected], [email protected], [email protected]

Abstract

Source code cloning does not happen within a single sys-tem only. It can also occur between one system and another.We use the term code sibling to refer to a code clone thatevolves in a different system than the code from which itoriginates. Code siblings can only occur when the sourcecode copyright owner allows it and when the conditionsimposed by such license are not incompatible with the li-cense of the destination system. In some situations copyingof source code fragments are allowed—legally—in one di-rection, but not in the other.In this paper, we use clone detection, license mining and

classification, and change history techniques to understandhow code siblings—under different licenses—flow in one di-rection or the other between Linux and two BSD Unixes,FreeBSD and OpenBSD. Our results show that, in mostcases, this migration appears to happen according to theterms of the license of the original code being copied, fa-voring always copying from less restrictive licenses towardsmore restrictive ones. We also discovered that sometimescode is inserted to the kernels from an outside source.

Keywords: Code licensing, software evolution, clonedetection.

1 Introduction

A source code fragment (or a whole source code file) canbe copied from one system to another for several reasons,including adding features already available in the other sys-tem or fixing a bug using a known and robust implementa-tion. Such a copying often happens when a developer workson both systems or migrates from one system to the other.Furthermore, to promote hardware adoption, companies of-ten release and distribute the same code, e.g., a driver, for

different operating systems and environments. In all cases,cross-system clones are introduced.Usually, source code is distributed according to the terms

of a software license. Once the developer chooses to dis-tribute her work with a particular license, she explicitly im-poses limits on what can be done with the code: if and howit can be used, modified, copied, distributed, and extended.Software licenses may prevent or favor the migration of

code fragments in one or the other direction, or both. Oncehaving migrated, code fragments evolve constrained by thenew environment. In the following, we use the term siblingto refer to a fragment of code that has been cloned from onefile in one system to another file in a different system. Insome cases, a sibling may span an entire file.Then, we propose an analysis process to identify siblings

and to locate potential legal issues that affect them. Inves-tigating such issues is relevant because, from a legal pointof view, two licenses can be incompatible. With incom-patible licenses, code fragments cannot—legally—migratebetween systems. The compatibility of one license with an-other (e.g., the new BSD License is compatible with theGNU General Public License) creates a preferential flow ofcode with the former license into the system with latter.The primary contributions of this paper can be summa-

rized as follows: (i) we propose an approach relying onclone detection across systems and license classification tostudy the impact of software licenses on code siblings; (ii)we provide evidence that a preferential flow exists fromFreeBSD/OpenBSD to Linux; (iii) we report unexpected re-sults on the migration of third-party code from outside thekernels into two or more kernels.This paper is organized as follows. After a discussion of

related work in Section 2, Section 3 describes our study andthe process followed to extract data from the three kernels.Section 4 presents the empirical study results, while Sec-tion 5 provides a qualitative analysis of some examples of

MSR 2009978-1-4244-3493-0/09/$25.00 © 2009 IEEE 81

Authorized licensed use limited to: Univ Sannio. Downloaded on May 21,2010 at 12:45:36 UTC from IEEE Xplore. Restrictions apply.

Code Siblings and Licensing

FreeBSD

Linux

siblings

Cloned fragments

Cloned fragmentsMigration direction

Preferential Migration from OS with

permissive License (FreeBSD-OpenBSD)

towards Linux (mainly GPL)

Migration From Third-Party Code

commit a9474917099e007c0f51d5474394b5890111614f Author: Sean Hefty <[email protected]> Date: Mon Jul 14 23:48:43 2008 -0700 RDMA: Fix license text The license text for several files references a third software license that was inadvertently copied in. Update the license to what was intended. This update was based on a request from HP. [..]

Blame-based tracking

Distinguishing Copies from Originals in Software Clones

Jens Krinke, Nicolas Gold, Yue JiaKing’s College London

Centre for Research on Evolution, Search andTesting (CREST)

{jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk

David BinkleyLoyola University Maryland

Baltimore, MD, [email protected]

ABSTRACTCloning is widespread in today’s systems where automated assis-tance is required to locate cloned code. Although the evolution ofclones has been studied for many years, no attempt has been madeso far to automatically distinguish the original source code leadingto cloned copies. This paper presents an approach to classify theclones of a clone pair based on the version information availablein version control systems. This automatic classification attemptsto distinguish the original from the copy. It allows for the fact thatthe clones may be modified and thus consist of lines coming fromdifferent versions. An evaluation, based on two case studies, showsthat when comments are ignored and a small tolerance is accepted,for the majority of clone pairs the proposed approach can automat-ically distinguish between the original and the copy.

Categories and Subject DescriptorsD.2.9 [Software Engineering]: Management—Software config-uration management; D.2.13 [Software Engineering]: ReusableSoftware—Reusable libraries

General TermsAlgorithms

KeywordsClone detection, mining software archives, software evolution

1. INTRODUCTIONThe duplication of code is a common practice to make software

development faster, to enable “experimental” development with-out impacting the original code, or to enable independent evolu-tion [7]. Since these practices involve both duplication and modifi-cation, they are collectively called code cloning and the duplicatedcode is called a code clone. A clone group consists of code clonesthat are clones of each other (sometimes this is also called a cloneclass). During the software development life cycle, code cloningis an easy and inexpensive (in both effort and money) way to reuse

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IWSC2010 May 8, 2010, Cape Town, South AfricaCopyright 2010 ACM 978-1-60558-980-0/10/05 ...$10.00.

existing code. However, such practices can complicate softwaremaintenance so it has been suggested that too much cloned code isa risk, albeit the practice itself is not generally harmful [16]. Be-cause of these problems, many approaches to detecting cloned codehave been developed [2, 3, 8, 15, 18–20, 24, 26]. While methods toidentify clones automatically and efficiently are to some extent un-derstood, it is still disputable whether the presence of clones is arisk. To better understand why and how code is cloned, recent em-pirical studies of cloned code have focused mainly on examiningthe evolution of clones, such as whether cloned code is more stableor changed consistently [1, 10, 12, 17, 21, 22, 27].

A lot of research has been done on finding and identifying soft-ware clones, but without additional information it is impossible todistinguish the original from the copy. Most of the above men-tioned previous empirical studies used version control systems toextract limited information about the discovered clones; for exam-ple, when a clone appears in some previous version. However, sofar there has been no general approach proposed to distinguish orig-inals from copies except for a study done by German et al. [11] whotracked when clones appeared in the version history to identify theclone of a pair that appeared first. This paper presents an approachthat uses line-by-line version information available from versioncontrol systems to distinguish the original from the copied codeclone in a clone pair.

Most version control systems have a ‘blame’ command whichshows author and version information for each line in a file. Thisinformation, which includes the version when the line was added orlast modified, can be used as a line age: if all lines in one clone haveolder versions than the lines in the other clone of a clone pair, thenthe clone with the older lines may be the original and the other maybe the copy (assuming that the clone with the oldest lines existedfirst). However, usually, it is not that simple because the originaland the copy may have been modified in turn after the copy wascreated.

This paper makes the following contributions:

• A language-independent approach to identify the clones inone version of a program and distinguish the original from itscopy in every clone pair by mapping the version information,retrieved from a version control system, to each line of theclones.

• Two initial case studies evaluating the approach show thatwhen comments are ignored and a small tolerance is accepted,the majority of clone pairs can be automatically separatedinto the original and the copied clone.

The following section presents background on clones and clonedetection and the retrieval of version information. Section 3 thenpresents the approach to distinguishing copied clones from original

c�ACM, 2010. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.The definitive version will be published in the Proceedings 4th International Workshop on Software Clones, 2010 in Cape Town, SouthAfrica.

Cloning and Copying between GNOME ProjectsJens Krinke, Nicolas Gold, Yue Jia

King’s College London,Centre for Research on Evolution, Search and Testing (CREST)

{jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk

David BinkleyLoyola University Maryland,

Baltimore, MD, [email protected]

Abstract—This paper presents an approach to automaticallydistinguish the copied clone from the original in a pair of clones.It matches the line-by-line version information of a clone to thepair’s other clone. A case study on the GNOME Desktop Suiterevealed a complex flow of reused code between the differentsubprojects. In particular, it showed that the majority of largerclones (with a minimal size of 28 lines or higher) exist betweenthe subprojects and more than 60% of the clone pairs can beautomatically separated into original and copy.

I. INTRODUCTION

The duplication of code is a common practice to makesoftware development faster, to enable “experimental” devel-opment without impacting the original code, or to enableindependent evolution [1]. Since these practices involve bothduplication and modification, they are collectively called codecloning and the duplicated code is called a code clone. Aclone group consists of code clones that are clones of eachother (sometimes this is also called a clone class). During thesoftware development life cycle, code cloning is an easy andinexpensive (in both effort and money) way to reuse existingcode. However, such practices can complicate software main-tenance so it has been suggested that too much cloned codeis a risk, albeit the practice itself is not generally harmful[2]. Because of these problems, many approaches to detectingcloned code have been developed [3]–[10]. While methods toidentify clones automatically and efficiently are to some extentunderstood, it is still disputable whether the presence of clonesis a risk. To better understand why and how code is cloned,recent empirical studies of cloned code have focused mainlyon examining the evolution of clones, such as whether clonedcode is more stable or changed consistently [11]–[17].

A lot of research has been done on finding and identifyingsoftware clones, but without additional information it is im-possible to distinguish the original from the copy. Most of theabove empirical studies use version control systems to extractlimited information about the originals and their copied clones;for example, when a clone appears in some previous version.However, so far there has been only two approaches [18], [19]to distinguish originals from copies.

Most version control systems have a ‘blame’ commandwhich shows author and version information for each line ina file. This information, which includes the version when theline was added or last modified, can be used as a line age: ifall lines in one clone have older versions than the lines in theother clone of a clone pair, then the clone with the older lines

is most likely the original and the other the copy. However,usually, it is not that simple because the original and the copymay have been modified in turn after the copy was created.

This paper makes the following contributions:• It extends previous work [19] to automatically distinguish

between copy and original by allowing the clones of aclone pair to be in different systems.

• A case study on the GNOME Desktop Suite subprojectsshows that the majority of larger clones (with a minimalsize of 28 lines or higher) exist between the subprojectsand more than 60% of the clone pairs can be automat-ically separated automatically into original and copiedclone.

The following section presents background on clones andclone detection, the retrieval of version information, andthe approach to distinguishing copied clones from originalclones. The case study on the GNOME Desktop Suite is thendiscussed in Section 3. Related work is discussed in Section4 and the last section concludes.

II. BACKGROUND

This section presents the framework in which code clones,groups of code clones, and changes to code clones are defined.This is followed by a description of how version informationis retrieved from version control systems and how it is mappedonto the source code lines.

A. Code Clones

Code clones are usually described as source code ranges (orfragments) that are identical or very similar. They are groupedinto clone groups (sometime called clone classes) which aresets of identical or very similar code clones. A code clonec = (s, l, f) is the source code range starting at line s with thefollowing l lines of code in file f , thus the last line of the codeclone is line number s+l�1. A clone group G = {c1, . . . , cn}is a set of n code clones c1, . . . , cn, where each of the codeclones is a clone of the others. A group consisting of twoclones is a clone pair. The clone pairs of a group are generatedby pairing all clones of a group.

For the purpose of this study, the effects of split or frag-mented code clones are ignored. Such clones would consist ofmultiple source code ranges in the same file. An example ofsuch a code clone is a source code range that is copied andadditional source code subsequently inserted into the copiedcode. The code clones do not have to be disjoint: it is possible

c�2010 IEEE. To be published in the Proceedings 7th IEEE Working Conference on Mining Software Repositories, 2010 in Cape Town, South Africa.Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating newcollective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from theIEEE.

Smell EvolutionWhen and Why Your Code Starts to Smell Bad

(and Whether the Smells Go Away)Michele Tufano1, Fabio Palomba2, Gabriele Bavota3

Rocco Oliveto4, Massimiliano Di Penta5, Andrea De Lucia2, Denys Poshyvanyk1

1The College of William and Mary, Williamsburg, VA, USA 2University of Salerno, Fisciano (SA), Italy,3Universita della Svizzera italiana (USI), Switzerland, 4University of Molise, Pesche (IS), Italy,

5University of Sannio, Benevento (BN), Italy

[email protected], [email protected], [email protected]@unimol.it, [email protected], [email protected], [email protected]

Abstract—Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”.One noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementationchoices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. Whilethe repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when andwhy bad smells are introduced, what is their survivability, and how they are removed by developers. To empirically corroborate suchanecdotal evidence, we conducted a large empirical study over the change history of 200 open source projects. This study required thedevelopment of a strategy to identify smell-introducing commits, the mining of over half a million of commits, and the manual analysisand classification of over 10K of them. Our findings mostly contradict common wisdom, showing that most of the smell instances areintroduced when an artifact is created and not as a result of its evolution. At the same time, 80% of smells survive in the system. Also,among the 20% of removed instances, only 9% are removed as a direct consequence of refactoring operations.

Index Terms—Code Smells, Empirical Study, Mining Software Repositories

F

1 INTRODUCTION

THE technical debt metaphor introduced by Cunning-ham [23] explains well the trade-offs between deliv-

ering the most appropriate but still immature product,in the shortest time possible [14], [23], [43], [48], [71].Bad code smells (shortly “code smells” or “smells”), i.e.,symptoms of poor design and implementation choices[28], represent one important factor contributing to tech-nical debt, and possibly affecting the maintainability ofa software system [43]. In the past and, most notably, inrecent years, several studies investigated the relevancethat code smells have for developers [61], [91], the extentto which code smells tend to remain in a software systemfor long periods of time [4], [18], [49], [65], as well asthe side effects of code smells, such as an increase inchange- and fault-proneness [38], [39] or decrease ofsoftware understandability [1] and maintainability [73],[90], [89]. While the repercussions of code smells onsoftware quality have been empirically proven, thereis still noticeable lack of empirical evidence related tohow, when, and why they occur in software projects,as well as whether, after how long, and how they are

This paper is an extension of “When and Why Your Code Starts to Smell Bad”that appeared in the Proceedings of the 37th IEEE/ACM InternationalConference on Software Engineering (ICSE 2015), Florence, Italy, pp.403-414, 2015 [82].

removed [14]. This represents an obstacle for an effec-tive and efficient management of technical debt. Also,understanding the typical life-cycle of code smells andthe actions undertaken by developers to remove themis of paramount importance in the conception of recom-mender tools for developers’ support. In other words,only a proper understanding of the phenomenon wouldallow the creation of recommenders able to highlight thepresence of code smells and suggesting refactorings onlywhen appropriate, hence avoiding information overloadfor developers [54].

Common wisdom suggests that urgent maintenanceactivities and pressure to deliver features while prior-itizing time-to-market over code quality are often thecauses of such smells. Generally speaking, software evo-lution has always been considered as one of the reasonsbehind “software aging” [62] or “increasing complexity”[45], [56], [88]. Also, one of the common beliefs is thatdevelopers remove code smells from the system byperforming refactoring operations. However, to the bestof our knowledge, there is no comprehensive empiricalinvestigation into when and why code smells are intro-duced in software projects, how long they survive, andhow they are removed.

In this paper we fill the void in terms of our under-standing of code smells, reporting the results of a large-scale empirical study conducted on the change history

1

Smell-introducing Commits

100

200

300

400

500

c1 c2

c3 c4

c5

c6

c7

c8Metric

When Are Smells IntroducedCommits required to a class for becoming smell

50 1000 25 75

Generally, blobs affect a class since its creation

There are several cases in which a blob is introduced during maintenance activities

Why are smell introduced?

BLOB

CDSBP

CC

FD

SC

BF E NF R

Blob

Class Data Should Be Private

Complex Class

Functional Decomposition

Spaghetti Code

Bug Fixing

0 1005025 75

Enhancement New Feature

Refactoring

Smell Removal

Code Removal

Code Replacement

Code Insertion

Refactoring

Major Restructuring

0% 10% 20% 30% 40%

4%

9%

15%

33%

40%

Clone changes

Clones and bugs

Tracking Entities

Late Propagation

Late Propagation in Software Clones

Liliane Barbour, Foutse Khomh, Ying ZouDepartment of Electrical and Computer Engineering

Queen’s UniversityKingston, ON

{l.barbour, foutse.khomh, ying.zou}@queensu.ca

Abstract—Two similar code segments, or clones, form a clonepair within a software system. The changes to the clones overtime create a clone evolution history. In this work we studylate propagation, a specific pattern of clone evolution. In latepropagation, one clone in the clone pair is modified, causingthe clone pair to become inconsistent. The code segmentsare then re-synchronized in a later revision. Existing workhas established late propagation as a clone evolution pattern,and suggests that the pattern is related to a high numberof faults. In this study we examine the characteristics oflate propagation in two long-lived software systems using theSimian and CCFinder clone detection tools. We define 8 typesof late propagation and compare them to other forms of cloneevolution. Our results not only verify that late propagationis more harmful to software systems, but also establish thatsome specific cases of late propagations are more harmful thanothers. Specifically, two cases are most risky: (1) when a cloneexperiences inconsistent changes and then a re-synchronizingchange without any modification to the other clone in aclone pair; and (2) when two clones undergo an inconsistentmodification followed by a consistent change that modifies boththe clones in a clone pair.

Keywords-clone genealogies; late propagation; fault-proneness.

I. INTRODUCTION

A code segment is labeled as a code clone if it is identicalor highly similar to another code segment. Similar codesegments form a clone pair. Clone pairs can be introducedinto systems deliberately (e.g., “copy and paste” actions)or inadvertently by a developer during development andmaintenance activities. Like all code segments, code clonesare not immune to change. Large software systems undergothousands of revisions over their lifecycles. Each revisioncan involve modifications to code clones. As the clones in aclone pair are modified, a change evolution history, knownas a clone genealogy [1], is generated.

In a previous study on clone genealogies, Kim et al. [1]define two types of evolutionary changes that can affect aclone pair: a consistent change or an inconsistent change.During a consistent change, both clones in a clone pairare modified in parallel, preserving the clone pair. In aninconsistent change, one or both of the clones evolvesindependently, destroying the clone pair relationship. Incon-sistent changes can occur deliberately, such as when codeis copied and pasted and then subsequently modified to fit

the new context. For example, if a driver is required for anew printer model, a developer could copy the driver codefrom an older printer model and then modify it. Inconsistentchanges can also occur accidentally. A developer may beunaware of a clone pair, and cause an inconsistency by onlychanging one half of the clone pair. This inconsistency couldcause a software fault. If a fault is found in one clone andfixed, but not propagated to the other clone in the clone pair,the fault remains in the system. For example, a fault mightbe found in the old printer driver code and fixed, but the fixis not propagated to the new printer driver. For these reasons,previous studies [1] have argued that accidental inconsistentchanges make code clones more prone to faults.

Late propagation occurs when a clone pair that under-goes one or more inconsistent changes followed by a re-synchronizing change [2]. The re-synchronization of thecode clones indicates that the gap in consistency is acci-dental. Since accidental inconsistencies are considered risky[3], the presence of late propagation in clone genealogiescan be an indicator of risky, fault-prone code.

Many studies have been performed on the evolution ofclones. A few (e.g., [2], [3]) have studied late propagation,and indicated that late propagation genealogies are morefault-prone than other clone genealogies. Thummalapenta etal. began the initial work in examining the characteristics oflate propagation. The authors measured the delay betweenan inconsistent change and a re-synchronizing change andrelated the delay to software faults. In our work, we examinemore characteristics of late propagation to determine if onlya subset of late propagation genealogies are at risk of faults.In our case study, we found that late propagation genealogiesaccounts for between 8-21% of all clone genealogies thatexperience at least one change. If all late propagation ge-nealogies are considered equally prone to faults, this meansthat as much as a fifth of all genealogies must be monitoredfor defects, which is resource intensive. Developers areinterested in identifying which clones are most at risk offaults. Our goal is to support developers in allocating limitedcode testing and review resources towards the most risky latepropagation genealogies.

In this paper, we study the characteristics of late prop-agation genealogies and estimate the likelihood of faults.Using clone genealogies from two open source systems,

More Detailed Genealogy

More Detailed Genealogy

More Detailed GenealogyPropagation always occurs


Propagation may not occur


Propagation may not occur

Propagation never occurs

BreakdownPe

rcen

tage

of A

ll LP

O

ccur

renc

es

0%

20%

40%

60%

80%

LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8

ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder

May notoccur

Neveroccurs

BreakdownPe

rcen

tage

of A

ll LP

O

ccur

renc

es

0%

20%

40%

60%

80%


ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder

Faults by LP TypePe

rcen

tage

of F

ault

Occ

urre

nces

0%

20%

40%

60%

80%

LP Type


Ant - Simian ArgoUML - CCFinder Ant - CCFinder

LP In Type-3 ClonesLate propagation of Type-3 Clones

Saman Bazrafshan

Universitat Bremen

[email protected]

Abstract

Type-3 clones are duplicated source code fragmentsthat span two or more identical sequences of tokens(whitespace and comments are ignored) that form acontiguous source code fragment interrupted by non-identical token sequences. Several studies on the evo-lution of code clones have been conducted to detectpatterns that can help to manage clones [3,6]. One ofthose patterns that is assumed to be of special inter-est is late propagation [1, 2, 4]. In this paper, ways ofdetecting late propagation in the evolution of type-3clones are proposed and discussed.

1 Introduction

During the last years, di↵erent studies focused on de-tecting clone patterns that are considered to havea negative impact on code quality and therefore onmaintainability of software. Missing or inconsistentpropagation of changes to clones is identified as onepattern that may introduce new defects or prevent theremoval of existing ones. To find these clone patternsand enable clone management, a series of tools havebeen introduced—including clone detectors and clonegenealogy extractors. Clones reported by a clone de-tector are generally distinguished according to theirlevel of similarity. Clones that are identical except forcomments and whitespaces are called type-1 clones.Type-2 clones extend type-1 clones by tolerating dif-ferences in parameters (e.g., variables, identifiers andliterals). Type-3, moreover, allow the insertion anddeletion of statements. For a general overview of cloneresearch, please refer to [5]. In this paper, we will nametype-1 clones as identical clones and type-2 and type-3clones as near-miss clones.

An important aspect of clone management is thatnot all detected clones are equally relevant. To filterrelevant clones out of the large number of clones re-ported by a clone detector, it is of advantage to extractand analyze the evolution of clones—also called clonegenealogy [3, 6]. One evolution pattern that has beenstudied in recent studies is the late propagation. Thelate-propagation pattern denotes a change to one ormore fragments of a clone class that is not propagatedto all fragments of the clone class at the same time.Considering defect-correcting changes, late propaga-tion pattern is an indicator that code clones were not

intentionally changed inconsistently [1, 2, 4].

2 Late Propagation of Near-Miss

Clones

The definition of a late propagation regarding identi-cal clones is straightforward: an inconsistent modifica-tion of an identical clone causing the fragments to benon-identical until another inconsistent change to thefragments makes them identical again. However, thedefinition is not suitable for near-miss clones becausethey are not completely identical–changes between theidentical and the non-identical parts have to be dif-ferentiated. The challenging question that arises fromthis fact is:

What are the essential characteristics of a

change that makes an inconsistent change to

a near-miss clone consistent at a later point

of time?

One way to define the late propagation pattern fornear-miss clones is to focus exclusively on the identicalparts of a clone disregarding the gaps as the gaps arealready not common between the cloned fragments.In this case, we would regard a near-miss clone tobe changed consistently if the identical parts undergothe same modifications and continue to be identical–analogously to the definition of a late propagation ofidentical clones. Hence, to recognize an inconsistentchange to a near-miss clone that makes a precedinginconsistent change to the same clone consistent at alater time, all deltas of the clone fragments have to beremembered and compared to every new inconsistentchange. Considering a clone class with more than twoclone fragments, a gap between two fragments mightexist at a di↵erent position, in a di↵erent form (e.g.,di↵erent in size), or even not present at all comparedto the other fragments of the same clone class. For thisreason, it has to be taken into account that a changemight hit a gap regarding one or more fragments butan identical block with respect to the other fragmentsof the clone class. In addition, it is possible that aninconsistent change makes more than one precedinginconsistent change to various fragments consistentat once. Thus, di↵erent combinations of deltas haveto be compared against an inconsistent change for asu�cient analysis of the consistency of all fragments

ECEASST

Late Propagation in Near-Miss Clones: An Empirical Study

Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3

1 [email protected], https://homepage.usask.ca/⇠mam815/2 [email protected], http://www.cs.usask.ca/⇠croy/

3 [email protected], http://www.cs.usask.ca/⇠kas/University of Saskatchewan, Canada

Abstract:

If two or more code fragments in the code-base of a software system are exactlyor nearly similar to one another, we call them code clones. It is often importantthat updates (i.e., changes) in one clone fragment should be propagated to the othersimilar clone fragments to ensure consistency. However, if there is a delay in thispropagation because of unawareness, the system might behave inconsistently. Thisdelay in propagation, also known as late propagation, has been investigated by anumber of existing studies. However, the existing studies did not investigate theintensity as well as the effect of late propagation in different types of clones sepa-rately. Also, late propagation in Type 3 clones is yet to investigate. In this researchwork we investigate late propagation in three types of clones (Type 1, Type 2, andType 3) separately. According to our experimental results on six subject systemswritten in three programming languages, late propagation is more intense in Type 3clones compared to the other two clone-types. Block clones are mostly involved inlate propagation instead of method clones. Refactoring of block clones can possiblyminimize late propagation. If not refactorable, then the clones that often need to bechanged together consistently should be placed in close proximity to one another.

Keywords: Code Clone; Late Propagation; Code Evolution; Software Mainte-nance; Method Genealogy;

1 Introduction

Software maintenance is one of the most important phases of the software development lifecycle. Studies [GH11, GK11, LW08, LW10, Kri07, Kri08, ACD07, TCAP09, BKZ13, KG08,MRR+12, MRS12c, MRS12b, MRS13] show that code clones have both positive [Kri07, GH11,GK11, KG08] and negative [LW08, LW10, MRR+12, MRS12c, MRS12b, MRS13, ACD07,TCAP09] impacts on software maintenance and evolution. Code clones are exactly or nearlysimilar code fragments scattered in the code-base of a software system. These are mainly cre-ated because of the frequent copy-paste activities of the programmers with an aim to repeat thesame or similar functionalities during software development and maintenance. If a code frag-ment is copied from one place of a code-base and pasted to some other places with or withoutmodifications, then the original code fragment and the pasted code fragments become clones ofone another.

1 / 17 Volume 63 (2014)


Saman Bazrafshan

Universitat Bremen

[email protected]

Abstract


1 Introduction





Clones





of time?


ECEASST





Abstract:



1 Introduction


1 / 17 Volume 63 (2014)

More late propagations in type-3

clones than in others


Saman Bazrafshan

Universitat Bremen

[email protected]

Abstract


1 Introduction





Clones





of time?


ECEASST





Abstract:



1 Introduction


1 / 17 Volume 63 (2014)

More late propagations in type-3

clones than in others

Late propagations occur in small (block-size) clones

A Study of Consistent and Inconsistent Changes to Code Clones

Jens KrinkeFernUniversitat in Hagen, Germany

[email protected]

Abstract

Code Cloning is regarded as a threat to software main-tenance, because it is generally assumed that a change toa code clone usually has to be applied to the other clonesof the clone group as well. However, there exists littleempirical data that supports this assumption. This paperpresents a study on the changes applied to code clones inopen source software systems based on the changes betweenversions of the system. It is analyzed if changes to codeclones are consistent to all code clones of a clone group ornot. The results show that usually half of the changes tocode clone groups are inconsistent changes. Moreover, thestudy observes that when there are inconsistent changes toa code clone group in a near version, it is rarely the casethat there are additional changes in later versions such thatthe code clone group then has only consistent changes.

1 Introduction

Duplicated code is common in all kind of software sys-tems. Although cut-copy-paste (-and-adapt) techniques areconsidered bad practice, every programmer uses them.

Since these practices involve both duplication and mod-ification, they are collectively called code cloning. Whilethe duplicated code is called a code clone. A clone groupconsists of code clones that are clones of each other (some-times this is also called a clone class). During the softwaredevelopment cycle, code cloning is both easy and inexpen-sive (in both cost and money). However, this practice cancomplicate software maintenence in the following ways:

• Errors may have been duplicated (cloned) in parallelwith the cloned code.

• Modifications done to the original code must often beapplied to the cloned code as well.

Because of these problems, research has developed manyapproaches to detect cloned code [5, 6, 9, 12, 16–18, 20]. Inaddition, some empirical work done has attempted to check

whether or not the above mentioned problems are relevantin practice. Kim et al. [15] investigated the evolution ofcode clones and provided a classification for evolving codeclones. Their work already showed that during the evolutionof the code clones, consistent changes to the code clonesof a group are fewer than anticipated. Aversano et al. [4]did a similar study and they state “that the majority of cloneclasses is always maintained consistently.” Geiger et al. [10]studied the relation of code clone groups and change cou-plings (files which are committed at the same time, by thesame author, and with the same modification description),but could not find a (strong) relation. Therefore, this workwill present an empirical study that verifies the followinghypothesis:

During the evolution of a system, code clones ofa clone group are changed consistently.

Of course, a system may contain bugs where a changehas been applied to some code clones, but has been forgot-ten for other code clones of the clone group. For stablesystems it can be assumed that such bugs will be resolvedat a later time. This results in a second hypothesis:

During the evolution of a system, if code clonesof a clone group are not changed consistently, themissing changes will appear in a later version.

This work will verify the two hypotheses by studying thechanges that are applied to code clones during 200 weeks ofevolution of five open source software systems. The contri-butions of this paper are:

• A large empirical study that examines the changes tocode clones in evolving systems. This study involvesboth a greater number and diversity of systems thanprevious empirical studies.

• The study will show that both hypotheses are not gen-erally valid for the five studied systems. In summary,clone groups are changed consistently in roughly halfof the time, invalidating the first hypothesis. The sec-ond hypothesis is only partially valid. This is because

c�2007 IEEE. To be published in the Proceedings of the 14th Working Conference on Reverse Engineering, 2007 in Vancouver, Canada. Personal use of thismaterial is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective worksfor resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

ECEASST

Studying Late Propagations in Code Clone Evolution UsingSoftware Repository Mining

Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1

1 [email protected], [email protected] Engineering Research Group

Delft University of Technology, the Netherlands

2 [email protected] Engineering Research Group

University of Klagenfurt, Austria

Abstract: In the code clone evolution community, the Late Propagation (LP) hasbeen identified as one of the clone evolution patterns that can potentially lead tosoftware defects. An LP occurs when instances of a clone pair are changed consis-tently, but not at the same time. The clone instance, which receives the update at alater time, might exhibit unintended behavior if the modification was a bugfix. Inthis paper, we present an approach to extract LPs from software repositories. Sub-sequently, we study LPs in four software systems, which allows us to investigate thepropagation time, the clone dispersion and the effects of LPs on the software.

Keywords: code clone evolution, late propagation, software repository mining,bugs

1 Introduction

Research in the area of code clones has shown that 7% to 23% of the code in large softwaresystems contains duplicated source code fragments [1, 2]. While these so-called code clonesare generally considered harmful [3], other studies indicate the contrary [4]. Research has longfocused on techniques for (a) finding and (b) subsequently refactoring code clones [5], how-ever, more recently, the code clone evolution research community has taken interest in managingcode clones, rather than refactoring them [6, 7]. Code clone management tools, such as Clone-Tracker [6] or CloneBoard [7], help developers to understand and remember where code clonesare in the system; they can also help to propagate changes from one clone instance, to all in-stances of the clone relation.

Kim et al. [8] investigated the evolution of code clones and they found patterns of clone evo-lution. Aversano et al. [9] expanded on this research by adding two new patterns. One of thesepatterns, the Late Propagation is of particular interest to investigate, as it can provide an indi-cation of the usefulness of code clone management tools. Figure 1 shows an example of thelate propagation code clone evolution pattern. It shows two duplicated code fragments C1 andC2 in version Vi, both belonging to the same clone relation. In a subsequent version (Vj) C1 ismodified while C2 is not, which means that both clones are now inconsistent. In version Vk C2

1 / 11 Volume 63 (2014)

Inconsistent? LP?



[email protected]

Abstract


1 Introduction














ECEASST









1 Introduction



1 / 11 Volume 63 (2014)

Consistent changes occur half of the time

Inconsistent? LP?



[email protected]

Abstract


1 Introduction














ECEASST









1 Introduction



1 / 11 Volume 63 (2014)

LP seldom occurs, and most of them re-synchronize within one day

Consistent changes occur half of the time

Inconsistent? LP?

Clones and bugs

Tracking Entities

Late Propagation

Clone changes

Release Level AnalysisScience of Computer Programming ( ) –


Science of Computer Programming

journal homepage: www.elsevier.com/locate/scico

An empirical study on inconsistent changes to code clones at therelease levelNicolas Bettenburg ⇤, Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou,Ahmed E. HassanQueen’s University, Kingston, Ontario, Canada


Article history:Available online xxxx

Keywords:Software engineeringMaintenance managementReuse modelsClone detectionMaintainabilitySoftware evolution

a b s t r a c t

To study the impact of code clones on software quality, researchers typically carry outtheir studies based on fine-grained analysis of inconsistent changes at the revision level.As a result, they capture much of the chaotic and experimental nature inherent in any on-going software development process. Analyzing highly fluctuating and short-lived clonesis likely to exaggerate the ill effects of inconsistent changes on the quality of the releasedsoftware product, as perceived by the end user. To gain a broader perspective, we performan empirical study on the effect of inconsistent changes on software quality at the releaselevel. Based on a case study on three open source software systems, we observe thatonly 1.02%–4.00% of all clone genealogies introduce software defects at the release level,as opposed to the substantially higher percentages reported by previous studies at therevision level. Our findings suggest that clones do not have a significant impact on thepost-release quality of the studied systems, and that the developers are able to effectivelymanage the evolution of cloned code.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Code clones are the source of heated debates among softwaremaintenance researchers. Developers typically clone (copy)existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source codefor new development. However, unless a clone is reused as is, developers quickly lose track of the link between the cloneand the cloned piece of code, especially after some local modifications. Losing the links between clones increases the riskof inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to allclones, such as defect fixing changes.

There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such asincreased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due tothe belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, otherresearchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineeringmethod to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which ofthese two visions prevails, or whether the right vision depends on the software system at hand [15,43,47].

Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones.Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small

⇤ Corresponding author. Tel.: +1 613 533 6802.E-mail addresses: [email protected] (N. Bettenburg), [email protected] (W. Shang), [email protected] (W.M. Ibrahim), [email protected]

(B. Adams), [email protected] (Y. Zou), [email protected] (A.E. Hassan).

0167-6423/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.scico.2010.11.010

Evaluating Code Clone Genealogies at Release Level: An Empirical Study

Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9

{ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca

Abstract

Code clone genealogies show how clone groups evolve with the evolution of the associated software system, and thus could provide important insights on the maintenance implications of clones. In this paper, we provide an in-depth empirical study for evaluating clone genealogies in evolving open source systems at the release level. We develop a clone genealogy extractor, examine 17 open source C, Java, C++ and C# systems of diverse varieties and study different dimensions of how clone groups evolve with the evolution of the software systems. Our study shows that majority of the clone groups of the clone genealogies either propagate without any syntactic changes or change consistently in the subsequent releases, and that many of the genealogies remain alive during the evolution. These findings seem to be consistent with the findings of a previous study that clones may not be as detrimental in software maintenance as believed to be (at least by many of us), and that instead of aggressively refactoring clones, we should possibly focus on tracking and managing clones during the evolution of software systems. 1. Introduction

Programmers often copy code fragments and then paste them with or without modifications during software development. Such duplicated code fragments are known as software clones or code clones. Previous studies have shown that systems contain duplicate code in amounts ranging from 5-15% of the code-base [23] to as high as 50% [22]. Despite their usefulness [12, 15], the presence of identical or near identical code fragments may add to the difficulties of software maintenance. For example, if a bug is detected in a code fragment, all the fragments similar to it should be investigated to check for the same bug and when enhancing or adapting a piece of code, duplicated fragments can multiply the work to be done [19]. Code clones are also considered as one of the bad smells of a software system [3, 10]. Consequently, identification and management of software clones has now become

an essential part of software maintenance. However, due to the intense use of template-based programming [12], a certain amount of clones are likely acceptable.

Previous studies were highly influenced by the idea that clones are harmful and can be removed through refactoring [15]. This notion has been challenged by the work of Kim et al. [15]. They provided a clone genealogy model and analyzed the clone genealogies of two open source software systems. While a clone group consists of a set of code fragments in a particular version of a software that are clones to each other, a genealogy of a clone group describes how the code fragments of that clone group propagate during the evolution of the subject system. Each clone genealogy consists of a set of clone lineages that originate from the same clone group (source). A clone lineage is a directed acyclic graph that describes the evolution history of a clone group from the beginning to the final release of the software system. The empirical study described by Kim et al. on code clone genealogy reveals that clones are not always harmful. Programmers intentionally practice code cloning to achieve certain benefits [12, 13]. During the development of a software system, many clones are short lived. Refactoring them aggressively can overburden the developers. Their study also shows that many long-lived consistently changing clones are not locally refactorable. Such clones cannot be removed from the system through refactoring [15].

We are motivated by the work of Kim et al. [15]. They were the first to analyze clone genealogies. However, they only analyzed two small Java systems. They also speculated that the selected systems might not have captured the characteristics of larger systems and thus, further empirical evaluations need to be carried out for larger systems of different languages. After Kim et al. several other researchers also investigated the maintenance implications of clones. Kapser and Godfrey [12] conducted several studies in the area and showed that clones might not always be harmful and even could be useful in a number of ways. Krinke [16, 17] studied change types and the stability of code clones based on the changes between the revisions of several open source systems. Although he analyzed several systems written in C, C++ and Java,

Release Level AnalysisScience of Computer Programming ( ) –


Science of Computer Programming

journal homepage: www.elsevier.com/locate/scico

An empirical study on inconsistent changes to code clones at therelease levelNicolas Bettenburg ⇤, Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou,Ahmed E. HassanQueen’s University, Kingston, Ontario, Canada


Article history:Available online xxxx

Keywords:Software engineeringMaintenance managementReuse modelsClone detectionMaintainabilitySoftware evolution

a b s t r a c t

To study the impact of code clones on software quality, researchers typically carry outtheir studies based on fine-grained analysis of inconsistent changes at the revision level.As a result, they capture much of the chaotic and experimental nature inherent in any on-going software development process. Analyzing highly fluctuating and short-lived clonesis likely to exaggerate the ill effects of inconsistent changes on the quality of the releasedsoftware product, as perceived by the end user. To gain a broader perspective, we performan empirical study on the effect of inconsistent changes on software quality at the releaselevel. Based on a case study on three open source software systems, we observe thatonly 1.02%–4.00% of all clone genealogies introduce software defects at the release level,as opposed to the substantially higher percentages reported by previous studies at therevision level. Our findings suggest that clones do not have a significant impact on thepost-release quality of the studied systems, and that the developers are able to effectivelymanage the evolution of cloned code.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Code clones are the source of heated debates among softwaremaintenance researchers. Developers typically clone (copy)existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source codefor new development. However, unless a clone is reused as is, developers quickly lose track of the link between the cloneand the cloned piece of code, especially after some local modifications. Losing the links between clones increases the riskof inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to allclones, such as defect fixing changes.

There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such asincreased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due tothe belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, otherresearchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineeringmethod to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which ofthese two visions prevails, or whether the right vision depends on the software system at hand [15,43,47].

Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones.Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small

⇤ Corresponding author. Tel.: +1 613 533 6802.E-mail addresses: [email protected] (N. Bettenburg), [email protected] (W. Shang), [email protected] (W.M. Ibrahim), [email protected]

(B. Adams), [email protected] (Y. Zou), [email protected] (A.E. Hassan).

0167-6423/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.scico.2010.11.010

Evaluating Code Clone Genealogies at Release Level: An Empirical Study

Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9

{ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca

Abstract

Code clone genealogies show how clone groups evolve with the evolution of the associated software system, and thus could provide important insights on the maintenance implications of clones. In this paper, we provide an in-depth empirical study for evaluating clone genealogies in evolving open source systems at the release level. We develop a clone genealogy extractor, examine 17 open source C, Java, C++ and C# systems of diverse varieties and study different dimensions of how clone groups evolve with the evolution of the software systems. Our study shows that majority of the clone groups of the clone genealogies either propagate without any syntactic changes or change consistently in the subsequent releases, and that many of the genealogies remain alive during the evolution. These findings seem to be consistent with the findings of a previous study that clones may not be as detrimental in software maintenance as believed to be (at least by many of us), and that instead of aggressively refactoring clones, we should possibly focus on tracking and managing clones during the evolution of software systems. 1. Introduction

Programmers often copy code fragments and then paste them with or without modifications during software development. Such duplicated code fragments are known as software clones or code clones. Previous studies have shown that systems contain duplicate code in amounts ranging from 5-15% of the code-base [23] to as high as 50% [22]. Despite their usefulness [12, 15], the presence of identical or near identical code fragments may add to the difficulties of software maintenance. For example, if a bug is detected in a code fragment, all the fragments similar to it should be investigated to check for the same bug and when enhancing or adapting a piece of code, duplicated fragments can multiply the work to be done [19]. Code clones are also considered as one of the bad smells of a software system [3, 10]. Consequently, identification and management of software clones has now become

an essential part of software maintenance. However, due to the intense use of template-based programming [12], a certain amount of clones are likely acceptable.

Previous studies were highly influenced by the idea that clones are harmful and can be removed through refactoring [15]. This notion has been challenged by the work of Kim et al. [15]. They provided a clone genealogy model and analyzed the clone genealogies of two open source software systems. While a clone group consists of a set of code fragments in a particular version of a software that are clones to each other, a genealogy of a clone group describes how the code fragments of that clone group propagate during the evolution of the subject system. Each clone genealogy consists of a set of clone lineages that originate from the same clone group (source). A clone lineage is a directed acyclic graph that describes the evolution history of a clone group from the beginning to the final release of the software system. The empirical study described by Kim et al. on code clone genealogy reveals that clones are not always harmful. Programmers intentionally practice code cloning to achieve certain benefits [12, 13]. During the development of a software system, many clones are short lived. Refactoring them aggressively can overburden the developers. Their study also shows that many long-lived consistently changing clones are not locally refactorable. Such clones cannot be removed from the system through refactoring [15].

We are motivated by the work of Kim et al. [15]. They were the first to analyze clone genealogies. However, they only analyzed two small Java systems. They also speculated that the selected systems might not have captured the characteristics of larger systems and thus, further empirical evaluations need to be carried out for larger systems of different languages. After Kim et al. several other researchers also investigated the maintenance implications of clones. Kapser and Godfrey [12] conducted several studies in the area and showed that clones might not always be harmful and even could be useful in a number of ways. Krinke [16, 17] studied change types and the stability of code clones based on the changes between the revisions of several open source systems. Although he analyzed several systems written in C, C++ and Java,

Most of the clone inconsistent changes are not

visible at release level

Risks for Clone ChangesFrequency and Risks of Changes to Clones

Nils Göde

University of Bremen

Bremen, Germany

[email protected]

Rainer Koschke


Bremen, Germany

[email protected]

ABSTRACTCode Clones—duplicated source fragments—are said to in-crease maintenance e↵ort and to facilitate problems causedby inconsistent changes to identical parts. While this is cer-tainly true for some clones and certainly not true for others,it is unclear how many clones are real threats to the system’squality and need to be taken care of. Our analysis of cloneevolution in mature software projects shows that most clonesare rarely changed and the number of unintentional incon-sistent changes to clones is small. We thus have to carefullyselect the clones to be managed to avoid unnecessary e↵ortmanaging clones with no risk potential.

Categories and Subject DescriptorsD.2.7 [Software Engineering]: Distribution, Maintenance,and Enhancement—restructuring, reverse engineering, andreengineering

General TermsExperimentation, Measurement

KeywordsSoftware maintenance, clone detection, clone evolution

1. INTRODUCTIONCode clones are similar fragments of source code. There

are many problems caused by the presences of clones. Amongothers, the source code becomes larger, change e↵ort in-creases, and change propagation bears the risk of unwantedinconsistencies—for example, incomplete removal of defects.Consequently, a variety of clone detection techniques andtools has evolved to identify duplicated source code withina system. In addition, various tools have been created thatsupport developers in managing clones. These include refac-toring support [4, 12], automated change propagation [6, 33],and change monitoring to prevent unintentional inconsisten-cies [30].

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICSE ’11, May 21–28, 2011, Honolulu, Hawaii, USACopyright 2011 ACM 978-1-4503-0445-0/11/05 ...$10.00.

There certainly exist clones that are true threats to soft-ware maintenance. Nevertheless, recent research [19, 20]doubts the harmfulness of clones in general and lists nu-merous situations in which clones are a reasonable designdecision. From the clone management perspective, it is de-sirable to detect and manage only the harmful clones, be-cause managing clones that have no negative e↵ects createsonly additional e↵ort.Unfortunately, state-of-the-art clone tools detect and clas-

sify clones based only on similar structures in the sourcecode or one of its various representations. When it comes toclone-related problems, however, the most important char-acteristic of a clone is its change behavior and not its struc-ture. Only if a clone changes, it causes additional changee↵ort. Only if a clone changes, unintentional inconsistenciescan arise. If, on the other hand, a clone never changes, thereare no additional costs induced by propagating changes andthere is no risk of unwanted inconsistencies.Our hypothesis is that many clones detected by state-of-

the-art tools are “structurally interesting” but irrelevant tosoftware maintenance because they never change during theirlifetime.Up-to-date clone detectors can e�ciently process and de-

tect clones within huge amounts of source code, consequentlydelivering huge numbers of clones. In contrast, clone assess-ment and deciding how to proceed can be very costly even forindividual clones as we have experienced with clones in ourown code [11]. Hence, having many unproblematic clones inthe detection results creates enormous overhead for assess-ing and managing clones that do not threaten maintenancebecause they never change.To gain a better understanding of clones’ threat potential,

we conducted an extensive study on clone evolution in dif-ferent systems and performed a detailed tracking to detectwhen and how clones had been changed. For this study, weconcentrated on two prominent clone-related problems—theadditional change e↵ort caused by clones and the risk of un-intentional inconsistent changes. Our research questions arethe following:

Question 1 — How often are clones changedthroughout their lifetime?

Question 2 — How many changes to clones areunintentionally inconsistent?

Contribution. The contribution of our work is a detailedanalysis of how individual clones were changed throughouttheir lifetime. This includes the frequency of changes and


Nils Göde


Bremen, Germany

[email protected]

Rainer Koschke


Bremen, Germany

[email protected]
















Inconsistent changes are often intentional


Nils Göde


Bremen, Germany

[email protected]

Rainer Koschke


Bremen, Germany

[email protected]
















Inconsistent changes are often intentional

Worthless to plan clone maintenance where not needed

Tracking Entities

Late Propagation

Clone changes

Clones and bugs

Empir Software Eng (2012) 17:503–530DOI 10.1007/s10664-011-9195-3

Clones: what is that smell?

Foyzur Rahman · Christian Bird ·Premkumar Devanbu

Published online: 24 December 2011© Springer Science+Business Media, LLC 2011Editors: Jim Whitehead and Tom Zimmermann

Abstract Clones are generally considered bad programming practice in softwareengineering folklore. They are identified as a bad smell (Fowler et al. 1999) and amajor contributor to project maintenance difficulties. Clones inherently cause codebloat, thus increasing project size and maintenance costs. In this work, we try tovalidate the conventional wisdom empirically to see whether cloning makes codemore defect prone. This paper analyses the relationship between cloning and defectproneness. For the four medium to large open source projects that we studied, wefind that, first, the great majority of bugs are not significantly associated with clones.Second, we find that clones may be less defect prone than non-cloned code. Third,we find little evidence that clones with more copies are actually more error prone.Fourth, we find little evidence to support the claim that clone groups that span morethan one file or directory are more defect prone than collocated clones. Finally, wefind that developers do not need to put a disproportionately higher effort to fixclone dense bugs. Our findings do not support the claim that clones are really a“bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at thesame time.

Keywords Empirical software engineering · Software maintenance ·Software clone · Software quality · Software evolution

F. Rahman (B) · P. DevanbuDepartment of Computer Science, University of California, Davis, Davis, CA, USAe-mail: [email protected]

P. Devanbue-mail: [email protected]

C. BirdEmpirical Software Engineering, Microsoft Research, One Microsoft Way, Redmond,WA 98052, USAe-mail: [email protected]










Most of defect-prone code (>80%)

does not contain clones












Large clones have lower defect density












Large clones have lower defect density

Amount of changes to fix bugs is smaller for clones

Duplicate bugs in clonesBug Replication in Code Clones: An Empirical

Study

Judith F. Islam Manishankar Mondal Chanchal K. RoyDepartment of Computer Science, University of Saskatchewan, Canada

{judith.islam, mshankar.mondal, chanchal.roy}@usask.ca

Abstract—Code clones are exactly or nearly similar codefragments in the code-base of a software system. Existing studiesshow that clones are directly related to bugs and inconsistenciesin the code-base. Code cloning (making code clones) is suspectedto be responsible for replicating bugs in the code fragments.However, there is no study on the possibilities of bug-replicationthrough cloning process. Such a study can help us discover waysof minimizing bug-replication. Focusing on this we conduct anempirical study on the intensities of bug-replication in the codeclones of the major clone-types: Type 1, Type 2, and Type 3.

According to our investigation on thousands of revisions ofsix diverse subject systems written in two different programminglanguages, C and Java, a considerable proportion (i.e., up to10%) of the code clones can contain replicated bugs. Both Type2 and Type 3 clones have higher tendencies of having replicatedbugs compared to Type 1 clones. Thus, Type 2 and Type 3 clonesare more important from clone management perspectives. Theextent of bug-replication in the buggy clone classes is generallyvery high (i.e., 100% in most of the cases). We also find thatoverall 55% of all the bugs experienced by the code clones canbe replicated bugs. Our study shows that replication of bugsthrough cloning is a common phenomenon. Clone fragmentshaving method-calls and if-conditions should be considered forrefactoring with high priorities, because such clone fragmentshave high possibilities of containing replicated bugs. We believethat our findings are important for better maintenance of softwaresystems, in particular, systems with code clones.

I. INTRODUCTION

If two or more code fragments in a software system’s code-base are exactly or nearly similar to one another we call themcode clones [44], [45]. A group of similar code fragmentsforms a clone class. Code clones are mainly created becauseof the frequent copy/paste activities of the programmers duringsoftware development and maintenance. Whatever may be thereasons behind cloning, code clones are of great importancefrom the perspectives of software maintenance and evolution[44].

A great many studies [1], [2], [10]–[12], [14], [16], [18],[20]–[23], [25], [26], [36], [37], [51], [53] have been conductedon discovering the impact of cloning on software maintenance.While a number of studies [1], [11], [12], [18], [20]–[22]have revealed some positive sides of code cloning, there isstrong empirical evidence [2], [10], [14], [16], [23], [25],[26], [36], [37], [51] of negative impacts of code clonestoo. These negative impacts include higher instability [36],late propagation [2], and unintentional inconsistencies [10].Existing studies [2], [39] show that code clones are relatedto bugs in the code-base. Also, it is suspected that cloningis responsible for replicating bugs [44]. If a particular code

fragment contains a bug and a programmer copies that codefragment to several other places in the code-base without theknowledge of the existing bug, the bug in the original fragmentgets replicated. Fixing of such replicated bugs may requireincreased maintenance effort and cost for software systems.However, although cloning is suspected to be responsible forreplicating bugs, there is no study on the possibilities ofbug-replication through cloning. Such a study can provide ushelpful insights for minimizing bug-replication as well as forprioritizing code clones for refactoring or tracking. Focusingon this we conduct an in-depth empirical study regarding bug-replication in the code clones of the major clone-types: Type1, Type 2, Type 3.

We conduct our empirical study on thousands of revisionsof six diverse subject systems written in two different program-ming languages (Java and C). We detect code clones fromeach of the revisions of a subject system using the NiCad[6] clone detector, analyze the evolution history of these codeclones, and investigate whether and to what extent they containreplicated bugs. We answer four important research questions(Table I) regarding the intensity and cause of bug-replicationthrough our investigation. According to our investigation in-volving rigorous manual analysis we can state that:

(1) A considerable percentage of the code clones can berelated to bug-replication. According to our observation upto 10% of the code clones in a software system can containreplicated bugs.

(2) Both Type 2 and Type 3 clones have higher possibilitiesof containing replicated bugs compared to Type 1 clones. Thus,Type 2 and Type 3 clones should be given higher priorities formanagement.

(3) A considerable proportion (around 55%) of the bugsoccurred in code clones can be replicated bugs.

(4) Most of the replicated bugs are related to the method-calls and if-conditions residing in the clone fragments. Thus,clone fragments containing method-calls and/or if-conditionsshould be considered for refactoring or tracking with highpriorities.

Our findings imply that bug-replication tendencies of codeclones should be taken in proper consideration when makingclone management decisions. The findings from our study areimportant for better management of code clones as well as forbetter maintenance of software systems.

The rest of the paper is organized as follows. Section IIcontains the terminology, Section III discusses the experimen-tal steps, Section IV describes the process of identifying the

2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering

978-1-5090-1855-0/16 $31.00 © 2016 IEEE

DOI 10.1109/SANER.2016.78

68

Duplicate bugs in clonesBug Replication in Code Clones: An Empirical

Study

Judith F. Islam Manishankar Mondal Chanchal K. RoyDepartment of Computer Science, University of Saskatchewan, Canada

{judith.islam, mshankar.mondal, chanchal.roy}@usask.ca

Abstract—Code clones are exactly or nearly similar codefragments in the code-base of a software system. Existing studiesshow that clones are directly related to bugs and inconsistenciesin the code-base. Code cloning (making code clones) is suspectedto be responsible for replicating bugs in the code fragments.However, there is no study on the possibilities of bug-replicationthrough cloning process. Such a study can help us discover waysof minimizing bug-replication. Focusing on this we conduct anempirical study on the intensities of bug-replication in the codeclones of the major clone-types: Type 1, Type 2, and Type 3.

According to our investigation on thousands of revisions ofsix diverse subject systems written in two different programminglanguages, C and Java, a considerable proportion (i.e., up to10%) of the code clones can contain replicated bugs. Both Type2 and Type 3 clones have higher tendencies of having replicatedbugs compared to Type 1 clones. Thus, Type 2 and Type 3 clonesare more important from clone management perspectives. Theextent of bug-replication in the buggy clone classes is generallyvery high (i.e., 100% in most of the cases). We also find thatoverall 55% of all the bugs experienced by the code clones canbe replicated bugs. Our study shows that replication of bugsthrough cloning is a common phenomenon. Clone fragmentshaving method-calls and if-conditions should be considered forrefactoring with high priorities, because such clone fragmentshave high possibilities of containing replicated bugs. We believethat our findings are important for better maintenance of softwaresystems, in particular, systems with code clones.

I. INTRODUCTION

If two or more code fragments in a software system’s code-base are exactly or nearly similar to one another we call themcode clones [44], [45]. A group of similar code fragmentsforms a clone class. Code clones are mainly created becauseof the frequent copy/paste activities of the programmers duringsoftware development and maintenance. Whatever may be thereasons behind cloning, code clones are of great importancefrom the perspectives of software maintenance and evolution[44].

A great many studies [1], [2], [10]–[12], [14], [16], [18],[20]–[23], [25], [26], [36], [37], [51], [53] have been conductedon discovering the impact of cloning on software maintenance.While a number of studies [1], [11], [12], [18], [20]–[22]have revealed some positive sides of code cloning, there isstrong empirical evidence [2], [10], [14], [16], [23], [25],[26], [36], [37], [51] of negative impacts of code clonestoo. These negative impacts include higher instability [36],late propagation [2], and unintentional inconsistencies [10].Existing studies [2], [39] show that code clones are relatedto bugs in the code-base. Also, it is suspected that cloningis responsible for replicating bugs [44]. If a particular code

fragment contains a bug and a programmer copies that codefragment to several other places in the code-base without theknowledge of the existing bug, the bug in the original fragmentgets replicated. Fixing of such replicated bugs may requireincreased maintenance effort and cost for software systems.However, although cloning is suspected to be responsible forreplicating bugs, there is no study on the possibilities ofbug-replication through cloning. Such a study can provide ushelpful insights for minimizing bug-replication as well as forprioritizing code clones for refactoring or tracking. Focusingon this we conduct an in-depth empirical study regarding bug-replication in the code clones of the major clone-types: Type1, Type 2, Type 3.

We conduct our empirical study on thousands of revisionsof six diverse subject systems written in two different program-ming languages (Java and C). We detect code clones fromeach of the revisions of a subject system using the NiCad[6] clone detector, analyze the evolution history of these codeclones, and investigate whether and to what extent they containreplicated bugs. We answer four important research questions(Table I) regarding the intensity and cause of bug-replicationthrough our investigation. According to our investigation in-volving rigorous manual analysis we can state that:

(1) A considerable percentage of the code clones can berelated to bug-replication. According to our observation upto 10% of the code clones in a software system can containreplicated bugs.

(2) Both Type 2 and Type 3 clones have higher possibilitiesof containing replicated bugs compared to Type 1 clones. Thus,Type 2 and Type 3 clones should be given higher priorities formanagement.

(3) A considerable proportion (around 55%) of the bugsoccurred in code clones can be replicated bugs.

(4) Most of the replicated bugs are related to the method-calls and if-conditions residing in the clone fragments. Thus,clone fragments containing method-calls and/or if-conditionsshould be considered for refactoring or tracking with highpriorities.

Our findings imply that bug-replication tendencies of codeclones should be taken in proper consideration when makingclone management decisions. The findings from our study areimportant for better management of code clones as well as forbetter maintenance of software systems.

The rest of the paper is organized as follows. Section IIcontains the terminology, Section III discusses the experimen-tal steps, Section IV describes the process of identifying the

2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering

978-1-5090-1855-0/16 $31.00 © 2016 IEEE

DOI 10.1109/SANER.2016.78

68

Over half of bugs occurring in clones are

duplicated bugs

Chapter Four: Take-Aways

Late propagations for type-3 clones

Actually, it does not happen so often



Many clone genealogies




Consistent if we look at release level


Late propagation is highly correlated with defects





Late propagation is highly correlated with defects





But no more than defects in non-cloned code

We now have data, infrastructure and

computational power for larger, better studies

Comparing ApproachesComparative Stability of Cloned and Non-cloned Code: An

Empirical Study

Manishankar Mondal

1, Chanchal K. Roy

1, Md. Saidur Rahman

1, Ripon K. Saha

1, Jens

Krinke

2, Kevin A. Schneider

1

1Department of Computer Science, University of Saskatchewan, Canada

2University College London, UK

1{mshankar.mondal, chanchal.roy, saeed.cs, ripon.saha, kevin.schneider}@usask.ca

[email protected]

ABSTRACTCode cloning is a controversial software engineering practicedue to contradictory claims regarding its e↵ect on softwaremaintenance. Code stability is a recently introduced mea-surement technique that has been used to determine theimpact of code cloning by quantifying the changeability of acode region. Although most of the existing stability analy-sis studies agree that cloned code is more stable than non-cloned code, the studies have two major flaws: (i) each studyonly considered a single stability measurement (e.g., lines ofcode changed, frequency of change, age of change); and, (ii)only a small number of subject systems were analyzed andthese were of limited variety.

In this paper, we present a comprehensive empirical studyon code stability using three di↵erent stability measuringmethods. We use a recently introduced hybrid clone detec-tion tool, NiCAD, to detect the clones and analyze theirstability in four dimensions: by clone type, by measuringmethod, by programming language, and by system size andage. Our four-dimensional investigation on 12 diverse sub-ject systems written in three programming languages consid-ering three clone types reveals that: (i) Type-1 and Type-2clones are unstable, but Type-3 clones are not; (ii) clonesin Java and C systems are not as stable as clones in C#systems; (iii) a system’s development strategy might play akey role in defining its comparative code stability scenario;and, (iv) cloned and non-cloned regions of a subject systemdo not follow a consistent change pattern.

Categories and Subject DescriptorsD.2.7 [Software Engineering]: Distribution, Maintenance,and Enhancement—Restructuring, Reverse Engineering andReengineering.

General TermsMeasurement and Experimentation

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

SAC’12 March 25-29, 2012, Riva del Garda, Italy.

Copyright 2012 ACM 978-1-4503-0857-1/12/03 ...$10.00.

KeywordsCode Stability; Modification Frequency; Average Last ChangeDate; Average Age; Clone Types

1. INTRODUCTIONFrequent copy-paste activity by programmers during soft-

ware development is common. Copying a code fragmentfrom one location and pasting it to another location withor without modifications cause multiple copies of exact orclosely similar code fragments to co-exist in software sys-tems. These code fragments are known as clones. Whatevermay be the reasons behind cloning, the impact of clones onsoftware maintenance and evolution is of great concern.The common belief is that, the presence of duplicate code

poses additional challenges to software maintenance by mak-ing inconsistent changes more di�cult, introducing bugs andas a result increasing maintenance e↵orts. From this point ofview, some researchers have identified clones as “bad smells”and their studies showed that clones have negative impact onsoftware quality and maintenance [7, 14, 15]. On the otherhand, there has been a good number of empirical evidencein favour of clones concluding that clones are not harmful[1, 6, 9, 10, 18]. Instead, clones can be useful from di↵erentpoints of views [8].A widely used term to assess the impact of clones on soft-

ware maintenance is stability [6, 11, 12, 14]. Because ifcloned code is more stable (changes less frequently) as com-pared to non-cloned code during software evolution, it canbe concluded that cloned code does not significantly increasemaintenance e↵orts. Di↵erent researchers have defined andevaluated stability from di↵erent viewpoints which can bebroadly divided into two categories:(1) Stability measurement in terms of changes:

Some methodologies [6, 11, 14, 5] have measured stabilityby quantifying the changes to a code region using two gen-eral approaches - (i) determination of the ratio of the numberof lines added, modified and deleted to the total number oflines in a code region (cloned or non-cloned) [11, 14, 5] and(ii) determination of the frequency of modifications to thecloned and non-cloned code [6] with the hypothesis that thehigher the modification frequency of a code region is the lessstable it is.(2) Stability measurement in terms of age: This

approach [12] determines the average last changed dates ofcloned and non-cloned code of a subject system. The hy-pothesis is that the older the average last change date of acode region is, the more stable it is.

Genealogy Extractors

An Automatic Framework for Extracting andClassifying Near-Miss Clone Genealogies

Ripon K. Saha Chanchal K. Roy Kevin A. SchneiderDepartment of Computer Science, University of Saskatchewan, Canada

{ripon.saha, chanchal.roy, kevin.schneider}@usask.ca

Abstract—Extracting code clone genealogies across multipleversions of a program and classifying them according to theirchange patterns underlies the study of code clone evolution.While there are a few studies in the area, the approaches donot handle near-miss clones well and the associated tools areoften computationally expensive. To address these limitations,we present a framework for automatically extracting both exactand near-miss clone genealogies across multiple versions of aprogram and for identifying their change patterns using a few keysimilarity factors. We have developed a prototype clone genealogyextractor, applied it to three open source projects including theLinux Kernel, and evaluated its accuracy in terms of precisionand recall. Our experience shows that the prototype is scalable,adaptable to different clone detection tools, and can automaticallyidentify evolution patterns of both exact and near-miss clones byconstructing their genealogies.

Index Terms—clone genealogy extractor; mapping; clone evo-lution.

I. INTRODUCTION

The investigation and analysis of code clones has attractedconsiderable attention from the software engineering researchcommunity in recent years. Researchers have presented ev-idence that code clones have both positive [10], [22] andnegative [16] consequences for maintenance activities andthus, in general, code clones are neither good nor bad. It isalso not possible or practical to eliminate certain clone classesfrom a software system [10]. Consequently, the identificationand management of software clones, and the evaluation of theirimpact has become an essential part of software maintenance.Knowing the evolution of clones throughout a system’s historyis important for properly comprehending and managing thesystem’s clones [9].

There has been quite a bit of research on studying codeclone evolution. Most of these studies investigate retrospec-tively how clones are modified by constructing a clone ge-nealogy. A clone genealogy describes how the code fragmentsof a clone class propagate through versions during the evo-lution of the subject system. Therefore, accurately mappingclones between versions of a program, and classifying theirchange patterns are the fundamental tasks for studying cloneevolution.

Researchers have used three different approaches to mapclones across multiple versions of a program. In the firstapproach [2], [10], [20], clones are detected in all the versionsof interest and then clones are mapped between consecutiveversions based on heuristics. In the second approach [1],clones are detected in the first version of interest, and thenthey are mapped to consecutive versions based on change

logs provided by source code repositories such as svn. Inthe third approach [15], [6], clones are mapped during clonedetection based on source code changes between revisions. Acombination of the first and second approaches has also beenused in some studies [3].

Although intuitive, each of these approaches has somelimitations. In the first approach, a number of the similaritymetrics used to map clones have quadratic time complexities[9]. In addition, if a clone fragment changes significantlyin the next version and goes beyond the given similaritythreshold of the clone genealogy extractor, a mapping may notbe identified. In the second approach, only clones identifiedin the first version are mapped. Therefore, we do not knowwhat happens to clones introduced in later versions. Thethird approach (“incremental approach”) avoids some of thelimitations of the previous two approaches by combiningdetection and mapping, and works well for mapping clonesin many versions. By integrating clone detection and clonemapping this approach can be faster than the approaches thatrequire clone detection to be conducted separately for eachversion. Although this incremental approach is fast enoughboth for detection and mapping for a given set of revisions,it might not be as beneficial at the release level [6] becausethere might be a significant difference between the releases.Furthermore, in the sole available incremental tool, iClones[6] (available for academic purpose), when a new revision orrelease is added for mapping, the whole detection and mappingprocess should be repeated since clones are both detected andmapped simultaneously. Clone management is likely beingconducted on a changing system, and it is a disadvantage for anapproach to require detecting clones for all revision/versionseach time a new revision/version is added. Another issue withthe incremental mapping is that it cannot utilize the resultsobtained with a classical non-incremental clone detectiontool as the detection of clones and their mapping is donesimultaneously. Most of the existing clone detection tools arenon-incremental. There is also no representative tool available.Depending on the task at hand and the availability of tools,one might want to study cloning evolution with several clonedetection tools. It is thus important to have a clone evolutionanalysis tool in place independent of the clone detection tools.Scalability of the incremental approaches is another greatchallenge because of huge memory requirements.

Again, while most of these approaches [1], [2], [3], [10],[20] are based on the state of the art detection and mappingtechniques, they only focused on Type-1 and Type-2 clones.

Tooling

Clone Detection In Modern IDEs

https://blogs.msdn.microsoft.com/zainnab/2012/06/28/visual-studio-2012-new-features-code-clone-analysis/

Clone Tracking Should be also Put In The Practice

3

Clone Region Descriptors: Representing andTracking Duplication in Source Code

EKWA DUALA-EKOKO and MARTIN P. ROBILLARDMcGill University

Source code duplication, commonly known as code cloning, is considered an obstacle to softwaremaintenance because changes to a cloned region often require consistent changes to other regionsof the source code. Research has provided evidence that the elimination of clones may not alwaysbe practical, feasible, or cost-effective. We present a clone management approach that describesclone regions in a robust way that is independent from the exact text of clone regions or theirlocation in a file, and that provides support for tracking clones in evolving software. Our techniquerelies on the concept of abstract clone region descriptors (CRDs), which describe clone regions usinga combination of their syntactic, structural, and lexical information. We present our definition ofCRDs, and describe a clone tracking system capable of producing CRDs from the output of dif-ferent clone detection tools, notifying developers of modifications to clone regions, and supportingupdates to the documented clone relationships. We evaluated the performance and usefulnessof our approach across three clone detection tools and five subject systems, and the results in-dicate that CRDs are a practical and robust representation for tracking code clones in evolvingsoftware.

Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance,and Enhancement

General Terms: Design, Experimentation

Additional Key Words and Phrases: Source code duplication, code clones, clone detection, refactor-ing, clone management

ACM Reference Format:Duala-Ekoko, E. and Robillard, M. P. 2010. Clone region descriptors: Representing and trackingduplication in source code. ACM Trans. Softw. Eng. Methodol. 20, 1, Article 3 (June 2010), 31 pages.DOI = 10.1145/1767751.1767754 http://doi.acm.org/10.1145/1767751.1767754

This work was supported by NSERC.This article is a revised and extended version of an article presented at ICSE 2007 in Minneapolis,MN.Authors’ address: School of Computer Science, McGill University, 3480 University Street,McConnell Engineering Building no. 318, Montreal, P.Q., Canada, H3A 2A7; email: {ekwa,martin}@mcgill.ca.

Permission to make digital or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributed for profit or commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]⃝ 2010 ACM 0163-5948/2010/06-ART3 $10.00DOI 10.1145/1767751.1767754 http://doi.acm.org/10.1145/1767751.1767754

ACM Transactions on Software Engineering and Methodology, Vol. 20, No. 1, Article 3, Publication date: June 2010.

Applying Clone Change Notification System into anIndustrial Development Process

Yuki Yamanaka ∗, Eunjong Choi ∗, Norihiro Yoshida †, Katsuro Inoue ∗, Tateki Sano ‡∗ Graduate School of Information Science and Technology, Osaka University, Japan

{y-yuuki, ejchoi, inoue}@ist.osaka-u.ac.jp† Graduate School of Information Science, Nara Institute of Science and Technology, Japan

[email protected]‡ Software Process Innovation and Standardization Division, NEC Corporation, Japan

[email protected]

Abstract—Programmers tend to write code clones unintention-ally even in the case that they can easily avoid them. Clone changemanagement is one of crucial issues in open source software(OSS) development as well as in industrial software development(e.g., development of social infrastructure, financial system, andmedical equipment). When an industrial developer fixes a defect,he/she has to find the code clones corresponding to the codefragment including it. So far, several studies performed on theanalysis of clone evolution in OSS. However, to our knowledge,a few researches have been reported on an application of a clonechange notification system to industrial development process.In this paper, we introduce a system for notifying creationand change of code clones, and then report on the experiencewith 40-days application of it into a development process inNEC Corporation. In the industrial application, a developersuccessfully identified ten unintentionally-developed clones thatshould be refactored.Index Terms—Code Clone, Software Maintenance, Refactoring

I. INTRODUCTION

A code clone is a code fragment that has similar or identicalcode fragments in source code. Many code clone detectiontools [1], [2], [3] have been proposed to capture variousaspects of source code similarity. A code clone detection toolgenerally finds all source code clones that match its owndefinition of code clone; therefore, a tool may report a largenumber of code clones for large scale software. On the otherhand, software developers are interested in only the subsetof code clones that are relevant to their activities [4]. Forexample, although refactoring [5] is one of promising activitiesto improve the maintainability of code clones, code clone isnot always appropriate for refactoring. One of the reasonsis that developers sometimes have to repeatedly write codeclones that cannot be merged due to the in-expressivenessof a programming language [6], [7], [8]. However, a cloneset (i.e., a set of code clones identical or similar to eachother) indicates considerable opportunities for developers tomerge code clones into one or a few program units (e.g., Javamethods) by refactoring [6], [7], [8].Refactoring aimed to merge code clones is required not

only in open source projects but also in industry. A devel-opment team at NEC Corporation, a Japanese multinationalIT company, has been developed a web application software.

Because the team plans long-time maintenance as well asreuse for other system developments, the developers are highlymotivated to merge code clones into a single module.However, the cost of refactoring cannot be ignored espe-

cially in industry. Regression test after refactoring takes muchcost to preserve behavior after refactoring. The developmentteam at NEC also considers the cost of refactoring. Basically,they do not touch source code after large-scale system test forreleasing major version of the software because refactoringafter large-scale test leads the re-performance of such costlytest. Therefore, they need to know newly-appeared clonesregularly, especially before large-scale system test.In this paper, we present clone change notification system

Clone Notifier (see Figure 3) for the promotion of efficientclone management (e.g., refactoring, simultaneous editing).Clone Notifier notifies newly-appeared and changed clonesregularly to developers. As an industrial application, we ap-plied Clone Notifier into the process of the web applicationsoftware development at NEC. The result shows 119 newly-appeared clone sets, and ten out of them are recognized asrefactoring candidates by an experienced project manager (i.e.,he recognized that each of ten clone sets should be mergedinto a single module).As an ex-post analysis, we investigated the characteristics

of clone sets recognized as refactoring candidate by the expe-rienced project manager at NEC. The aim of the analysis isdata collection for the development of technique to recommendrefactoring candidate from all newly-appeared and changedclones. The recommendation is promising to help developersto reduce the cost of finding clone sets should be merged intoa single module.The rest of paper is organized as follows: Section II provides

a brief explanation of CCFinder, a code clone detection tool.Section III describes categorization of code clones and clonesets based on the evolution patterns between two versionsof source code. Section IV explains on our developed clonechange notification system, Clone Notifier. Section V describesresults of industry application and feedbacks from projectmanager. Section VI explains ex-post analysis. Section VIIdiscusses threats to validity. Section VIII presents some related

978-1-4673-3092-3/13/$31.00

c� 2013 IEEE

ICPC 2013, San Francisco, CA, USA

199

AMIC (Automatic Mining of Important Clones) 41

http://sr-p2irc-big2.usask.ca/amic/

Above all, in Continuous Integration

[Duvall et al. , 2007]

Compile Test Integrate Check Deploy…

Developers

SCM Server

CI Server

PollPush changes

Push changes

[Duvall et al., 2007]

Feedback

Survey in ING NL

Amount of duplicated codes

Cyclomatic complexity

Number of function parameters

Lines of Code (LOC)

Comment words

Number of source files

Other

% of respondents0% 25% 50% 75% 100%

15%

16%

18%

44%

51%

69%

78%

Metrics Collected to Monitor Source Code Quality

Cloning From Forums…

Tomorrow 9:00AM: Stack Overflow: A Code Laundering Platform? Le An, Ons Mlouki, Foutse Khomh and Giuliano Antoniol

Conclusion


Clone class A

Clone class B





Clone class A

Clone class B




Evolution Patterns

0%

20%

40%

60%

80%


16%

4%5%7%

39%

24%

52%

34%38%

71%

40%

55%


Tracking Entities

Late Propagation

Clone changes

Clones and bugs


Clone class A

Clone class B




Evolution Patterns

0%

20%

40%

60%

80%


16%

4%5%7%

39%

24%

52%

34%38%

71%

40%

55%


Tracking Entities

Late Propagation

Clone changes

Clones and bugs


Clone class A

Clone class B




Evolution Patterns

0%

20%

40%

60%

80%


16%

4%5%7%

39%

24%

52%

34%38%

71%

40%

55%


Survey in ING NL

Amount of duplicated codes

Cyclomatic complexity

Number of function parameters

Lines of Code (LOC)

Comment words

Number of source files

Other

% of respondents0% 25% 50% 75% 100%

15%

16%

18%

44%

51%

69%

78%

Metrics Collected to Monitor Source Code Quality

Most Influential Paper - SANER 2017

Software

Transcript of Most Influential Paper - SANER 2017