Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… ·...

26
EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Lecture 25 Clone Detection CCFinder

Transcript of Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… ·...

Page 1: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25Clone Detection

CCFinder

Page 2: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Today’s Agenda (1)

• Recap of Polymetric Views

• Class Presentation

• Suchitra (advocate)

• Reza (skeptic)

Page 3: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Today’s Agenda (2)

• CCFinder, Kamiya et al. TSE 2002

Page 4: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Recap of Polymetric Views

• Polymetric view is a customizable software visualization tool enriched software metrics.

• This tool targets initial understanding of a legacy system.

• This tool can help programmers develop a high-level mental model.

• It is simple, powerful, scalable, and customizable; however, it requires some training to parse these generated views.

Page 5: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Class Presentation

• Suchitra (advocate)

• Reza (skeptic)

Page 6: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

CCFinder

• CCFinder: A multilinguistic token-based code clone detection system for large scale source code, Kamiya et al. TSE 2002

Page 7: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Definition of Code Clones

• There is no precise or consistent definition on what clones are.

• “a code portion in source files that is identical or similar to another code”

• Clone are often operationally defined by a definition of a clone detector.

Page 8: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

When and Why do programmers create clones?

Page 9: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

When and Why do programmers create clones?

• What we have is slight different what we want.

• When reusing code as a mental macro template

• Due to programming language limitations

• Legacy code is well-tested and often reliable.

• Management reasons

• A team does not want to create a dependency on another team’s code.

• A team does not support other teams’ usage scenarios and customization

• Automatic code generation

Page 10: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Why is code cloning a problem during software evolution?

Page 11: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Why is code cloning a problem during software evolution?

• When a fault is found in one system, it may have to be propagated to other counterpart systems.

• When cloned systems require similar changes, all systems need to be modified consistently.

• If you miss to update these clones consistently, missed updates could lead to a potential bug.

• Redundant development efforts

• Code plagiarism

Page 12: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Research problem addressed by CCFinder

• How can we find clones written in popular programming languages in a fast & scalable way?

• industrial strength

• million-line size system within affordable computation time and memory

• can use heuristics for finding helpful clones

• robust to renaming & small edits

• limited uses of language-dependent clone detection

Page 13: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Approach

• Language-dependent parts

• Lexical analysis

• Rule-based source transformation

• Language-independent parts:

• Suffix-tree matching algorithm for matching token sequences

Page 14: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Rule-based Transformations

• Remove package names

• Supplement callees

• Remove initialization lists

• Separate class definitions

• Remove accessibility keywords

• Convert to compound block

Page 15: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Parameter Replacement!"#$% &%"'()#$ *)(+ '"" (+% (&'$,-#&.'()#$ &/"%, ), !#.0'&%1(# ' !"#$% &%"'()#$ *)(+ ' ,/2,%( #- (+% (&'$,-#&.'()#$ &/"%,3

!"# $%&'()* +,' -./01/&(23 40,2% 5/('*/26 40,2% 40/**%*

4#& 5/'$()('()6% %6'"/'()#$7 ,%"%!()#$7 '$1 -)"(%&)$8 #-!"#$%,7 *% 1%-)$% ,%6%&'" .%(&)!, -#& !"#$% !"',,%,3 9+%,%.%(&)!, %$'2"% /, (# '$,*%& 5/%,()#$,7 ,/!+ ', :*+)!+ !"#$%!"',, *)"" 2&)$8 (+% "'&8%,( &%1/!()#$ #- !#1% 2; &%*&)()$8*)(+ ' ,+'&%1 &#/()$%<= #& :*+)!+ !"#$% !#1% 0#&()#$ ),*)1%"; #& -&%5/%$("; /,%1 )$ (+% ,;,(%.<=

!"#"$ %&'()*+ %,- ./01 %,- .20

!"#>$? ), (+% "%$8(+ #- ' !#1% 0#&()#$ #- $3 !"#>%? -#& !"#$%!"',, % ), (+% .'@)./. !"#>$? -#& %'!+ $ )$ %3 9+% "%$8(+!'$ 2% .%',/&%1 2; (+% $/.2%& #- (#A%$,7 #& (+% ,)B%.%',/&%, ,/!+ ', CDE >(+% $/.2%& #- ")$%,7 )$!"/1)$8 $/""'$1 !#..%$( ")$%,?7 #& FCDE >(+% $/.2%& #- ")$%,7 %@!%0($/"" #& !#..%$( ")$%,?3 G$ #/& '00&#'!+7 (+% $/.2%& #-(#A%$, #- %'!+ !#1% 0#&()#$ #- ' !"#$% !"',, ), )1%$()!'"*+%$ )( ), .%',/&%1 #$ ' (&'$,-#&.%1 (#A%$ ,%5/%$!%H

!"#$%" &' "()* ++,$-.&/* " #0('$($-10$2'$+ '3!&-45"2&. +3.& +(3-& .&'&+'$3- 2%2'&# ,3/ ("/1& 2+"(& 230/+& +3.& 678

,9:) ;) '<= >?@= ABC=D EADAF=C=D D=EGA>=F=HC)

,9:) 7) #ACD9I J<?K9H: C<= J>ACC=D EG?C C?L=H4MN4C?L=H)

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 09:02 from IEEE Xplore. Restrictions apply.

!"#$"%& '( )*+ &,-& .! /-0-11"1 &' &," $-.% 2.-#'%-1 '( &,"$-&0.34 5," 6'2" /'0&.'%! (0'$ 1.%" 7 &' 8 -%2 (0'$ 1.%" 77&' 78 $-9" - 61'%" /-.04 5," 6'2" /'0&.'%! (0'$ 1.%" : &' 7;-%2 (0'$ 7< &' =7 $-9" -%'&,"0 61'%" /-.04 >?@/'0&.'%! '(&,'!" 61'%"! 6-% -1!' @" 61'%" /-.0!A ,'B"C"0D B" -0" '%1E.%&"0"!&"2 .% &,'!" '( &," $-3.$?$ 1"%#&, -%2 &," &''1 2'"!%'& 0"/'0& &,".0 !?@/'0&.'%!4

F"0"D - 61'%"G0"1-&.'% .! 2"(.%"2 B.&, &," &0-%!('0$-&.'%0?1"! -%2 &," /-0-$"&"0G0"/1-6"$"%& 2"!60.@"2 -@'C"4H&,"0 61'%" 0"1-&.'%! 6-% @" 2"(.%"2 B.&, 2.(("0"%&&0-%!('0$-&.'% 0?1"! '0 @E %"#1"6&.%# &," /-0-$"&"00"/1-6"$"%&4 I% &," 6-!" !&?2."! 2"!60.@"2 .% >"6&.'% JD -

!"# $%%% &'()*(+&$,)* ,) *,-&.('% %)/$)%%'$)/0 1,23 4#0 ),3 50 6728 4994

&(:2% 4&;<=>?@;A<BC@= 'DEF> ?@; 6<G<

-CH3 43 *<AIEF J@KF3 -CH3 L3 &MF B;<=>?@;AFK J@KF NO BMF B;<=>?@;A<BC@= ;DEF>3

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 09:02 from IEEE Xplore. Restrictions apply.

Page 16: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Other minor contributions

• Similar to duploc’s scatter-plot visualization

• Suggestions of metrics for clones

Page 17: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Evaluation (1)

• Research questions

• RQ1: Is CCFinder scalable and can be applied to industry size programs?

• e.g. Two versions of OpenOffice. 10 million lines in total. 68 minutes

• e.g. FreeBSD, NetBSD, and OpenBSD

• RQ2: What is the impact of each transformation rule?

Page 18: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Evaluation (2)

• RQ3: Can CCFinder be used for investigating where and how similar code fragments are used among similar software systems such as FreeBSD, NetBSD, and Linux?

• A hypothesis: FreeBSD and NetBSD are more similar to each other than Linux.

• Results: about 40% of source files in FreeBSD have clones with NetBSD; whereas less than 5% of source fules in FreeBSD or NetBSD have clones with Linux.

Page 19: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Other Existing Clone Detection Techniques (1)

• String

• Baker’s Dup: a lexer and a line-based string matching tool: it removes white spaces and comments; replaces identifiers; concatenates all files; hashes each line for comparison; and extracts a set of pairs of longest matches using a suffix tree algorithm

• Token

• CCFinder transforms tokens using a language specific rules and performs a token-by-token comparison

Page 20: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Other Existing Clone Detection Techniques (2)

• AST

• Baxter et al.’s CloneDr parses source code to build an abstract syntax tree, compares its subtrees by characterization metrics.

• Jiang et al. and Koschke et al.

• PDG

• Komondoor and Horwitz clone detector finds isomorphic PDG subgraphs using program slicing

• Krinke uses a k-length patch matching to find similar PDG subgraphs.

• PDG-based clone detectors are robst to reordered statements, code insertion and deletion, intertwined code, non-contiguous code.

Page 21: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Other Existing Clone Detection Techniques (3)

• Metric-based

• Metric-based clone detectors compare various metrics called fingerprinting functions. They find clones at a particular syntactic granularity such as a class, a function, or a method because these fingerprints are often defined for a particular syntactic unit.

Page 22: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

My general thoughts on CCFinder

• CCFinder is a robust and scalable clone detector.

• As there is no consistent definition of code clones, finding X% of clones in one system does not mean very much; however,

• Its case studies show that CCFinder can be applied to industrial size programs.

• Its case studies show that CCFinder can be used for checking hypotheses about the origin of a system.

Page 23: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Revisiting this course’s goal (1)

• I hope you had a fun learning about state-of-the-art methods and tools in software evolution research.

• You have learned how to break down challenges in constructing and evolving software.

• You have learned how to cope with software engineering problems systematically.

• Now you probably know that building and evolving large scale software systems is challenging, yet there are systematic solutions (tool support and techniques) out there.

Page 24: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Revisiting this course’s goal (2)

• I hope you gained confidence in doing research. Why? I believe that research skills are important for both practitioners and researchers.

• I hope you gained perspectives in identifying and formulating research questions.

• I hope you now have learned how to identify open problems through a literature survey.

• I hope you are more comfortable about reading research papers critically and evaluating research works.

• I hope you learned the importance of evaluation component and how to evaluate research solutions.

Page 25: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Preview for Next Lecture

• We will continue with code duplication research.

• Empirical studies of code clone genealogies, Kim et al. FSE 2005

Page 26: Lecture 25 - web.cs.ucla.eduweb.cs.ucla.edu/~miryung/teaching/EE382V-Fall2009/PDF-Lecture Sli… · EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Recap of Polymetric

EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Announcement

• The peer review form is available on the blackboard.

• Please take your graded homework -- practical uses of software evolution research, part 1.

• Your grade review period ends on Apr 27th 11:50 PM.