CCFinder: A Mullinguisc Token- Based Code Clone Detec.on...

28
CCFinder: A Mul.linguis.c Token- Based Code Clone Detec.on System for Large Scale Source Code Saima Sultana Tithi 03/15/2017 1

Transcript of CCFinder: A Mullinguisc Token- Based Code Clone Detec.on...

CCFinder: A Mul.linguis.c Token-Based Code Clone Detec.on System

for Large Scale Source Code

Saima Sultana Tithi 03/15/2017

1

Outline

• Overview• Duplicatedcodedetec2onprocess• Advantages• Results• Discussion

2

About the paper

• DevelopedanalgorithmtodetectduplicatedcodeinasystemandimplementedatoolnamedCCFinder(CodeCloneFinder)•  Totalcita2ons:1306• Publishedin:IEEETransac2onsonSoJwareEngineering,Volume-28,issue-7• Publica2ondate:July2002• Authors:

§  ToshihiroKamiya,OsakaUniversity,Japan§  ShinjiKusumoto,OsakaUniversity,Japan§ KatsuroInoue,OsakaUniversity,Japan

3

Clone detec.ng process Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

4

Step 1: Lexical Analysis Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

5

Step 1: Lexical Analysis

•  Eachlineofsourcefilesisdividedintotokenscorrespondingtoalexicalruleoftheprogramminglanguage

sum=3+2;

tokenize/parsing

Token TokenCategory

sum Iden2fier

= Assignmentoperator

3 Integerliteral

+ Addi2onoperator

2 Integerliteral

; Endofstatement 6

Step 2: Transforma.on Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

7

Step 2: Transforma.on

• Transforma2onhas2steps§ Transforma)onbyTransforma)onRules:Thetokensequenceistransformedbasedonthetransforma2onrules

§ Parameterreplacement:AJertransforma2onbyrules,eachiden2fierrelatedtotypes,variables,andconstantsisreplacedwithaspecialtoken

8

Example of Transforma.on Rules

Removenamespaceacribu2ons std::ios_base::hexhexRemovetemplateparameters vector<int>vectorRemoveaccessibilitykeywords protectedvoidfoo()voidfoo()Converttocompoundblock if(a==1)b=2;if(a==1){b=2};…

•  Theauthorsdevelopedtransforma2onrulesforallprogramminglanguagessupportedbyCCFinder,whichwereC,C++,Java,COBOL

9

Step 2: Transforma.on 1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

SampleCode Transformedcodebytransforma*onrules10

Step 2: Transforma.on 1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

SampleCode Transformedcodebytransforma*onrules11

Step 2: Transforma.on 1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  $p $p ($p $p & $p) {

2.  $p $p = $p;

3.  $p $p = $p.$p();

4.  for (; $p != $p. $p(); ++ $p) {

5.  $p << $p << $p

6.  << *$p << $p;

7.  ++ $p;

8.  }

9.  }

10.  $p $p ($p $p & $p) {

11.  $p $p = $p ;

12.  $p $p = $p.$p();

13.  for (; $p != $p. $p(); ++ $p) {

14.  $p << $p << $p

15.  << *$p << $p;

16.  ++ $p;

17.  }

18.  }

Transformedcodebytransforma*onrules

Thecodea:erparameterreplacement

12

Step 3: Match Detec.on Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

13

Step 3: Match Detec.on

• Detectsimilarcodesegmentsbasedonsuffix-treematchingalgorithmTransformedTokenSequence:$p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}} $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}}

Createsuffixtreefrominputsequence

Longestcommonsubsequence:•  $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p

<< $p << *$p << $p;++ $p;}} •  $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p

<< $p << *$p << $p;++ $p;}} 14

Step 4: FormaOng Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

15

Step 4: FormaOng •  Fromtheoutputofsuffix-treematchingalgorithm,allclonesareconvertedtolinenumbersoftheoriginalcode• Here,line1-9andline10-18isaclonepair

1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

16

Advantages of using transforma.on step

public class MultiButtonUI extends ButtonUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiButtonUI();

return MultiLookAndFeel.createUIs(mui,

((MultiButtonUI)mui).uis, a);

}

public class MultiColorChooserUI extends ColorChooserUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiColorChooserUI();

return MultiLookAndFeel.createUIs(mui,

((MultiColorChooserUI)mui).uis, a);

} 17

Advantages of using transforma.on step

public class MultiButtonUI extends ButtonUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiButtonUI();

return MultiLookAndFeel.createUIs(mui,

((MultiButtonUI)mui).uis, a);

}

public class MultiColorChooserUI extends ColorChooserUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiColorChooserUI();

return MultiLookAndFeel.createUIs(mui,

((MultiColorChooserUI)mui).uis, a);

} 18

Advantages of using transforma.on step for (int i = 0; i < n; i++) {

if (a == 1) {

b = 2; }

}

for (int i = 0; i < n; i++) {

if (a == 1)

b = 2;

}

19

Advantages of using transforma.on step for (int i = 0; i < n; i++) {

if (a == 1) {

b = 2; }

}

for (int i = 0; i < n; i++) {

if (a == 1)

b = 2;

}

20

Implementa.on

• ImplementedinC++• Supports4programminglanguages:C,C++,Java,COBOL• Timeandspacecomplexityis𝑂(𝑛),wherenistotallengthofsourcefile

21

Results

• AppliedCCFinderonFreeBSD4.0(2.2Mlines),Linux2.4(2.4Mlines),NetBSD1.5(2.6Mlines)•  Time:108minutes

22

Cloneclasses Coverage(%LOC) Coverage(%file)

FreeBSD&Linux 1,091 0.8%FreeBSD0.9%Linux

3.1%FreeBSD4.6%Linux

FreeBSD&NetBSD 25,621 18.6%FreeBSD15.2%NetBSD

40.1%FreeBSD36.1%NetBSD

Linux&NetBSD 1,000 0.6%Linux0.6%NetBSD

3.3%Linux2.1%NetBSD

23

Figure:Scacerplotofclonepairshavingatleast30sametokens(about13lines)

Later Works

• BasedonCCFinder,theauthorsdevelopedanothertoolAIST-CCFinderXin2005• CCFinderXisfreelyavailablefrom:hcp://www.ccfinder.net/ccfinderxos.html,hcps://github.com/petersenna/ccfinderx-core•  SomeothertoolsfromtheauthorsofCCFinder:

•  D-CCFinder(distributedCCFinder)•  Gemini(addGUItoviewtheoutputofCCFinder)•  Aries(refactorcodebasedonclonedetec2on)•  Agec(clonedetec2onfromJavabytecode)

24

Discussion

•  Strengthsofthispaper•  Clearexplana2onofthemethod•  Appliesthetoolondifferentcodebasesandshowsalltheresultsintermsof2meprofile,memoryprofile,numberofclonepairs,andpercentageofclones

25

Discussion

• Weaknessesofthispaper:•  DidnotcompareCCFinderwithotherexis2ngtoolswithrespecttorunning2meormemoryconsump2on

•  DidnotapplyCCFinderonanybenchmarkdatasetandcalculatetheaccuracyoftheresult

26

Discussion

• HowtoimproveCCFinder?•  Tocomputetokensequencematching,CCFinderusessuffix-treebasedmatchingalgorithm,butsuffix-treeisnotspaceefficientforlargecodebases.AccordingtotheauthorsofSourcererCC,CCFinderrunsoutofmemoryforlargecodebases.Assuffix-arraybasedmatchingalgorithmismorespaceefficient,insteadofsuffix-treebasedmatchingalgorithms,suffix-arraybasedmatchingalgorithmcanbeused.

27

ThankYou!

28