Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on...

32
Detection of Plagiarism In University Projects Using Metrics- Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July, 2006 Ettore Merlo, Ecole Polytechnique de Montréal
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    1

Transcript of Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on...

Page 1: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Detection of Plagiarism In University Projects Using Metrics-Based Similarity

Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software

July, 2006

Ettore Merlo, Ecole Polytechnique de Montréal

Page 2: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Context

• Detect plagiarism in first years programming projects at university– Programming skills have to be developed

during courses

Page 3: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Plagiarism Detection

• Comparison of sets of syntactic blocks• Spectral analysis of similarity

– Increasing thresholds– Spectral shape parameters are computed

• Projects are ranked by similarity spectrum• The most similar projects are considered

as candidates for plagiarism

Page 4: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Plagiarism Problem

• Detect code transformations that require little programming effort and make apparent differences in source code– Changed identifier by editing operations– Changed source code layout (comments,

indentation, order of procedures, functions, and methods, file structure)

– Changed constants (initialization, loops)

Page 5: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Metrics-Based Similarity

• Definition– Two code fragments are similar if their associated

vectors of metrics satisfy some similarity criterion

Page 6: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Similarity Identification Process

F1 m11 m12 ……. M1k

………………………………….

Fj mj1 mj2 ……. mjk

Source code Parsing

and Analysis

Metrics Extraction

Clones Extraction

Abstract Syntax Tree

MetricsClones

Page 7: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Metrics Extraction

• Metrics for similarity detection– Volume– Complexity– Module/function interface– Call graph structure– Local memory– Global memory– Dataflow

Page 8: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Metrics Matching

• similar(fI,fJ) = | mk(fI) – mk(fJ) | <= thk

– forall k within the size of the metrics vector

Page 9: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Metrics Matching Complexity

• n = | fragments_set |• Exact solution algorithms show a

worst-case O(n ) complexity in general• Linear complexity exact solutions exist

for specific sub-problems• Opportunistic strategies and heuristics

may reduce the average-case complexity

• Approximate solutions may reduce the worst-case complexity

2

Page 10: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Threshold-Based Quantization

Page 11: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Threshold-Based Quantization (2)

• Clusters represent the following hyper-parallelepiped:

• Clusters represent a partition of all fragments• Complexity is O(M·n) where:

– M is the cardinality of metrics– n is the total number of fragments– often M << n

Page 12: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Quantization Error

• Fragments in neighboring clusters may be closer than (thi / 2) and still be in different clusters

• Errors for threshold level (thi) disappear for threshold levels (k·thi), (k > 1)

Page 13: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Project Comparison

• Compute structural similarity spectrum– Compute similarity for increasing threshold

levels in s steps• Quantize projects for the current threshold level• Traverse current clusters to check for

commonality in compared project• Count common structurally-similar fragments

under current threshold level

Page 14: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Project Comparison (2)

• Complexity: O(s·M·(n1 + n2))

– n1, n2 : size of projects

– M: cardinality of metrics– s: threshold steps

• Rationale: – Plagiarism is hard to deeply hide if little

programming energy is deployed– Surface differences are quickly ignored by

thresholds of increasing levels

Page 15: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Project Comparison (3)

• Typical spectrum

Page 16: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Parameters

• Granularity: functions and methods• Steps: 5• Metrics and thresholds:

– CALLS: 1– LOCALS: 1– NONLCALS: 1– PARNUM: 1– STMNT: 3– NBRANCHES: 1– NLOOPS: 1

Page 17: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Plagiarism problem

• Projects are composed of a variable number of fragments– Problem similar to class comparison or to

software evolution analysis

• Identify projects with high spectral similarity– p = number of projects– Galaxy approach

• O(p)

– Pair comparison• O(p2)

Page 18: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Galaxy• Algorithm:

Page 19: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Procedural Projects

Page 20: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

OO Projects

Page 21: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Clone Visualization

• Visual display of source code fragments differences

• DP-matching algorithm on tokens

Page 22: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Matching Algorithms

• Compute the sets of lexical changes– Dynamic programming– Sub-optimal and heuristic ones

Page 23: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

int restore_stack ( object info ) {

intrestore_list ( int index , object info ) {

Matching Example

Page 24: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Remarks• Similarity contrast is very good for

procedural code• Distribution of similarity for OO code is

less sharp– Reference classes were given as a part of the

projects– Methods tend to be smaller– More methods tend to be similar– Class structure could be taken into

consideration– Inter-class relationship could be taken into

account

Page 25: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Administrative Approach

• Identify most similar projects• Do not make any hypothesis about the causes

of similarity• Shift the burden of explanation over the

authors of a project

Page 26: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Conclusions

• A metrics based plagiarism detection approach in an academic environment has been presented

• The presented approach has been successfully used to discourage plagiarism in course projects

Page 27: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Bibliography

Merlo E., Antoniol G., Di Penta M., Rollo F."Linear Complexity Object-Oriented Similarity for Clone Detection and Software Evolution Analysis",Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, 2004, pp. 412-416

Merlo E., Antoniol G., Di Penta M.,``Complexity and Feasibility Issues in Object Oriented Clone Detection'',Proc. 2nd International Workshop on Detection of Software Clones (IWDSC-2003), Victoria (BC), Canada, 2003, pp. 5-6.

G. Antoniol, U. Villano, E. Merlo, M. Di Penta,``Analyzing Cloning Evolution in the Linux Kernel'‘,Information and Software Technology, Vol. 44, No. 13, pp. 755-765, October 1, 2002

Page 28: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Bibliography (2)

E. Merlo, M. Dagenais, P. Bachand, J. S. Sormani, G. Antoniol``Investigating Large Software System Evolution: the Linux Kernel''Computer Software and Applications Conference, COMPSAC - 2002

Dagenais M., Patenaude J. F., Merlo E., Lague B.,``Comparison of clones occurrence in Java and Modula-3 software systems'',in ``Advances in Software Engineering: Comprehension, Evaluation, and Evolution'',H. Erdogmus and O. Tanir (Eds.), Springer-Verlag,ISBN: 0-387-95109-1, 2001.

Casazza G., Antoniol G., Villano U., Merlo E., Di Penta M.,``Identifying Clones in the Linux Kernel'',Proc. International Workshop on Source Code Analysis and Manipulation (IWSCAM),IEEE Computer Society Press, pp. 90-97, 2001

Page 29: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Bibliography (3)

Antoniol A., Casazza G., Di Penta M., Merlo E.,

``Modeling Clones Evolution through Time Series'',

Proc. International Conference of Software Maintenance (ICSM),

IEEE Computer Society Press, pp. 273-280, 2001

Antoniol G., Casazza G., Merlo E.,

``GAWK Software System Evolution'',

International Workshop on Feedback and Evolution in Software and Business

Processes (FEAST), July 2000

Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,

``Advanced Clone-analysis as a Basis for Object-oriented System

Refactoring'',

Proc. Working Conference on Reverse Engineering (WCRE),

IEEE Computer Society Press, pp. 98-107, 2000.

Page 30: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Bibliography (4)

Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,

``Measuring Clone Based Reengineering Opportunities'', Proc. International

Software Metrics Symposium, pp. 292-303, IEEE Computer Society Press, 1999

Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,

``Partial Redesign of Java Software Systems Based on Clone Analysis'',

Proc. 6th Working Conference on Reverse Engineering, WCRE99, pp. 326-336,

IEEE Computer Society Press, 1999

Dagenais M., Merlo E., Lague B., Proulx D., ``Clones Occurrence on

Large Object Oriented Software Packages'', Proc. CASCON'98, pp. 192-200,

IBM Canada, National Research Council of Canada, 1998

Page 31: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Bibliography (5)

Lague, B., Proulx, D., Mayrand, J., Merlo, E.M., Hudepohl, J.,

``Assessing the Benefits of Incorporating Function Clone Detection in a

Development Process'', Proc. of International Conference on Software

Maintenance, IEEE Computer Society Press, 1997, pp. 314-321.

Mayrand, J., Leblanc, C., and Merlo, E.,

``Experiment on the Automatic Detection of Function Clones in a Software System

Using Metrics'',

Proc. IEEE International Conference on Software Maintenance, Monterey,

California, November 1996, IEEE Computer Society Press, pp. 244-253.

Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein M.,

``Pattern matching techniques for clone detection'',

Journal of Automated Software Engineering, V.3, 1996, pp. 77-108, Kluwer

Academic Publishers.

Page 32: Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006

Further Contacts

Ettore MerloEcole Polytechnique de Montréaltel: +1 (514 ) 340 4711 ext. 5758

fax: +1 (514) 340 [email protected]