CodeSimian
description
Transcript of CodeSimian
![Page 1: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/1.jpg)
CodeSimianCS491B – Andrew Weng
![Page 2: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/2.jpg)
Motivation
• Academic integrity is a universal issue
• Plagiarism is still common today• Kaavya Viswanathan (Harvard Student)
• Book contains many plagiarized passages
• Yoshihiko Wada (Painter, Japan)• Artwork plagiarized from Alberto Sughi
• Scott D. Miller (Wesley College President)• Plagiarized material found on his website
![Page 3: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/3.jpg)
Is Plagiarism Harmful?
• Who does plagiarism really hurt?• The student• The class• The University
• Plagiarism is not only concerned with the protection of intellectual property rights
![Page 4: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/4.jpg)
Plagiarism Detection
Benefits of Utilizing Plagiarism Detection
• Prevention
• Enforcement
• Objective standpoint
![Page 5: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/5.jpg)
Platform Overview
• Developed on Visual Studio .NET 2005• Coded in Microsoft Visual C# .NET• Windows Forms application• Simple and familiar GUI (Windows)
• Intended focus is ease of use
![Page 6: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/6.jpg)
Theoretical Overview
CodeSimian is based on two primary principles
• Kolmogorov Complexity
• Information Distance
![Page 7: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/7.jpg)
Kolmogorov Complexity
• Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output
• Purely theoretical
• Impossible to calculate exactly
![Page 8: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/8.jpg)
Kolmogorov Complexity
Define x to be a desired output string
K(x) = The length of the program that produces x
K(x|y) = The length of the program that produces x given y as an input
K(xy) = The length of the program that produces x concatenated with y
![Page 9: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/9.jpg)
Kolmogorov Complexity
Compare two infinitely long numbers π and a randomly generated number between 0 and 1:
π =3.1415926535897932384626433832795…
n = 0.5234958723957329875320935293853…
K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite
![Page 10: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/10.jpg)
Kolmogorov Complexity
π =3.1415926535897932384626433832795…
K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite
Perhaps something as simple as the implementation of Leibniz’s formula:
...11
1
9
1
7
1
5
1
3
1
1
14
12
14
0n
n
n
![Page 11: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/11.jpg)
Kolmogorov Complexity
n = 0.5234958723957329875320935293853…
In order to generate the full output of a truly random number n, the length of the program would be infinitely long.
The code would essentially be System.out.println(“0.52349587…”);
![Page 12: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/12.jpg)
Kolmogorov Complexity
So how does this apply to plagiarism detection?
Define x = π and y = π/4
K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.
![Page 13: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/13.jpg)
Information Distance
The distance (or difference) between two objects
Formula used:
)(
)|()(1),(
xyK
yxKxKyxd
![Page 14: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/14.jpg)
Information Distance
• Similarity Factor
If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity
)(
)|()(),(
xyK
yxKxKyxs
![Page 15: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/15.jpg)
Implementation
What does CodeSimian do to obtain the similarity factors?
1. Parse and Tokenize the code
2. Compress the tokenized strings
3. Compare the compressed strings
![Page 16: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/16.jpg)
Parsing the Code
• Utilized ANTLR to parse and tokenize the code
• ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)
![Page 17: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/17.jpg)
Tokenizing the Code
• The tokenized output is a string of characters, each of which represents a token within the code
• For Example:
{ int c = 0; } contains 7 “letters”
Open Bracket
Integer type declaration
Variable name
Assignment operator
Integer Value
Statement end
Close Bracket
![Page 18: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/18.jpg)
Compressing the String
This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers
• As the string is being read, a library is generated as it progresses.
• When repeats are detected, it utilizes pointers to the library to recreate the required section
![Page 19: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/19.jpg)
Compressing the String
• Normally limitations exist on library size and the “word” length stored
• Memory utilization and efficiency is not important
• Lempel-Ziv is suitable for this application
![Page 20: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/20.jpg)
Comparing the Compressed String
• K(x) is the size of the compressed and tokenized code x.
• K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library
• K(xy) is the size of the compressed and tokenized code x+y.
![Page 21: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/21.jpg)
Results
Using the test on trivial examples:• LinkedList.java• LinkedList2.java• LinkedList3.java
• Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output.
• All files came out as >85% similar
![Page 22: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/22.jpg)
Results
Using the test on a small real-world sample
Professor Kang’s CS201 HW1
• Relatively simple homework assignment
• 30-50% similarity average
• 95% similarity detected on one pair of submissions
• Confirmed by Professor Kang as correct
![Page 23: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/23.jpg)
Results
Using the test on another small real-world sample
Professor Kang’s CS201 HW4• More complex homework assignment involving 2-3
files; break down of java files according to function• Problem being that specialized function files may
possible present false positives?• 30-70% similarity average• 95+% similarity detected on pairs of submissions• Confirmed by Professor Kang as correct
![Page 24: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/24.jpg)
Results
• Things to note…
• The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive
• Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes
![Page 25: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/25.jpg)
Conclusions
• Successful test cases
• Simple and straightforward to use
• Based on an objective principle which works!
![Page 26: CodeSimian](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813c6c550346895da5fd5e/html5/thumbnails/26.jpg)
Future Work
• Enhancing the application to be able to compare internal “blocks” of code
• Improving the compression algorithm to better handle and adapt to “approximate matches”
• Improving the functionality with the GUI
• Providing a report printing capability of directories