Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...
Transcript of Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...
Standard similarity detection Karma Tarap, Programmer | Budapest, Oct 2012
Be Wise, Plagiarize
Disclaimer
The opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of Novartis. Novartis does not guarantee the accuracy or reliability of the information provided herein
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Plagiarism Detection
“Plagiarism detection is the process of locating instances of plagiarism within a work or document.” – Wikipedia
§ Plagiarism detection algorithms: 1. Well researched area. Used in:
- Academia to identify cheating - Industry to identify copyright infringements
2. Has the goal: “How similar are a set of documents”
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Standard programs
§ Standard programs are an essential component of clinical trial reporting.
1. Are the standards being used? 2. What is the degree of modifications required?
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Goals
On a fundamental level, we are interested in finding:
“How similar are a set of documents?”
How can we program this?
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Apply plagiarism detection techniques to our standard similarity problem.
The main difference being: In “plagiarism detection” a high score = bad. Whereas, in our case a high score = good.
Some considerations
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
proc sort data=class; by age; run;
data class.proc ; sort = ' by age ' ; run ;
/*proc sort data=class; by age run;*/
A word by word comparison would yield a high match for all of the above, despite being functionally different.
Lets consider the following 3 code snippets:
1 2 3
Some considerations (purpose)
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
proc sort data=class; by age; run;
data class.proc ; sort = ' by age ' ; run ;
/*proc sort data=class; by age ;run;*/
A word by word comparison would yield a high match for all of the above, despite being functionally different.
Lets consider the following 3 code snippets:
1 2 3
Purpose matters
Some considerations (context)
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
proc sort data=class; by age; run;
data class.proc ; sort = ' by age ' ; run ;
/*proc sort data=class; by age ;run;*/
A word by word comparison does not take into consideration, special meaning generated by context.
Lets consider the following 3 code snippets:
1 2 3
Context matters
Some considerations (order)
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
proc sort data=class; by age; run;
;proc sort data=class; by age; run;
/*proc sort data=class; by age run;*/
Comparing files based on the index of the word yields a complete mismatch of the above programs
Lets consider the following 3 code snippets:
1 2 3
Order doesn’t matter
Some considerations (cont.)
§ The issues identified in this approach can be classified as follows:
1. Purpose – The purpose of the word 2. Context – The context of the word given the surrounding words 3. Ordering – Changes of order of sections in a file
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Tokenization
Tokens
§ Tokens are the basic elements of a language
§ SAS defines four basic token types: 1. Literal - One or more characters enclosed in single or double
quotation marks. 2. Name - One or more characters beginning with a letter or an
underscore. 3. Number - A numeric value. 4. Special character - Any character that is not a letter, number, or
underscore
We will need to extend this a little further (keywords , macro...)
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Tokenization flow
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
code tokens
mapping
Abbreviated tokens
Now resistant to datastep and variable name changes!
Tokenization is the process of breaking a language into tokens.
Some considerations (cont.)
§ The issues identified in this approach can be classified as follows:
1. Purpose – The purpose of the word 2. Context – The context of the word given the surrounding words 3. Ordering – Changes of order of sections in a file
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
n-grams
n-grams
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
§ An n-gram is a contiguous sequence of n items from a given sequence of text
§ Converting our tokens to n-grams allows us to compare sections of code.
n-grams sliding window
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
5 4 7 4 3 4 3 4 9 4 4 7
Let n = 4
S0 = {5, 4, 7, 4}
n-grams sliding window
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
5 4 7 4 3 4 3 4 9 4 4 7
Let n = 4
S1 = {4, 7, 4, 3}
S0 = {5, 4, 7, 4}
n-grams sliding window
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
5 4 7 4 3 4 3 4 9 4 4 7
Let n = 4
S1 = {4, 7, 4, 3}
S0 = {5, 4, 7, 4}
S2 = {7, 4, 3, 4}
We can now compare n-grams of files instead of single tokens.
Sn = {......}
Some considerations (cont.)
§ The issues identified in this approach can be classified as follows:
1. Purpose – The purpose of the word 2. Context – The context of the word given the surrounding words 3. Ordering – Changes of order of sections in a file
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Jaccard’s Index
We will also now look at scoring.
Jaccard’s Index
§ Jaccard’s Index is a statistic for comparing the similarity of sets.
§ Intersect of files A and B, divide by their union.
§ Has a bound of 0 to 1.
§ By comparing n-grams irrespective of their position, we have an order independent comparison.
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Jaccard’s Index (cont.)
An example:
§ File A: {5, 4, 7, 4} {3, 4, 3, 4} {9, 4, 4, 7}
§ File B: {5, 4, 7, 4} {3, 4, 3, 4} {3, 4, 5, 7}
A∪B= Total distinct n-grams=4, A∩B= total matched n-grams=2
§ J(A,B)=2/4 =.5
Similarity between file A and file B is 50%
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Recap
§ Apply plagiarism detection techniques to our standard similarity problem
1. Purpose – Tokenization 2. Context – n-grams 3. Ordering – Jaccard’s Index
§ Implement solution in Proc Groovy (SAS 9.3) • Full code provided in the paper appendix
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Results
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Sets sensitivity of match
High level summary checks if standards are being used
Low level breakdown identifies standards that require updating.
Discussion
1. Are the standards being used? • Is the user aware they exist? • Is the outputs/datasets required not standard? • Is the standard not flexible enough?
2. What is the degree of modifications required? • Few modifications suggest the standard programs are robust • Many changes suggest the programs need updating
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Questions?
Be wise, plagiarize
Karma Tarap | Oct 2012 | Be Wise, Plagiarize
Sample Groovy code
Karma Tarap | Oct 2012 | Be Wise, Plagiarize