Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...

Standard similarity detection Karma Tarap, Programmer | Budapest, Oct 2012

Be Wise, Plagiarize

Disclaimer

The opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of Novartis. Novartis does not guarantee the accuracy or reliability of the information provided herein

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Plagiarism Detection

“Plagiarism detection is the process of locating instances of plagiarism within a work or document.” – Wikipedia

§ Plagiarism detection algorithms: 1.  Well researched area. Used in:

-  Academia to identify cheating -  Industry to identify copyright infringements

2.  Has the goal: “How similar are a set of documents”


Standard programs

§ Standard programs are an essential component of clinical trial reporting.

1.  Are the standards being used? 2.  What is the degree of modifications required?


Goals

On a fundamental level, we are interested in finding:

“How similar are a set of documents?”

How can we program this?


Apply plagiarism detection techniques to our standard similarity problem.

The main difference being: In “plagiarism detection” a high score = bad. Whereas, in our case a high score = good.

Some considerations


proc sort data=class; by age; run;

data class.proc ; sort = ' by age ' ; run ;

/*proc sort data=class; by age run;*/

A word by word comparison would yield a high match for all of the above, despite being functionally different.

Lets consider the following 3 code snippets:

1 2 3

Some considerations (purpose)




/*proc sort data=class; by age ;run;*/

A word by word comparison would yield a high match for all of the above, despite being functionally different.


1 2 3

Purpose matters

Some considerations (context)




/*proc sort data=class; by age ;run;*/

A word by word comparison does not take into consideration, special meaning generated by context.


1 2 3

Context matters

Some considerations (order)



;proc sort data=class; by age; run;

/*proc sort data=class; by age run;*/

Comparing files based on the index of the word yields a complete mismatch of the above programs


1 2 3

Order doesn’t matter

Some considerations (cont.)

§ The issues identified in this approach can be classified as follows:

1.  Purpose – The purpose of the word 2.  Context – The context of the word given the surrounding words 3.  Ordering – Changes of order of sections in a file


Tokenization

Tokens

§ Tokens are the basic elements of a language

§ SAS defines four basic token types: 1.  Literal - One or more characters enclosed in single or double

quotation marks. 2.  Name - One or more characters beginning with a letter or an

underscore. 3.  Number - A numeric value. 4.  Special character - Any character that is not a letter, number, or

underscore

We will need to extend this a little further (keywords , macro...)


Tokenization flow


code tokens

mapping

Abbreviated tokens

Now resistant to datastep and variable name changes!

Tokenization is the process of breaking a language into tokens.





n-grams

n-grams


§ An n-gram is a contiguous sequence of n items from a given sequence of text

§ Converting our tokens to n-grams allows us to compare sections of code.

n-grams sliding window


5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S0 = {5, 4, 7, 4}



5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S1 = {4, 7, 4, 3}

S0 = {5, 4, 7, 4}



5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S1 = {4, 7, 4, 3}

S0 = {5, 4, 7, 4}

S2 = {7, 4, 3, 4}

We can now compare n-grams of files instead of single tokens.

Sn = {......}





Jaccard’s Index

We will also now look at scoring.

Jaccard’s Index

§ Jaccard’s Index is a statistic for comparing the similarity of sets.

§  Intersect of files A and B, divide by their union.

§ Has a bound of 0 to 1.

§ By comparing n-grams irrespective of their position, we have an order independent comparison.


Jaccard’s Index (cont.)

An example:

§ File A: {5, 4, 7, 4} {3, 4, 3, 4} {9, 4, 4, 7}

§ File B: {5, 4, 7, 4} {3, 4, 3, 4} {3, 4, 5, 7}

A∪B= Total distinct n-grams=4, A∩B= total matched n-grams=2

§ J(A,B)=2/4 =.5

Similarity between file A and file B is 50%


Recap

§ Apply plagiarism detection techniques to our standard similarity problem

1.  Purpose – Tokenization 2.  Context – n-grams 3.  Ordering – Jaccard’s Index

§  Implement solution in Proc Groovy (SAS 9.3) •  Full code provided in the paper appendix


Results


Sets sensitivity of match

High level summary checks if standards are being used

Low level breakdown identifies standards that require updating.

Discussion

1.  Are the standards being used? •  Is the user aware they exist? •  Is the outputs/datasets required not standard? •  Is the standard not flexible enough?

2.  What is the degree of modifications required? •  Few modifications suggest the standard programs are robust •  Many changes suggest the programs need updating


Questions?

Be wise, plagiarize


Sample Groovy code


Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...

Documents

Transcript of Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...