Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...

25
Standard similarity detection Karma Tarap, Programmer | Budapest, Oct 2012 Be Wise, Plagiarize

Transcript of Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization...

Page 1: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Standard similarity detection Karma Tarap, Programmer | Budapest, Oct 2012

Be Wise, Plagiarize

Page 2: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Disclaimer

The opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of Novartis. Novartis does not guarantee the accuracy or reliability of the information provided herein

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 3: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Plagiarism Detection

“Plagiarism detection is the process of locating instances of plagiarism within a work or document.” – Wikipedia

§ Plagiarism detection algorithms: 1.  Well researched area. Used in:

-  Academia to identify cheating -  Industry to identify copyright infringements

2.  Has the goal: “How similar are a set of documents”

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 4: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Standard programs

§ Standard programs are an essential component of clinical trial reporting.

1.  Are the standards being used? 2.  What is the degree of modifications required?

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 5: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Goals

On a fundamental level, we are interested in finding:

“How similar are a set of documents?”

How can we program this?

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Apply plagiarism detection techniques to our standard similarity problem.

The main difference being: In “plagiarism detection” a high score = bad. Whereas, in our case a high score = good.

Page 6: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

proc sort data=class; by age; run;

data class.proc ; sort = ' by age ' ; run ;

/*proc sort data=class; by age run;*/

A word by word comparison would yield a high match for all of the above, despite being functionally different.

Lets consider the following 3 code snippets:

1 2 3

Page 7: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (purpose)

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

proc sort data=class; by age; run;

data class.proc ; sort = ' by age ' ; run ;

/*proc sort data=class; by age ;run;*/

A word by word comparison would yield a high match for all of the above, despite being functionally different.

Lets consider the following 3 code snippets:

1 2 3

Purpose matters

Page 8: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (context)

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

proc sort data=class; by age; run;

data class.proc ; sort = ' by age ' ; run ;

/*proc sort data=class; by age ;run;*/

A word by word comparison does not take into consideration, special meaning generated by context.

Lets consider the following 3 code snippets:

1 2 3

Context matters

Page 9: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (order)

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

proc sort data=class; by age; run;

;proc sort data=class; by age; run;

/*proc sort data=class; by age run;*/

Comparing files based on the index of the word yields a complete mismatch of the above programs

Lets consider the following 3 code snippets:

1 2 3

Order doesn’t matter

Page 10: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (cont.)

§ The issues identified in this approach can be classified as follows:

1.  Purpose – The purpose of the word 2.  Context – The context of the word given the surrounding words 3.  Ordering – Changes of order of sections in a file

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Tokenization

Page 11: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Tokens

§ Tokens are the basic elements of a language

§ SAS defines four basic token types: 1.  Literal - One or more characters enclosed in single or double

quotation marks. 2.  Name - One or more characters beginning with a letter or an

underscore. 3.  Number - A numeric value. 4.  Special character - Any character that is not a letter, number, or

underscore

We will need to extend this a little further (keywords , macro...)

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 12: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Tokenization flow

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

code tokens

mapping

Abbreviated tokens

Now resistant to datastep and variable name changes!

Tokenization is the process of breaking a language into tokens.

Page 13: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (cont.)

§ The issues identified in this approach can be classified as follows:

1.  Purpose – The purpose of the word 2.  Context – The context of the word given the surrounding words 3.  Ordering – Changes of order of sections in a file

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

n-grams

Page 14: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

n-grams

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

§ An n-gram is a contiguous sequence of n items from a given sequence of text

§ Converting our tokens to n-grams allows us to compare sections of code.

Page 15: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

n-grams sliding window

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S0 = {5, 4, 7, 4}

Page 16: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

n-grams sliding window

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S1 = {4, 7, 4, 3}

S0 = {5, 4, 7, 4}

Page 17: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

n-grams sliding window

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

5 4 7 4 3 4 3 4 9 4 4 7

Let n = 4

S1 = {4, 7, 4, 3}

S0 = {5, 4, 7, 4}

S2 = {7, 4, 3, 4}

We can now compare n-grams of files instead of single tokens.

Sn = {......}

Page 18: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Some considerations (cont.)

§ The issues identified in this approach can be classified as follows:

1.  Purpose – The purpose of the word 2.  Context – The context of the word given the surrounding words 3.  Ordering – Changes of order of sections in a file

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Jaccard’s Index

We will also now look at scoring.

Page 19: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Jaccard’s Index

§ Jaccard’s Index is a statistic for comparing the similarity of sets.

§  Intersect of files A and B, divide by their union.

§ Has a bound of 0 to 1.

§ By comparing n-grams irrespective of their position, we have an order independent comparison.

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 20: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Jaccard’s Index (cont.)

An example:

§ File A: {5, 4, 7, 4} {3, 4, 3, 4} {9, 4, 4, 7}

§ File B: {5, 4, 7, 4} {3, 4, 3, 4} {3, 4, 5, 7}

A∪B= Total distinct n-grams=4, A∩B= total matched n-grams=2

§ J(A,B)=2/4 =.5

Similarity between file A and file B is 50%

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 21: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Recap

§ Apply plagiarism detection techniques to our standard similarity problem

1.  Purpose – Tokenization 2.  Context – n-grams 3.  Ordering – Jaccard’s Index

§  Implement solution in Proc Groovy (SAS 9.3) •  Full code provided in the paper appendix

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 22: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Results

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Sets sensitivity of match

High level summary checks if standards are being used

Low level breakdown identifies standards that require updating.

Page 23: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Discussion

1.  Are the standards being used? •  Is the user aware they exist? •  Is the outputs/datasets required not standard? •  Is the standard not flexible enough?

2.  What is the degree of modifications required? •  Few modifications suggest the standard programs are robust •  Many changes suggest the programs need updating

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 24: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Questions?

Be wise, plagiarize

Karma Tarap | Oct 2012 | Be Wise, Plagiarize

Page 25: Be Wise, Plagiarize - Lex Jansen · Karma Tarap | Oct 2012 | Be Wise, Plagiarize . Tokenization flow Karma Tarap | Oct 2012 | Be Wise, Plagiarize code tokens mapping Abbreviated tokens

Sample Groovy code

Karma Tarap | Oct 2012 | Be Wise, Plagiarize