Automatic plagiarism detection system for specialized corpora

18
Authors University Politehnica of Bucharest Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiană Adrian Scoică Traian Rebedea – [email protected] Razvan Rughiniș

Transcript of Automatic plagiarism detection system for specialized corpora

Page 1: Automatic plagiarism detection system for specialized corpora

Authors

University Politehnica of Bucharest

Automatic Plagiarism Detection System for Specialized Corpora

Filip Cristian BuruianăAdrian ScoicăTraian Rebedea – [email protected] Rughiniș

Page 2: Automatic plagiarism detection system for specialized corpora

Overview

• Introduction• System architecture • Detection of plagiarism• Algorithms for candidate selection• Algorithms for detailed analysis• Algorithms for post-procesing• Results• Conclusions

12.04.23 Sesiunea de Licenţe - Iulie 2012 2

Page 3: Automatic plagiarism detection system for specialized corpora

Introduction

• Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author

• Lots of documents => automatic detection needed

• Information Retrieval– Stemming (ex. beauty, beautiful, beautifulness => beauti)– Vector Space Model– tf-idf weighting, cosine similarity

• Measuring results– precision, recall, granularity => F-measure

12.04.23 CSCS 2013 – Bucharest, Romania 3

Page 4: Automatic plagiarism detection system for specialized corpora

Existing solutions

• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)• They are general solutions, topic independent• No open-source solutions that offer good results• No solutions specialized for Computer Science• Difficult to evaluate: need a good corpus (annotated by persons,

how to find plagiarized documents, etc.)

• AuthentiCop – developed for specialized corpora, also evaluated on general texts

• Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and

social software misuse” at CLEF)– Bachelor thesis @ A&C

12.04.23 CSCS 2013 – Bucharest, Romania 4

Page 5: Automatic plagiarism detection system for specialized corpora

System Architecture

• Web interface for accessing AuthentiCop– Simple to add documents (text, pdf) and to highlight suspicios elements

12.04.23 5CSCS 2013 – Bucharest, Romania

Page 6: Automatic plagiarism detection system for specialized corpora

System architecture

12.04.23 6

• Logical separation– Front-end (PHP, JavaScript + AJAX, jquery)– Back-end (C++)– Cross-Language Communication

• Scalable solution, easy to update– Web server (front-end) and the plagiarism detection

modules (back-end) may run on different machines– Plagiarism detection can be distributed on different

machines (distributed workers)• Several external open-source libraries are used

(e.g. Apache Tika, Clucene, etc.)

CSCS 2013 – Bucharest, Romania

Page 7: Automatic plagiarism detection system for specialized corpora

System architecture

12.04.23 7CSCS 2013 – Bucharest, Romania

Page 8: Automatic plagiarism detection system for specialized corpora

System architecture

12.04.23 8

•Example: sequence of steps for processing PDF files:•Apache Tika is used for transforming PDFs into text

•Automatic build module for the back-end components•Automatic deployment system for the solution

CSCS 2013 – Bucharest, Romania

Page 9: Automatic plagiarism detection system for specialized corpora

Detection of plagiarism

• Different problems– Intrinsic plagiarism (analyze only the suspicious

document)– External plagiarism (also has a reference collection

to check against) • How large is the collection? Online sources?

• Source identification• Text allignment

12.04.23 CSCS 2013 – Bucharest, Romania 9

Page 10: Automatic plagiarism detection system for specialized corpora

Detection of plagiarism

Steps for external plagiarism detection

1.Candidate selection – Find pairs of suspicious texts– Combines source identification with text

allignment

2.Detailed analysis3.Post-processing

12.04.23 CSCS 2013 – Bucharest, Romania 10

Page 11: Automatic plagiarism detection system for specialized corpora

Algorithms for candidate selection

12.04.23 11

•Selection of the plausible pairs of plagiarism

•Using stop-words elimination, tf-idf & cosine•Initial hypothesis

•“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join)

CSCS 2013 – Bucharest, Romania

Page 12: Automatic plagiarism detection system for specialized corpora

Algorithms for candidate selection

12.04.23 12

•FastDocode (presented at PAN 2010)+ caching + sub-linear merging

•New approach- Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions

N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet

3 150 10% 5413 44522 11469 ~ 1 0.162

4 150 10% 4913 10297 11969 ~ 2 0.306

4 150 30% 7633 35169 9249 ~ 4.5 0.256

5 150 20% 5194 6256 11688 ~ 3 0.367

Used method (used on 1000 documents)

TP FP FN Prec. Recall Plagdet

Fingerprinting & indexing 685 494 761 0.581 0.474 0.522

FastDocode#3 634 4097 812 0.134 0.438 0.205

FastDocode#4 424 815 1022 0.342 0.293 0.316

CSCS 2013 – Bucharest, Romania

Page 13: Automatic plagiarism detection system for specialized corpora

Algorithms for detailed analysis

12.04.23 13

•DotPlot: “Sequence Alignment Problem”.

•Modified FastDocode• Extending the analysis to the right and to the left,

starting from common words/passages• Using passages instead of words as seeds for the

comparison• tf-idf weighting & cosine similarity

Image source: Wikipedia

CSCS 2013 – Bucharest, Romania

Page 14: Automatic plagiarism detection system for specialized corpora

Algorithms for post-processing

• Semantic analysis using LSA– Built a semantic space with papers from Computer

Science (and pages from Wikipedia)– Gensim framework in Pyhton

• Smith-Waterman Algorithm– Dynamic programming– Similar to the longest common subsequence– Insert and delete operations may have any cost

(they may be greater than 1)

12.04.23 14CSCS 2013 – Bucharest, Romania

Page 15: Automatic plagiarism detection system for specialized corpora

Results

12.04.23 15

• Corpus: PAN 2011 (~ 22k documents)• Run time on laptop: ~ 20 hours• Results:

• Official results from PAN 2011:

Plagdet Recall Precision Granularity

0.221929185084 0.202996955425 0.366482242839 1.26150173611

CSCS 2013 – Bucharest, Romania

Page 16: Automatic plagiarism detection system for specialized corpora

Results

12.04.23 16

• Specific corpus for CS:– 940 BSc thesis + 8700 article on CS from Wikipedia

• Detecting thesis written in English: TextCat– 307 BSc thesis in English Plagiarized text Original text from Wikipedia

The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.

Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.

• Some elements are incorrectly identified as plagiarism: quotes, bibliographic references

CSCS 2013 – Bucharest, Romania

Page 17: Automatic plagiarism detection system for specialized corpora

Conclusions

• Improving the corpus• The system uses several parameters that were

determined empirically => use machine learning for finding the best values

• Increase the speed of the processing• Improve the method: “bag of words” +

information about the position of the words• Need a better post-processing for real

documents (like scientific papers or thesis)

12.04.23 17CSCS 2013 – Bucharest, Romania

Page 18: Automatic plagiarism detection system for specialized corpora

Thank you!

• Questions?

• Discussion

12.04.23 CSCS 2013 – Bucharest, Romania 18