Plagiarism: Taxonomy, Tools and Detection Techniques HUSSAIN A... · Plagiarism: Taxonomy, Tools...

55
Introduction Plagiarism and Its Types: A Taxonomy Plagiarism Detection Approaches: A Taxonomy Plagiarism Detection Tools Issues and Challenges Conceptual Framework for Plagiarism Detection Conclusions References Plagiarism: Taxonomy, Tools and Detection Techniques Dhruba K Bhattacharyya, FIETE Tezpur University [email protected] October 28, 2016 Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

Transcript of Plagiarism: Taxonomy, Tools and Detection Techniques HUSSAIN A... · Plagiarism: Taxonomy, Tools...

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism: Taxonomy, Tools and DetectionTechniques

Dhruba K Bhattacharyya, FIETE

Tezpur University

[email protected]

October 28, 2016

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Overview

1 Introduction

2 Plagiarism and Its Types: A Taxonomy

3 Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Methods

4 Plagiarism Detection Tools

5 Issues and Challenges

6 Conceptual Framework for Plagiarism Detection

7 Conclusions

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Introduction I

Plagiarism is the presentation of another’s words, work or ideaas one’s own[4]. It has two components[4]

1 Taking the words, work or ideas from some source(s).2 Presenting it without acknowledgments of the source(s) from

where words, works or ideas are taken.There are mainly two types of plagiarisms typically found tooccur[8]

1 Textual plagiarism2 Source code plagiarism.

[4] Melissa S Anderson and Nicholas H Steneck. The problem of plagiarism. In Urologic Oncology: Seminars andOriginal Investigations, volume 29, pages 90-94. Elsevier, 2011.

[8] Netra Charya, Kushagra Doshi, Smit Bawkar, and Radha Shankarmani. Intrinsic plagiarism detection in digitaldata 2(3), 23-30, 2015.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Introduction IIPlagiarism can occur between two same or two different naturallanguages. Based on language homogeneity or heterogeneity of thetextual documents being compared, the plagiarism detection canbe divided into two basic types[3].

1 Monolingual Plagiarism Detection: This type of detectiondeals with homogeneous language settings e.g.,English-English

2 Cross-Lingual Plagiarism Detection: This detection approachis able to perform in heterogeneous language settings e.g.,English-Chinese.

[3] Salha M Alzahrani, Naomie Salim, and Ajith Abraham. Understanding plagiarism linguistic patterns, textualfeatures, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews), 42(2):133-149, 2012.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Introduction III

There are mainly two types of plagiarism detection approachesavailable based on whether external resources or references areused or not during plagiarism detection[3].a Intrinsic plagiarism detection : Where no external references are

used.b Extrinsic plagiarism detection : Where external references are

used.

[3] Salha M Alzahrani, Naomie Salim, and Ajith Abraham. Understanding plagiarism linguistic patterns, textualfeatures, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews), 42(2):133-149, 2012.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Contributions I

It reports a comprehensive and systematic survey on a largenumber of methods of plagiarism detection and analyzes theirpros and cons.It includes discussion on a large number of tools on plagiarismdetection and reports their features. It also compares thesetools based on a set of crucial parameters.It introduces the framework of a hybrid detection approach toidentify text similarity based on both intrinsic and extrinsicplagiarism.Finally, in includes a list of issues and research challenges.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Definition of plagiarismAs stated in [4], plagiarism can be defined as an appropriation ofthe ideas, words, process or results of another person withoutproper acknowledgment, credit or citation. Plagiarism can appearin a research article or program in following ways:a Claiming another person’s work as your own.b Use of another person’s work without giving credit.c Majority of someone’s contribution as your own, whether credit

is given or not.d Restructuring the other works and claiming as your own work.e providing wrong acknowledgment of other works in your work.

[4] Melissa S Anderson and Nicholas H Steneck. The problem of plagiarism. In Urologic Oncology: Seminars andOriginal Investigations, volume 29, pages 90-94. Elsevier, 2011.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Types of plagiarism

Plagiarism can appear in different forms in a document, work,production or program. Two basic types of plagiarisms[2].

1 Textual plagiarism: It is commonly seen in education andresearch. Figure 1 (a) shows an example of textual plagiarism.

2 Source Code plagiarism: Here, codes written by others arecopied or reused or modified or converted a part of codes andclaimed as one’s own. Figure 1 (b) shows the example anexample of source code.

[2] Asim M El Tahir Ali, Hussam M Dahwa Abdulla, and Vaclav Snasel. Overview and comparison of plagiarismdetection tools. In DATESO, pages 161-172. Citeseer, 2011..

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Example of plagiarism

(a)

(b)Figure 1: Examples of (a) Textual (b) Source Code Plagiarism

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

Taxonomy of plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Different types of Textual Plagiarism I

Textual plagiarism further can be divided into seven sub categoriesbased on its forms and application[8], [6]. We discuss each of thesein brief, next.

1 Deliberate copy-paste/clone plagiarism: This type of textualplagiarism refers to copying other works and presenting as ifyour own work with or without acknowledging the originalsource.

2 Paraphrasing plagiarism: This form of plagiarism can occur intwo ways as given below.

[6]C Barnbaum. Plagiarism: A student’s guide to recognizing it and avoiding it.[online].[cit. 2010-12-14], 2009.[8]Netra Charya, Kushagra Doshi, Smit Bawkar, and Radha Shankarmani. Intrinsic plagiarism detection in digitaldata 2(3), 23-30, 2015.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Different types of Textual Plagiarism IIi Simple paraphrasing: It refers to use of other idea, words or

work, and presenting it in different ways by switching words,changing sentence construction and changing in grammar style.

ii Mosaic/Hybrid/patchwork paraphrasing: This form of textualplagiarism generally occurs when you combine multiple researchcontributions of some others and present it in a different way bychanging structure and pattern of sentence, replacing wordswith synonyms and by applying a different grammar stylewithout citing the source(s).

3 Metaphor plagiarism: Metaphors are used to present othersidea in a clear and better manner.

4 Idea plagiarism: Here, idea or solution is borrowed from othersource(s) and claiming as your own in a research paper.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Different types of Textual Plagiarism III

5 Self/recycled plagiarism: In this form, an author uses his/herown previous published work in a new research paper forpublication.

6 404 Error / Illegitimate Source plagiarism: Here, an authorcites some references but the sources are invalid.

7 Retweet plagiarism: In this form an author cites the referenceof proper source but his/her presentation is very similar in thescene of original content wordings, sentence structures and/orgrammar usage.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Different types of Source Code Plagiarism I

This type of plagiarism, as shown in previous figure can be dividedinto four subtypes.

1 Manipulation from Vicinity plagiarism: Here, a developermanipulates a program by (i) inserting, (ii) deleting, or (iii)substituting some codes in an existing program, with orwithout acknowledging the original source and claiming it ashis/her own program.

2 Reordering structure plagiarism: In this type, the developerreorders the statements or functions of a program or changessyntax of a program without referring the original source.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Different types of Source Code Plagiarism II

3 No change plagiarism: Here, the developer adds or removeswhite spaces or comments or indentation of the program andclaims the program as his/her own program.

4 Language switching plagiarism: In this type, the developerchanges the languages, or a program written in one languageis rewritten in another language and declare it as his/her own.

Based on characteristics, plagiarism can also be categorized intoliteral and intelligent plagiarism. Literal plagiarism consists ofcopy-paste/clone, paraphrasing, self/recycled, and retweetplagiarism. The other form of plagiarism can be considered asintelligent type of plagiarism.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Plagiarism Detection Approach I

Plagiarism can occur between two same or two different naturallanguages. Based on language homogeneity or heterogeneity of thetextual documents being compared, the plagiarism detection can bedivided into two basic types[3] i.e., monolingual and cross-lingual.

1 Monolingual Plagiarism Detection: This type of detectiondeals with homogeneous language settings e.g.,English-English. Most detection methods are of this category.It can be further divided into two subtypes based on the useof external references during detection.

[3] Salha M Alzahrani, Naomie Salim, and Ajith Abraham. Understanding plagiarism linguistic patterns, textualfeatures, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews), 42(2):133-149, 2012.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Plagiarism Detection Approach II

a Intrinsic Plagiarism Detection: This detection approach analysesthe writing style or uniqueness of the author and attempts todetect plagiarism based on own-conformity or deviation betweenthe text segments. It does not require any external sources fordetection.

b Extrinsic Plagiarism Detection: Unlike the intrinsic approach, thisapproach compares a submitted research article against manyother available relevant digital resources in repositories or in theWeb for detection of plagiarism.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Plagiarism Detection Approach III

2 Cross-Lingual Plagiarism Detection: This detection approachis able to perform in heterogeneous language settings e.g.,English-Chinese. There are only a few cross-lingual plagiarismdetection methods available due to difficulty in findingproximity between two text segments for different languages.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Ways of Plagiarism Detection

There are different textual features like lexical feature,syntactic feature, semantic feature and structural feature,which can be used to detect similarity between twodocuments.Source code similarity detection can be carried out in variousways, such as (i) string matching, (ii) token matching, (iii)parse tree matching, (iv) program dependency graph (PDG)matching, (v) similarity-score matching and (vi) byhybridization of the above[1].

[1] Vale Alekya and S Sai Satyanarayana Reddy. Survey of programming plagiarism detection 2(6), 188-193, 2014.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

A schematic view of a generic plagiarism detection method

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

Plagiarism Detection Methods: A Taxonomy

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Character-Based Methods I

It utilizes character-based, word-based, and syntax-basedfeatures to find similarity between a query document and anexisting document.The similarity between a pair of documents may be estimatedusing both exact matching and approximate matching.Most of the detection techniques are developed based onn-gram or word n-gram based exact string similarity findingapproach.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Character-Based Methods II

Grozea et al. [15] use character 16-gram matching, whereasthe authors of [7] use word 8-gram matching.One can use string similarity metric or vector similarity metricfor finding similarity.

[7] Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro, and MD Esposti. A plagiarismdetection procedure in three steps: Selection, matches and aAsquaresaAI. In Proc. SEPLN, pages 19-23, 2009.

[15] Cristian Grozea, Christian Gehl, and Marius Popescu. Encoplot: Pairwise sequence matching in linear timeapplied to plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social SoftwareMisuse, page 10, 2009.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Vector-Based Method

Here, lexical and syntax features are extracted and categorizedas tokens rather than strings.The similarity can be computed using various vector similaritymeasures like Jaccard, Dice’s, Overlap, Cosine, Euclidean andManhattan coefficients.Cosine coefficient is useful in detecting partial plagiarismwithout sharing document content.Hence it is useful to detect plagiarism in documents wheresubmission is considered as confidential[24].

[24] Haijun Zhang and Tommy WS Chow. A coarse-to-fine framework to efficiently thwart plagiarism. PatternRecognition, 44(2):471-487, 2011.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Syntax-Based Methods I

This methods exploit syntactical features like part of speech(POS) of phrase and words in different statements to detectplagiarism.The elements of basic POS tags are verbs, nouns, pronouns,adjectives, adverbs, prepositions, conjunctions, andinterjections.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Syntax-Based Methods II

In [11], [10], the authors use POS tag features followed bystring similarity metric to analyze and calculate similaritybetween texts.Documents containing same POS tag features are carried outfor further analysis and for identification of source of aplagiarism[10].

[10] Mohamed Elhadi and Amjad Al-Tobi. Use of text syntactical structures in detection of document duplicates.In Digital Information Management, 2008. ICDIM 2008. Third International Conference on, pages 520-525. IEEE,2008.

[11] Mohamed Elhadi and Amjad Al-Tobi. Duplicate detection in documents and webpages using improved longestcommon subsequence and documents syntactical structures. In Computer Sciences and Convergence InformationTechnology, 2009. ICCIT’09. Fourth International Conference on, pages 679-684. IEEE, 2009. IEEE, 2008.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Semantic-Based Methods I

Two sentences may be same but the order of their words maybe different. For example, a sentence may be constructed byjust transforming from active voice to passive voice.WordNet[12] is used in this context to find the semanticsimilarity between words or sentences.The degree of similarity between two words used inknowledge-based measures by Gelbukh[22] is calculated usinginformation from a dictionary.

[12] Christiane Fellbaum. WordNet. Wiley Online Library, 1998.

[22] Sulema Torres and Alexander Gelbukh. Comparing similarity measures for original wsd lesk algorithm.Research in Computing Science, 43:155-166, 2009.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Semantic-Based Methods II

Resnik[20] used WordNet to calculate the semantic similarity,whereas, Leacock’s et al.,[17] determine semantic similarity bycounting the number of nodes of shortest path between twoconcepts.

[17] Claudia Leacock, George A Miller, and Martin Chodorow. Using corpus statistics and wordnet relations forsense identification. Computational Linguistics, 24(1):147-165, 1998.

[20] Philip Resnik et al. Semantic similarity in a taxonomy: An information-based measure and its application toproblems of ambiguity in natural language. J. Artif. Intell. Res.(JAIR), 11:95-130, 1999.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Fuzzy-Based Methods

Here, the words in a document are represented using a set ofwords of similar meaning and sets are considered as fuzzysince each word of the documents is associated with a degreeof similarity[23].This method is attractive because it can detect similaritybetween documents with uncertainty.In [16], the degree of similarity between two documents or anytwo Web documents are identified by using fuzzy IR approach.

[12] Tommy WS Chow and MKM Rahman. Multilayer som with tree-structured data for efficient documentretrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9):1385-1402, 2009.

[19] MKM Rahman, Wang Pi Yang, Tommy WS Chow, and Sitao Wu. A flexible multi-layer self-organizing mapfor generic processing of tree-structured data. Pattern Recognition, 40(5):1406-1424, 2007.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Structure-Based Methods

Unlike other methods discussed, structure based method usescontextual similarity.Contextual information is generally handled usingtree-structure feature representation as can be found inML-SOM [19].In [9], the authors detect plagiarism in two steps. First stepperforms document clustering and candidate retrieval usingtree-structure feature representation and second step detectsplagiarism by utilizing ML-SOM.

[9] Tommy WS Chow and MKM Rahman. Multilayer som with tree-structured data for efficient document retrievaland plagiarism detection. IEEE Transactions on Neural Networks, 20(9):1385-1402, 2009.

[19] MKM Rahman, Wang Pi Yang, Tommy WS Chow, and Sitao Wu. A flexible multi-layer self-organizing mapfor generic processing of tree-structured data. Pattern Recognition, 40(5):1406-1424, 2007.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Stylometric-Based Methods

These methods aim to quantify the writing styles of theauthor to detect plagiarism.It computes similarity score between two sections orparagraphs or sentences based on stylometric features of theauthors.These methods are instances of intrinsic plagiarism.The style representation formula may be writer specific orreader specific[27].One can find usefulness of outlier mining in this context todetect plagiarism in a document under this approach.

[27] Sven Meyer Zu Eissen, Benno Stein, and Marion Kulig. Plagiarism detection without reference collections. InAdvances in data analysis, pages 359âĂŞ366. Springer, 2007.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Methods for Cross-Lingual Plagiarism Detection

It requires in-depth knowledge of multiple languages.This type of methods work based on cross-lingual textfeatures.Various types of these methods include (1) cross-lingualsyntax-based methods, (2) cross-lingual dictionary-basedmethods, and (3) cross-lingual semantic-based methods [3].In [15], a statistical model is used to evaluate the similaritybetween two documents regardless of the order in which theterms appear in suspected and original documents[18].

[18] Ahmed Hamza Osman, Naomie Salim, and Albaraa Abuobieda. Survey of text plagiarism detection. ComputerEngineering and Applications Journal (ComEngApp), 1(1):37-45, 2012.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Grammar Semantics Hybrid Plagiarism Detection Methods

These methods eliminate the limitations of semantic-basedmethods.They are capable of detecting copy/paste and paraphrasingplagiarism accurately.A semantic-based method usually cannot detect anddetermine the location of plagiarised part of the document butsuch grammar-based method can address this issue efficiently[5], [3].

[3] Salha M Alzahrani, Naomie Salim, and Ajith Abraham. Understanding plagiarism linguistic patterns, textualfeatures, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews), 42(2):133-149, 2012.

[5] Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, and Qin-Bao Song. A survey on natural language text copydetection. Journal of software, 14(10):1753-1760, 2003.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Classification and Cluster-Based Methods

In information retrieval process, supervised and unsupervisedgrouping of documents play an important role.It helps in reducing the document comparison timesignificantly during plagiarism detection[26].Some methods [25], [21] use keywords or specific words tocluster the similar sections of documents.

[21] Antonio Si, Hong Va Leong, and Rynson WH Lau. Check: a document plagiarism detection system. InProceedings of the 1997 ACM symposium on Applied computing, pages 70-77. ACM, 1997.

[25] Manuel Zini, Marco Fabbri, Massimo Moneglia, and Alessandro Panunzi. Plagiarism detection throughmultilevel text comparison. In 2006 Second International Conference on Automated Production of Cross MediaContent for Multi-Channel Distribution (AXMEDIS’06), pages 181-185. IEEE, 2006.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Plagiarism Detection Methods

Citation-Based Methods

It detects plagiarism in documents based on citations.In [13], a novel method is proposed to detect plagiarism onthe basis of citation.Citation-based methods belong to semantic plagiarismdetection techniques because these techniques use semanticscontained in the citation in a document[14].The similarity between two document is computed based onthe similar patterns in the citation sequences[14].

[13] Bela Gipp and Joran Beel. Citation based plagiarism detection: a new approach to identify plagiarized worklanguage independently. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, pages 273-274.ACM, 2010.

[14] Bela Gipp and Norman Meuschke. Citation pattern matching algorithms for citation-based plagiarismdetection: greedy citation tiling, citation chunking and longest common citation sequence. In Proceedings of the11th ACM symposium on Document engineering, pages 249-258. ACM, 2011.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

Plagiarism Detection Techniques: A General Comparison

Plagiarism Detection Tools Classification

Plagiarism Detection Tools: A General Comparison

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Issues and Challenges I

Most prominent methods have been able to address the majorissues related to (i) salient syntactic and semantic featureextraction, (ii) handling of both monolingual and cross-lingualplagiarism detection, and (iii) detecting plagiarism in bothtext data and program source code with or without usingreferences.Due to the rapid growth of digital technology to support itsreproduction, storage and dissemination, some importantissues and research challenges are still left unattended.We highlight some issues and challenges that need to beaddressed by computer science and linguistic researchers.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Issues and Challenges II

(a) A detection method for both text data and source code thatensures both proof of correctness and proof of completeness isstill missing, and hence an important issue.

(b) A proximity measure that guarantees detection of plagiarizedtext segment(s) in both intrinsic and extrinsic detectionframework with high accuracy, is still not available.

(c) Developing a cross-lingual plagiarism checking tool that canperform without external references but ensures high accuracyis a challenging task.

(d) Developing a repository that maintains references based onauthor footprints, which is complete and accurate is anotherchallenging task.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Issues and Challenges III

(e) Developing a plagiarism checker that accepts an idea narratedby user and generates a detail plagiarism report (with similarityif detected from 1%-99%) with correct sources, is animportant issue.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conceptual Framework for Plagiarism Detection I

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conceptual Framework for Plagiarism Detection II

This conceptual framework is a hybrid approach to detect textplagiarism with high accuracy at an early stage.The approach is designed based on the merit of both intrinsicand extrinsic plagiarism detection approaches.Initially it takes a candidate document Dc as input, computesits salient global features, segments Dc and extract localfeatures for each segmented text, and finally forms ak-dimensional (k = m(global) + n(local)) feature vector Dc

i .

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conceptual Framework for Plagiarism Detection III

The global feature values of the feature vector are used toidentify the a subset of relevant source documents for a giveninput candidate document, whereas, the local feature valuesare utilized during matching to identify the possible plagiarisedtext segments as well as the relevant source documents.Our approach assume the possession of such k dimensionalfeature vector against each prototype document Dp

j in therepository.The following features make our approach more attractivewhile comparing its other counterparts.

a It exploits the merits of both intrinsic and extrinsic plagiarismin a cost effective manner.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conceptual Framework for Plagiarism Detection IV

b The nearest-neighbour based outlier mining technique is ableto detect a plagiarised text segment or paragraph at an earlystage in terms of non-conformity.

c The meta-level (i) initial reference filtering and (ii) the finalselection of sources or subset of references relevant to theplagiarised text segment, makes our approach more economicand computationally cost-effective.

d The flexibility of allowing one to incorporate or delete featureswhile constructing vector to represent a text document ingeneral and the uniqueness of a text segment.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conclusions I

We reported an exhaustive survey on plagiarism detectionmethods and tools in a systematic way.We presented a taxonomy of various forms of plagiarism occurin text data and source code.Next, we reported a large number of methods and tools undervarious categories and compared and analyzed their pros andcons.We present a hybrid plagiarism detection approach based onthe merit of both intrinsic and extrinsic plagiarism detectionapproaches.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Conclusions II

Finally, we have highlighted a list of issues and researchchallenges towards developing a plagiarism checker that iscomplete and correct for both monolingual and cross-lingualtext data and for source code.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References I

[1] Vale Alekya and S Sai Satyanarayana Reddy. Survey of programmingplagiarism detection, 2(6), 188–193, 2014.

[2] Asim M El Tahir Ali, Hussam M Dahwa Abdulla, and Vaclav Snasel.Overview and comparison of plagiarism detection tools. In DATESO,pages 161–172. Citeseer, 2011.

[3] Salha M Alzahrani, Naomie Salim, and Ajith Abraham. Understandingplagiarism linguistic patterns, textual features, and detection methods.IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews), 42(2):133–149, 2012.

[4] Melissa S Anderson and Nicholas H Steneck. The problem of plagiarism.In Urologic Oncology: Seminars and Original Investigations, volume 29,pages 90–94. Elsevier, 2011.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References II

[5] Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, and Qin-Bao Song. A surveyon natural language text copy detection. Journal of software,14(10):1753–1760, 2003.

[6] C Barnbaum. Plagiarism: A student’s guide to recognizing it and avoidingit.[online].[cit. 2010-12-14], 2009.

[7] Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro,and MD Esposti. A plagiarism detection procedure in three steps:Selection, matches and âĂIJsquaresâĂİ. In Proc. SEPLN, pages 19–23,2009.

[8] Netra Charya, Kushagra Doshi, Smit Bawkar, and Radha Shankarmani.Intrinsic plagiarism detection in digital data, 2(3), 23–30, 2015.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References III

[9] Tommy WS Chow and MKM Rahman. Multilayer som withtree-structured data for efficient document retrieval and plagiarismdetection. IEEE Transactions on Neural Networks, 20(9):1385–1402,2009.

[10] Mohamed Elhadi and Amjad Al-Tobi. Use of text syntactical structures indetection of document duplicates. In Digital Information Management,2008. ICDIM 2008. Third International Conference on, pages 520–525.IEEE, 2008.

[11] Mohamed Elhadi and Amjad Al-Tobi. Duplicate detection in documentsand webpages using improved longest common subsequence anddocuments syntactical structures. In Computer Sciences and ConvergenceInformation Technology, 2009. ICCIT’09. Fourth International Conferenceon, pages 679–684. IEEE, 2009.

[12] Christiane Fellbaum. WordNet. Wiley Online Library, 1998.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References IV

[13] Bela Gipp and Joran Beel. Citation based plagiarism detection: a newapproach to identify plagiarized work language independently. InProceedings of the 21st ACM conference on Hypertext and hypermedia,pages 273–274. ACM, 2010.

[14] Bela Gipp and Norman Meuschke. Citation pattern matching algorithmsfor citation-based plagiarism detection: greedy citation tiling, citationchunking and longest common citation sequence. In Proceedings of the11th ACM symposium on Document engineering, pages 249–258. ACM,2011.

[15] Cristian Grozea, Christian Gehl, and Marius Popescu. Encoplot: Pairwisesequence matching in linear time applied to plagiarism detection. In 3rdPAN Workshop. Uncovering Plagiarism, Authorship and Social SoftwareMisuse, page 10, 2009.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References V

[16] Jonathan Koberstein and Yiu-Kai Ng. Using word clusters to detectsimilar web documents. In International Conference on KnowledgeScience, Engineering and Management, pages 215–228. Springer, 2006.

[17] Claudia Leacock, George A Miller, and Martin Chodorow. Using corpusstatistics and wordnet relations for sense identification. ComputationalLinguistics, 24(1):147–165, 1998.

[18] Ahmed Hamza Osman, Naomie Salim, and Albaraa Abuobieda. Survey oftext plagiarism detection. Computer Engineering and Applications Journal(ComEngApp), 1(1):37–45, 2012.

[19] MKM Rahman, Wang Pi Yang, Tommy WS Chow, and Sitao Wu. Aflexible multi-layer self-organizing map for generic processing oftree-structured data. Pattern Recognition, 40(5):1406–1424, 2007.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References VI[20] Philip Resnik et al. Semantic similarity in a taxonomy: An

information-based measure and its application to problems of ambiguity innatural language. J. Artif. Intell. Res.(JAIR), 11:95–130, 1999.

[21] Antonio Si, Hong Va Leong, and Rynson WH Lau. Check: a documentplagiarism detection system. In Proceedings of the 1997 ACM symposiumon Applied computing, pages 70–77. ACM, 1997.

[22] Sulema Torres and Alexander Gelbukh. Comparing similarity measures fororiginal wsd lesk algorithm. Research in Computing Science, 43:155–166,2009.

[23] Rajiv Yerra and Yiu-Kai Ng. A sentence-based copy detection approachfor web documents. In International Conference on Fuzzy Systems andKnowledge Discovery, pages 557–570. Springer, 2005.

[24] Haijun Zhang and Tommy WS Chow. A coarse-to-fine framework toefficiently thwart plagiarism. Pattern Recognition, 44(2):471–487, 2011.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

References VII

[25] Manuel Zini, Marco Fabbri, Massimo Moneglia, and Alessandro Panunzi.Plagiarism detection through multilevel text comparison. In 2006 SecondInternational Conference on Automated Production of Cross MediaContent for Multi-Channel Distribution (AXMEDIS’06), pages 181–185.IEEE, 2006.

[26] Du Zou, Wei-Jiang Long, and Zhang Ling. A cluster-based plagiarismdetection method. In Notebook Papers of CLEF 2010 LABs andWorkshops, 22-23 September, 2010.

[27] Sven Meyer Zu Eissen, Benno Stein, and Marion Kulig. Plagiarismdetection without reference collections. In Advances in data analysis,pages 359–366. Springer, 2007.

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism

IntroductionPlagiarism and Its Types: A Taxonomy

Plagiarism Detection Approaches: A TaxonomyPlagiarism Detection Tools

Issues and ChallengesConceptual Framework for Plagiarism Detection

ConclusionsReferences

Thank You...

Dhruba K Bhattacharyya, FIETE Survey on Plagiarism