How To Avoid Plagiarism OCHS ENGLISH DEPT Joseph Trimmer, A GUIDE TO MLA DOCUMENTATION.
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
-
Upload
ahmed-mater -
Category
Engineering
-
view
153 -
download
13
Transcript of My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
SUPERVISED BY: Dr Hitham M. Abo Bakr
Implementing Plagiarism Detection Engine
For English Academic Papers
By
Muhamed Gameel Abd El Aziz
Ahmed Motair El Said Mater
Mohamed Hessien Mohamed
Shreif Hosni Zidan Esmail
Manar Mohamed Said Ahmed
Doaa Abd El Hamid Abd El Hamid
Implementing Plagiarism Detection Engine for English Academic Papers 1
Abstract Plagiarism became a serious issue now days due to the presence of vast resources easily
available on the web, which makes developing plagiarism detection tool a useful and challenging task
due to the scalability issues.
Our project is implementing a Plagiarism Detection Engine oriented for English academic papers
using text Information Retrieval methods, relational database, and Natural Language Processing
techniques.
The main parts of the projects are:
Gathering and cleaning data: crawling the web and collecting academic papers and parsing it to
extract information about the paper and make a big dataset of these scientific paper content.
Tokenization: Parse, tokenize, and preprocess documents.
Plagiarism engine: checking similarity between the input document and the database to detect
potential plagiarism.
Implementing Plagiarism Detection Engine for English Academic Papers 2
Table of Contents
Abstract ___________________________________________________________________________ 1
Table of Contents ___________________________________________________________________ 2
Table of Figures ____________________________________________________________________ 4
Table of Tables _____________________________________________________________________ 7
Chapter 1 Introduction ___________________________________________________________ 8
1.1 What is Plagiarism? _________________________________________________________________8
1.2 What is Self-Plagiarism? _____________________________________________________________8
1.3 Plagiarism on the Internet ____________________________________________________________8
1.4 Plagiarism Detection System __________________________________________________________8
1.4.1 Local similarity: __________________________________________________________________________ 8 1.4.2 Global similarity: _________________________________________________________________________ 9 1.4.3 Fingerprinting ___________________________________________________________________________ 9 1.4.4 String Matching __________________________________________________________________________ 9 1.4.5 Bag of words _____________________________________________________________________________ 9 1.4.6 Citation-based Analysis ____________________________________________________________________ 9 1.4.7 Stylometry _______________________________________________________________________________ 9
Chapter 2 Background Theory ____________________________________________________ 10
2.1 Linear Algebra Basics ______________________________________________________________ 10 2.1.1 Vectors _________________________________________________________________________________ 10
2.2 Information Retrieval (IR) __________________________________________________________ 11
2.3 Regular Expression ________________________________________________________________ 15
2.4 NLTK Toolkit ____________________________________________________________________ 16
2.5 Node.js __________________________________________________________________________ 16
2.6 Express.js ________________________________________________________________________ 16
2.7 Sockets.io ________________________________________________________________________ 16
2.8 Languages Used ___________________________________________________________________ 16
Chapter 3 Design and Architecture _________________________________________________ 17
3.1 Extract, Transfer and Load (ETL) ___________________________________________________ 17
3.2 Plagiarism Engine _________________________________________________________________ 17 3.2.1 Natural Language Processing, (Generating k-grams), and vectorization __________________________ 18 3.2.2 Semantic Analysis (Vector Space Model VSM Representation) _________________________________ 18 3.2.3 Calculating Similarity ____________________________________________________________________ 18 3.2.4 Clustering ______________________________________________________________________________ 18 3.2.5 Communicating Results___________________________________________________________________ 19
Chapter 4 Implementation ________________________________________________________ 20
4.1 Extract, Load and Transform (ETL) _________________________________________________ 20 4.1.1 The Crawler ____________________________________________________________________________ 20 4.1.2 The Parser ______________________________________________________________________________ 20 4.1.3 The Data Extracted from the paper _________________________________________________________ 20 4.1.4 The Parser Implementation _______________________________________________________________ 21
Implementing Plagiarism Detection Engine for English Academic Papers 3
4.1.5 How it works ____________________________________________________________________________ 21 4.1.6 Steps of Parsing _________________________________________________________________________ 22 4.1.7 The Paper Class _________________________________________________________________________ 26 4.1.8 The Paragraph Structure _________________________________________________________________ 27 4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) _____________________________________ 27 4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)____________________________________ 37
4.2 The Natural Language Processing (NLP) ______________________________________________ 42
4.2.1 Introduction ____________________________________________________________________________ 42 4.2.2 The Implementation Overview _____________________________________________________________ 42 4.2.3 The Text Processing Procedure ____________________________________________________________ 42 4.2.4 Example of the Text Processing ____________________________________________________________ 45
4.3 Term Weighting __________________________________________________________________ 47 4.3.1 Lost Connection to Database Problem ______________________________________________________ 47 4.3.2 Process Paragraph _______________________________________________________________________ 48 4.3.3 Generating Terms _______________________________________________________________________ 48 4.3.4 Populating term, paragraphVector Tables ___________________________________________________ 51 4.3.5 Executing VSM Algorithm ________________________________________________________________ 52
4.4 Testing Plagiarism ________________________________________________________________ 53
4.4.1 Process Paragraph _______________________________________________________________________ 53 4.4.2 Calculate Similarity ______________________________________________________________________ 54 4.4.3 Get Results _____________________________________________________________________________ 54
4.5 The VSM Algorithm _______________________________________________________________ 55
4.5.1 Calculating similarity ____________________________________________________________________ 55 4.5.2 K-means and Clustering __________________________________________________________________ 56
4.6 Server Side _______________________________________________________________________ 59
4.6.1 Handling Routing ________________________________________________________________________ 59 4.6.2 Running Python System __________________________________________________________________ 60
4.7 Client Side _______________________________________________________________________ 62
4.8 The GUI of the System _____________________________________________________________ 63
Chapter 5 Results and Discussion __________________________________________________ 66
5.1 Dataset of the Parser _______________________________________________________________ 66
5.2 Exploring dataset _________________________________________________________________ 68
5.4.1 Small dataset (15K) ______________________________________________________________________ 68 5.4.2 Big dataset (50K) ________________________________________________________________________ 69
5.3 Performance _____________________________________________________________________ 70
5.4 Detecting plagiarism _______________________________________________________________ 72
5.4.1 Percentage score functions:________________________________________________________________ 72
5.5 Discussing results _________________________________________________________________ 74
Chapter 6 Conclusion ___________________________________________________________ 75
Chapter 7 Appendix _____________________________________________________________ 76
7.1 Entity-Relation Diagram (ERD) _____________________________________________________ 76
7.2 Stored procedures _________________________________________________________________ 77
References _______________________________________________________________________ 84
Implementing Plagiarism Detection Engine for English Academic Papers 4
Table of Figures Figure 1.1 Plagiarism Detection Approaches _____________________________________________________8
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3) ______ 10
Figure 2.2 Geometric representation of documents ______________________________________________ 12
Figure 3.1 High level block diagram __________________________________________________________ 17
Figure 3.2 Detailed block diagram of the Plagiarism Engine ______________________________________ 17
Figure 4.1 Overview for the Crawler and Parser ________________________________________________ 20
Figure 4.2 UML of the Parser Application _____________________________________________________ 21
Figure 4.3 the Flow Chart of the Parser _______________________________________________________ 21
Figure 4.4 The main function of Parsing ______________________________________________________ 22
Figure 4.5 The First Page of an IEEE Paper (as Blocks) _________________________________________ 22
Figure 4.6 First Page of a Science Direct Paper _________________________________________________ 23
Figure 4.7 First Paper of a Springer Paper_____________________________________________________ 23
Figure 4.8 The function of parseOtherPages ___________________________________________________ 24
Figure 4.9 Block of String before Enhancing ___________________________________________________ 25
Figure 4.10 The Paragraphs after enhancing ___________________________________________________ 25
Figure 4.11 the Paper Structure _____________________________________________________________ 26
Figure 4.12 the Paragraph Structure _________________________________________________________ 27
Figure 4.13 Different forms for an IEEE Top Header ____________________________________________ 27
Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper ____________________________ 28
Figure 4.15 The supported Regex of the IEEE Header formats ____________________________________ 29
Figure 4.16 The Function of extracting the Volume Number ______________________________________ 30
Figure 4.17 The Function of Extracting the Issue Number ________________________________________ 30
Figure 4.18 The Function of Extracting the DOI ________________________________________________ 30
Figure 4.19 The Function of Extracting the Start and End Pages __________________________________ 31
Figure 4.20 The Function of Extracting the Journal Title_________________________________________ 32
Figure 4.21 Parsing the rest of blocks in the first Page ___________________________________________ 32
Figure 4.22 The Function of Extracting the DOI and PII _________________________________________ 33
Figure 4.23 The Function of Extracting the ISSN _______________________________________________ 33
Figure 4.24 The Function of extracting the paper Dates __________________________________________ 34
Figure 4.25 The Function of Extracting the Keywords ___________________________________________ 35
Implementing Plagiarism Detection Engine for English Academic Papers 5
Figure 4.26 The Function of Extracting the Keywords ___________________________________________ 36
Figure 4.27 The Function of Extracting the Title and the Authors __________________________________ 36
Figure 4.28 The Defining the Style of the Header _______________________________________________ 38
Figure 4.29 the Function of Extracting the Figure Captions ______________________________________ 39
Figure 4.30 the Function of separating the lists _________________________________________________ 40
Figure 4.31 the Function of Extracting the Paragraph ___________________________________________ 40
Figure 4.32 The Function of Extracting the Paragraph __________________________________________ 41
Figure 4.33 Process Text Function ___________________________________________________________ 42
Figure 4.34 Tokenizing words Function _______________________________________________________ 42
Figure 4.35 Tokenization Example ___________________________________________________________ 43
Figure 4.36 POS Function __________________________________________________________________ 43
Figure 4.37 POS Output Example ____________________________________________________________ 43
Figure 4.38 WordNet POS Function __________________________________________________________ 43
Figure 4.39 Removing Punctuations Function __________________________________________________ 44
Figure 4.40 Removing Stop Words Function ___________________________________________________ 44
Figure 4.41 Stop Words list _________________________________________________________________ 44
Figure 4.42 Lemmatization Function _________________________________________________________ 45
Figure 4.43 Paragraph before Text Processing _________________________________________________ 45
Figure 4.44 Paragraph after Text Processing ___________________________________________________ 46
Figure 4.45 Retrieving Paragraphs ___________________________________________________________ 47
Figure 4.46 Process Paragraph Function ______________________________________________________ 48
Figure 4.47 Generate k-gram Terms Function __________________________________________________ 49
Figure 4.48 Paragraph Example _____________________________________________________________ 49
Figure 4.49 1-gram terms ___________________________________________________________________ 50
Figure 4.50 2-gram terms ___________________________________________________________________ 50
Figure 4.51 3-gram terms ___________________________________________________________________ 50
Figure 4.52 4-gram terms ___________________________________________________________________ 50
Figure 4.53 5-gram terms ___________________________________________________________________ 50
Figure 4.54 Calculate Term Frequency _______________________________________________________ 51
Figure 4.55 insert Terms in Database _________________________________________________________ 51
Figure 4.56 insert Paragraph Vector in Database _______________________________________________ 51
Figure 4.57 Executing the VSM Algorithm_____________________________________________________ 52
Implementing Plagiarism Detection Engine for English Academic Papers 6
Figure 4.58 tokenizing and link paragraphs together _____________________________________________ 53
Figure 4.59 Process input paragraphs _________________________________________________________ 53
Figure 4.60 Populate input paragraph vector ___________________________________________________ 53
Figure 4.61 Calculate Similarity _____________________________________________________________ 54
Figure 4.62 Get Results ____________________________________________________________________ 54
Figure 4.63 Flowchart of the Kmeans text clustering algorithm ____________________________________ 57
Figure 4.64 Home Page Routing _____________________________________________________________ 59
Figure 4.65 Pre-Process Page Routing ________________________________________________________ 59
Figure 4.66 Communicating between the Server and the Core Engine for testing plagiarism ____________ 60
Figure 4.67 Communicating between the Server and the Core Engine for Pre-processing ______________ 61
Figure 4.68 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.69 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.70 Submitting an input document_____________________________________________________ 63
Figure 4.71 The Results of the Process Part 1 __________________________________________________ 64
Figure 4.72 The Results of the Process Part 2 __________________________________________________ 65
Figure 5.1 Number of Papers Published per Year in IEEE ________________________________________ 66
Figure 5.2 Number of Papers Published per Year in Springer _____________________________________ 67
Figure 5.3 Number of Papers Published per Year in Science Direct _________________________________ 67
Figure 5.4 Response time against number of paragraphs tested on small dataset ______________________ 70
Figure 5.5 Screenshot of the System Performance from the System GUI _____________________________ 71
Figure 7.1 ERD of the plagiarism Engine database ______________________________________________ 76
Implementing Plagiarism Detection Engine for English Academic Papers 7
Table of Tables Table 1 Statistics of the Parser ________________________________________________________________66
Table 2 Dataset Statistics ____________________________________________________________________68
Table 3 Unique Terms count in each Paragraph _________________________________________________68
Table 4 Unique Terms count in Dataset ________________________________________________________68
Table 5 Dataset Statistics ____________________________________________________________________69
Table 6 Unique Terms count in each Paragraph _________________________________________________69
Table 7 Unique Terms count in Dataset ________________________________________________________69
Table 8 Processing time of each module in Plagiarism Engine ______________________________________70
Table 9 Parameters _________________________________________________________________________72
Table 10 Testing Paragraphs and Results _______________________________________________________73
Implementing Plagiarism Detection Engine for English Academic Papers 8
Chapter 1 Introduction 1.1 What is Plagiarism?
It’s the act of Academic stealing someone’s work such as: copying words from a book or a
scientific paper and publish it as it’s his work, also stealing the ideas, images, videos and music and
using them without a permission or providing a proper citation is called plagiarism.
1.2 What is Self-Plagiarism?
It’s the act when someone is using a portion of an article or work he published before without
citing that he is doing so, and this portion could be significant, identical or nearly identical, also it may
cause copyrights issues as the copyright of the old work will be transferred to the new one. This type of
articles and work are called duplicate or multiple publication.
1.3 Plagiarism on the Internet
Now the Blogs, Facebook Pages and some website are copying and pasting information violating
many copyrights, so there are many tools that are used to prevent plagiarism such as: disabling the right
click to prevent copying, also placing copyright warning in every page in the website as banners or
pictures, and the use of DCMA copyright law to report for copyright infragment and the violation of
copyrights, this report could be sent to the website owner or the ISP hosting the website and the website
will be removed.
1.4 Plagiarism Detection System
It’s a system used to test a material if it has plagiarism or not, this material could be scientific article or
technical report or essay or others, also the system can emphasize the parts of plagiarism in the material and estate
from where it’s copied from even there is difference between some words with the same meaning.
Figure 1.1 Plagiarism Detection Approaches
1.4.1 Local similarity:
Given a small dataset, the system checks the similarity between each pair of the paragraphs in this dataset,
like checking if two students cheated in an assignment.
Implementing Plagiarism Detection Engine for English Academic Papers 9
1.4.2 Global similarity:
Global similarity systems checks the similarity between a small input paragraphs against a large dataset,
like checking if a submitted paper is plagiarized from an already published paper.
1.4.3 Fingerprinting
In this approach, the data set consists of set of multiple n-grams from documents, these n-grams
are selected randomly as a substring of the document, each set of n-grams representing a fingerprint for
that document and called minutiae, and all of this fingerprints are indexed in the database, and the input
text is processed in the same way and compared with the fingerprints in the database and if it matches
with some of them, then it plagiarizes some documents. [1]
1.4.4 String Matching
This is one of most problems in the plagiarism detection systems as to detect plagiarism you can
to make an exact match, but to compare the document to be tested with the total Database requires huge
amount of resources and storage, so suffix trees and suffix vector are used to overcome this problem. [2]
1.4.5 Bag of words
This approach is an adoption of the vector space retrieval, where the document is represented as
a bag of words and these words are inserted in the database as n-grams with its location in the document
and their frequencies in this or other documents, and for the document to be tested it will be represented
as bag of words too and compared with the n-grams in the database. [3]
1.4.6 Citation-based Analysis
This is the only approach that doesn’t rely on text similarity, It examines the citation and
reference information in texts to identify similar patterns in the citation sequences, It’s not widely used
in the commercial software, but there are prototypes of it.
1.4.7 Stylometry
Analyze only the suspicious document to detect plagiarized passages by detecting the difference
of linguistic characters.
This method isn’t accurate in small documents as it needs to analyze large passages to be able to
extract linguistic properties –up to thousands of words per chunk [4].
Our Project is working with Global Similarity and the Bag of Words Approach, as the system
has Dataset of many scientific papers divided into paragraphs, and the input text is divided into
paragraphs and compared with the large dataset of paragraphs.
Implementing Plagiarism Detection Engine for English Academic Papers 10
Chapter 2 Background Theory
2.1 Linear Algebra Basics
Since we use Vector space model to represent and retrieve text documents, a basic linear algebra
is needed.
2.1.1 Vectors
A vector is Geometric object that have a magnitude and a direction, or a mathematical object
consists of ordered values.
1. Representation in 2D and 3D
1) Graphical (Geometric) representation
A vector is represented graphically as an arrow in the Cartesian 2D plane or the Cartesian 3D
space.
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3).
Source: Wikimedia commons.
2) Cartesian representation
Vectors in an n-dimensional Euclidean space can be represented as coordinate vectors; the
endpoint of a vector can be identified with an ordered list of n real numbers (n-tuple). [5]
2D vector �� = (𝑎𝑥, 𝑎𝑦) Euclidean vector
3D vector �� = (𝑎𝑥, 𝑎𝑦, 𝑎𝑧)
2. Operations on vectors
1) Scalar product
𝑟�� = (𝑟𝑎𝑥, 𝑟𝑎𝑦, 𝑟𝑎𝑧)
2) Sum
�� + �� = (𝑎𝑥 + 𝑏𝑥 , 𝑎𝑦 + 𝑏𝑦, 𝑎𝑧 + 𝑏𝑧 )
Implementing Plagiarism Detection Engine for English Academic Papers 11
3) Subtract
�� − �� = (𝑎𝑥 − 𝑏𝑥 , 𝑎𝑦 − 𝑏𝑦, 𝑎𝑧 − 𝑏𝑧 )
4) Dot product
Algebraic definition: �� . �� = (𝑎𝑥 𝑏𝑥 , 𝑎𝑦𝑏𝑦)
Geometric definition: �� . �� = ‖𝑎‖‖𝑏‖ cos 𝜃
Where: |a| is the magnitude of vector a, |b| is the magnitude of vector b, θ is the angle between a and b
The projection of a vector �� in the direction of another vector �� is given by: 𝑎𝑏 = �� . ��
Where: �� is the normalized vector –unit vector- of ��.
2.2 Information Retrieval (IR)
Information retrieval could be defined as “the process of finding material of an unstructured
nature –usually text- that satisfies information need or relevant to a query from large collection of data
[6].
As the definition suggests, IR differ than ordinary select query that is the information retrieved
unstructured, and doesn’t always exactly match the query.
Information Retrieval methods are used in search engines, text classification –such as spam
filtering-, and in our case plagiarism engine.
1. Vector Space Model (VSM)
The basic idea of VSM is to represent text documents as vectors in term weights space.
1) Term Frequency weighting (TF)
The simplest VSM weighting is just Term Frequency; all other weighting functions are
modification of it.
In TF weighting we represent each text by a vector of d-dimensions where d is the number of
terms in dataset, the value of the vector nth dimension equals to the frequency of the nth term in the
document.
For example let’s assume a dataset of 2 dimensions/terms (play, ground)
Document 1 ‘play ground’ is represented as 𝑑1 = (1, 1)
Document 2 “play play” is represented as 𝑑2 = (2, 0)
Document 3 “ground” is represented as 𝑑3 = (0, 1)
More generally weight of word w in document d is defined as 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)
Implementing Plagiarism Detection Engine for English Academic Papers 12
The dot product similarity between d1 and d2 = 𝑑1 . 𝑑2
= (1, 1) . (2,0) = 1 ∗ 2 + 1 ∗ 0 = 2
2) Term Frequency with Inverse Document Frequency weighting (TF-IDF)
Document Frequency 𝑑𝑓(𝑤) is the number of documents that contains the word.
TF-IDF have additional Inverse Document Frequency term to penalize common terms as they
are have high probability [7] of appearing in a document so they don’t strongly indicate plagiarism
unlike less probable terms which have less probability and more information.
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) ∗ 1
𝑑𝑓(𝑤)
So for the above example:
df (play) = 2
df (ground) = 2
and in this case all the weights will be scaled by a half.
2. State-of-the-art VSM functions
1) Pivoted Length Normalization [8]
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = ln[1 + ln [1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]]
1 − 𝑏 + 𝑏 |𝑑|𝑎𝑣𝑑𝑙
log𝑀 + 1
𝑑𝑓(𝑤)
Where: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) is the weight of word w in document/paragraph d
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) is the count of word w in document d – i.e. term frequency-
play
ground
2
1
1
d2
d1
d3
Figure 2.2 Geometric representation of documents
Implementing Plagiarism Detection Engine for English Academic Papers 13
𝑏 𝑖𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ∈ [0, 1]
|𝑑| is the length of document d
𝑎𝑣𝑑𝑙 is the average length of the documents in dataset
𝑀 is the number of documents in dataset
𝑑𝑓(𝑤) is the number of documents that contains the word w i.e. document frequency
Document length normalizing term 1 − 𝑏 + 𝑏 |𝑑|
𝑎𝑣𝑑𝑙 is used to linearly penalize long documents
if it’s length is larger than the average document length (avdl), or reward short documents if it’s length
is smaller than the average document length.
The parameter b is used to determine the normalization; if its equal to zero then there is no
normalization at all if it’s equal to 1 then the normalization is linear with offset zero and slope 1.
The Inverse Document Frequency (IDF) term log𝑀+1
𝑑𝑓(𝑤) is used to penalize common terms as
explained above, the IDF is multiplied by number of documents to normalize it, as the probability of the
term depends not only on the document frequency of that term, but also on the size of the dataset, a term
appeared 10 times in a dataset with 100 document much more common than a term appeared 10 times
but in a dataset with 1000 document, the logarithmic function is used to smooth the IDF weighting, i.e.
reduce the variation of weighting when the Document Frequency varies a lot.
The Term frequency (TF) term ln[1 + ln [1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] contains double natural logarithmic
functions to achieve sub linear transformation-i.e. smoothing TF curve- to avoid over scoring documents
that have large repeated words, as the first occurring of a term should have the highest weight.
Imagine a document with extremely large frequency of a term, without sub linear transformation
this document will always have high similarity with any input query that contains the same term, even
higher similarity than another more similar document.
2) Okapi\BM25 [9]
BM stands for Best Match, the weights are defined as follows:
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = (𝑘 + 1)𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞)
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) + 𝑘(1 − 𝑏 + 𝑏 |𝑑|𝑎𝑣𝑑𝑙
)log
𝑀 + 1
𝑑𝑓(𝑤)
Where all symbols defined as in Pivoted Length Normalization 𝑘 ∈ [0, ∞]
Similar to Pivoted Length Normalization, but instead of Natural logarithms it uses division and k
parameter to achieve sub linear transformation.
Implementing Plagiarism Detection Engine for English Academic Papers 14
It was originally developed based on the probabilistic model, however it’s very similar to Vector
Space Model.
3. Similarity functions
After representing text documents as vectors in the space, we need functions to calculate
similarity –or distance- between any two vectors.
1) Dot product similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)
𝑤∈ 𝑞∩𝑑
Where: 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) is the similarity score between document d and input query q
The score is simply the summation of the product of term weights of each word appear in both
document and query.
It’s very popular because it’s general and can be used with any fancy term weighting.
2) Cosine similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) =∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)𝑤∈ 𝑞∩𝑑
|𝑞| ∪ |𝑑|
Where: |𝑞| is the magnitude of the query vector, |𝑑| is the magnitude of the document vector.
It’s basically a dot product divided by the product of the lengths of the two vectors, which yields
the cosine of the angle between the vectors.
This function has a built in document – and query – length normalization.
4. Clustering
Clustering is an unsupervised1 machine learning method, and a powerful data mining technique,
it is the process of grouping similar objects together.
This technique can theoretically speed up the information retrieval process by a factor of K
where K is the number of clusters.
This may be achieved by clustering similar paragraphs together and measure the similarity
between each new query and the centroids of the clusters, then measure the similarity between the query
and paragraphs in one cluster, this is much faster than measuring similarity of query against each
paragraph in the data set.
1 Because the data are not labeled.
Implementing Plagiarism Detection Engine for English Academic Papers 15
We use Kmeans algorithm-centroid based clustering-, which is an iterative improvement
algorithm that groups the data set into a pre defined number of clusters K.
Which goes like this [10]:
1 Select K random points from data set to be the initial guess of the centroids –cluster centers.
2 Assign each record in the data set to the closest centroid based on a given similarity function.
3 Move each centroid closer to points assigned to it by calculating the mean value of the points
in the cluster.
4 If reached local optima –i.e. centroids stopped moving- stop, else repeat
Since Kmeans algorithm is sensitive to initial choose of centroids and can stuck in local optima
we repeat it with different initial centroids and keep the best results –which have the least mean square
error-.
Time complexity 𝛰(𝑡𝑘𝑛𝑑) where t is the number of iterations till converges, k is the number of
clusters, and n is the number of records in the data set, d is the number of dimensions.
Since usually 𝑡𝑘 ≪ 𝑛 the algorithm is considered a linear time algorithm.
Note: that is the typical time complexity for applying the algorithm on a dataset with neglect
able limited Number of dimensions, in our case we have very large number of dimensions –all terms in
the dataset- but fortunately for each centroid or paragraph we iterate only terms appear in it –not all
dimensions- so the time complexity will differ as we will discuss in details in the implementation
section.
2.3 Regular Expression
It’s a sequence of character and symbols written in some way to detect patterns, where each
symbol in the sequence has a meaning ex: + means one or more, * means zero or more, - means range
(A-Z: all capital letters from A to Z).
For ex: the Birth date could be written in this way May 15th, 1993 so the pattern of the Date is
[Month] [day][st or nd or rd or th] [year], and the Regular Expression for it is [A-Za-z]{3,9} [0-
9]{1,2}(st|nd|rd|th) [0-9]{4}
First, the month is one of 12 fixed words, they could be written explicitly or simply it’s a
sequence of Char range from 3 to 9 as the smallest month (May) and the largest (September), then space,
Then, the Day and it’s a number of 1 or 2 digits, then one of the 4 words (st, nd, rd, th), then space, then
the Year and it’s a number of 4 digits.
Also this isn’t the only format of date so the date expression could be more complicated than
this.
Implementing Plagiarism Detection Engine for English Academic Papers 16
2.4 NLTK Toolkit
It’s a python module that is responsible for Natural Language Processing (NLP) used for text
processing, It has algorithms for sentence and word tokenization, and contains a large number of corpus
(data), also has its own Wordnet corpus, and it’s used for Part of Speech (POS) Tagging, Stemming, and
lemmatization.
2.5 Node.js
It’s a runtime environment built on Chrome’s V8 JavaScript Enginer for developing server-side
web applications, it uses an event driven, non-blocking I/O model.
2.6 Express.js
It’s a Node.js web application server framework, It’s a standard server framework for Node.js, It
has a very thin layer with many features available as plugins.
2.7 Sockets.io
It’s a library for real-time web application, It enables bi-directional communication between the
web client and server, It primarily use web sockets protocol with polling as fallback option.
2.8 Languages Used
1. Java
2. Python
3. SQL
4. JavaScript
5. HTML & CSS
Implementing Plagiarism Detection Engine for English Academic Papers 17
Chapter 3 Design and Architecture
3.1 Extract, Transfer and Load (ETL)
In this part, we are building the database by downloading many scientific papers using the
Crawler software (Extract), then they are passed to the Parser software where all the paper information
and text content are extracted as paragraphs (Transform) and inserted in the database (Load).
3.2 Plagiarism Engine
The plagiarism Engine preprocess a huge dataset of academic English papers and analysis it uses
Natural language processing techniques to extract useful information, then measures the similarity
between an input query and the dataset using Information Retrieval methods to detect both Identical and
paraphrased plagiarism in a fast and intelligent way.
Parsed
papers
NLP and
vectorization
Semantic analysis
VSM representation
Vectorized paragraph
Extracted features
Lexical database
(WordNet/NLTK) Input query
Calculating
similarity Clustering
centroid Find Cluster
Potential plagiarized
paragraphs
ETL
Parsing
Plagiarism
engine
Communicating results
Academic
papers
Input query
Figure 3.1 High level block diagram
Figure 3.2 Detailed block diagram of the Plagiarism Engine
Implementing Plagiarism Detection Engine for English Academic Papers 18
3.2.1 Natural Language Processing, (Generating k-grams), and vectorization
The Text Processing part work on these data to extract the most important words from the
paragraphs, and ignore the common words, then k-grams terms are generated from these words, and
each bag of words is linked to its corresponding paragraph in the database.
3.2.2 Semantic Analysis (Vector Space Model VSM Representation)
Input: simple Term Frequency vector representation stored in paragraphVector table.
Output: dataset statistics (number of paragraphs, number of terms, average length of paragraph)
stored in dataSetInfo table, document frequency (for each term; number of paragraphs in which that term
appeared) stored in IDF column in term table, pivoted length normalization, and BM25 vector weights.
In this part we calculate a more sophisticated vector representation than just term-frequency of
our text corpus.
We calculate a TF-IDF normalized vector representation of the text using both pivoted length
normalization and BM25 as discussed later.
3.2.3 Calculating Similarity
Input: Vectorized input paragraphs stored in inputPargraphVector Table, and BM25 or pivoted
length normalization in BM25, pivotNorm columns in paragraphVector table.
Output: similarity between input paragraph and relevance paragraphs in dataset stored in
similarity table.
Checking similarity between the input paragraph and paragraphs in the dataset, and detect
possible plagiarism if the similarity measure between the input paragraph and any paragraph form the
dataset exceeded a predetermined threshold.
We implemented both Okapi\BM25 and pivoted length normalization similarity functions.
The system first measure similarity on 5-gram vectors, then 4-grams and so on, if it ever found a
high similarity in one k-gram it limits its scope to those paragraphs with high similarity in precedes k-
grams to increase performance.
3.2.4 Clustering
Input: paragraph vectors with BM25 (or pivoted length normalization) weights stored in
paragraphVector table.
Output: the cluster of each paragraph stored in clusterId column in paragraph table, and the
centroids of the clusters stored in centroid table.
Implementing Plagiarism Detection Engine for English Academic Papers 19
We clustered similar paragraphs together so that we can measure similarity between only similar
paragraphs to increase the checking similarity step speed.
An input paragraph have a similarity measure against centroids first to determine its cluster then
the regular similarity measure with all paragraphs in the dataset in the same cluster.
3.2.5 Communicating Results
It’s the interface where the user can check his document to plagiarism by inserting the document
in a text box and the document is parsed in a similar way as the Parser of the system by splitting the
document into paragraphs and they are passed by the text processing part and compared by the dataset in
the database and results appear as plagiarism percentage in the document and showing the plagiarized
parts in the document with other documents.
Implementing Plagiarism Detection Engine for English Academic Papers 20
Chapter 4 Implementation
4.1 Extract, Load and Transform (ETL)
Figure 4.1 Overview for the Crawler and Parser
4.1.1 The Crawler It’s a software that download all the scientific papers from the web into a folders for each
publisher where the parser will start working on them.
4.1.2 The Parser It’s a software that take a PDF document (Scientific paper) as an input and extract the paper
information and content of the paper and insert them in the database of the system.
4.1.3 The Data Extracted from the paper a. Paper Information
1. Paper Title
2. Paper authors
3. Journal and its ISSN
4. Volume, Issue, Paper Date and other dates (Accepted, Received, Revised, Published)
5. DOI (Digital Object Identifier) or PII (Publisher Item Identifier)
6. Starting Page and Ending Page
b. Abstract and Keywords
c. Table of Contents
d. Figure and Table captions
e. Paper text content (as Paragraphs)
Implementing Plagiarism Detection Engine for English Academic Papers 21
4.1.4 The Parser Implementation
Figure 4.2 UML of the Parser Application
4.1.5 How it works
The Parser consists of a parent class (Parser) and other children classes (IEEE, Springer, APEM,
and Science Direct). The parent class has the general functions that parse the PDF document and extract
(Table of contents, Figure and Table captions, and the text content) of the paper, the children classes
Start
Check for
new Papers
Choose the Suitable
Parser
Parse the
Papers
Move to
Processed Directory
Move to
Unprocessed Directory
No
Yes
Success
Fail
Figure 4.3 the Flow Chart of the Parser
Implementing Plagiarism Detection Engine for English Academic Papers 22
have specific functions and Regular Expressions for each publisher structure to extract the paper
information (Title, Authors, DOI ...).
Each Publisher has its own folder where the scientific paper are downloaded by the Crawler, and
the Parser will monitor each folder for the new documents and use the suitable child class to parse the
new document found and extract all the needed information and data.
If the paper information and content are extracted completely, the file will be moved to the
(Processed Directory), otherwise, the file will be moved to the (Unprocessed Directory) logging the
error, so the Developer can check if it’s a new structure to be supported it in the Parser, or something
goes wrong and he has to fix it.
4.1.6 Steps of Parsing
4.1.6.1 Extracting the Text from the PDF file (extractBlocks Function)
The Parser uses the (PDFxStream Java Library) which extract the text from the PDF file as
blocks of String, and in this Function, It loops the file page by page and for each page it extract the
content of the page in an object of ArrayList<String> called page and add this page with the page
Number in an object of type HashMap<Integer,ArrayList<String>> called pages.
Figure 4.5 The First Page of an IEEE Paper (as Blocks)
public void parsePaper(String publisher) throws Exception { extractBlocks(); try {parseFirstPage();}
catch (Exception e) {throw new Exception("Error Not Processed");}
parserOtherPages(); paper.enhaceParagraphs(); try {paper.insertPaperInDatabase(publisher);}
catch (SQLException e) {throw new Exception("Error Database");} }
Figure 4.4 The main function of Parsing
Implementing Plagiarism Detection Engine for English Academic Papers 23
4.1.6.2 Extracting the Paper Information (parserFirstPage Function)
Each Publisher accepts his scientific paper in a specific structure which differs from publisher to
publisher, and the difference lies in the first page where the paper information are written, so there has to
be parser for each publisher designed to support its structure, so this function which is an abstract
function in the parent class is implemented in each child class for each publisher.
Figure 4.6 First Page of a Science Direct Paper
Figure 4.7 First Paper of a Springer Paper
These are different structure for Science Direct and Springer to show the difference in the
structures, and the difference lies in the organization and structure of the information ex:
Implementing Plagiarism Detection Engine for English Academic Papers 24
1. This is a header of a Springer Paper
Kong et al. EURASIP Journal on Advances in Signal Processing 2014, 2014:44
http://asp.eurasipjournals.com/content/2014/1/44
2. This is a header of an IEEE Paper
IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93
4.1.6.3 Extracting the Paper text content (parserOtherPages Function)
This function uses the general Parser functions as it loops over all the Pages and the blocks of
string in each page and extract the data from the blocks that could be (Table of contents, Figure and
Table Captions, Lists and Paragraphs).
Each block passes several stages:
1) First Test if the Block is a Figure Caption
2) Then Test if it’s a Table Caption
3) Then Test if it has a Header (Table of Content)
4) Then Test if the block has lists (maybe numeric, dash, Dot)
In the 3rd stage, if there are headers in the block, they will be extracted and the rest of the block
will be returned to the function and it will continue the other stages.
void parserOtherPages(){ for (Entry<Integer, ArrayList<String>> entrySet : pages.entrySet()) { Integer pageNumber = entrySet.getKey(); ArrayList<String> page = entrySet.getValue(); Iterator<String> it = page.iterator(); while (it.hasNext()) { String block = it.next().trim(); boolean isFigureCaption = false, isTableCaption = false; boolean isList = false, isEmptyParagraph = false; isFigureCaption = parseFigureCaption(block, pageNumber); isTableCaption = parseTableCaption(block, pageNumber); block = parseHeaders(block); isList = parseLists(block, pageNumber); isEmptyParagraph = "".equals(block); if(!isFigureCaption && !isTableCaption
&& !isEmptyParagraph && !isList) parseParagraph(block, pageNumber); } } }
Figure 4.8 The function of parseOtherPages
Implementing Plagiarism Detection Engine for English Academic Papers 25
5) Finally if the block isn’t one of the previous types (not Figure or Table Caption, has no
list or it has header extracted and the rest of the block is returned), then it’s a paragraph
and it will be extracted.
4.1.6.4 Enhancing the Paragraphs (enhanceParagraph Function)
As shown in Fig IV.9, some paragraphs when they are extracted won’t be in a good shape,
1) Some words may be separated between 2 lines with a hyphen, so I have to rejoin it, also
there are many spaces between words so I have to remove the extra spaces.
2) The paragraph is extracted as lines (has a new line char at the end of the line) not a
continuous String so I have to refine it.
3) Some of the paper Info are in uppercase so I capitalize them.
Figure 4.9 Block of String before Enhancing
Page Number: 1 The Content: However, as the number of metal layers increases and interconnect dimensions decrease, the parasitic capacitance increases associated with fill metal have become more significant, which can lead to timing and signal integrity problems.
Page Number: 1 The Content: Previous research has primarily focused on two important aspects of fill metal realization: 1) the development of fill metal generation methods – which we discuss further in Section II and 2) the modeling and analysis of capacitance increases due to fill metal – Several studies have examined the parasitic capacitance associated with fill metal for small scale interconnect test structures in order to provide general guidelines on fill metal geometry selection and placement. For large-scale designs,
Figure 4.10 The Paragraphs after enhancing
Implementing Plagiarism Detection Engine for English Academic Papers 26
4.1.6.5 Finally inserting all these data in the Database
When the Parser starts, an Object of type Paper is created and every information and data
extracted from the scientific paper are assigned to their attribute in this object, and at the end of the
parsing, all these information and data are inserted in the Database by calling this function.
1) Retrieve the Journal ID from the database by its name or ISSN, if it’s already found, the ID will
be returned, Otherwise, It will be considered new Journal and will be inserted and its ID will be
returned.
2) Test if the Paper is already inserted the paper in the Database before if it is already found, the
Parser will throw Exception stating that its already inserted before, but If it’s a new paper then it
will be inserted with its information (title, volume, issue …) and the Paper ID will be returned.
3) With the Paper ID, the rest of the Data will inserted (Authors, Keywords, Table of Contents,
Figure captions, Table captions, and the text content of the Paper which is the paragraphs).
4.1.7 The Paper Class
This Class works as a structure for the paper, It has the attributes that holds the information and
data of the paper and it has also the function of enhancingParagraphs() that is responsible for
improving the text and enhancing the praragraph to be ready for the next step of processing in the
Natural Language Processing part, and also the function of insertPaperInDatabase() which is
responsible for testing if the Paper is already inserted in the database before or not and if it’s a new
public class Paper { public String title = ""; public int volume = -1,issue = -1; public int startingPage = -1, endingPage = -1; public String journal = "", String ISSN = ""; public String DOI = ""; public ArrayList<String> headers = new ArrayList<>(); public ArrayList<String> authors = new ArrayList<>(); public ArrayList<String> keywords = new ArrayList<>(); public ArrayList<Paragraph> figureCaptions = new ArrayList<>(); public ArrayList<Paragraph> tableCaptions = new ArrayList<>(); public ArrayList<Paragraph> paragraphs = new ArrayList<>(); public String date=""; public String dateReceived="NULL", dateRevised="NULL"; public String dateAccepted="NULL", dateOnlinePublishing="NULL"; public void enhaceParagraphs() public void insertPaperInDatabase(String publisher) }
Figure 4.11 the Paper Structure
Implementing Plagiarism Detection Engine for English Academic Papers 27
paper, then It will be inserted with all of its data from paragraphs, figure and table captions and the
paper information.
Note that:
When the Parser finds a new PDF document in the folders of the publishers, it create a new
object of type paper and in the meantime of parsing the document, each piece information extracted is
assigned to its attribute in this object, and in the end of the parsing process, This paper object execute
its 2 member function the enhanceParagraph() to refine the paragraph content then execute the
insertPaperInDatabase() method to insert all the data in the database.
4.1.8 The Paragraph Structure As shown in the Figure IV.12, the paragraph structure is very simple is contains the number of
the content of the paragraph extracted and the page number from where this paragraph is extracted.
4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper)
As shown in Figure IV.13, we can see that the page is divided into blocks of string and each
block has a piece or more of the paper information, this function in the parser is implemented specific to
the publisher, so the function of IEEE parser won’t work to the Springer Parser and so on, and this
function is implemented to parse only the first page and to extract the paper information in this page and
assign them to the attributes of the paper object.
Even in the publisher itself, there are differences in the location of the paper information in the
page, and the structure is changing overtime for example the IEEE Parser support 8 different forms of
Paper header.
Figure 4.13 Different forms for an IEEE Top Header
public class Paragraph { public int pageNum; public String content; }
Figure 4.12 the Paragraph Structure
Implementing Plagiarism Detection Engine for English Academic Papers 28
Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper
Implementing Plagiarism Detection Engine for English Academic Papers 29
4.1.9.1 Parsing the Paper Header
The function of the parseFirstPage() starts with parsing the Header of the paper which is the
first block in the paper, the block is sent to a parsePaperHeader() function which have the different
Regular Expressions for every form of that the parser support as shown in Fig IV.15.
When the function receives the block, the block passes through the different Expressions, and if
it matches with one of supported formats, the function will start extracting the information, otherwise
the function will throw an Exception stating that this header format isn’t supported and the developer
has to support it.
As shown in Figure IV.13, The Header may contain information such as the Starting page
number (may be in the start or at the end of the line), Journal Title, Volume number, Issue number,
and the Date. These information could be presented in the header or not, so according to the format of
the header the suitable functions (parsePaperDate(), parseVolume(), parseIssue(), parseJournal(),
parseStartingPage()) will be called to extracted these information.
// ex: Chang et al. VOL. 1, NO. 4/SEPTEMBER 2009/ J. OPT. COMMUN. NETW. C35 String header1_Exp = "^([A-Z]+ ET AL\\. " + volume_Exp + ", " + issue_Exp
+ "[ ]*\\/[ ]*" + paperDate_Exp + "[ ]*\\/[ ]*" + journalTitle_Exp + " [A-Z0-9]+)$"; // ex: 594 J. OPT. COMMUN. NETW. /VOL. 1, NO. 7/DECEMBER 2009 Lim et al. String header2_Exp = "^([A-Z0-9]+ " + journalTitle_Exp + "\\/"
+ volume_Exp + ", " + issue_Exp + "\\/" + paperDate_Exp + " [A-Z]+ ET AL\\.)$";
// ex: IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May-June 2008 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May/June 2008 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS Vol. 13, No. 6; December 2006 String header3_Exp = "^(([0-9]+ )*" + journalTitle_Exp +"(,)* " + volume_Exp + ", " + issue_Exp + "(,|;) " + paperDate_Exp + "( [0-9]+)*)$"; // ex: 598 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING String header4_Exp = "^([0-9]+ " + journalTitle_Exp + ")$"; // ex: IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 598 String header5_Exp = "^(" + journalTitle_Exp + " [0-9]+)$"; // ex: 1956 lRE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES 75 String header6_Exp = "^([0-9]{4} " + journalTitle_Exp + "[0-9]+)$"; // ex: 112 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS May String header7_Exp = "^([0-9]+ " + journalTitle_Exp + "[A-Z]{3,9})$"; // ex: SUPPLEMENT TO IEEE TRANSACTIONS ON AEROSPACE / JUNE 1965 String header8_Exp = "^(" + journalTitle_Exp + "[ ]*\\/[ ]*" + dateExp + ")$";
Figure 4.15 The supported Regex of the IEEE Header formats
Implementing Plagiarism Detection Engine for English Academic Papers 30
4.1.9.2 Extracting the Volume from the Header
The IEEE Parser uses the volume_Exp = VOL(\\.)* [A-Z\\-]*[0-9]+ to detect the
Volume part from the Header and passes it to the parseVolume() Function, then uses another
Expression to extract the number from this part, ex: in the first line in Fig IV.16 of the header forms, the
parser will detect the part (VOL. 18), then It will detect the number from this result (18), then change its
type from String to int to be assigned to the volume attribute in the paper Object.
4.1.9.3 Extracting the Issue number from the Header
The IEEE Parser uses the issue_Exp = NO(\\.|,) [0-9]+ to detect the issue part from the
Header and passes it to the parseIssue() Function, then uses another Expression to extract the
number from this part, for the same example presented in the Volume section, the parser will detect the
part (NO. 3), then It will detect the number from this result (3), then change its type from String to int to
be assigned to the issue attribute in the paper Object.
4.1.9.4 Extracting the PaperDate from the Header
@Override void parseVolume(String volume) { Matcher matcher = Pattern.compile(volume_Exp).matcher(volume); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+").matcher(matcher.group()); while(numMatcher.find()) paper.volume = Integer.parseInt(numMatcher.group()); } }
Figure 4.16 The Function of extracting the Volume Number
@Override void parseIssue(String issue) { Matcher matcher = Pattern.compile(issue_Exp).matcher(issue); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+")
.matcher(matcher.group()); if(numMatcher.find()) paper.issue = Integer.parseInt(numMatcher.group()); } }
Figure 4.17 The Function of Extracting the Issue Number
@Override void parsePaperDate(String date) { Matcher matcher = Pattern.compile(paperDate_Exp).matcher(date.trim()); if(matcher.find()) paper.date = matcher.group().replaceAll("^\\/", "").trim(); }
Figure 4.18 The Function of Extracting the DOI
Implementing Plagiarism Detection Engine for English Academic Papers 31
Like the other parts of the header, the IEEE Parser uses the date_Exp = [A-Z]{0,9}[\\/\\-
]*[A-Z]{3,9}(\\.)*( [0-9]{1,2}(,)*)* [0-9]{4} to extract the date part from the Header,
then assign it to the date attribute in the paper object, also the Date could be written in different formats
(2016, March 2016, May/June 2016, May-June 2016) and the Expression is written to detect all forms
of the date formats.
Note that after extracting each information from the previous ones, this information is removed
from the block (header) String, so the after removing the volume, issue, and date, the information left in
the header will be the Journal Title and the Starting page, and the Starting page could be in the start or
the end of the header.
4.1.9.5 Extracting the Start and End Page numbers from the Header
Now, we know that the header has only the Journal Title and the Start Page number, so The
IEEE Parser uses the startPage_Exp = ^[0-9]+|[0-9]+$ this expression is to extract a number
that lies at the start of the end of the checked String so if the Start Page number lies in the start of the
header or at the end of the header, It will be detected and extracted, then as the other information it will
be assigned to the attribute in the paper object.
And for the End Page, It’s very simple as the IEEE Parser will add the number of pages of the
Paper to the number of Start Page and assign the result to the End page of the paper object.
4.1.9.6 Extracting the Journal Title from the Header
Now finally for the Journal Title, The IEEE Parser uses the journalTitle_Exp = [A-Z
\\:\\-\\—\\/\\)\\(\\.\\,]+ to extract the Journal Title part from the Header, then passes the
title to the parseJournalTitle() Function.
The title may have some extra words that aren’t needed such as: ([author name] et al.) or it may
end with some separating characters (comma or forward slash) so they must be removed first, then
assign the rest to the journal attribute in the paper object.
@Override void parseStartingPage(String startingPage) { Matcher matcher = Pattern.compile(startingPage_Exp).matcher(startingPage); if(matcher.find()) paper.startingPage = Integer.parseInt(matcher.group().trim()); parseEndingPage(startingPage); } @Override void parseEndingPage(String endingPage) { paper.endingPage = paper.startingPage + pages.size(); }
Figure 4.19 The Function of Extracting the Start and End Pages
Implementing Plagiarism Detection Engine for English Academic Papers 32
4.1.9.7 Parsing the Rest of the first page’s blocks
@Override void parseJournal(String journal) { journal = journal.replaceAll("( \\/|, )", "").trim(); Matcher matcher = Pattern.compile(journalTitle_Exp).matcher(journal); if(matcher.find()){ String journalName = matcher.group().replaceAll("[A-Z ]+ ET AL.", ""); if (journalName.charAt(journalName.length()-1) == '/') paper.journal = journalName.substring(0, journalName.length()-1); else paper.journal = journalName; } }
Figure 4.20 The Function of Extracting the Journal Title
Iterator<String> it = pageOne.iterator(); while (it.hasNext()) { String mainBlock = it.next(); String block = mainBlock.replaceAll("[ ]+", " "); if(Pattern.compile(IEEE_DOI_Exp + "|" + PII_Exp).matcher(block).find()) { parseDOI(block); blockList.add(mainBlock); } if(ISSN_Pattern.matcher(block).find()) { parseISSN(block); blockList.add(mainBlock); } if(Pattern.compile("Index Terms").matcher(block).find()) { parseKeywords(block); blockList.add(mainBlock); } if(Pattern.compile("(Abstract|ABSTRACT|Summary)").matcher(block).find()) { parseAbstract(block); blockList.add(mainBlock); } if(date_Pattern.matcher(block.toUpperCase()).find() && !datesFound){ parseDates(block); if (!paper.dateAccepted.equals("NULL") ||
!paper.dateOnlinePublishing.equals("NULL") || !paper.dateReceived.equals("NULL") || !paper.dateRevised.equals("NULL"))
{ blockList.add(mainBlock); datesFound = true; } } } removeUnimportantBlocks(); for (String blockList1 : blockList) pageOne.remove(blockList1);
Figure 4.21 Parsing the rest of blocks in the first Page
Implementing Plagiarism Detection Engine for English Academic Papers 33
After parsing the Header Block and extracting all the information from it, the IEEE Parser will
continue to parse the other blocks, searching for the rest information, but due to the difference in
structure, the location of these information could differ from structure to structure so the best way to
extract them is by looping through all the first page blocks and using the Regular Expressions of the
these information such as (DOI, ISSN …) the Parser can locate them and also It will try to detect some
other blocks such as the Abstract, Keywords, Nomenclature, and the paper Dates (when it’s
Received, Accepted, Revised, and Published Online).
In every loop, if an information is detected the block will be passed to the suitable function to
extract this information, and since the information is extracted, the block isn’t needed so the Parser will
add this block to a TreeSet<String> (blockList) and after finishing all the iterations on the blocks
of page one, these blocks will be removed from the blocks of the page.
Also, there may be some other blocks that don’t have important information such as: the website
of the publisher or the logo of the publisher with its name under the logo, so they all also have to be
detected and removed using the function removeUnimportantBlocks().
4.1.9.8 Extracting the DOI or the PII
In the loop, if the block is detected to have the DOI (Digital Object Identifier) or PII (Publisher
Object Identifier) using the IEEE_DOI_Exp = [0-9]{2}\\.[0-9]{4}\\/[A-Z\\-]+\\.[0-
9]+\\.[0-9]+ and the PII_Exp = [0-9]{4}\\-[0-9xX]{4}\\([0-9]{2}\\)[0-9]{5}\\-
(x|X|[0-9]), The IEEE Parser will be passed the block to the function of parseDOI(), and the DOI
or PII will be extracted, then if it’s the DOI, It will be concatenated with the domain of the DOI of the
papers (http://dx.doi.org/), but if it’s the PII, It will be concatenated with (http://dx.doi.org/10.1109/S)
and the result will be assigned it to the DOI attribute in the paper object.
4.1.9.9 Extracting the ISSN
@Override void parseDOI(String DOI) { Matcher matcher = Pattern.compile(IEEE_DOI_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/" + matcher.group(); matcher = Pattern.compile(PII_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/10.1109/S" + matcher.group(); }
Figure 4.22 The Function of Extracting the DOI and PII
void parseISSN(String ISSN){ Matcher matcher = ISSN_Pattern.matcher(ISSN); while(matcher.find()) paper.ISSN = matcher.group().replaceAll("(–|-|‐)", "-"); }
Figure 4.23 The Function of Extracting the ISSN
Implementing Plagiarism Detection Engine for English Academic Papers 34
Also if a block is detected to have the ISSN using the ISSN_Exp = [0-9]{4}(\\–|\\-|\\‐|
)[0-9]{3}[0-9xX], The IEEE Parser will pass the block to the function parseISSN(), that will
extract the ISSN, and assign it to the ISSN attribute in the paper object.
4.1.9.10 Extracting the Dates of the Paper:
If a block through the iteration is detected to have dates using the date_Exp = ([0-9]{1,2}(\\-| )[A-Z]{3,9}[\\.]*(\\-| )[0-9]{4})|[A-Z]{3,9}[\\.]*( [0-9]{1,2},)* [0-9]{4}) so The IEEE Parser will pass the block to the function parseDates(), this
block have dates related to the Paper such as when it’s received in the publisher and when it’s revised,
void parseDates(String dates){ dates = dates.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").toUpperCase(); Matcher matcher = receivedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateReceived = dateMatcher.group().trim(); } matcher = revisedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateRevised = dateMatcher.group().trim(); } matcher = acceptedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateAccepted = dateMatcher.group().trim(); } matcher = publishingDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateOnlinePublishing = dateMatcher.group().trim(); } }
Figure 4.24 The Function of extracting the paper Dates
Implementing Plagiarism Detection Engine for English Academic Papers 35
accepted and published online and for every date of those there is a Regular Expression to detect it and
Note that: not all papers include these dates in the paper, but most of them include it, so they will be
extracted if they are included in the paper and assigned their attributes in the paper object.
The dates could be written in many formats: (30 OCTOBER 2007), (17 AUG. 2007), (28-JULY-
2009), (OCTOBER 6, 2006), so the Regular Expression of the date itself could be complicated as to
detect all of these formats of dates
Also the word before the date could be written in different forms (Received), (Received:),
(Revised), (Revised:) or (Received in revised form) and maybe lowercase or capitalized, so the Regular
Expressions are constructed to detect all forms of those words and for the character case, we transform
the string to uppercase and compare them.
4.1.9.11 Extracting the Keywords
If the block of the Keywords is detected, The IEEE Parser will pass it to the parseKeywords()
function, the keywords may be found in the block of the Abstract so the first line to crop the part of the
Keywords if it exists with the abstract, then the block could be separated in 2 lines or has a word
separated in 2 lines with a hyphen so they have to be removed and fixed, after that some papers separate
the keywords with comma (,) and others separate them with semi-colon (;), Then the splitted keywords
are added to the list of the keywords in the paper object.
4.1.9.12 Extracting the Abstract
For the block of the abstract, while the iteration in the parseFirstPage() procedure, If one of
the blocks matches the word abstract or summary, then this block will be passed to the
parseAbstract() function, and it will be considered the first paragraph in the page with the header
Abstract .
In some cases the abstract may contain some other information such as the keywords or the
Nomenclature so they have to be copped first and parsed separately.
@Override void parseKeywords(String keywords) { keywords = keywords.substring(keywords.indexOf("Index Terms")); String indexTerms_Removal = "-\\r\\n|Index Terms|\\-"; keywords = keywords.replaceAll(indexTerms_Removal, ""); String[] splitted = keywords.replaceAll(newLine_Removal, " ").split(",|;"); for (int i = 0; i < splitted.length; i++) paper.keywords.add(splitted[i].trim()); }
Figure 4.25 The Function of Extracting the Keywords
Implementing Plagiarism Detection Engine for English Academic Papers 36
4.1.9.13 Extracting the Title and Authors
Now after extracting all the information of the paper and removing these blocks, The next block
will have the Title of the paper, then the Authors, then the Introduction.
First the title will be passed to the parseTitle() procedure, so If it’s separated on more than
one line, it will remove the newline char and assign it to the title attribute.
Next the Authors will be passed to the parseAuthors() procedure, where they will be
separated may be by comma or semi-colon or some other separation according to the publisher style,
and each author will be added to the authors list in the paper object.
@Override void parseAbstract(String abstractContent) { int indexOfIndexTerms = abstractContent.indexOf("Index Terms"); if (indexOfIndexTerms != -1) abstractContent = abstractContent.substring(0,indexOfIndexTerms); int indexOfNomenclature = abstractContent.indexOf("NOMENCLATURE"); if (indexOfNomenclature != -1) abstractContent = abstractContent.substring(0,indexOfNomenclature); paper.headers.add("Abstract"); abstractContent = abstractContent.replaceAll("(Abstract|Summary)(\\-)*", ""); String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph paragraph = new Paragraph(1, lastHeader, abstractContent); paper.paragraphs.add(paragraph); }
Figure 4.26 The Function of Extracting the Keywords
void parseTitle(String title) { paper.title = title.replaceAll(newLine_Removal, " ").trim(); } @Override void parseAuthors(String authors) { authors = authors.replaceAll(author_Removal, "")
.replaceAll("[ ]+", " "); authors = authors.replaceAll(separatedWord_Fixing, "")
.replace(newLine_Removal, " "); String[] split = authors.split(",| and| And| AND"); for (String author : split) if(!author.trim().isEmpty()) paper.authors.add(author.replaceAll("[0-9]+", "").trim()); }
Figure 4.27 The Function of Extracting the Title and the Authors
Implementing Plagiarism Detection Engine for English Academic Papers 37
4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)
Now all the paper Information is extracted and the blocks exist in the first page is the
introduction and the rest of the page content, and parseFirstPage() procedure is done executing,
and parseOtherPages() procedure will start executing, as we demonstrate before it loops over all
the blocks of strings in the pages and extract all the possible data from them such as Headers (Table of
Contents), figure and table captions, lists and if not any of the previous it will be a paragraph, and all of
these procedures are part of the parent Parser.
4.1.10.1 Extracting the Headers
This procedure is a very general one and works efficiently for most types of headers, first it
detect the style of the level 1 headers and it supports (I. INTRODUCTION), (1 INTRODUCTION), (1.
INTRODUCTION), (1 Introduction), the Headers could be listed with the roman numbers or may be
with number and the header written in an uppercase or may be the number has a dot after it, or the
header written in a capitalized case, so first the function detects the type of the header.
Also for the level 2 headers, there are different styles for these headers and there is another
function to detect them and it supports 3 different types of headers such as: (for example) (A. Level 2
Header), (1.1 Level 2 Header),
(1.1. Level 2 Header), so the Header could be listed alphabetically or number dot number or
number dot number dot then the title of the header.
Once the headers style are specified the Header’s Regular Expressions are created and are tested
on all the passed blocks to detect any headers, and those Regex are not constant but they are changing
for example if I detected
(1. Introduction) then the next header that I will wait will be (2. Another Header) so the
number will be incremented.
Another thing, the header always comes at the start of the block of string and the rest of the
string is a paragraph or it may be extracted from the beginning in one block, so the function will extract
the header only and the rest of the block will be return so it will be parsed as a paragraph.
Also there are other headers that has no numbers such as the Abstract, References,
Acknowledgements, Appendix and more other, and those headers are detected separately with a
separate Regex.
Also this procedure can detect the level 3 and level 4 headers and their style is specified
according to the style of the level 2 headers for example (1.1.1 Header) or (1.1.1. Header) and
all are add to the headers list in the paper object.
Implementing Plagiarism Detection Engine for English Academic Papers 38
4.1.10.2 Extracting the Figure and Table Captions
In This procedure, the parent Parser uses the figure_Exp = ^(Fig\\.|Figure)[ ]+[0-
9]+(\\.|\\:) and table_Exp = ^(TABLE|Table)[ ]+([0-9]+|[IVX]+) to detect the figure
and table captions and they may appear in different styles for example (Figure 1.), (Fig. 1),
(Figure 1:), (TABLE 1) and (Table II) also the listing could be numeric or alphabetic, and after
extracting it, the caption will be added to the list of captions as paragraphs with the page number.
private enum HeaderMode { NUM_SPACE_UPPERCASE, NUM_SPACE_CAPITALIZED, ROMAN_DOT_UPPERCASE,
NUM_DOT_UPPERCASE, NUM_DOT_CAPITALIZED, ABC_DOT_CAPITALIZED, NUM_DOT_NUM_DOT_CAPITALIZED, NUM_DOT_NUM_CAPITALIZED } Pattern restHeader_Pattern = Pattern.compile("^(REFERENCES|References|" + "ACKNOWLEDGMENT[S]*|Acknowledg[e]*ment[s]*|Nomenclature|DEFINITIONS" + "|Contents|NOMENCLATURE|ACRONYM|ACRONYMS|NOTATION|APPENDIX|"
+ "Appendix)(\\r\\n| )*");
void detect_Header1Mode(String block){ if (Pattern.compile("I. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.ROMAN_DOT_UPPERCASE; else if (Pattern.compile("1 INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_UPPERCASE; else if (Pattern.compile("1 Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_CAPITALIZED; else if (Pattern.compile("1. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_UPPERCASE; else if (Pattern.compile("1. Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_CAPITALIZED; } void detect_Header2Mode(int _1st_header, String block){ if (Pattern.compile("^A. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.ABC_DOT_CAPITALIZED; else if (Pattern.compile("^" + _1st_header
+ "\\.1 [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_CAPITALIZED; else if (Pattern.compile("^" + _1st_header
+ "\\.1\\. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_DOT_CAPITALIZED; }
Figure 4.28 The Defining the Style of the Header
Implementing Plagiarism Detection Engine for English Academic Papers 39
4.1.10.3 Extracting the Lists
In this procedure, the parent Parser can detect the lists in text content and separate them as a
whole paragraph itself, and it supports the numeric, dot and dashed lists, and the block could have
paragraph at the beginning and paragraph at the last, so they have to be separated, and each of the
paragraph (if found) and the lists are added as paragraphs with the page number in the paragraph list in
the paper object.
private boolean parseFigureCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = figureCaption_Pattern.matcher(block); while(matcher.find()){ String figureTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph figure = new Paragraph(pageNumber, lastHeader, figureTitle); paper.figureCaptions.add(figure); return true; } return false; } private boolean parseTableCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = tableCaption_Pattern.matcher(block); while(matcher.find()){ String tableTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph table = new Paragraph(pageNumber, lastHeader, tableTitle); paper.tableCaptions.add(table); return true; } return false; }
Figure 4.29 the Function of Extracting the Figure Captions
Implementing Plagiarism Detection Engine for English Academic Papers 40
4.1.10.4 Extracting the Paragraph
Pattern newList_Pattern = Pattern.compile("\\r\\n[ ]*([0-9]|\\-|\\.|\\·|\\•)"); Pattern numericList1_Pattern = Pattern.compile("^[0-9](\\.|\\))[ ]+[A-Z]"); Pattern numericList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*[0-9](\\.|\\))[ ]+[A-
Z]"); Pattern dotList1_Pattern = Pattern.compile("^(\\.|\\·|\\•)[ ]+[A-Za-z]"); Pattern dotList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*(\\.|\\·|\\•)[ ]+"); Pattern dashList1_Pattern = Pattern.compile("^\\-[ ]+[A-Za-z]+"); Pattern dashList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*\\-[ ]+[A-Z]");
private boolean parseLists(String block,int pageNumber){ Matcher orderList1_Matcher = numericList1_Pattern.matcher(block); Matcher orderList2_Matcher = numericList2_Pattern.matcher(block); if(orderList1_Matcher.find() || orderList2_Matcher.find()) return parseList(block, pageNumber, numericList2_Pattern, newList_Pattern); Matcher dotList1_Matcher = dotList1_Pattern.matcher(block); Matcher dotList2_Matcher = dotList2_Pattern.matcher(block); if(dotList1_Matcher.find() || dotList2_Matcher.find()) return parseList(block, pageNumber, dotList2_Pattern, newList_Pattern); Matcher dashList1_Matcher = dashList1_Pattern.matcher(block); Matcher dashList2_Matcher = dashList2_Pattern.matcher(block); if(dashList1_Matcher.find() || dashList2_Matcher.find()) return parseList(block, pageNumber, dashList2_Pattern, newList_Pattern); return false; }
Figure 4.30 the Function of separating the lists
void parseParagraph(String block, int pageNumber){ Matcher matcher = newParagraph_Pattern.matcher(block); String lastHeader="",content; int startIndex =0, endIndex; while(matcher.find()){ endIndex = matcher.start(); if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex, endIndex+1);
Figure 4.31 the Function of Extracting the Paragraph
Implementing Plagiarism Detection Engine for English Academic Papers 41
In this procedure, the parent Parser can detect the paragraphs using the paragraph_Exp =
.\\r\\n[ ]+[A-Z], and as we mentioned before, the block is passed to this procedure after being
tested to be a Figure or Table Caption or it’s a list, or it could have a header so it has to be extracted first
and return the rest of the block, so if the block passed all these tests, it will be considered a paragraph,
and passed to the parseParagraph() procedure.
Note that: the block may contain one or more paragraph so all of them has to be detected and
separated and each of them is added to the paragraphs list with page number in the paper object.
Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); startIndex = endIndex + matcher.group().length()-1; } if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex); Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); }
Figure 4.32 The Function of Extracting the Paragraph
Implementing Plagiarism Detection Engine for English Academic Papers 42
4.2 The Natural Language Processing (NLP)
4.2.1 Introduction
In this section, the text extracted from the scientific papers has to be refined. We have to focus
on the important words in the text such as names and verbs, and ignoring the staffed words such as
prepositions and adverbs, so the plagiarism can be detected efficiently even if the user try to play with
words.
4.2.2 The Implementation Overview
First, each paragraph in the database is selected and passed to the processText() procedure that
perform the text processing and return an array of refined words, in this procedure the paragraph passes
through several steps.
1. Lowercase
2. Tokenization
3. Part of Speech (POS) tagging
4. Remove Punctuations
5. Remove Stop words
6. Lemmatization
4.2.3 The Text Processing Procedure
4.2.3.1 Lowercase
In this step, all the text is changed to the lowercase, so we don’t have redundant data of the same
words written in different cases (Play, play).
4.2.3.2 Tokenization
def processText(document): document = document.lower() words = tokenizeWords(document) tagged_words = pos_tag(words) filtered_words = removePunctuation(tagged_words) filtered_words = removeStopWords(filtered_words) filtered_words = lemmatizeWords(filtered_words) return filtered_words
Figure 4.33 Process Text Function
def tokenizeWords(sentence): return word_tokenize(sentence)
Figure 4.34 Tokenizing words Function
Implementing Plagiarism Detection Engine for English Academic Papers 43
Here, we split the text into words using Treebank Tokenization Algorithm. This Algorithm
splitting the words in intelligent way based on corpus (data) retrieved from NLTK, it also split the words
from surrounded punctuation.
For Example:
4.2.3.3 Part of Speech (POS) tagging
The purpose of the POS is to find the position of the word in the sentence, it can detect if
the word is verb, noun, adjective, or adverb, so this information will help return the words to
their origins as for verbs they will be returned to their infinitives.
We use WordNet database to get the words origins.
1. i’m → [ 'i', "'m" ]
2. won’t → ['wo', "n't"]
3. gonna (tested) {helping} (25) → ['gon', 'na', 'tested', 'helping', '25']
Figure 4.35 Tokenization Example
words = ['At', '5 am', "tomorrow", 'morning', 'the', 'weather', "will", 'be', 'very', 'good', '.'] taged_words = nltk.pos_tag(words)
Figure 4.36 POS Function
[('at', 'IN'), ('5', 'CD'), ('am', 'VBP'), ('tomorrow', 'NN'), ('morning', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('will', 'MD'), ('be', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]
Figure 4.37 POS Output Example
def getWordnetPos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN
Figure 4.38 WordNet POS Function
Implementing Plagiarism Detection Engine for English Academic Papers 44
4.2.3.4 Remove Punctuations
In this Step the punctuations are removed from the text such as: comma, full stop, single and
double quotes, and the parenthesis either circle or square or the curly braces.
4.2.3.5 Remove Stop words
In this process the staffed words (stop words) are removed.
def removePunctuation(words): new_words = [] for word in words: if len(word[0]) > 1: new_words.append(word) return new_words
Figure 4.39 Removing Punctuations Function
def removeStopWords(words): stop_words = set(stopwords.words("english")) new_words = [] for word in words: if word[0] not in stop_words: new_words.append(word) return new_words
Figure 4.40 Removing Stop Words Function
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Figure 4.41 Stop Words list
Implementing Plagiarism Detection Engine for English Academic Papers 45
4.2.3.6 Lemmatization
In this step, we use the information retrieved from the POS to get the origins of the words. By
passing the word and its wordnet position to the lemmatize function.
Now the paragraph after being processed have only the important words that describe the real meaning
of the paragraph.
4.2.4 Example of the Text Processing
def lemmatizeWords(words): new_words = [] wordnet_lemmatizer = WordNetLemmatizer() for word in words: new_word = wordnet_lemmatizer.lemmatize(word[0], getWordnetPos(word[1])) new_words.append(new_word)
Figure 4.42 Lemmatization Function
Plagiarism is the wrongful appropriation and stealing and publication of another author's language, thoughts, ideas, or expressions and the representation of them as one's own original work. The idea remains problematic with unclear definitions and unclear rules. The modern concept of plagiarism as immoral and originality as an ideal emerged in Europe only in the 18th century, particularly with the Romantic movement. Plagiarism is considered academic dishonesty and a breach of journalistic ethics. It is subject to sanctions like penalties, suspension, and even expulsion. Recently, cases of 'extreme plagiarism' have been identified in academia. Plagiarism is not in itself a crime, but can constitute copyright infringement. In academia and industry, it is a serious ethical offense. Plagiarism and copyright infringement overlap to a considerable extent, but they are not equivalent concepts, and many types of plagiarism do not constitute copyright infringement, which is defined by copyright law and may be adjudicated by courts. Plagiarism is not defined or punished by law, but rather by institutions (including professional associations, educational institutions, and commercial entities, such as publishing companies).
Figure 4.43 Paragraph before Text Processing
Implementing Plagiarism Detection Engine for English Academic Papers 46
['plagiarism', 'wrongful', 'appropriation', 'stealing', 'publication', 'another', 'author', "'s", 'language', 'thought', 'idea', 'expression', 'representation', 'one', "'s", 'original', 'work', 'idea', 'remain', 'problematic', 'unclear', 'definition', 'unclear', 'rule', 'modern', 'concept', 'plagiarism', 'immoral', 'originality', 'ideal', 'emerge', 'europe', '18th', 'century', 'particularly', 'romantic', 'movement', 'plagiarism', 'consider', 'academic', 'dishonesty', 'breach', 'journalistic', 'ethic', 'subject', 'sanction', 'like', 'penalty', 'suspension', 'even', 'expulsion', 'recently', 'case', "'extreme", 'plagiarism', 'identify', 'academia', 'plagiarism', 'crime', 'constitute', 'copyright', 'infringement', 'academia', 'industry', 'serious', 'ethical', 'offense', 'plagiarism', 'copyright', 'infringement', 'overlap', 'considerable', 'extent', 'equivalent', 'concept', 'many', 'type', 'plagiarism', 'constitute', 'copyright', 'infringement', 'define', 'copyright', 'law', 'may', 'adjudicate', 'court', 'plagiarism', 'define', 'punish', 'law', 'rather', 'institution', 'include', 'professional', 'association', 'educational', 'institution', 'commercial', 'entity', 'publish', 'company']
Figure 4.44 Paragraph after Text Processing
Implementing Plagiarism Detection Engine for English Academic Papers 47
4.3 Term Weighting
In this section we will calculate the term weighting our system using the data extracted from the
scientific papers by the parser. The parser extract the data as paragraphs and store them in database, and
here we will retrieve these paragraphs and calculate the term weighting for the system.
4.3.1 Lost Connection to Database Problem
First we open a connection to database and retrieve the unprocessed paragraphs, but we are
processing a large number of paragraphs and the connection must stay open all that time. So we face a
problem of lost connection to database when its internal timeout expires.
1) Increasing timeout solution
This problem could be solved by increasing the timeout but this solution is limited as we might
have a very large number of paragraphs that exceeds the timeout was set before.
2) Better solution
We will retrieve 100 paragraph and process them, then close the connection. Then open a new
connection and retrieve another 100 paragraphs and so on until all the unprocessed paragraphs are
processed.
cursor = connection.run("SELECT COUNT(*) FROM paragraph WHERE processed = false") (unprocessedParagraphsNum,) = cursor.fetchone() connection.endConnect() pCounter = 0 insertTermsBeginTime = time.time() while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 100: remain = 100 rows = connection1.run("SELECT paragraphId,content FROM paragraph WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 # Process Paragraph
connection.endConnect()
Figure 4.45 Retrieving Paragraphs
Implementing Plagiarism Detection Engine for English Academic Papers 48
4.3.2 Process Paragraph
For each paragraph we will pass it to processText() procedure to get an array of refined words, if
the array is empty, it means that there we no important words in the paragraph and the paragraph will be
deleted.
We will use the returned words to generate k-gram terms of them and populate the term table and
paragraphVector table.
Finally we will update the length of the paragraph with the number of words returned from
processText(), and mark the paragraph as processed.
4.3.3 Generating Terms
To generate terms we will call the generateTerms() procedure and pass to it: the bag of words,
and the kind of the k-grams we want to generate.
while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 1000: remain = 1000 rows = connection1.run("SELECT paragraphId,content FROM paragraph"
+ " WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 data = processText(content) length = len(data) if length < 1: connection2.run("DELETE FROM paragraph WHERE paragraphId = %s;"
, (paragraphId,)) connection2.commit() continue term.populateTerms_ParagraphVector(connection2, data, paragraphId) connection2.run("UPDATE paragraph SET length = %s, processed = %s "
+ " WHERE paragraphId = %s;", (length, True, paragraphId)) connection2.commit() connection1.endConnect() connection2.endConnect()
Figure 4.46 Process Paragraph Function
Implementing Plagiarism Detection Engine for English Academic Papers 49
Example on k-grams
data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) def generateTerms(data, kgrams, paragraphId=0): all_terms = {} for i in kgrams: if len(data) < i: continue terms = createTerms(data, i) all_terms[i] = terms data = { 'paragraphId': paragraphId, 'terms': all_terms } return data def createTerms(words, kgram): length = len(words) - kgram + 1 i = 0 terms = [] while i < length: term = createTerm(words, i, kgram) terms.append(term) i += 1 return terms def createTerm(words, start, kgram): i = start term = [] while i < kgram + start: term.append(words[i]) i += 1 t = ' '.join(term) if len(t) > 180: t = t[0:180] return t
Figure 4.47 Generate k-gram Terms Function
Physics is one of the oldest academic disciplines, perhaps the oldest through its inclusion of astronomy. Over the last two millennia, physics was a part of natural philosophy along with chemistry, biology, and certain branches of mathematics.
Figure 4.48 Paragraph Example
Implementing Plagiarism Detection Engine for English Academic Papers 50
['physic', 'one', 'old', 'academic', 'discipline', 'perhaps', 'old', 'inclusion', 'astronomy', 'last', 'two', 'millennium', 'physic', 'part', 'natural', 'philosophy', 'along', 'chemistry', 'biology', 'certain', 'branch', 'mathematics']
Figure 4.49 1-gram terms
['physic one', 'one old', 'old academic', 'academic discipline', 'discipline perhaps', 'perhaps old', 'old inclusion', 'inclusion astronomy', 'astronomy last', 'last two', 'two millennium', 'millennium physic', 'physic part', 'part natural', 'natural philosophy', 'philosophy along', 'along chemistry', 'chemistry biology', 'biology certain', 'certain branch', 'branch mathematics']
Figure 4.50 2-gram terms
['physic one old', 'one old academic', 'old academic discipline', 'academic discipline perhaps', 'discipline perhaps old', 'perhaps old inclusion', 'old inclusion astronomy', 'inclusion astronomy last', 'astronomy last two', 'last two millennium', 'two millennium physic', 'millennium physic part', 'physic part natural', 'part natural philosophy', 'natural philosophy along', 'philosophy along chemistry', 'along chemistry biology', 'chemistry biology certain', 'biology certain branch', 'certain branch mathematics']
Figure 4.51 3-gram terms
['physic one old academic', 'one old academic discipline', 'old academic discipline perhaps', 'academic discipline perhaps old', 'discipline perhaps old inclusion', 'perhaps old inclusion astronomy', 'old inclusion astronomy last', 'inclusion astronomy last two', 'astronomy last two millennium', 'last two millennium physic', 'two millennium physic part', 'millennium physic part natural', 'physic part natural philosophy', 'part natural philosophy along', 'natural philosophy along chemistry', 'philosophy along chemistry biology', 'along chemistry biology certain', 'chemistry biology certain branch', 'biology certain branch mathematics']
Figure 4.52 4-gram terms
['physic one old academic discipline', 'one old academic discipline perhaps', 'old academic discipline perhaps old', 'academic discipline perhaps old inclusion', 'discipline perhaps old inclusion astronomy', 'perhaps old inclusion astronomy last', 'old inclusion astronomy last two', 'inclusion astronomy last two millennium', 'astronomy last two millennium physic', 'last two millennium physic part', 'two millennium physic part natural', 'millennium physic part natural philosophy', 'physic part natural philosophy along', 'part natural philosophy along chemistry', 'natural philosophy along chemistry biology', 'philosophy along chemistry biology certain', 'along chemistry biology certain branch', 'chemistry biology certain branch mathematics']
Figure 4.53 5-gram terms
Implementing Plagiarism Detection Engine for English Academic Papers 51
4.3.4 Populating term, paragraphVector Tables
After we generated the terms we will use them to populate term, paragraphVector tables.
4.3.4.1 Calculate Term Frequency
We will use nltk.FreqDist() function to calculate the term frequency of each k-gram term in the
paragraph
4.3.4.2 Inserting Terms
We will insert each term with its corresponding term gram.
4.3.4.3 Inserting ParagraphVector
In this step we will link each term with its paragraph and the term frequency by inserting these
into the paragraphVector table.
tf = {} for kgram in data['terms']: tf[kgram] = nltk.FreqDist(data['terms'][kgram])
Figure 4.54 Calculate Term Frequency
query1 ="INSERT INTO term (kgram, term) VALUES (%s, %s) ON DUPLICATE KEY UPDATE kgram = kgram, term = term;" insertTerms = [(str(kgram), str(term)) for kgram in tf for term in tf[kgram]] connection.runMany(query1, insertTerms) connection.commit()
Figure 4.55 insert Terms in Database
query2 = "INSERT IGNORE INTO paragraphVector (paragraphId, termId, termFreq, kgram) VALUES (%s, (SELECT termId FROM term WHERE term = %s AND kgram = %s), %s, %s);" insertDocVec = [(data['paragraphId'], str(term), str(kgram), tf[kgram][term], str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query2, insertDocVec) connection.commit()
Figure 4.56 insert Paragraph Vector in Database
Implementing Plagiarism Detection Engine for English Academic Papers 52
4.3.5 Executing VSM Algorithm
After all paragraphs are inserted into the database after begin processed, we will run some sorted
SQL procedures to update some columns(inverseDocFreq, BM25, pivotNorm) in term and
paragraphVector tables.
Now the system is finished and all terms are evaluated and ready for testing plagiarism.
connection.callProcedure('update_inverseDocFreq') connection.callProcedure('update_BM25', (0.75, 1.5)) connection.callProcedure('update_pivotNorm', (0.75,))
Figure 4.57 Executing the VSM Algorithm
Implementing Plagiarism Detection Engine for English Academic Papers 53
4.4 Testing Plagiarism When a user submits a text or a file to test plagiarism on it, this text must first be splitted into
paragraphs and an inputPaper will be inserted to relate these paragraphs together.
4.4.1 Process Paragraph Then each paragraph must be processed in a similar way like in the pre-processing, first the
paragraph will be inserted into the database in the inputParagraph table, then the text will be passed to
processText() procedure and return a refined bag of words. And finally these words will be used to
generate terms of them and populate the inputParagraphVector table.
connection.run(" INSERT INTO inputPaper (inputPaperId) VALUES(''); ") paragraphs = tokenizeParagrapgs(text)
Figure 4.58 tokenizing and link paragraphs together
for paragraph in paragraphs: data = processText(paragraph) length = len(data) if length < 1: continue cursor = connection.run("INSERT INTO inputParagraph (content,inputPaperId) VALUES (%s,%s)", (paragraph, paperId)) connection.commit() paragraphId = cursor.getlastrowid() term.populateInput_Terms_ParagraphVector(connection, data, paragraphId)
Figure 4.59 Process input paragraphs
def populateInput_Terms_ParagraphVector(connection, words, paragraphId): data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) # Term Frequency representation tf = {} for kgram in data['terms']: tf[kgram] = FreqDist(data['terms'][kgram]) query = "INSERT INTO inputParagraphVector (inputParagraphId, termId, termFreq, kgram) SELECT %s, termId, %s, %s FROM term WHERE term = %s AND kgram = %s ;" insertDocVec = [(data['paragraphId'], tf[kgram][term], str(kgram), str(term), str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query, insertDocVec) connection.commit()
Figure 4.60 Populate input paragraph vector
Implementing Plagiarism Detection Engine for English Academic Papers 54
4.4.2 Calculate Similarity After all data has been inserted, we will call the calculateSimilarity stored procedure in the SQL
to calculate the similarity between all inserted paragraphs and the original paragraphs. The procedure
will use the value of BM25 and thresholds of the different k-grams will be passed to it.
4.4.3 Get Results Finally we will get the results from the similarity table and process them in JSON format to send
them to the client interface.
connection.callProcedure('calculateSimilarity', ('BM25', 0, 0, 2, 5, 8))
Figure 4.61 Calculate Similarity
def getResults(connection, paperId, paragraphsNum, beginTime): cursor = connection.callProcedure('getResults', (paperId,)) columnNames = ('inParagraph', 'originalParagraph', 'page', 'paperTitle'
, 'volume', 'issue', 'journal', 'publisher', 'issn', 'similarity' , 'paragraphMagnitude', 'inputParagraphMagnitude')
for results in cursor.stored_results(): all = results.fetchall() finalResults = [] for result in all: temp = [] for index, name in enumerate(columnNames): value = result[index+2] if name == 'inParagraph' or name == 'originalParagraph': value = value.strip() temp.append((name, value)) temp = dict(temp) finalResults.append(temp)
Figure 4.62 Get Results
Implementing Plagiarism Detection Engine for English Academic Papers 55
4.5 The VSM Algorithm
4.5.1 Calculating similarity
The similarity between input paragraph and paragraphs in dataset is calculated using dot product,
in order to optimize performance without affecting accuracy we represent each paragraph as 5 to 1
grams, calculate similarity on all representation but after each calculation we chose paragraphs to limit
our calculations to as discussed follows.
5.1.1 Pseudo code of Dot product
dotProduct (k-gram, threshold, limiting conditions)
Select the paragraphs to limit calculation according to limiting conditions and calculating similarity
plan.
For each (paragraph, newInputParagraph) pair from the selected paragraphs:
score = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑘𝑔𝑟𝑎𝑚, 𝑖𝑛𝑝𝑢𝑡𝑃𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑘𝑔𝑟𝑎𝑚, 𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ)𝑘𝑔𝑟𝑎𝑚∈ 𝑞∩𝑑
If score > threshold
Insert score in similarity table
5.1.2 Time complexity analysis
Selecting the paragraphs on which we will limit our calculations step is done by joining
similarity table with paragraph table which takes 𝜃(𝑛 ∗ log 𝑚) where m is the maximum of paragraph
table size and similarity table size, n is the minimum of them.
The most costly parts are inner joining paragraphVector and inputParagraphVector tables, and
aggregating the resulted table.
Inner join time complexity is 𝜃(𝑀 log 𝑁 + 𝑀 ∩ 𝑁 )
And aggregation time complexity is 𝜃(𝑀 ∩ 𝑁 ∗ log #𝑖𝑝 ∗ #𝑝)
Where M: size of inputParagraph table i.e. sum of unique terms in each input paragraph.
N: size of paragraph table i.e. sum of unique terms in each paragraph.
𝑀 ∩ 𝑁: sum of common terms between each (paragraph, input paragraph) pair.
#p: number of paragraphs in dataset that have common terms with input paragraphs.
#ip: number of input paragraphs in dataset that have common terms with paragraphs.
To optimize the dot product we need to minimize unnecessary 𝑀 ∩ 𝑁 or limit the number of
paragraphs and input paragraphs that we operate on.
Implementing Plagiarism Detection Engine for English Academic Papers 56
5.1.3 Design discussions based on complexity analysis
It’s recommended to periodically empty the similarity and inputParagraph tables and store them
in a different backup storage to speed up the join operation in selecting paragraphs step (first step).
The calculating similarity is only logarithmically proportional to the dataset size, which means
we can collect as large data as we can without having a fatal impact on the system response
performance.
Since the dot product operation complexity depends on the number of common terms we want to
design a calculating similarity plan which minimize unnecessary common terms, and join paragraphs
with a lot of common terms only when they have high probability of being plagiarized.
This can be done by starting calculating similarity on large k-grams as they have much less
probability than lower ones, and if found possible plagiarism we limit the calculations to the suspected
paragraphs.
If we compared paragraphs on 1-gram or 2-gram we will find a lot of common terms that doesn’t
imply plagiarism, but if we compared them on 5 or 4-gram we will not find common terms unless it’s
plagiarized with a high probability.
5.1.4 Calculating similarity plan
Calculate similarity on 5-gram
If found any match: limit later calculations on the matched paragraphs
Else: do later calculations on all paragraphs
Repeat for 4-gram, 3-gram, and 2-gram
If found any match in the past calculations: calculate similarity on 1-gram
4.5.2 K-means and Clustering
Text documents clustering differ than ordinary clustering because there are as many dimensions
as there are terms in dataset, to avoid iterating each term we make use of the fact that there is only small
number of terms in each document\paragraph and ‘iterate’ –or join since we are using RDBMS-only
over each unique term in document.
The text clustering K-means algorithm is described in the following flowchart.
Implementing Plagiarism Detection Engine for English Academic Papers 57
5.2.1 Time complexity analysis
Define:
K: number of clusters / centroids
P: number of paragraphs
Start
End
Choose random c
centroids from dataset
i = 0
Calculate similarity
between each (centroid,
paragraph) pair using dot
product
Assign each paragraph to the
centroid with maximum
similarity.
Move each centroid to the
mean of the points assigned
to it.
Cost =
∑ ∑ ∑ (𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑝) − 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑐))2
𝑤∈𝑝∪𝑐𝑝∈𝑐𝑐
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Max step a centroid moved > epsilon
OR i > maxIter
no
yes
Figure 4.63 Flowchart of the Kmeans text clustering algorithm
Implementing Plagiarism Detection Engine for English Academic Papers 58
N: sum of unique terms in each paragraph, also number of rows in paragraph vector table
𝑁 = ∑ ∑ 1
𝑡∈𝑝𝑝
M: sum of unique terms in each centroid, also number of rows in centroid table
𝑀 = ∑ ∑ 1
𝑡∈𝑐𝑐
Where t: term, p: paragraph, c: centroid
𝑀 ∩ 𝑁 : sum of unique terms that appears in a paragraph and a centroid for each(paragraph,
centroid) pair, also number of rows in the resulting of joining paragraph and centroid tables on term
𝑀 ∩ 𝑁 = ∑ ∑ ∑ 1
𝑡∈𝑝 𝑎𝑛𝑑 𝑡∈𝑐
= ∑ 1
𝑡∈𝑝∩𝑐𝑐𝑝
maxIter: maximum possible number if iterations before the program terminates even if it didn’t
converge yet
The most expensive part in the main loop is the inner join between paragraphs and centroids on
term then aggregation for calculating similarity to assign clusters, note that the mathematical operations
in aggregation part –multiplication or summation- may have larger hidden constants.
Assuming B-tree index on primary key, the time complexity of join:
Best case scenario where there are no duplicates 𝜔(𝑀 log 𝑁).
General case 𝜃(𝑀 log 𝑁 + 𝑀 ∩ 𝑁 )
If both tables where indexed time complexity may be linear M+N instead of M log N
But the cost of inserting records to centroid table in moving centroids step will be n log n instead
of just n.
The time complexity of aggregation 𝜃(𝑀 ∩ 𝑁 ∗ log 𝑘𝑝)
So the time complexity of the Kmeans algorithm is:
𝑂(𝑚𝑎𝑥𝐼𝑡𝑒𝑟 ∗ (𝑀 log 𝑁 + 𝑀 ∩ 𝑁 ∗ log 𝑘𝑝))
And the time complexity of the whole clustering operations is a multiple of the complexity of
Kmeans algorithm by a small integer since it’s repeated multiple times to avoid local optima.
Issue:
Each paragraph in the dataset is assigned to a centroid by measuring similarity between both, if a
paragraph has no common term with the centroids, not only it will have zero similarity, but it will have n
Implementing Plagiarism Detection Engine for English Academic Papers 59
4.6 Server Side
4.6.1 Handling Routing
This is the home routing for testing a plagiarized document, first we check if the pre-processing
is running, and if it’s running, we won’t allow the user to submit any testing in the meantime.
app.get('/', function(req, res) { updateFooterInfo(); var q = "SELECT trainingOn FROM siteInfo WHERE id = 1;" connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); if (rows.length === 0 || !rows[0].trainingOn) { res.render('home'); } else if (rows[0].trainingOn) { res.render('no_test_now'); } }); });
Figure 4.64 Home Page Routing
app.get('/admin/pre_process', function(req, res) { updateFooterInfo(); var q = "SELECT trainingOn FROM siteInfo WHERE id = 1;" connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); if (rows.length === 0 || !rows[0].trainingOn) { var query = "SELECT COUNT(*) AS num FROM paragraph WHERE processed=false;"; connection.query(query, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); var paragraphNums = rows[0].num; res.render('train', {
paragraphNums: paragraphNums, word: "Unprocessed Paragraphs",
load: false });
}); } else if (rows[0].trainingOn) { res.render('train', {word: 'running ...', load: true}); } }); });
Figure 4.65 Pre-Process Page Routing
Implementing Plagiarism Detection Engine for English Academic Papers 60
This is the routing for the pre-processing for the system, where we can follow the progress of the
pre-processing if it’s already running and processing the paragraphs in the database, otherwise, it will
show status of the database including: number of unprocessed paragraphs in the database, and a button
to start the pre-processing.
4.6.2 Running Python System
When the user connect to the server, a socket session id will be assigned to him, then if the
server received a submitTest message from the client, The server will read text sent from the client and
run the python code for testing plagiarism and pass the text to it.
While the python code is running, it will keep sending progressUpdateTest messages to update
the progress bar appears in the client, when the python code sends doneTest message, the server will
pass the results and score of the plagiarism test to the client.
io.on('connection', function(client) { client.on('submitTest', function(data) { var text = data.text; pyshell = new PythonShell('/python/main.py', {mode: 'text',
args: ['test', 'browser']}); pyshell.send(text); pyshell.on('message', function(message) { message = JSON.parse(message); var results; if (message.done) client.emit('doneTest', { data: message.data, score: message.score, time: message.time, scoreValue: message.scoreValue}); else client.emit('progressUpdateTest', { kind: message.kind, percent: message.percent}); }); pyshell.end(function (err) { if (err) throw err; }); }); });
Figure 4.66 Communicating between the Server and
the Core Engine for testing plagiarism
Implementing Plagiarism Detection Engine for English Academic Papers 61
After a socket session id has been assigned to the user, and the server received a submitTrain
message from the client, the server will run the python code for vectorization.
While the python code is running, it will keep sending progressUpdateTrain messages to update
the progress bar appears in the client, when the python code sends doneTest message, the server will
pass the status and timings of the preprocessing.
io.on('connection', function(client) { client.on('submitTrain', function(data) { var q = "INSERT INTO siteInfo (id,trainingOn) VALUES(1, true) " + "ON DUPLICATE KEY UPDATE trainingOn = true"; connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); }); pyshell = new PythonShell('/python/main.py', {mode: 'text', args: ['train', 'browser']}); pyshell.on('message', function(message) { message = JSON.parse(message); if (message.done) { io.emit('doneTrain', { processedParagraphs: message.processedParagraphs, remainParagraphs: message.remainParagraphs, insertTime: message.insertTime, inverseDocVecTime: message.inverseDocVecTime, bm25Time: message.bm25Time, pivotNormTime: message.pivotNormTime, totalTime: message.totalTime }); } else { io.emit('progressUpdateTrain', {kind: message.kind, percent: message.percent}); } }); pyshell.end(function (err) { if (err) throw err; }); }); })
Figure 4.67 Communicating between the Server and
the Core Engine for Pre-processing
Implementing Plagiarism Detection Engine for English Academic Papers 62
4.7 Client Side
After the input text is completely tested, and the plagiarized parts are identified, we use the LCS
Algorithm to detect the similar parts between the input and the matched paragraph.
function LCS(a, b) { var m = a.length, n = b.length, C = [], i, j; for (i = 0; i <= m; i++) C.push([0]); for (j = 0; j < n; j++) C[0].push(0); for (i = 0; i < m; i++) for (j = 0; j < n; j++) C[i+1][j+1] = a[i] === b[j] ?
C[i][j]+1 : Math.max(C[i+1][j], C[i][j+1]);
return (function bt(i, j) { if (i*j === 0) { return ""; } if (a[i-1] === b[j-1]) { return bt(i-1, j-1) + a[i-1] + ' '; } return (C[i][j-1] > C[i-1][j]) ? bt(i, j-1) : bt(i-1, j) + '\n'; }(m, n)); }
Figure 4.68 Least Common Subsequence LCS Algorithm
for (var i = 0; i < data.data.length; i++) { var p1 = data.data[i].inParagraph; var p2 = data.data[i].originalParagraph; var lcs_for_1 = LCS(p1.split(' '), p2.split(' ')); lcs_for_1 = lcs_for_1.split('\n').filter(function(s)
{return s}).map(function(s) {return s.trim()}); var tempStr = '', currentCursor = 0, charCounter = 0; for (j = 0; j < lcs_for_1.length; j++) { var sub = lcs_for_1[j]; tempStr += p1.slice(currentCursor, p1.indexOf(sub)); var t = p1.slice(p1.indexOf(sub), p1.indexOf(sub)+sub.length); tempStr += '<span class="highlight-text-1">' + t + '</span>'; charCounter += t.length; currentCursor = p1.indexOf(sub)+sub.length; } tempStr += p1.slice(currentCursor, p1.length); }
Figure 4.69 Least Common Subsequence LCS Algorithm
Implementing Plagiarism Detection Engine for English Academic Papers 63
The detected parts returned from LCS will be used to highlight the input paragraph and the
matched paragraph and calculate a score for similarity.
4.8 The GUI of the System
This is the Interface of the system where the user will input the text to be tested and start
analyzing, then the system will start working, until it finishes and the results will be shown
Figure 4.70 Submitting an input document
Implementing Plagiarism Detection Engine for English Academic Papers 64
Figure 4.71 The Results of the Process Part 1
The results appear as shown in Fig IV.72:
1. The Yellow Highlighted Text is the Plagiarized text in the input the text
2. The Red Highlighted Text is the source of the plagiarized text
Also Our System shows the Time of the Process, The Percentage of Similarity between the two
texts, and information about the paper that has the source text.
Implementing Plagiarism Detection Engine for English Academic Papers 65
Figure 4.72 The Results of the Process Part 2
Implementing Plagiarism Detection Engine for English Academic Papers 66
Chapter 5 Results and Discussion
5.1 Dataset of the Parser
After parsing the Scientific Papers downloaded by the Crawler and passed to the Parser, and this
is the Statistics after parsing.
Table 1 Statistics of the Parser
Publisher Num of
Journals
Num of
Paper
Avg
Pages/Paper
Avg
Paragraph/Paper
Avg
words/Paper
IEEE 130 609 10 149 9363
Springer 59 541 25 386 21603
Science Direct 4 1206 27 406 21092
Figure 5.1 Number of Papers Published per Year in IEEE
36 11 7 12 12 6 18 13 17 25 82
339
2 5 2
Nu
mb
er
of
Pap
ers
Dates of Publishing
Number of Papers Published per YearIn IEEE
Implementing Plagiarism Detection Engine for English Academic Papers 67
Figure 5.2 Number of Papers Published per Year in Springer
Figure 5.3 Number of Papers Published per Year in Science Direct
We tested the plagiarism engine on a vast real world data without the clustering process, the
dataset contains text from English academic papers collected from IEEE, Springer’s, and Science Direct.
28 20 20 23 5 45
75
233
74
18
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Nu
mb
er
of
Pap
ers
Dates of Publishing
Number of crawled Papers per YearIn Springer
5466 68 74 68
52 52
78
52 59
189
125
35
9480 73
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Nu
mb
er
of
Pap
ers
Dates of Publishing
Number of Papers Published per YearIn Science Direct
Implementing Plagiarism Detection Engine for English Academic Papers 68
5.2 Exploring dataset
We tested the system on two datasets that differ in size and compared the performance of the
system on both datasets, the small one has 15K paragraphs and the big data set has 50K paragraphs.
Here are some useful and insightful statistics about the datasets.
5.4.1 Small dataset (15K) Table 2 Dataset Statistics
Number of paragraphs 15342
Number of paragraphs 15065
Total length of paragraphs 462103
Average paragraph length 30.6739
Table 3 Unique Terms count in each Paragraph
Number of unique k-gram
terms in each paragraph Average Document Frequency
1-gram 358140 11.0725
2-gram 417189 1.5651
3-gram 419472 1.1462
4-gram 410394 1.0766
5-gram 399109 1.0513
Sum of all k-grams
(size of paragraphVector table)
2004304 -
Table 4 Unique Terms count in Dataset
Number of unique k-gram terms in dataset
1-gram 32345
2-gram 266553
3-gram 365979
4-gram 381179
5-gram 379651
Sum of all k-grams
(size of term table)
1425707
Implementing Plagiarism Detection Engine for English Academic Papers 69
5.4.2 Big dataset (50K) Table 5 Dataset Statistics
Number of paragraphs
Number of paragraphs 50792
Total length of paragraphs 1561846
Average paragraph length 30.7498
Table 6 Unique Terms count in each Paragraph
Number of unique k-gram terms in
each paragraph Average Document Frequency
1-gram 1206539 13.6523
2-gram 1402768 1.7401
3-gram 1409069 1.1597
4-gram 1379124 1.0835
5-gram 1342290 1.0572
Sum of all k-grams
(size of paragraphVector table)
6739790 -
Table 7 Unique Terms count in Dataset
Number of unique k-gram terms in dataset
1-gram 88376
2-gram 806165
3-gram 1215077
4-gram 1272892
5-gram 1269639
Sum of all k-grams
(size of term table)
4652149
Implementing Plagiarism Detection Engine for English Academic Papers 70
5.3 Performance The data preprocessed on a machine with specifications:
Processor: 4x Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
Memory: 3895MB
Operating System: Arch Linux
Kernel: 4.4.1
And the performance was found as follows:
Figure 5.4 Response time against number of paragraphs tested on small dataset
Table 8 Processing time of each module in Plagiarism Engine
Time in seconds
Small dataset Big dataset
Natural Language processing and Vectorization 2899.04 15982.93
Calculating Document Frequency and other
statistics
36.77 171.05
Calculating BM25 weights 65.49 256.52
Calculating Pivoted length Normalization weights 53.07 225.14
Total pre processing time 3054.45 16658.95
Typical response time 10-20 60
8
1419
34
60
89
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120
Tim
e in
sec
on
ds
Number of Paragraphs
Implementing Plagiarism Detection Engine for English Academic Papers 71
Figure 5.5 Screenshot of the System Performance from the System GUI
Implementing Plagiarism Detection Engine for English Academic Papers 72
Note that: these measurements include the time of initiating connections, read/write from database, and
other overhead not only the processing time.
5.4 Detecting plagiarism
We test plagiarism engine one time since the size of the dataset doesn’t have a considerable
effect on the accuracy of the system, and both methods give similar result; we tested the system with
input paragraphs plagiarized from a randomly chosen paragraph from the dataset.
Original paragraph: “. Finally, our proposed fault detection schemes and almost all of the
previously reported ones have been implemented on the recent Xilinx Virtex FPGAs, and their area and
delay overheads have been derived and compared. The FPGA implementation results show the low area
and delay overheads for the proposed fault detection schemes.”
With id = 2766
We will use only “Finally, our proposed fault detection schemes and almost all of the previously
reported ones have been implemented on the recent Xilinx Virtex FPGAs” to test the system.
We chose parameters as follows:
Table 9 Parameters
value
k for BM25 1.5
b for BM25 0.75
b for pivoted length normalization 0.75
Threshold for 5-gram 0
Threshold for 4-gram 0
Threshold for 3-gram 2
Threshold for 2-gram 5
Threshold for 1-gram 8
The paragraphs we used in testing and the results of testing are shown in the below table.
5.4.1 Percentage score functions: We used dot product similarity function with BM25 weighting for evaluating similarity,
whatever we need a normalized similarity function to get a percentage score that is easy to understand
by users, so we used cosine similarity and Longest Common Subsequence LCS.
Cosine similarity: it’s a function similar to dot product with built-in normalization, it’s the dot
product score divided by the maximum possible similarity, so it would be 100% if the input paragraph
has the same lemmas of all non stop words of the original paragraphs, changing the phrase tense or
Implementing Plagiarism Detection Engine for English Academic Papers 73
reordering it will not affect the cosine similarity score, if we copied only half the original phrase the
score will be just 50% and so on.
LCS: LCS algorithm finds the common parts between input and original paragraph and
highlights it, the percentage score is the number of common words divided by number of words in
original paragraph, unlike cosine similarity LCS compares the words literally without NLP analysis, so
it can’t detect paraphrased plagiarism.
Table 10 Testing Paragraphs and Results
Paragraph Description Results
(detected?)
Cosine
similarity
LCS
similarity
Finally, our proposed fault detection schemes
and almost all of the previously reported ones
have been implemented on the recent Xilinx
Virtex FPGAs
Copied Yes 50% 43%
Finally, we propose fault detection schemes and
most of the previous reports will have
implementation on the recent Xilinx Virtex
FPGAs
Slightly paraphrased (change
phrase time and stop words)
Yes 40% 22%
At last, we suggest error detection schemes and
nearly all of the detected errors were applied on
the past methods.
Highly paraphrased (replace
words with synonymous or
negotiate antonyms)
No - -
almost all of the previously reported ones and
our proposed fault detection schemes have been
implemented on the recent Xilinx Virtex FPGAs
Slightly rearranged Yes 40% 28%
most of the ones reported previously and fault
detection schemes that we proposed have been
recently implemented on the Virtex Xilinx
FPGAs
Moderately rearranged (no 4
successive non stop words as
the original paragraph)
Yes 40% 13%
schemes of fault detection that we proposed and
nearly all of the ones previously reported have
been recently Xilinx implemented on Virtex
FPGAs
Highly rearranged (no 3
words have the same order as
the original paragraph)
Yes 40% 24%
proposed, our fault schemes Finally detection
and almost reported all of the implemented
previously ones Xilinx have been FPGAs on the
recent Virtex
Extremely rearranged (no
two words have the same
order as the original
paragraph)
No - -
Implementing Plagiarism Detection Engine for English Academic Papers 74
5.5 Discussing results
The system has reasonably fast response, and the calculating similarity step –which happen to
have a small time relative to the total response time- is the only step that depends on the size of the
dataset, luckily it’s only logarithmically proportional to dataset size, this implies that the system is
scalable to large dataset and will have fast response on big data.
Increasing the volume of the data didn’t hurt the system response performance nor affected it’s
accuracy in plagiarism detection.
The system can detect slightly paraphrasing plagiarism because we use shallow NLP techniques
to lemmatization and remove stop words, so the system can detect plagiarized sentences with change in
grammatical tense, the sentence grammar structure, or any stop words.
Shallow NLP can’t detect strongly paraphrased sentences or idea plagiarism, in order to detect
those sentences we will have to use complex deep NLP methods which will consume a huge
preprocessing and response time, note that the shallow NLP processing already occupy the majority of
the system time complexity.
Since we use bag of words method our system is insensitive to words order, however in order to
optimize the system performance we use k-gram fingerprinting method, so the system deals with k-
grams (subsequent words) not each term alone, as we discussed in implementation section we calculate
similarity on 5 to 2 grams, so the system can detect rearranged plagiarized sentence with at least two non
stop word still in the original order as long as these words are not very common.
This explains why the system can detect the three first rearranged sentences but can’t detect the last one,
note that the two words “Virtex FPGAs” are rare enough to imply plagiarism, but it detected another
paragraph from the same paper that contains this modified bi-gram, Anyway the last sentence have no 2-
gram as the original sentence to be detected.
We properly implemented the clustering algorithm and tested it on a very small data and got
satisfying clustering results, but a very slow performance – as expected from a data mining algorithm-
so we couldn’t use it on a real dataset.
If the system was used in a real world application and hosted on a high performance server the
clustering module can be applied then, as the off line processing wouldn’t be a problem and can be done
on HPCC (High Performance Computing Clusters), ant it will be very useful to speed up the response
time.
Implementing Plagiarism Detection Engine for English Academic Papers 75
Chapter 6 Conclusion
Plagiarism is an act of stealing someone else’s work and admitting it’s his work, and now we
have a system to detect this act.
Our system consists of three parts: ETL (Parser, Crawler), Plagiarism Engine, and GUI for
testing a document.
For the ETL: it extracts papers from Open Access Journals on the internet, parses those papers
by extracting the paper info and the paper data, and loads these data into the database.
For the Plagiarism Engine: it does the text processing on the paragraphs then tokenizes the result
and generate k-gram terms.
For the GUI, it receives the document to be tested, split the document into paragraphs and
process the text and generate k-grams, then compare the k-grams with the database using the VSM
Algorithm for similarity, and it will highlight the matching parts that have plagiarism.
Implementing Plagiarism Detection Engine for English Academic Papers 76
Chapter 7 Appendix
7.1 Entity-Relation Diagram (ERD)
Figure 7.1 ERD of the plagiarism Engine database
Implementing Plagiarism Detection Engine for English Academic Papers 77
3. paper table contains info about the paper such as paper title, author, Digital Object Identifier
DOI, and other information.
4. paragraph table contains the text of each paragraph, the length of the paragraph (after stop
words removal), a normalizing magnitude of the paragraph –the similarity between the
paragraph and an identical copy of it-, a foreign key of the papered, and other information.
5. paragraphVector table contains a Vectorized representation of paragraphs with difrrent
weights (TF, pivoted length normalization, BM25).
6. similarity table contains the calculated similarity between paragraphs and input paragraphs.
7. inputParagraph(Vector) table similar to original tables, but contain less information.
7.2 Stored procedures
CREATE DEFINER=`root`@`localhost` PROCEDURE `calculateMagnitude`() BEGIN UPDATE paragraph INNER JOIN (SELECT paragraphId, SUM(BM25 * termFreq) AS magnitude FROM paragraphVector WHERE kgram = 1 GROUP BY paragraphId ) AS PV ON paragraph.paragraphId = PV.paragraphId SET paragraph.magnitude = PV.magnitude; END CREATE DEFINER=`root`@`localhost` PROCEDURE `calculateSimilarity`(IN method varchar(9), IN paperId INT, IN threshold5 VARCHAR(10), IN threshold4 VARCHAR(10), IN threshold3 VARCHAR(10), IN threshold2 VARCHAR(10), IN threshold1 VARCHAR(10)) BEGIN CALL dotProduct(method, paperId, '5', '', 0, threshold5); CALL dotProduct(method, paperId, '4', '5', 0, threshold4); CALL dotProduct(method, paperId, '3', '4, 5', 0, threshold3); CALL dotProduct(method, paperId, '2', '3, 4, 5', 0, threshold2); CALL dotProduct(method, paperId, '1', '2, 3, 4, 5', 1, threshold1); END CREATE DEFINER=`root`@`localhost` PROCEDURE `clustering`(IN c INT, IN maxIter INT) BEGIN DECLARE minCost FLOAT; DECLARE i INT; DECLARE cost FLOAT; SET minCost := 10000000000; SET i = 0; WHILE (i < 2) DO SET i = i + 1; CALL kmeans(c, maxIter, cost); IF cost < minCost THEN SET @realStable := @stable;
Implementing Plagiarism Detection Engine for English Academic Papers 78
DROP TABLE IF EXISTS centroid; DROP TABLE IF EXISTS paragraphCluster; CREATE TABLE centroid LIKE tempCentroid; INSERT INTO centroid (SELECT * FROM tempCentroid); CREATE TEMPORARY TABLE paragraphCluster AS (SELECT paragraphId, clusterId FROM paragraph); SET minCost := cost; END IF; END WHILE; SET @minCost := minCost; UPDATE paragraph AS P INNER JOIN paragraphCluster AS PC ON P.paragraphId = PC.paragraphId SET P.clusterId = PC.clusterId; DROP TABLE IF EXISTS tempCentroid; DROP TABLE IF EXISTS paragraphCluster; ALTER TABLE centroid ADD INDEX termId_centroidIndex USING BTREE (termId); END CREATE DEFINER=`root`@`localhost` PROCEDURE `dotProduct`(IN method varchar(9), IN paperId INT, IN kgram varchar(1), IN conditionkgram varchar(50), IN conditionkgramOnly INT(1), IN threshold VARCHAR(10)) BEGIN DECLARE conditionstr VARCHAR(150) DEFAULT ''; DECLARE conditionjoin VARCHAR(350) DEFAULT ''; IF (conditionkgram != '') THEN IF (conditionkgramOnly = 0) THEN SET conditionjoin = 'LEFT OUTER JOIN similarity AS S2 ON (P.paragraphId = S2.paragraphId AND IP.inputParagraphId = S2.inputParagraphId)'; SET conditionstr = CONCAT('AND (S2.kgram IN(', conditionkgram, ') OR S2.kgram IS NULL)'); ELSE SET conditionjoin = 'INNER JOIN similarity AS S2 ON (P.paragraphId = S2.paragraphId AND IP.inputParagraphId = S2.inputParagraphId)'; SET conditionstr = CONCAT('AND S2.kgram IN( ',conditionkgram, ' ) '); END IF; END IF; SET @s = CONCAT('INSERT INTO similarity (paragraphId, inputParagraphId, kgram, similarity) SELECT DISTINCT sim.paragraphId, sim.inputParagraphId, sim.kgram, sim.score FROM paragraph AS P INNER JOIN ( SELECT paragraphId, inputParagraphId, IPV.kgram, SUM(PV.',method,' * IPV.termFreq) AS score FROM paragraphVector AS PV INNER JOIN inputParagraphVector AS IPV ON PV.termId = IPV.termId WHERE IPV.kgram = ',kgram,' GROUP BY paragraphId, inputParagraphId, IPV.kgram HAVING score >= ',threshold,' ) AS sim ON P.paragraphId = sim.paragraphId INNER JOIN ( SELECT * FROM inputParagraph WHERE inputPaperId = ', paperId,' ) AS IP ON sim.inputParagraphId = IP.inputParagraphId ',conditionjoin,' ',conditionstr,' AND P.clusterId = IP.clusterId;');
Implementing Plagiarism Detection Engine for English Academic Papers 79
PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt; END CREATE DEFINER=`root`@`localhost` PROCEDURE `fastUpdateParagraph`(IN paragraphId INT) BEGIN PREPARE stmt1 FROM 'UPDATE term INNER JOIN (SELECT termId, COUNT(paragraphId) AS idf FROM paragraphVector WHERE paragraphId = ? GROUP BY termId) AS PV ON term.termId = PV.termId SET term.inverseDocFreq = IFNULL(term.inverseDocFreq, 0) + PV.idf'; SET @id := paragraphId; EXECUTE stmt1 USING @id; DEALLOCATE PREPARE stmt1; PREPARE stmt2 FROM 'UPDATE paragraphVector as PV, paragraph as P, term AS T, datasetInfo AS I SET PV.BM25 = ((I.k+1) * PV.termFreq)/(PV.termFreq + I.k*(1-I.b+I.b*(P.length/I.avdl)) ) * LOG10( (I.numDoc+1)/T.InverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId AND P.paragraphId = ?'; EXECUTE stmt2 USING @id; DEALLOCATE PREPARE stmt2; PREPARE stmt3 FROM 'UPDATE paragraphVector as PV, paragraph as P, term as T, datasetInfo AS I SET PV.pivotNorm = ( LN(1+LN(1+PV.termFreq))/(1-I.b+I.b*(P.length/I.avdl)) ) * LOG10( (I.numDoc+1)/T.inverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId AND P.paragraphId = ?'; EXECUTE stmt3 USING @id; DEALLOCATE PREPARE stmt3; PREPARE stmt4 FROM 'UPDATE datasetInfo AS D, paragraph AS P SET D.numDoc = D.numDoc + 1, D.avdl = (P.length+D.totalLength)/D.numDoc, D.totalLength = D.totalLength + P.length WHERE D.id = 1 AND P.paragraphId = ?'; EXECUTE stmt4 USING @id; DEALLOCATE PREPARE stmt4; END CREATE DEFINER=`root`@`localhost` PROCEDURE `findCluster`() BEGIN CREATE OR REPLACE VIEW centroidSimilarity AS ( SELECT centroidId, inputParagraphId, SUM(centroid.value * inputParagraphVector.termFreq) AS score FROM centroid INNER JOIN inputParagraphVector ON centroid.termId = inputParagraphVector.termId WHERE inputParagraphVector.kgram = 1 GROUP BY centroidId, inputParagraphId );
Implementing Plagiarism Detection Engine for English Academic Papers 80
UPDATE inputParagraph AS IP INNER JOIN (SELECT CS1.inputParagraphId, CS1.centroidId FROM centroidSimilarity AS CS1 INNER JOIN (SELECT inputParagraphId, MAX(score) AS m FROM centroidSimilarity GROUP BY inputParagraphId) AS CS2 ON CS1.inputParagraphId = CS2.inputParagraphId WHERE CS1.score = CS2.m) AS maxSimilarity ON IP.inputParagraphId = maxSimilarity.inputParagraphId SET IP.clusterId = maxSimilarity.centroidId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `getResults`(IN inPaperId INT) BEGIN SELECT s1.inputParagraphId, s1.paragraphId, s3.inParagraph, s4.originalParagraph, s4.pageNumber, s5.paperTitle, s5.DOI, s5.volume, s5.issue, s6.journal, s6.publisher, s6.ISSN, s7.author, s1.similarity, s4.magnitude FROM similarity AS s1 INNER JOIN (SELECT paragraphId, m.inputParagraphId FROM similarity INNER JOIN (SELECT inputParagraphId, MAX(similarity) AS ms FROM similarity GROUP BY inputParagraphId) AS m ON similarity.inputParagraphId = m.inputParagraphId AND similarity.similarity = m.ms) As s2 ON s1.inputParagraphId = s2.inputParagraphId AND s1.paragraphId = s2.paragraphId INNER JOIN (SELECT inputParagraphId,inputPaperId,content AS inParagraph FROM inputParagraph) s3 ON s1.inputParagraphId = s3.inputParagraphId INNER JOIN (SELECT paperId,paragraphId,pageNumber,magnitude,content AS originalParagraph FROM paragraph) s4 ON s1.paragraphId = s4.paragraphId INNER JOIN (SELECT paperId,DOI,title AS paperTitle,journalId,volume,issue FROM paper) s5 ON s4.paperId = s5.paperId INNER JOIN (SELECT journalId,journal,publisher,ISSN FROM publisher) s6 ON s5.journalId = s6.journalId INNER JOIN (SELECT paperId, GROUP_CONCAT(author SEPARATOR ', ') AS author FROM author GROUP BY paperId) s7 ON s7.paperId = s4.paperId WHERE s1.kgram = 1 AND s3.inputPaperId = inPaperId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `kmeans`(IN c INT, IN maxIter INT, OUT cost FLOAT) BEGIN DECLARE i INT; DECLARE epsilon INT; SET epsilon = 0.01; SET @stable := 0; DROP TABLE IF EXISTS tempCentroid;
Implementing Plagiarism Detection Engine for English Academic Papers 81
PREPARE stmt1 FROM 'CREATE TABLE tempCentroid AS ( SELECT C.Id AS centroidId, P.termId AS termId, P.BM25 AS value FROM paragraphVector AS P INNER JOIN (SELECT paragraphId AS Id FROM paragraph ORDER BY RAND() LIMIT ?) AS C ON P.paragraphId = C.Id WHERE P.kgram = 1 )'; SET @c := c; EXECUTE stmt1 USING @c; DEALLOCATE PREPARE stmt1; DROP TABLE IF EXISTS oldCentroid; CREATE TEMPORARY TABLE oldCentroid LIKE tempCentroid; SET i = 0; SET @ID := 0; PREPARE stmt2 FROM 'SET @ID := (SELECT DISTINCT centroidId FROM tempCentroid ORDER BY centroidId ASC LIMIT ?,1);'; WHILE (i <= c) DO SET @i := i; SET i = i + 1; EXECUTE stmt2 USING @i; UPDATE tempCentroid SET centroidId = @i+1 WHERE centroidId = @ID; END WHILE; DEALLOCATE PREPARE stmt2; SET i = 0; mainLoop: WHILE (i < maxIter) DO SET i = i + 1; CREATE OR REPLACE VIEW paragraphCentroidSimilarity AS ( SELECT paragraphId, centroidId, SUM(paragraphVector.BM25 * tempCentroid.value) AS similarity FROM paragraphVector INNER JOIN tempCentroid ON paragraphVector.termId = tempCentroid.termId WHERE paragraphVector.kgram = 1 GROUP BY paragraphId, centroidId); UPDATE paragraph AS P INNER JOIN (SELECT CS1.paragraphId, CS1.centroidId FROM paragraphCentroidSimilarity AS CS1 INNER JOIN (SELECT paragraphId, MAX(similarity) AS m FROM paragraphCentroidSimilarity GROUP BY paragraphId) AS CS2 ON CS1.paragraphId = CS2.paragraphId WHERE CS1.similarity = CS2.m) AS maxSimilarity ON P.paragraphId = maxSimilarity.paragraphId SET P.clusterId = maxSimilarity.centroidId; TRUNCATE oldCentroid; INSERT INTO oldCentroid (SELECT * FROM tempCentroid); CREATE OR REPLACE VIEW newCentroid AS ( SELECT P.clusterId, PV.termId, SUM(PV.BM25) AS newValue FROM paragraph AS P INNER JOIN paragraphVector AS PV ON (P.paragraphId = PV.paragraphId) WHERE PV.kgram = 1 GROUP BY P.clusterId, PV.termId );
Implementing Plagiarism Detection Engine for English Academic Papers 82
TRUNCATE tempCentroid; INSERT INTO tempCentroid (SELECT * FROM newCentroid); CREATE OR REPLACE VIEW clusterSize AS ( SELECT P.clusterId AS centroidId, COUNT(P.paragraphId) AS size FROM paragraph AS P GROUP BY P.clusterId ); UPDATE tempCentroid AS C INNER JOIN clusterSize AS CS ON C.centroidId = CS.centroidId SET C.value = C.value/CS.size; SET @step := (SELECT MAX(sub.delta) FROM ( SELECT SUM(ABS(C1.value - IFNULL(C2.value, 0))) AS delta FROM tempCentroid AS C1 LEFT OUTER JOIN oldCentroid AS C2 ON (C1.centroidId = C2.centroidId AND C1.termId = C2.termId) GROUP BY C1.centroidId ) AS sub); IF @step < epsilon THEN SET @stable = 1; LEAVE mainLoop; END IF; END WHILE mainLoop; SET @m := (SELECT COUNT(*) FROM paragraph); SET cost := (SELECT (SUM(POWER((PV.BM25 - IFNULL(C.value, 0)), 2)) / @m ) FROM tempCentroid AS C INNER JOIN paragraph AS P ON C.centroidId = P.clusterId RIGHT OUTER JOIN paragraphVector AS PV ON (P.paragraphId = PV.paragraphId AND C.termId = PV.termId)); DROP VIEW IF EXISTS paragraphCentroidSimilarity; DROP TABLE IF EXISTS oldCentroid; DROP VIEW IF EXISTS newCentroid; DROP VIEW IF EXISTS clusterSize; END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_BM25`(IN b FLOAT, IN k FLOAT) BEGIN DECLARE numDoc INT; DECLARE avdl FLOAT; DECLARE totalLength INT; SELECT datasetInfo.numDoc INTO numDoc FROM datasetInfo WHERE id = 1; SELECT datasetInfo.totalLength INTO totalLength FROM datasetInfo WHERE id = 1; SELECT datasetInfo.avdl INTO avdl FROM datasetInfo WHERE id = 1; SET @k := k; SET @b := b; PREPARE stmt1 FROM 'INSERT INTO datasetInfo (id, k, b) VALUES (1, ?, ?) ON DUPLICATE KEY UPDATE k = VALUES(k), b = VALUES(b)'; EXECUTE stmt1 USING @k, @b;
Implementing Plagiarism Detection Engine for English Academic Papers 83
UPDATE paragraphVector as PV, paragraph as P, term as T SET PV.BM25 = ((k+1) * PV.termFreq)/(PV.termFreq + k*(1-b+b*(P.length/avdl)) ) * LOG10( (numDoc+1)/T.InverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_inverseDocFreq`() BEGIN UPDATE term INNER JOIN (SELECT termId, COUNT(paragraphId) AS idf FROM paragraphVector GROUP BY termId) AS PV ON term.termId = PV.termId SET term.inverseDocFreq = PV.idf; SET @numDoc := (SELECT COUNT(*) FROM paragraph); SET @totalLength := (SELECT SUM(length) FROM paragraph); SET @avdl := @totalLength/@numDoc; INSERT INTO datasetInfo (id, numDoc, avdl, totalLength) VALUES (1, @numDoc, @avdl, @totalLength) ON DUPLICATE KEY UPDATE numDoc = VALUES(numDoc), avdl = VALUES(avdl), totalLength = VALUES(totalLength); END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_pivotNorm`(IN b FLOAT) BEGIN DECLARE numDoc INT; DECLARE avdl FLOAT; DECLARE totalLength INT; SELECT datasetInfo.numDoc INTO numDoc FROM datasetInfo WHERE id = 1; SELECT datasetInfo.totalLength INTO totalLength FROM datasetInfo WHERE id = 1; SELECT datasetInfo.avdl INTO avdl FROM datasetInfo WHERE id = 1; SET @b := b; PREPARE stmt1 FROM 'INSERT INTO datasetInfo (id, b) VALUES (1, ?) ON DUPLICATE KEY UPDATE b = VALUES(b)'; EXECUTE stmt1 USING @b; UPDATE paragraphVector as PV, paragraph as P, term as T SET PV.pivotNorm = ( LN(1+LN(1+PV.termFreq))/(1-b+b*(P.length/avdl)) ) * LOG10( (numDoc+1)/T.inverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId; END
Implementing Plagiarism Detection Engine for English Academic Papers 84
References
[1] T. Hoad and J. Zobel, " Methods for Identifying Versioned and Plagiarised Documents," Journal of the American
Society for Information Science and Technology, vol. 54, no. 3, p. 203–215, 2003.
[2] K. Monostori, A. Zaslavsky and H. Schmidt, "Document Overlap Detection System for Distributed Digital Libraries,"
Proceedings of the fifth ACM conference on Digital libraries, p. 226–227, 2000.
[3] A. Si, H. V. Leong and R. W. H. Lau, "CHECK: A Document Plagiarism Detection System," SAC ’97: Proceedings of the
1997 ACM symposium on Applied computing, p. 70–77, 1997.
[4] C. Noah, H. Marcus, J. Nick, S. Cole, T. Tony and W.-D. Zach, "Plagiarism Detection," 17 March 2014. [Online].
Available: www.cs.carleton.edu/cs_comps/1314/dlibenno/final-results/plagcomps.pdf.
[5] "Euclidean vector," Wikipedia.
[6] D. M. Christopher, R. Prabhakar and S. Hinrich, "Introduction to Information Retrieval," Cambridge University Press,
2008.
[7] S. T. Piantadosi, "According to Zipf’s law, Zipf’s word frequency law in natural language: a critical review and future
directions," June 2, 2015.
[8] S. Amit, B. Chris and M. Mandar, "Pivoted Document Length Normalization".
[9] S. Robertson and H. Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond".
[10] S. Ullman, "Unsupervised Learning: Clustering," 2014. [Online]. Available:
http://www.mit.edu/~9.54/fall14/slides/Class13.pdf.