My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

SUPERVISED BY: Dr Hitham M. Abo Bakr

Implementing Plagiarism Detection Engine

For English Academic Papers

By

Muhamed Gameel Abd El Aziz

Ahmed Motair El Said Mater

Mohamed Hessien Mohamed

Shreif Hosni Zidan Esmail

Manar Mohamed Said Ahmed

Doaa Abd El Hamid Abd El Hamid

Implementing Plagiarism Detection Engine for English Academic Papers 1

Abstract Plagiarism became a serious issue now days due to the presence of vast resources easily

available on the web, which makes developing plagiarism detection tool a useful and challenging task

due to the scalability issues.

Our project is implementing a Plagiarism Detection Engine oriented for English academic papers

using text Information Retrieval methods, relational database, and Natural Language Processing

techniques.

The main parts of the projects are:

Gathering and cleaning data: crawling the web and collecting academic papers and parsing it to

extract information about the paper and make a big dataset of these scientific paper content.

Tokenization: Parse, tokenize, and preprocess documents.

Plagiarism engine: checking similarity between the input document and the database to detect

potential plagiarism.


Table of Contents

Abstract ___________________________________________________________________________ 1

Table of Contents ___________________________________________________________________ 2

Table of Figures ____________________________________________________________________ 4

Table of Tables _____________________________________________________________________ 7

Chapter 1 Introduction ___________________________________________________________ 8

1.1 What is Plagiarism? _________________________________________________________________8

1.2 What is Self-Plagiarism? _____________________________________________________________8

1.3 Plagiarism on the Internet ____________________________________________________________8

1.4 Plagiarism Detection System __________________________________________________________8

1.4.1 Local similarity: __________________________________________________________________________ 8 1.4.2 Global similarity: _________________________________________________________________________ 9 1.4.3 Fingerprinting ___________________________________________________________________________ 9 1.4.4 String Matching __________________________________________________________________________ 9 1.4.5 Bag of words _____________________________________________________________________________ 9 1.4.6 Citation-based Analysis ____________________________________________________________________ 9 1.4.7 Stylometry _______________________________________________________________________________ 9

Chapter 2 Background Theory ____________________________________________________ 10

2.1 Linear Algebra Basics ______________________________________________________________ 10 2.1.1 Vectors _________________________________________________________________________________ 10

2.2 Information Retrieval (IR) __________________________________________________________ 11

2.3 Regular Expression ________________________________________________________________ 15

2.4 NLTK Toolkit ____________________________________________________________________ 16

2.5 Node.js __________________________________________________________________________ 16

2.6 Express.js ________________________________________________________________________ 16

2.7 Sockets.io ________________________________________________________________________ 16

2.8 Languages Used ___________________________________________________________________ 16

Chapter 3 Design and Architecture _________________________________________________ 17

3.1 Extract, Transfer and Load (ETL) ___________________________________________________ 17

3.2 Plagiarism Engine _________________________________________________________________ 17 3.2.1 Natural Language Processing, (Generating k-grams), and vectorization __________________________ 18 3.2.2 Semantic Analysis (Vector Space Model VSM Representation) _________________________________ 18 3.2.3 Calculating Similarity ____________________________________________________________________ 18 3.2.4 Clustering ______________________________________________________________________________ 18 3.2.5 Communicating Results___________________________________________________________________ 19

Chapter 4 Implementation ________________________________________________________ 20

4.1 Extract, Load and Transform (ETL) _________________________________________________ 20 4.1.1 The Crawler ____________________________________________________________________________ 20 4.1.2 The Parser ______________________________________________________________________________ 20 4.1.3 The Data Extracted from the paper _________________________________________________________ 20 4.1.4 The Parser Implementation _______________________________________________________________ 21


4.1.5 How it works ____________________________________________________________________________ 21 4.1.6 Steps of Parsing _________________________________________________________________________ 22 4.1.7 The Paper Class _________________________________________________________________________ 26 4.1.8 The Paragraph Structure _________________________________________________________________ 27 4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) _____________________________________ 27 4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)____________________________________ 37

4.2 The Natural Language Processing (NLP) ______________________________________________ 42

4.2.1 Introduction ____________________________________________________________________________ 42 4.2.2 The Implementation Overview _____________________________________________________________ 42 4.2.3 The Text Processing Procedure ____________________________________________________________ 42 4.2.4 Example of the Text Processing ____________________________________________________________ 45

4.3 Term Weighting __________________________________________________________________ 47 4.3.1 Lost Connection to Database Problem ______________________________________________________ 47 4.3.2 Process Paragraph _______________________________________________________________________ 48 4.3.3 Generating Terms _______________________________________________________________________ 48 4.3.4 Populating term, paragraphVector Tables ___________________________________________________ 51 4.3.5 Executing VSM Algorithm ________________________________________________________________ 52

4.4 Testing Plagiarism ________________________________________________________________ 53

4.4.1 Process Paragraph _______________________________________________________________________ 53 4.4.2 Calculate Similarity ______________________________________________________________________ 54 4.4.3 Get Results _____________________________________________________________________________ 54

4.5 The VSM Algorithm _______________________________________________________________ 55

4.5.1 Calculating similarity ____________________________________________________________________ 55 4.5.2 K-means and Clustering __________________________________________________________________ 56

4.6 Server Side _______________________________________________________________________ 59

4.6.1 Handling Routing ________________________________________________________________________ 59 4.6.2 Running Python System __________________________________________________________________ 60

4.7 Client Side _______________________________________________________________________ 62

4.8 The GUI of the System _____________________________________________________________ 63

Chapter 5 Results and Discussion __________________________________________________ 66

5.1 Dataset of the Parser _______________________________________________________________ 66

5.2 Exploring dataset _________________________________________________________________ 68

5.4.1 Small dataset (15K) ______________________________________________________________________ 68 5.4.2 Big dataset (50K) ________________________________________________________________________ 69

5.3 Performance _____________________________________________________________________ 70

5.4 Detecting plagiarism _______________________________________________________________ 72

5.4.1 Percentage score functions:________________________________________________________________ 72

5.5 Discussing results _________________________________________________________________ 74

Chapter 6 Conclusion ___________________________________________________________ 75

Chapter 7 Appendix _____________________________________________________________ 76

7.1 Entity-Relation Diagram (ERD) _____________________________________________________ 76

7.2 Stored procedures _________________________________________________________________ 77

References _______________________________________________________________________ 84


Table of Figures Figure 1.1 Plagiarism Detection Approaches _____________________________________________________8

Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3) ______ 10

Figure 2.2 Geometric representation of documents ______________________________________________ 12

Figure 3.1 High level block diagram __________________________________________________________ 17

Figure 3.2 Detailed block diagram of the Plagiarism Engine ______________________________________ 17

Figure 4.1 Overview for the Crawler and Parser ________________________________________________ 20

Figure 4.2 UML of the Parser Application _____________________________________________________ 21

Figure 4.3 the Flow Chart of the Parser _______________________________________________________ 21

Figure 4.4 The main function of Parsing ______________________________________________________ 22

Figure 4.5 The First Page of an IEEE Paper (as Blocks) _________________________________________ 22

Figure 4.6 First Page of a Science Direct Paper _________________________________________________ 23

Figure 4.7 First Paper of a Springer Paper_____________________________________________________ 23

Figure 4.8 The function of parseOtherPages ___________________________________________________ 24

Figure 4.9 Block of String before Enhancing ___________________________________________________ 25

Figure 4.10 The Paragraphs after enhancing ___________________________________________________ 25

Figure 4.11 the Paper Structure _____________________________________________________________ 26

Figure 4.12 the Paragraph Structure _________________________________________________________ 27

Figure 4.13 Different forms for an IEEE Top Header ____________________________________________ 27

Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper ____________________________ 28

Figure 4.15 The supported Regex of the IEEE Header formats ____________________________________ 29

Figure 4.16 The Function of extracting the Volume Number ______________________________________ 30

Figure 4.17 The Function of Extracting the Issue Number ________________________________________ 30

Figure 4.18 The Function of Extracting the DOI ________________________________________________ 30

Figure 4.19 The Function of Extracting the Start and End Pages __________________________________ 31

Figure 4.20 The Function of Extracting the Journal Title_________________________________________ 32

Figure 4.21 Parsing the rest of blocks in the first Page ___________________________________________ 32

Figure 4.22 The Function of Extracting the DOI and PII _________________________________________ 33

Figure 4.23 The Function of Extracting the ISSN _______________________________________________ 33

Figure 4.24 The Function of extracting the paper Dates __________________________________________ 34

Figure 4.25 The Function of Extracting the Keywords ___________________________________________ 35


Figure 4.26 The Function of Extracting the Keywords ___________________________________________ 36

Figure 4.27 The Function of Extracting the Title and the Authors __________________________________ 36

Figure 4.28 The Defining the Style of the Header _______________________________________________ 38

Figure 4.29 the Function of Extracting the Figure Captions ______________________________________ 39

Figure 4.30 the Function of separating the lists _________________________________________________ 40

Figure 4.31 the Function of Extracting the Paragraph ___________________________________________ 40

Figure 4.32 The Function of Extracting the Paragraph __________________________________________ 41

Figure 4.33 Process Text Function ___________________________________________________________ 42

Figure 4.34 Tokenizing words Function _______________________________________________________ 42

Figure 4.35 Tokenization Example ___________________________________________________________ 43

Figure 4.36 POS Function __________________________________________________________________ 43

Figure 4.37 POS Output Example ____________________________________________________________ 43

Figure 4.38 WordNet POS Function __________________________________________________________ 43

Figure 4.39 Removing Punctuations Function __________________________________________________ 44

Figure 4.40 Removing Stop Words Function ___________________________________________________ 44

Figure 4.41 Stop Words list _________________________________________________________________ 44

Figure 4.42 Lemmatization Function _________________________________________________________ 45

Figure 4.43 Paragraph before Text Processing _________________________________________________ 45

Figure 4.44 Paragraph after Text Processing ___________________________________________________ 46

Figure 4.45 Retrieving Paragraphs ___________________________________________________________ 47

Figure 4.46 Process Paragraph Function ______________________________________________________ 48

Figure 4.47 Generate k-gram Terms Function __________________________________________________ 49

Figure 4.48 Paragraph Example _____________________________________________________________ 49

Figure 4.49 1-gram terms ___________________________________________________________________ 50





Figure 4.54 Calculate Term Frequency _______________________________________________________ 51

Figure 4.55 insert Terms in Database _________________________________________________________ 51

Figure 4.56 insert Paragraph Vector in Database _______________________________________________ 51

Figure 4.57 Executing the VSM Algorithm_____________________________________________________ 52


Figure 4.58 tokenizing and link paragraphs together _____________________________________________ 53

Figure 4.59 Process input paragraphs _________________________________________________________ 53

Figure 4.60 Populate input paragraph vector ___________________________________________________ 53

Figure 4.61 Calculate Similarity _____________________________________________________________ 54

Figure 4.62 Get Results ____________________________________________________________________ 54

Figure 4.63 Flowchart of the Kmeans text clustering algorithm ____________________________________ 57

Figure 4.64 Home Page Routing _____________________________________________________________ 59

Figure 4.65 Pre-Process Page Routing ________________________________________________________ 59

Figure 4.66 Communicating between the Server and the Core Engine for testing plagiarism ____________ 60

Figure 4.67 Communicating between the Server and the Core Engine for Pre-processing ______________ 61

Figure 4.68 Least Common Subsequence LCS Algorithm _________________________________________ 62

Figure 4.69 Least Common Subsequence LCS Algorithm _________________________________________ 62

Figure 4.70 Submitting an input document_____________________________________________________ 63

Figure 4.71 The Results of the Process Part 1 __________________________________________________ 64

Figure 4.72 The Results of the Process Part 2 __________________________________________________ 65

Figure 5.1 Number of Papers Published per Year in IEEE ________________________________________ 66

Figure 5.2 Number of Papers Published per Year in Springer _____________________________________ 67

Figure 5.3 Number of Papers Published per Year in Science Direct _________________________________ 67

Figure 5.4 Response time against number of paragraphs tested on small dataset ______________________ 70

Figure 5.5 Screenshot of the System Performance from the System GUI _____________________________ 71

Figure 7.1 ERD of the plagiarism Engine database ______________________________________________ 76


Table of Tables Table 1 Statistics of the Parser ________________________________________________________________66

Table 2 Dataset Statistics ____________________________________________________________________68

Table 3 Unique Terms count in each Paragraph _________________________________________________68

Table 4 Unique Terms count in Dataset ________________________________________________________68

Table 5 Dataset Statistics ____________________________________________________________________69

Table 6 Unique Terms count in each Paragraph _________________________________________________69

Table 7 Unique Terms count in Dataset ________________________________________________________69

Table 8 Processing time of each module in Plagiarism Engine ______________________________________70

Table 9 Parameters _________________________________________________________________________72

Table 10 Testing Paragraphs and Results _______________________________________________________73


Chapter 1 Introduction 1.1 What is Plagiarism?

It’s the act of Academic stealing someone’s work such as: copying words from a book or a

scientific paper and publish it as it’s his work, also stealing the ideas, images, videos and music and

using them without a permission or providing a proper citation is called plagiarism.

1.2 What is Self-Plagiarism?

It’s the act when someone is using a portion of an article or work he published before without

citing that he is doing so, and this portion could be significant, identical or nearly identical, also it may

cause copyrights issues as the copyright of the old work will be transferred to the new one. This type of

articles and work are called duplicate or multiple publication.

1.3 Plagiarism on the Internet

Now the Blogs, Facebook Pages and some website are copying and pasting information violating

many copyrights, so there are many tools that are used to prevent plagiarism such as: disabling the right

click to prevent copying, also placing copyright warning in every page in the website as banners or

pictures, and the use of DCMA copyright law to report for copyright infragment and the violation of

copyrights, this report could be sent to the website owner or the ISP hosting the website and the website

will be removed.

1.4 Plagiarism Detection System

It’s a system used to test a material if it has plagiarism or not, this material could be scientific article or

technical report or essay or others, also the system can emphasize the parts of plagiarism in the material and estate

from where it’s copied from even there is difference between some words with the same meaning.

Figure 1.1 Plagiarism Detection Approaches

1.4.1 Local similarity:

Given a small dataset, the system checks the similarity between each pair of the paragraphs in this dataset,

like checking if two students cheated in an assignment.


1.4.2 Global similarity:

Global similarity systems checks the similarity between a small input paragraphs against a large dataset,

like checking if a submitted paper is plagiarized from an already published paper.

1.4.3 Fingerprinting

In this approach, the data set consists of set of multiple n-grams from documents, these n-grams

are selected randomly as a substring of the document, each set of n-grams representing a fingerprint for

that document and called minutiae, and all of this fingerprints are indexed in the database, and the input

text is processed in the same way and compared with the fingerprints in the database and if it matches

with some of them, then it plagiarizes some documents. [1]

1.4.4 String Matching

This is one of most problems in the plagiarism detection systems as to detect plagiarism you can

to make an exact match, but to compare the document to be tested with the total Database requires huge

amount of resources and storage, so suffix trees and suffix vector are used to overcome this problem. [2]

1.4.5 Bag of words

This approach is an adoption of the vector space retrieval, where the document is represented as

a bag of words and these words are inserted in the database as n-grams with its location in the document

and their frequencies in this or other documents, and for the document to be tested it will be represented

as bag of words too and compared with the n-grams in the database. [3]

1.4.6 Citation-based Analysis

This is the only approach that doesn’t rely on text similarity, It examines the citation and

reference information in texts to identify similar patterns in the citation sequences, It’s not widely used

in the commercial software, but there are prototypes of it.

1.4.7 Stylometry

Analyze only the suspicious document to detect plagiarized passages by detecting the difference

of linguistic characters.

This method isn’t accurate in small documents as it needs to analyze large passages to be able to

extract linguistic properties –up to thousands of words per chunk [4].

Our Project is working with Global Similarity and the Bag of Words Approach, as the system

has Dataset of many scientific papers divided into paragraphs, and the input text is divided into

paragraphs and compared with the large dataset of paragraphs.

https://en.wikipedia.org/wiki/Pattern


Chapter 2 Background Theory

2.1 Linear Algebra Basics

Since we use Vector space model to represent and retrieve text documents, a basic linear algebra

is needed.

2.1.1 Vectors

A vector is Geometric object that have a magnitude and a direction, or a mathematical object

consists of ordered values.

1. Representation in 2D and 3D

1) Graphical (Geometric) representation

A vector is represented graphically as an arrow in the Cartesian 2D plane or the Cartesian 3D

space.

Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3).

Source: Wikimedia commons.

2) Cartesian representation

Vectors in an n-dimensional Euclidean space can be represented as coordinate vectors; the

endpoint of a vector can be identified with an ordered list of n real numbers (n-tuple). [5]

2D vector �� = (𝑎𝑥, 𝑎𝑦) Euclidean vector

3D vector �� = (𝑎𝑥, 𝑎𝑦, 𝑎𝑧)

2. Operations on vectors

1) Scalar product

𝑟�� = (𝑟𝑎𝑥, 𝑟𝑎𝑦, 𝑟𝑎𝑧)

2) Sum

�� + �� = (𝑎𝑥 + 𝑏𝑥 , 𝑎𝑦 + 𝑏𝑦, 𝑎𝑧 + 𝑏𝑧 )


3) Subtract

�� − �� = (𝑎𝑥 − 𝑏𝑥 , 𝑎𝑦 − 𝑏𝑦, 𝑎𝑧 − 𝑏𝑧 )

4) Dot product

Algebraic definition: �� . �� = (𝑎𝑥 𝑏𝑥 , 𝑎𝑦𝑏𝑦)

Geometric definition: �� . �� = ‖𝑎‖‖𝑏‖ cos 𝜃

Where: |a| is the magnitude of vector a, |b| is the magnitude of vector b, θ is the angle between a and b

The projection of a vector �� in the direction of another vector �� is given by: 𝑎𝑏 = �� . ��

Where: �� is the normalized vector –unit vector- of ��.

2.2 Information Retrieval (IR)

Information retrieval could be defined as “the process of finding material of an unstructured

nature –usually text- that satisfies information need or relevant to a query from large collection of data

[6].

As the definition suggests, IR differ than ordinary select query that is the information retrieved

unstructured, and doesn’t always exactly match the query.

Information Retrieval methods are used in search engines, text classification –such as spam

filtering-, and in our case plagiarism engine.

1. Vector Space Model (VSM)

The basic idea of VSM is to represent text documents as vectors in term weights space.

1) Term Frequency weighting (TF)

The simplest VSM weighting is just Term Frequency; all other weighting functions are

modification of it.

In TF weighting we represent each text by a vector of d-dimensions where d is the number of

terms in dataset, the value of the vector nth dimension equals to the frequency of the nth term in the

document.

For example let’s assume a dataset of 2 dimensions/terms (play, ground)

Document 1 ‘play ground’ is represented as 𝑑1 = (1, 1)

Document 2 “play play” is represented as 𝑑2 = (2, 0)

Document 3 “ground” is represented as 𝑑3 = (0, 1)

More generally weight of word w in document d is defined as 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)


The dot product similarity between d1 and d2 = 𝑑1 . 𝑑2

= (1, 1) . (2,0) = 1 ∗ 2 + 1 ∗ 0 = 2

2) Term Frequency with Inverse Document Frequency weighting (TF-IDF)

Document Frequency 𝑑𝑓(𝑤) is the number of documents that contains the word.

TF-IDF have additional Inverse Document Frequency term to penalize common terms as they

are have high probability [7] of appearing in a document so they don’t strongly indicate plagiarism

unlike less probable terms which have less probability and more information.

𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) ∗ 1

𝑑𝑓(𝑤)

So for the above example:

df (play) = 2

df (ground) = 2

and in this case all the weights will be scaled by a half.

2. State-of-the-art VSM functions

1) Pivoted Length Normalization [8]

𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = ln[1 + ln [1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]]

1 − 𝑏 + 𝑏 |𝑑|𝑎𝑣𝑑𝑙

log𝑀 + 1

𝑑𝑓(𝑤)

Where: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) is the weight of word w in document/paragraph d

𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) is the count of word w in document d – i.e. term frequency-

play

ground

2

1

1

d2

d1

d3

Figure 2.2 Geometric representation of documents


𝑏 𝑖𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ∈ [0, 1]

|𝑑| is the length of document d

𝑎𝑣𝑑𝑙 is the average length of the documents in dataset

𝑀 is the number of documents in dataset

𝑑𝑓(𝑤) is the number of documents that contains the word w i.e. document frequency

Document length normalizing term 1 − 𝑏 + 𝑏 |𝑑|

𝑎𝑣𝑑𝑙 is used to linearly penalize long documents

if it’s length is larger than the average document length (avdl), or reward short documents if it’s length

is smaller than the average document length.

The parameter b is used to determine the normalization; if its equal to zero then there is no

normalization at all if it’s equal to 1 then the normalization is linear with offset zero and slope 1.

The Inverse Document Frequency (IDF) term log𝑀+1

𝑑𝑓(𝑤) is used to penalize common terms as

explained above, the IDF is multiplied by number of documents to normalize it, as the probability of the

term depends not only on the document frequency of that term, but also on the size of the dataset, a term

appeared 10 times in a dataset with 100 document much more common than a term appeared 10 times

but in a dataset with 1000 document, the logarithmic function is used to smooth the IDF weighting, i.e.

reduce the variation of weighting when the Document Frequency varies a lot.

The Term frequency (TF) term ln[1 + ln [1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] contains double natural logarithmic

functions to achieve sub linear transformation-i.e. smoothing TF curve- to avoid over scoring documents

that have large repeated words, as the first occurring of a term should have the highest weight.

Imagine a document with extremely large frequency of a term, without sub linear transformation

this document will always have high similarity with any input query that contains the same term, even

higher similarity than another more similar document.

2) Okapi\BM25 [9]

BM stands for Best Match, the weights are defined as follows:

𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = (𝑘 + 1)𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞)

𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) + 𝑘(1 − 𝑏 + 𝑏 |𝑑|𝑎𝑣𝑑𝑙

)log

𝑀 + 1

𝑑𝑓(𝑤)

Where all symbols defined as in Pivoted Length Normalization 𝑘 ∈ [0, ∞]

Similar to Pivoted Length Normalization, but instead of Natural logarithms it uses division and k

parameter to achieve sub linear transformation.


It was originally developed based on the probabilistic model, however it’s very similar to Vector

Space Model.

3. Similarity functions

After representing text documents as vectors in the space, we need functions to calculate

similarity –or distance- between any two vectors.

1) Dot product similarity

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)

𝑤∈ 𝑞∩𝑑

Where: 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) is the similarity score between document d and input query q

The score is simply the summation of the product of term weights of each word appear in both

document and query.

It’s very popular because it’s general and can be used with any fancy term weighting.

2) Cosine similarity

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) =∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)𝑤∈ 𝑞∩𝑑

|𝑞| ∪ |𝑑|

Where: |𝑞| is the magnitude of the query vector, |𝑑| is the magnitude of the document vector.

It’s basically a dot product divided by the product of the lengths of the two vectors, which yields

the cosine of the angle between the vectors.

This function has a built in document – and query – length normalization.

4. Clustering

Clustering is an unsupervised1 machine learning method, and a powerful data mining technique,

it is the process of grouping similar objects together.

This technique can theoretically speed up the information retrieval process by a factor of K

where K is the number of clusters.

This may be achieved by clustering similar paragraphs together and measure the similarity

between each new query and the centroids of the clusters, then measure the similarity between the query

and paragraphs in one cluster, this is much faster than measuring similarity of query against each

paragraph in the data set.

1 Because the data are not labeled.


We use Kmeans algorithm-centroid based clustering-, which is an iterative improvement

algorithm that groups the data set into a pre defined number of clusters K.

Which goes like this [10]:

1 Select K random points from data set to be the initial guess of the centroids –cluster centers.

2 Assign each record in the data set to the closest centroid based on a given similarity function.

3 Move each centroid closer to points assigned to it by calculating the mean value of the points

in the cluster.

4 If reached local optima –i.e. centroids stopped moving- stop, else repeat

Since Kmeans algorithm is sensitive to initial choose of centroids and can stuck in local optima

we repeat it with different initial centroids and keep the best results –which have the least mean square

error-.

Time complexity 𝛰(𝑡𝑘𝑛𝑑) where t is the number of iterations till converges, k is the number of

clusters, and n is the number of records in the data set, d is the number of dimensions.

Since usually 𝑡𝑘 ≪ 𝑛 the algorithm is considered a linear time algorithm.

Note: that is the typical time complexity for applying the algorithm on a dataset with neglect

able limited Number of dimensions, in our case we have very large number of dimensions –all terms in

the dataset- but fortunately for each centroid or paragraph we iterate only terms appear in it –not all

dimensions- so the time complexity will differ as we will discuss in details in the implementation

section.

2.3 Regular Expression

It’s a sequence of character and symbols written in some way to detect patterns, where each

symbol in the sequence has a meaning ex: + means one or more, * means zero or more, - means range

(A-Z: all capital letters from A to Z).

For ex: the Birth date could be written in this way May 15th, 1993 so the pattern of the Date is

[Month] [day][st or nd or rd or th] [year], and the Regular Expression for it is [A-Za-z]{3,9} [0-

9]{1,2}(st|nd|rd|th) [0-9]{4}

First, the month is one of 12 fixed words, they could be written explicitly or simply it’s a

sequence of Char range from 3 to 9 as the smallest month (May) and the largest (September), then space,

Then, the Day and it’s a number of 1 or 2 digits, then one of the 4 words (st, nd, rd, th), then space, then

the Year and it’s a number of 4 digits.

Also this isn’t the only format of date so the date expression could be more complicated than

this.


2.4 NLTK Toolkit

It’s a python module that is responsible for Natural Language Processing (NLP) used for text

processing, It has algorithms for sentence and word tokenization, and contains a large number of corpus

(data), also has its own Wordnet corpus, and it’s used for Part of Speech (POS) Tagging, Stemming, and

lemmatization.

2.5 Node.js

It’s a runtime environment built on Chrome’s V8 JavaScript Enginer for developing server-side

web applications, it uses an event driven, non-blocking I/O model.

2.6 Express.js

It’s a Node.js web application server framework, It’s a standard server framework for Node.js, It

has a very thin layer with many features available as plugins.

2.7 Sockets.io

It’s a library for real-time web application, It enables bi-directional communication between the

web client and server, It primarily use web sockets protocol with polling as fallback option.

2.8 Languages Used

1. Java

2. Python

3. SQL

4. JavaScript

5. HTML & CSS


Chapter 3 Design and Architecture

3.1 Extract, Transfer and Load (ETL)

In this part, we are building the database by downloading many scientific papers using the

Crawler software (Extract), then they are passed to the Parser software where all the paper information

and text content are extracted as paragraphs (Transform) and inserted in the database (Load).

3.2 Plagiarism Engine

The plagiarism Engine preprocess a huge dataset of academic English papers and analysis it uses

Natural language processing techniques to extract useful information, then measures the similarity

between an input query and the dataset using Information Retrieval methods to detect both Identical and

paraphrased plagiarism in a fast and intelligent way.

Parsed

papers

NLP and

vectorization

Semantic analysis

VSM representation

Vectorized paragraph

Extracted features

Lexical database

(WordNet/NLTK) Input query

Calculating

similarity Clustering

centroid Find Cluster

Potential plagiarized

paragraphs

ETL

Parsing

Plagiarism

engine

Communicating results

Academic

papers

Input query

Figure 3.1 High level block diagram

Figure 3.2 Detailed block diagram of the Plagiarism Engine


3.2.1 Natural Language Processing, (Generating k-grams), and vectorization

The Text Processing part work on these data to extract the most important words from the

paragraphs, and ignore the common words, then k-grams terms are generated from these words, and

each bag of words is linked to its corresponding paragraph in the database.

3.2.2 Semantic Analysis (Vector Space Model VSM Representation)

Input: simple Term Frequency vector representation stored in paragraphVector table.

Output: dataset statistics (number of paragraphs, number of terms, average length of paragraph)

stored in dataSetInfo table, document frequency (for each term; number of paragraphs in which that term

appeared) stored in IDF column in term table, pivoted length normalization, and BM25 vector weights.

In this part we calculate a more sophisticated vector representation than just term-frequency of

our text corpus.

We calculate a TF-IDF normalized vector representation of the text using both pivoted length

normalization and BM25 as discussed later.

3.2.3 Calculating Similarity

Input: Vectorized input paragraphs stored in inputPargraphVector Table, and BM25 or pivoted

length normalization in BM25, pivotNorm columns in paragraphVector table.

Output: similarity between input paragraph and relevance paragraphs in dataset stored in

similarity table.

Checking similarity between the input paragraph and paragraphs in the dataset, and detect

possible plagiarism if the similarity measure between the input paragraph and any paragraph form the

dataset exceeded a predetermined threshold.

We implemented both Okapi\BM25 and pivoted length normalization similarity functions.

The system first measure similarity on 5-gram vectors, then 4-grams and so on, if it ever found a

high similarity in one k-gram it limits its scope to those paragraphs with high similarity in precedes k-

grams to increase performance.

3.2.4 Clustering

Input: paragraph vectors with BM25 (or pivoted length normalization) weights stored in

paragraphVector table.

Output: the cluster of each paragraph stored in clusterId column in paragraph table, and the

centroids of the clusters stored in centroid table.


We clustered similar paragraphs together so that we can measure similarity between only similar

paragraphs to increase the checking similarity step speed.

An input paragraph have a similarity measure against centroids first to determine its cluster then

the regular similarity measure with all paragraphs in the dataset in the same cluster.

3.2.5 Communicating Results

It’s the interface where the user can check his document to plagiarism by inserting the document

in a text box and the document is parsed in a similar way as the Parser of the system by splitting the

document into paragraphs and they are passed by the text processing part and compared by the dataset in

the database and results appear as plagiarism percentage in the document and showing the plagiarized

parts in the document with other documents.


Chapter 4 Implementation

4.1 Extract, Load and Transform (ETL)

Figure 4.1 Overview for the Crawler and Parser

4.1.1 The Crawler It’s a software that download all the scientific papers from the web into a folders for each

publisher where the parser will start working on them.

4.1.2 The Parser It’s a software that take a PDF document (Scientific paper) as an input and extract the paper

information and content of the paper and insert them in the database of the system.

4.1.3 The Data Extracted from the paper a. Paper Information

1. Paper Title

2. Paper authors

3. Journal and its ISSN

4. Volume, Issue, Paper Date and other dates (Accepted, Received, Revised, Published)

5. DOI (Digital Object Identifier) or PII (Publisher Item Identifier)

6. Starting Page and Ending Page

b. Abstract and Keywords

c. Table of Contents

d. Figure and Table captions

e. Paper text content (as Paragraphs)


4.1.4 The Parser Implementation

Figure 4.2 UML of the Parser Application

4.1.5 How it works

The Parser consists of a parent class (Parser) and other children classes (IEEE, Springer, APEM,

and Science Direct). The parent class has the general functions that parse the PDF document and extract

(Table of contents, Figure and Table captions, and the text content) of the paper, the children classes

Start

Check for

new Papers

Choose the Suitable

Parser

Parse the

Papers

Move to

Processed Directory

Move to

Unprocessed Directory

No

Yes

Success

Fail

Figure 4.3 the Flow Chart of the Parser


have specific functions and Regular Expressions for each publisher structure to extract the paper

information (Title, Authors, DOI ...).

Each Publisher has its own folder where the scientific paper are downloaded by the Crawler, and

the Parser will monitor each folder for the new documents and use the suitable child class to parse the

new document found and extract all the needed information and data.

If the paper information and content are extracted completely, the file will be moved to the

(Processed Directory), otherwise, the file will be moved to the (Unprocessed Directory) logging the

error, so the Developer can check if it’s a new structure to be supported it in the Parser, or something

goes wrong and he has to fix it.

4.1.6 Steps of Parsing

4.1.6.1 Extracting the Text from the PDF file (extractBlocks Function)

The Parser uses the (PDFxStream Java Library) which extract the text from the PDF file as

blocks of String, and in this Function, It loops the file page by page and for each page it extract the

content of the page in an object of ArrayList<String> called page and add this page with the page

Number in an object of type HashMap<Integer,ArrayList<String>> called pages.

Figure 4.5 The First Page of an IEEE Paper (as Blocks)

public void parsePaper(String publisher) throws Exception { extractBlocks(); try {parseFirstPage();}

catch (Exception e) {throw new Exception("Error Not Processed");}

parserOtherPages(); paper.enhaceParagraphs(); try {paper.insertPaperInDatabase(publisher);}

catch (SQLException e) {throw new Exception("Error Database");} }

Figure 4.4 The main function of Parsing


4.1.6.2 Extracting the Paper Information (parserFirstPage Function)

Each Publisher accepts his scientific paper in a specific structure which differs from publisher to

publisher, and the difference lies in the first page where the paper information are written, so there has to

be parser for each publisher designed to support its structure, so this function which is an abstract

function in the parent class is implemented in each child class for each publisher.

Figure 4.6 First Page of a Science Direct Paper

Figure 4.7 First Paper of a Springer Paper

These are different structure for Science Direct and Springer to show the difference in the

structures, and the difference lies in the organization and structure of the information ex:


1. This is a header of a Springer Paper

Kong et al. EURASIP Journal on Advances in Signal Processing 2014, 2014:44

http://asp.eurasipjournals.com/content/2014/1/44

2. This is a header of an IEEE Paper

IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93

4.1.6.3 Extracting the Paper text content (parserOtherPages Function)

This function uses the general Parser functions as it loops over all the Pages and the blocks of

string in each page and extract the data from the blocks that could be (Table of contents, Figure and

Table Captions, Lists and Paragraphs).

Each block passes several stages:

1) First Test if the Block is a Figure Caption

2) Then Test if it’s a Table Caption

3) Then Test if it has a Header (Table of Content)

4) Then Test if the block has lists (maybe numeric, dash, Dot)

In the 3rd stage, if there are headers in the block, they will be extracted and the rest of the block

will be returned to the function and it will continue the other stages.

void parserOtherPages(){ for (Entry<Integer, ArrayList<String>> entrySet : pages.entrySet()) { Integer pageNumber = entrySet.getKey(); ArrayList<String> page = entrySet.getValue(); Iterator<String> it = page.iterator(); while (it.hasNext()) { String block = it.next().trim(); boolean isFigureCaption = false, isTableCaption = false; boolean isList = false, isEmptyParagraph = false; isFigureCaption = parseFigureCaption(block, pageNumber); isTableCaption = parseTableCaption(block, pageNumber); block = parseHeaders(block); isList = parseLists(block, pageNumber); isEmptyParagraph = "".equals(block); if(!isFigureCaption && !isTableCaption

&& !isEmptyParagraph && !isList) parseParagraph(block, pageNumber); } } }

Figure 4.8 The function of parseOtherPages

http://asp.eurasipjournals.com/content/2014/1/44


5) Finally if the block isn’t one of the previous types (not Figure or Table Caption, has no

list or it has header extracted and the rest of the block is returned), then it’s a paragraph

and it will be extracted.

4.1.6.4 Enhancing the Paragraphs (enhanceParagraph Function)

As shown in Fig IV.9, some paragraphs when they are extracted won’t be in a good shape,

1) Some words may be separated between 2 lines with a hyphen, so I have to rejoin it, also

there are many spaces between words so I have to remove the extra spaces.

2) The paragraph is extracted as lines (has a new line char at the end of the line) not a

continuous String so I have to refine it.

3) Some of the paper Info are in uppercase so I capitalize them.

Figure 4.9 Block of String before Enhancing

Page Number: 1 The Content: However, as the number of metal layers increases and interconnect dimensions decrease, the parasitic capacitance increases associated with fill metal have become more significant, which can lead to timing and signal integrity problems.

Page Number: 1 The Content: Previous research has primarily focused on two important aspects of fill metal realization: 1) the development of fill metal generation methods – which we discuss further in Section II and 2) the modeling and analysis of capacitance increases due to fill metal – Several studies have examined the parasitic capacitance associated with fill metal for small scale interconnect test structures in order to provide general guidelines on fill metal geometry selection and placement. For large-scale designs,

Figure 4.10 The Paragraphs after enhancing


4.1.6.5 Finally inserting all these data in the Database

When the Parser starts, an Object of type Paper is created and every information and data

extracted from the scientific paper are assigned to their attribute in this object, and at the end of the

parsing, all these information and data are inserted in the Database by calling this function.

1) Retrieve the Journal ID from the database by its name or ISSN, if it’s already found, the ID will

be returned, Otherwise, It will be considered new Journal and will be inserted and its ID will be

returned.

2) Test if the Paper is already inserted the paper in the Database before if it is already found, the

Parser will throw Exception stating that its already inserted before, but If it’s a new paper then it

will be inserted with its information (title, volume, issue …) and the Paper ID will be returned.

3) With the Paper ID, the rest of the Data will inserted (Authors, Keywords, Table of Contents,

Figure captions, Table captions, and the text content of the Paper which is the paragraphs).

4.1.7 The Paper Class

This Class works as a structure for the paper, It has the attributes that holds the information and

data of the paper and it has also the function of enhancingParagraphs() that is responsible for

improving the text and enhancing the praragraph to be ready for the next step of processing in the

Natural Language Processing part, and also the function of insertPaperInDatabase() which is

responsible for testing if the Paper is already inserted in the database before or not and if it’s a new

public class Paper { public String title = ""; public int volume = -1,issue = -1; public int startingPage = -1, endingPage = -1; public String journal = "", String ISSN = ""; public String DOI = ""; public ArrayList<String> headers = new ArrayList<>(); public ArrayList<String> authors = new ArrayList<>(); public ArrayList<String> keywords = new ArrayList<>(); public ArrayList<Paragraph> figureCaptions = new ArrayList<>(); public ArrayList<Paragraph> tableCaptions = new ArrayList<>(); public ArrayList<Paragraph> paragraphs = new ArrayList<>(); public String date=""; public String dateReceived="NULL", dateRevised="NULL"; public String dateAccepted="NULL", dateOnlinePublishing="NULL"; public void enhaceParagraphs() public void insertPaperInDatabase(String publisher) }

Figure 4.11 the Paper Structure


paper, then It will be inserted with all of its data from paragraphs, figure and table captions and the

paper information.

Note that:

When the Parser finds a new PDF document in the folders of the publishers, it create a new

object of type paper and in the meantime of parsing the document, each piece information extracted is

assigned to its attribute in this object, and in the end of the parsing process, This paper object execute

its 2 member function the enhanceParagraph() to refine the paragraph content then execute the

insertPaperInDatabase() method to insert all the data in the database.

4.1.8 The Paragraph Structure As shown in the Figure IV.12, the paragraph structure is very simple is contains the number of

the content of the paragraph extracted and the page number from where this paragraph is extracted.

4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper)

As shown in Figure IV.13, we can see that the page is divided into blocks of string and each

block has a piece or more of the paper information, this function in the parser is implemented specific to

the publisher, so the function of IEEE parser won’t work to the Springer Parser and so on, and this

function is implemented to parse only the first page and to extract the paper information in this page and

assign them to the attributes of the paper object.

Even in the publisher itself, there are differences in the location of the paper information in the

page, and the structure is changing overtime for example the IEEE Parser support 8 different forms of

Paper header.

Figure 4.13 Different forms for an IEEE Top Header

public class Paragraph { public int pageNum; public String content; }

Figure 4.12 the Paragraph Structure


Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper


4.1.9.1 Parsing the Paper Header

The function of the parseFirstPage() starts with parsing the Header of the paper which is the

first block in the paper, the block is sent to a parsePaperHeader() function which have the different

Regular Expressions for every form of that the parser support as shown in Fig IV.15.

When the function receives the block, the block passes through the different Expressions, and if

it matches with one of supported formats, the function will start extracting the information, otherwise

the function will throw an Exception stating that this header format isn’t supported and the developer

has to support it.

As shown in Figure IV.13, The Header may contain information such as the Starting page

number (may be in the start or at the end of the line), Journal Title, Volume number, Issue number,

and the Date. These information could be presented in the header or not, so according to the format of

the header the suitable functions (parsePaperDate(), parseVolume(), parseIssue(), parseJournal(),

parseStartingPage()) will be called to extracted these information.

// ex: Chang et al. VOL. 1, NO. 4/SEPTEMBER 2009/ J. OPT. COMMUN. NETW. C35 String header1_Exp = "^([A-Z]+ ET AL\\. " + volume_Exp + ", " + issue_Exp

+ "[ ]*\\/[ ]*" + paperDate_Exp + "[ ]*\\/[ ]*" + journalTitle_Exp + " [A-Z0-9]+)$"; // ex: 594 J. OPT. COMMUN. NETW. /VOL. 1, NO. 7/DECEMBER 2009 Lim et al. String header2_Exp = "^([A-Z0-9]+ " + journalTitle_Exp + "\\/"

+ volume_Exp + ", " + issue_Exp + "\\/" + paperDate_Exp + " [A-Z]+ ET AL\\.)$";

// ex: IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May-June 2008 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May/June 2008 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS Vol. 13, No. 6; December 2006 String header3_Exp = "^(([0-9]+ )*" + journalTitle_Exp +"(,)* " + volume_Exp + ", " + issue_Exp + "(,|;) " + paperDate_Exp + "( [0-9]+)*)$"; // ex: 598 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING String header4_Exp = "^([0-9]+ " + journalTitle_Exp + ")$"; // ex: IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 598 String header5_Exp = "^(" + journalTitle_Exp + " [0-9]+)$"; // ex: 1956 lRE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES 75 String header6_Exp = "^([0-9]{4} " + journalTitle_Exp + "[0-9]+)$"; // ex: 112 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS May String header7_Exp = "^([0-9]+ " + journalTitle_Exp + "[A-Z]{3,9})$"; // ex: SUPPLEMENT TO IEEE TRANSACTIONS ON AEROSPACE / JUNE 1965 String header8_Exp = "^(" + journalTitle_Exp + "[ ]*\\/[ ]*" + dateExp + ")$";

Figure 4.15 The supported Regex of the IEEE Header formats


4.1.9.2 Extracting the Volume from the Header

The IEEE Parser uses the volume_Exp = VOL(\\.)* [A-Z\\-]*[0-9]+ to detect the

Volume part from the Header and passes it to the parseVolume() Function, then uses another

Expression to extract the number from this part, ex: in the first line in Fig IV.16 of the header forms, the

parser will detect the part (VOL. 18), then It will detect the number from this result (18), then change its

type from String to int to be assigned to the volume attribute in the paper Object.

4.1.9.3 Extracting the Issue number from the Header

The IEEE Parser uses the issue_Exp = NO(\\.|,) [0-9]+ to detect the issue part from the

Header and passes it to the parseIssue() Function, then uses another Expression to extract the

number from this part, for the same example presented in the Volume section, the parser will detect the

part (NO. 3), then It will detect the number from this result (3), then change its type from String to int to

be assigned to the issue attribute in the paper Object.

4.1.9.4 Extracting the PaperDate from the Header

@Override void parseVolume(String volume) { Matcher matcher = Pattern.compile(volume_Exp).matcher(volume); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+").matcher(matcher.group()); while(numMatcher.find()) paper.volume = Integer.parseInt(numMatcher.group()); } }

Figure 4.16 The Function of extracting the Volume Number

@Override void parseIssue(String issue) { Matcher matcher = Pattern.compile(issue_Exp).matcher(issue); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+")

.matcher(matcher.group()); if(numMatcher.find()) paper.issue = Integer.parseInt(numMatcher.group()); } }

Figure 4.17 The Function of Extracting the Issue Number

@Override void parsePaperDate(String date) { Matcher matcher = Pattern.compile(paperDate_Exp).matcher(date.trim()); if(matcher.find()) paper.date = matcher.group().replaceAll("^\\/", "").trim(); }

Figure 4.18 The Function of Extracting the DOI


Like the other parts of the header, the IEEE Parser uses the date_Exp = [A-Z]{0,9}[\\/\\-

]*[A-Z]{3,9}(\\.)*( [0-9]{1,2}(,)*)* [0-9]{4} to extract the date part from the Header,

then assign it to the date attribute in the paper object, also the Date could be written in different formats

(2016, March 2016, May/June 2016, May-June 2016) and the Expression is written to detect all forms

of the date formats.

Note that after extracting each information from the previous ones, this information is removed

from the block (header) String, so the after removing the volume, issue, and date, the information left in

the header will be the Journal Title and the Starting page, and the Starting page could be in the start or

the end of the header.

4.1.9.5 Extracting the Start and End Page numbers from the Header

Now, we know that the header has only the Journal Title and the Start Page number, so The

IEEE Parser uses the startPage_Exp = ^[0-9]+|[0-9]+$ this expression is to extract a number

that lies at the start of the end of the checked String so if the Start Page number lies in the start of the

header or at the end of the header, It will be detected and extracted, then as the other information it will

be assigned to the attribute in the paper object.

And for the End Page, It’s very simple as the IEEE Parser will add the number of pages of the

Paper to the number of Start Page and assign the result to the End page of the paper object.

4.1.9.6 Extracting the Journal Title from the Header

Now finally for the Journal Title, The IEEE Parser uses the journalTitle_Exp = [A-Z

\\:\\-\\—\\/\\)\\(\\.\\,]+ to extract the Journal Title part from the Header, then passes the

title to the parseJournalTitle() Function.

The title may have some extra words that aren’t needed such as: ([author name] et al.) or it may

end with some separating characters (comma or forward slash) so they must be removed first, then

assign the rest to the journal attribute in the paper object.

@Override void parseStartingPage(String startingPage) { Matcher matcher = Pattern.compile(startingPage_Exp).matcher(startingPage); if(matcher.find()) paper.startingPage = Integer.parseInt(matcher.group().trim()); parseEndingPage(startingPage); } @Override void parseEndingPage(String endingPage) { paper.endingPage = paper.startingPage + pages.size(); }

Figure 4.19 The Function of Extracting the Start and End Pages


4.1.9.7 Parsing the Rest of the first page’s blocks

@Override void parseJournal(String journal) { journal = journal.replaceAll("( \\/|, )", "").trim(); Matcher matcher = Pattern.compile(journalTitle_Exp).matcher(journal); if(matcher.find()){ String journalName = matcher.group().replaceAll("[A-Z ]+ ET AL.", ""); if (journalName.charAt(journalName.length()-1) == '/') paper.journal = journalName.substring(0, journalName.length()-1); else paper.journal = journalName; } }

Figure 4.20 The Function of Extracting the Journal Title

Iterator<String> it = pageOne.iterator(); while (it.hasNext()) { String mainBlock = it.next(); String block = mainBlock.replaceAll("[ ]+", " "); if(Pattern.compile(IEEE_DOI_Exp + "|" + PII_Exp).matcher(block).find()) { parseDOI(block); blockList.add(mainBlock); } if(ISSN_Pattern.matcher(block).find()) { parseISSN(block); blockList.add(mainBlock); } if(Pattern.compile("Index Terms").matcher(block).find()) { parseKeywords(block); blockList.add(mainBlock); } if(Pattern.compile("(Abstract|ABSTRACT|Summary)").matcher(block).find()) { parseAbstract(block); blockList.add(mainBlock); } if(date_Pattern.matcher(block.toUpperCase()).find() && !datesFound){ parseDates(block); if (!paper.dateAccepted.equals("NULL") ||

!paper.dateOnlinePublishing.equals("NULL") || !paper.dateReceived.equals("NULL") || !paper.dateRevised.equals("NULL"))

{ blockList.add(mainBlock); datesFound = true; } } } removeUnimportantBlocks(); for (String blockList1 : blockList) pageOne.remove(blockList1);

Figure 4.21 Parsing the rest of blocks in the first Page


After parsing the Header Block and extracting all the information from it, the IEEE Parser will

continue to parse the other blocks, searching for the rest information, but due to the difference in

structure, the location of these information could differ from structure to structure so the best way to

extract them is by looping through all the first page blocks and using the Regular Expressions of the

these information such as (DOI, ISSN …) the Parser can locate them and also It will try to detect some

other blocks such as the Abstract, Keywords, Nomenclature, and the paper Dates (when it’s

Received, Accepted, Revised, and Published Online).

In every loop, if an information is detected the block will be passed to the suitable function to

extract this information, and since the information is extracted, the block isn’t needed so the Parser will

add this block to a TreeSet<String> (blockList) and after finishing all the iterations on the blocks

of page one, these blocks will be removed from the blocks of the page.

Also, there may be some other blocks that don’t have important information such as: the website

of the publisher or the logo of the publisher with its name under the logo, so they all also have to be

detected and removed using the function removeUnimportantBlocks().

4.1.9.8 Extracting the DOI or the PII

In the loop, if the block is detected to have the DOI (Digital Object Identifier) or PII (Publisher

Object Identifier) using the IEEE_DOI_Exp = [0-9]{2}\\.[0-9]{4}\\/[A-Z\\-]+\\.[0-

9]+\\.[0-9]+ and the PII_Exp = [0-9]{4}\\-[0-9xX]{4}\$[0-9]{2}\$[0-9]{5}\\-

(x|X|[0-9]), The IEEE Parser will be passed the block to the function of parseDOI(), and the DOI

or PII will be extracted, then if it’s the DOI, It will be concatenated with the domain of the DOI of the

papers (http://dx.doi.org/), but if it’s the PII, It will be concatenated with (http://dx.doi.org/10.1109/S)

and the result will be assigned it to the DOI attribute in the paper object.

4.1.9.9 Extracting the ISSN

@Override void parseDOI(String DOI) { Matcher matcher = Pattern.compile(IEEE_DOI_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/" + matcher.group(); matcher = Pattern.compile(PII_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/10.1109/S" + matcher.group(); }

Figure 4.22 The Function of Extracting the DOI and PII

void parseISSN(String ISSN){ Matcher matcher = ISSN_Pattern.matcher(ISSN); while(matcher.find()) paper.ISSN = matcher.group().replaceAll("(–|-|‐)", "-"); }

Figure 4.23 The Function of Extracting the ISSN

http://dx.doi.org/

http://dx.doi.org/10.1109/S


Also if a block is detected to have the ISSN using the ISSN_Exp = [0-9]{4}(\\–|\\-|\\‐|

)[0-9]{3}[0-9xX], The IEEE Parser will pass the block to the function parseISSN(), that will

extract the ISSN, and assign it to the ISSN attribute in the paper object.

4.1.9.10 Extracting the Dates of the Paper:

If a block through the iteration is detected to have dates using the date_Exp = ([0-9]{1,2}(\\-| )[A-Z]{3,9}[\\.]*(\\-| )[0-9]{4})|[A-Z]{3,9}[\\.]*( [0-9]{1,2},)* [0-9]{4}) so The IEEE Parser will pass the block to the function parseDates(), this

block have dates related to the Paper such as when it’s received in the publisher and when it’s revised,

void parseDates(String dates){ dates = dates.replaceAll(separatedWord_Fixing, "")

.replaceAll(newLine_Removal, " ").toUpperCase(); Matcher matcher = receivedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateReceived = dateMatcher.group().trim(); } matcher = revisedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateRevised = dateMatcher.group().trim(); } matcher = acceptedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateAccepted = dateMatcher.group().trim(); } matcher = publishingDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateOnlinePublishing = dateMatcher.group().trim(); } }

Figure 4.24 The Function of extracting the paper Dates


accepted and published online and for every date of those there is a Regular Expression to detect it and

Note that: not all papers include these dates in the paper, but most of them include it, so they will be

extracted if they are included in the paper and assigned their attributes in the paper object.

The dates could be written in many formats: (30 OCTOBER 2007), (17 AUG. 2007), (28-JULY-

2009), (OCTOBER 6, 2006), so the Regular Expression of the date itself could be complicated as to

detect all of these formats of dates

Also the word before the date could be written in different forms (Received), (Received:),

(Revised), (Revised:) or (Received in revised form) and maybe lowercase or capitalized, so the Regular

Expressions are constructed to detect all forms of those words and for the character case, we transform

the string to uppercase and compare them.

4.1.9.11 Extracting the Keywords

If the block of the Keywords is detected, The IEEE Parser will pass it to the parseKeywords()

function, the keywords may be found in the block of the Abstract so the first line to crop the part of the

Keywords if it exists with the abstract, then the block could be separated in 2 lines or has a word

separated in 2 lines with a hyphen so they have to be removed and fixed, after that some papers separate

the keywords with comma (,) and others separate them with semi-colon (;), Then the splitted keywords

are added to the list of the keywords in the paper object.

4.1.9.12 Extracting the Abstract

For the block of the abstract, while the iteration in the parseFirstPage() procedure, If one of

the blocks matches the word abstract or summary, then this block will be passed to the

parseAbstract() function, and it will be considered the first paragraph in the page with the header

Abstract .

In some cases the abstract may contain some other information such as the keywords or the

Nomenclature so they have to be copped first and parsed separately.

@Override void parseKeywords(String keywords) { keywords = keywords.substring(keywords.indexOf("Index Terms")); String indexTerms_Removal = "-\\r\\n|Index Terms|\\-"; keywords = keywords.replaceAll(indexTerms_Removal, ""); String[] splitted = keywords.replaceAll(newLine_Removal, " ").split(",|;"); for (int i = 0; i < splitted.length; i++) paper.keywords.add(splitted[i].trim()); }

Figure 4.25 The Function of Extracting the Keywords


4.1.9.13 Extracting the Title and Authors

Now after extracting all the information of the paper and removing these blocks, The next block

will have the Title of the paper, then the Authors, then the Introduction.

First the title will be passed to the parseTitle() procedure, so If it’s separated on more than

one line, it will remove the newline char and assign it to the title attribute.

Next the Authors will be passed to the parseAuthors() procedure, where they will be

separated may be by comma or semi-colon or some other separation according to the publisher style,

and each author will be added to the authors list in the paper object.

@Override void parseAbstract(String abstractContent) { int indexOfIndexTerms = abstractContent.indexOf("Index Terms"); if (indexOfIndexTerms != -1) abstractContent = abstractContent.substring(0,indexOfIndexTerms); int indexOfNomenclature = abstractContent.indexOf("NOMENCLATURE"); if (indexOfNomenclature != -1) abstractContent = abstractContent.substring(0,indexOfNomenclature); paper.headers.add("Abstract"); abstractContent = abstractContent.replaceAll("(Abstract|Summary)(\\-)*", ""); String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph paragraph = new Paragraph(1, lastHeader, abstractContent); paper.paragraphs.add(paragraph); }

Figure 4.26 The Function of Extracting the Keywords

void parseTitle(String title) { paper.title = title.replaceAll(newLine_Removal, " ").trim(); } @Override void parseAuthors(String authors) { authors = authors.replaceAll(author_Removal, "")

.replaceAll("[ ]+", " "); authors = authors.replaceAll(separatedWord_Fixing, "")

.replace(newLine_Removal, " "); String[] split = authors.split(",| and| And| AND"); for (String author : split) if(!author.trim().isEmpty()) paper.authors.add(author.replaceAll("[0-9]+", "").trim()); }

Figure 4.27 The Function of Extracting the Title and the Authors


4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)

Now all the paper Information is extracted and the blocks exist in the first page is the

introduction and the rest of the page content, and parseFirstPage() procedure is done executing,

and parseOtherPages() procedure will start executing, as we demonstrate before it loops over all

the blocks of strings in the pages and extract all the possible data from them such as Headers (Table of

Contents), figure and table captions, lists and if not any of the previous it will be a paragraph, and all of

these procedures are part of the parent Parser.

4.1.10.1 Extracting the Headers

This procedure is a very general one and works efficiently for most types of headers, first it

detect the style of the level 1 headers and it supports (I. INTRODUCTION), (1 INTRODUCTION), (1.

INTRODUCTION), (1 Introduction), the Headers could be listed with the roman numbers or may be

with number and the header written in an uppercase or may be the number has a dot after it, or the

header written in a capitalized case, so first the function detects the type of the header.

Also for the level 2 headers, there are different styles for these headers and there is another

function to detect them and it supports 3 different types of headers such as: (for example) (A. Level 2

Header), (1.1 Level 2 Header),

(1.1. Level 2 Header), so the Header could be listed alphabetically or number dot number or

number dot number dot then the title of the header.

Once the headers style are specified the Header’s Regular Expressions are created and are tested

on all the passed blocks to detect any headers, and those Regex are not constant but they are changing

for example if I detected

(1. Introduction) then the next header that I will wait will be (2. Another Header) so the

number will be incremented.

Another thing, the header always comes at the start of the block of string and the rest of the

string is a paragraph or it may be extracted from the beginning in one block, so the function will extract

the header only and the rest of the block will be return so it will be parsed as a paragraph.

Also there are other headers that has no numbers such as the Abstract, References,

Acknowledgements, Appendix and more other, and those headers are detected separately with a

separate Regex.

Also this procedure can detect the level 3 and level 4 headers and their style is specified

according to the style of the level 2 headers for example (1.1.1 Header) or (1.1.1. Header) and

all are add to the headers list in the paper object.


4.1.10.2 Extracting the Figure and Table Captions

In This procedure, the parent Parser uses the figure_Exp = ^(Fig\\.|Figure)[ ]+[0-

9]+(\\.|\\:) and table_Exp = ^(TABLE|Table)[ ]+([0-9]+|[IVX]+) to detect the figure

and table captions and they may appear in different styles for example (Figure 1.), (Fig. 1),

(Figure 1:), (TABLE 1) and (Table II) also the listing could be numeric or alphabetic, and after

extracting it, the caption will be added to the list of captions as paragraphs with the page number.

private enum HeaderMode { NUM_SPACE_UPPERCASE, NUM_SPACE_CAPITALIZED, ROMAN_DOT_UPPERCASE,

NUM_DOT_UPPERCASE, NUM_DOT_CAPITALIZED, ABC_DOT_CAPITALIZED, NUM_DOT_NUM_DOT_CAPITALIZED, NUM_DOT_NUM_CAPITALIZED } Pattern restHeader_Pattern = Pattern.compile("^(REFERENCES|References|" + "ACKNOWLEDGMENT[S]*|Acknowledg[e]*ment[s]*|Nomenclature|DEFINITIONS" + "|Contents|NOMENCLATURE|ACRONYM|ACRONYMS|NOTATION|APPENDIX|"

+ "Appendix)(\\r\\n| )*");

void detect_Header1Mode(String block){ if (Pattern.compile("I. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.ROMAN_DOT_UPPERCASE; else if (Pattern.compile("1 INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_UPPERCASE; else if (Pattern.compile("1 Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_CAPITALIZED; else if (Pattern.compile("1. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_UPPERCASE; else if (Pattern.compile("1. Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_CAPITALIZED; } void detect_Header2Mode(int _1st_header, String block){ if (Pattern.compile("^A. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.ABC_DOT_CAPITALIZED; else if (Pattern.compile("^" + _1st_header

+ "\\.1 [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_CAPITALIZED; else if (Pattern.compile("^" + _1st_header

+ "\\.1\\. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_DOT_CAPITALIZED; }

Figure 4.28 The Defining the Style of the Header


4.1.10.3 Extracting the Lists

In this procedure, the parent Parser can detect the lists in text content and separate them as a

whole paragraph itself, and it supports the numeric, dot and dashed lists, and the block could have

paragraph at the beginning and paragraph at the last, so they have to be separated, and each of the

paragraph (if found) and the lists are added as paragraphs with the page number in the paragraph list in

the paper object.

private boolean parseFigureCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = figureCaption_Pattern.matcher(block); while(matcher.find()){ String figureTitle = block.replaceAll(separatedWord_Fixing, "")

.replaceAll(newLine_Removal, " ").trim();

if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph figure = new Paragraph(pageNumber, lastHeader, figureTitle); paper.figureCaptions.add(figure); return true; } return false; } private boolean parseTableCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = tableCaption_Pattern.matcher(block); while(matcher.find()){ String tableTitle = block.replaceAll(separatedWord_Fixing, "")

.replaceAll(newLine_Removal, " ").trim();

if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph table = new Paragraph(pageNumber, lastHeader, tableTitle); paper.tableCaptions.add(table); return true; } return false; }

Figure 4.29 the Function of Extracting the Figure Captions


4.1.10.4 Extracting the Paragraph

Pattern newList_Pattern = Pattern.compile("\\r\\n[ ]*([0-9]|\\-|\\.|\\·|\\•)"); Pattern numericList1_Pattern = Pattern.compile("^[0-9](\\.|\\))[ ]+[A-Z]"); Pattern numericList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*[0-9](\\.|\\))[ ]+[A-

Z]"); Pattern dotList1_Pattern = Pattern.compile("^(\\.|\\·|\\•)[ ]+[A-Za-z]"); Pattern dotList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*(\\.|\\·|\\•)[ ]+"); Pattern dashList1_Pattern = Pattern.compile("^\\-[ ]+[A-Za-z]+"); Pattern dashList2_Pattern = Pattern.compile("(\\.|\\:)\\r\\n[ ]*\\-[ ]+[A-Z]");

private boolean parseLists(String block,int pageNumber){ Matcher orderList1_Matcher = numericList1_Pattern.matcher(block); Matcher orderList2_Matcher = numericList2_Pattern.matcher(block); if(orderList1_Matcher.find() || orderList2_Matcher.find()) return parseList(block, pageNumber, numericList2_Pattern, newList_Pattern); Matcher dotList1_Matcher = dotList1_Pattern.matcher(block); Matcher dotList2_Matcher = dotList2_Pattern.matcher(block); if(dotList1_Matcher.find() || dotList2_Matcher.find()) return parseList(block, pageNumber, dotList2_Pattern, newList_Pattern); Matcher dashList1_Matcher = dashList1_Pattern.matcher(block); Matcher dashList2_Matcher = dashList2_Pattern.matcher(block); if(dashList1_Matcher.find() || dashList2_Matcher.find()) return parseList(block, pageNumber, dashList2_Pattern, newList_Pattern); return false; }

Figure 4.30 the Function of separating the lists

void parseParagraph(String block, int pageNumber){ Matcher matcher = newParagraph_Pattern.matcher(block); String lastHeader="",content; int startIndex =0, endIndex; while(matcher.find()){ endIndex = matcher.start(); if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex, endIndex+1);

Figure 4.31 the Function of Extracting the Paragraph


In this procedure, the parent Parser can detect the paragraphs using the paragraph_Exp =

.\\r\\n[ ]+[A-Z], and as we mentioned before, the block is passed to this procedure after being

tested to be a Figure or Table Caption or it’s a list, or it could have a header so it has to be extracted first

and return the rest of the block, so if the block passed all these tests, it will be considered a paragraph,

and passed to the parseParagraph() procedure.

Note that: the block may contain one or more paragraph so all of them has to be detected and

separated and each of them is added to the paragraphs list with page number in the paper object.

Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); startIndex = endIndex + matcher.group().length()-1; } if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex); Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); }

Figure 4.32 The Function of Extracting the Paragraph


4.2 The Natural Language Processing (NLP)

4.2.1 Introduction

In this section, the text extracted from the scientific papers has to be refined. We have to focus

on the important words in the text such as names and verbs, and ignoring the staffed words such as

prepositions and adverbs, so the plagiarism can be detected efficiently even if the user try to play with

words.

4.2.2 The Implementation Overview

First, each paragraph in the database is selected and passed to the processText() procedure that

perform the text processing and return an array of refined words, in this procedure the paragraph passes

through several steps.

1. Lowercase

2. Tokenization

3. Part of Speech (POS) tagging

4. Remove Punctuations

5. Remove Stop words

6. Lemmatization

4.2.3 The Text Processing Procedure

4.2.3.1 Lowercase

In this step, all the text is changed to the lowercase, so we don’t have redundant data of the same

words written in different cases (Play, play).

4.2.3.2 Tokenization

def processText(document): document = document.lower() words = tokenizeWords(document) tagged_words = pos_tag(words) filtered_words = removePunctuation(tagged_words) filtered_words = removeStopWords(filtered_words) filtered_words = lemmatizeWords(filtered_words) return filtered_words

Figure 4.33 Process Text Function

def tokenizeWords(sentence): return word_tokenize(sentence)

Figure 4.34 Tokenizing words Function


Here, we split the text into words using Treebank Tokenization Algorithm. This Algorithm

splitting the words in intelligent way based on corpus (data) retrieved from NLTK, it also split the words

from surrounded punctuation.

For Example:

4.2.3.3 Part of Speech (POS) tagging

The purpose of the POS is to find the position of the word in the sentence, it can detect if

the word is verb, noun, adjective, or adverb, so this information will help return the words to

their origins as for verbs they will be returned to their infinitives.

We use WordNet database to get the words origins.

1. i’m → [ 'i', "'m" ]

2. won’t → ['wo', "n't"]

3. gonna (tested) {helping} (25) → ['gon', 'na', 'tested', 'helping', '25']

Figure 4.35 Tokenization Example

words = ['At', '5 am', "tomorrow", 'morning', 'the', 'weather', "will", 'be', 'very', 'good', '.'] taged_words = nltk.pos_tag(words)

Figure 4.36 POS Function

[('at', 'IN'), ('5', 'CD'), ('am', 'VBP'), ('tomorrow', 'NN'), ('morning', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('will', 'MD'), ('be', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]

Figure 4.37 POS Output Example

def getWordnetPos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN

Figure 4.38 WordNet POS Function


4.2.3.4 Remove Punctuations

In this Step the punctuations are removed from the text such as: comma, full stop, single and

double quotes, and the parenthesis either circle or square or the curly braces.

4.2.3.5 Remove Stop words

In this process the staffed words (stop words) are removed.

def removePunctuation(words): new_words = [] for word in words: if len(word[0]) > 1: new_words.append(word) return new_words

Figure 4.39 Removing Punctuations Function

def removeStopWords(words): stop_words = set(stopwords.words("english")) new_words = [] for word in words: if word[0] not in stop_words: new_words.append(word) return new_words

Figure 4.40 Removing Stop Words Function

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

Figure 4.41 Stop Words list


4.2.3.6 Lemmatization

In this step, we use the information retrieved from the POS to get the origins of the words. By

passing the word and its wordnet position to the lemmatize function.

Now the paragraph after being processed have only the important words that describe the real meaning

of the paragraph.

4.2.4 Example of the Text Processing

def lemmatizeWords(words): new_words = [] wordnet_lemmatizer = WordNetLemmatizer() for word in words: new_word = wordnet_lemmatizer.lemmatize(word[0], getWordnetPos(word[1])) new_words.append(new_word)

Figure 4.42 Lemmatization Function

Plagiarism is the wrongful appropriation and stealing and publication of another author's language, thoughts, ideas, or expressions and the representation of them as one's own original work. The idea remains problematic with unclear definitions and unclear rules. The modern concept of plagiarism as immoral and originality as an ideal emerged in Europe only in the 18th century, particularly with the Romantic movement. Plagiarism is considered academic dishonesty and a breach of journalistic ethics. It is subject to sanctions like penalties, suspension, and even expulsion. Recently, cases of 'extreme plagiarism' have been identified in academia. Plagiarism is not in itself a crime, but can constitute copyright infringement. In academia and industry, it is a serious ethical offense. Plagiarism and copyright infringement overlap to a considerable extent, but they are not equivalent concepts, and many types of plagiarism do not constitute copyright infringement, which is defined by copyright law and may be adjudicated by courts. Plagiarism is not defined or punished by law, but rather by institutions (including professional associations, educational institutions, and commercial entities, such as publishing companies).

Figure 4.43 Paragraph before Text Processing


['plagiarism', 'wrongful', 'appropriation', 'stealing', 'publication', 'another', 'author', "'s", 'language', 'thought', 'idea', 'expression', 'representation', 'one', "'s", 'original', 'work', 'idea', 'remain', 'problematic', 'unclear', 'definition', 'unclear', 'rule', 'modern', 'concept', 'plagiarism', 'immoral', 'originality', 'ideal', 'emerge', 'europe', '18th', 'century', 'particularly', 'romantic', 'movement', 'plagiarism', 'consider', 'academic', 'dishonesty', 'breach', 'journalistic', 'ethic', 'subject', 'sanction', 'like', 'penalty', 'suspension', 'even', 'expulsion', 'recently', 'case', "'extreme", 'plagiarism', 'identify', 'academia', 'plagiarism', 'crime', 'constitute', 'copyright', 'infringement', 'academia', 'industry', 'serious', 'ethical', 'offense', 'plagiarism', 'copyright', 'infringement', 'overlap', 'considerable', 'extent', 'equivalent', 'concept', 'many', 'type', 'plagiarism', 'constitute', 'copyright', 'infringement', 'define', 'copyright', 'law', 'may', 'adjudicate', 'court', 'plagiarism', 'define', 'punish', 'law', 'rather', 'institution', 'include', 'professional', 'association', 'educational', 'institution', 'commercial', 'entity', 'publish', 'company']

Figure 4.44 Paragraph after Text Processing


4.3 Term Weighting

In this section we will calculate the term weighting our system using the data extracted from the

scientific papers by the parser. The parser extract the data as paragraphs and store them in database, and

here we will retrieve these paragraphs and calculate the term weighting for the system.

4.3.1 Lost Connection to Database Problem

First we open a connection to database and retrieve the unprocessed paragraphs, but we are

processing a large number of paragraphs and the connection must stay open all that time. So we face a

problem of lost connection to database when its internal timeout expires.

1) Increasing timeout solution

This problem could be solved by increasing the timeout but this solution is limited as we might

have a very large number of paragraphs that exceeds the timeout was set before.

2) Better solution

We will retrieve 100 paragraph and process them, then close the connection. Then open a new

connection and retrieve another 100 paragraphs and so on until all the unprocessed paragraphs are

processed.

cursor = connection.run("SELECT COUNT(*) FROM paragraph WHERE processed = false") (unprocessedParagraphsNum,) = cursor.fetchone() connection.endConnect() pCounter = 0 insertTermsBeginTime = time.time() while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 100: remain = 100 rows = connection1.run("SELECT paragraphId,content FROM paragraph WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 # Process Paragraph

connection.endConnect()

Figure 4.45 Retrieving Paragraphs


4.3.2 Process Paragraph

For each paragraph we will pass it to processText() procedure to get an array of refined words, if

the array is empty, it means that there we no important words in the paragraph and the paragraph will be

deleted.

We will use the returned words to generate k-gram terms of them and populate the term table and

paragraphVector table.

Finally we will update the length of the paragraph with the number of words returned from

processText(), and mark the paragraph as processed.

4.3.3 Generating Terms

To generate terms we will call the generateTerms() procedure and pass to it: the bag of words,

and the kind of the k-grams we want to generate.

while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 1000: remain = 1000 rows = connection1.run("SELECT paragraphId,content FROM paragraph"

+ " WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 data = processText(content) length = len(data) if length < 1: connection2.run("DELETE FROM paragraph WHERE paragraphId = %s;"

, (paragraphId,)) connection2.commit() continue term.populateTerms_ParagraphVector(connection2, data, paragraphId) connection2.run("UPDATE paragraph SET length = %s, processed = %s "

+ " WHERE paragraphId = %s;", (length, True, paragraphId)) connection2.commit() connection1.endConnect() connection2.endConnect()

Figure 4.46 Process Paragraph Function


Example on k-grams

data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) def generateTerms(data, kgrams, paragraphId=0): all_terms = {} for i in kgrams: if len(data) < i: continue terms = createTerms(data, i) all_terms[i] = terms data = { 'paragraphId': paragraphId, 'terms': all_terms } return data def createTerms(words, kgram): length = len(words) - kgram + 1 i = 0 terms = [] while i < length: term = createTerm(words, i, kgram) terms.append(term) i += 1 return terms def createTerm(words, start, kgram): i = start term = [] while i < kgram + start: term.append(words[i]) i += 1 t = ' '.join(term) if len(t) > 180: t = t[0:180] return t

Figure 4.47 Generate k-gram Terms Function

Physics is one of the oldest academic disciplines, perhaps the oldest through its inclusion of astronomy. Over the last two millennia, physics was a part of natural philosophy along with chemistry, biology, and certain branches of mathematics.

Figure 4.48 Paragraph Example


['physic', 'one', 'old', 'academic', 'discipline', 'perhaps', 'old', 'inclusion', 'astronomy', 'last', 'two', 'millennium', 'physic', 'part', 'natural', 'philosophy', 'along', 'chemistry', 'biology', 'certain', 'branch', 'mathematics']

Figure 4.49 1-gram terms

['physic one', 'one old', 'old academic', 'academic discipline', 'discipline perhaps', 'perhaps old', 'old inclusion', 'inclusion astronomy', 'astronomy last', 'last two', 'two millennium', 'millennium physic', 'physic part', 'part natural', 'natural philosophy', 'philosophy along', 'along chemistry', 'chemistry biology', 'biology certain', 'certain branch', 'branch mathematics']


['physic one old', 'one old academic', 'old academic discipline', 'academic discipline perhaps', 'discipline perhaps old', 'perhaps old inclusion', 'old inclusion astronomy', 'inclusion astronomy last', 'astronomy last two', 'last two millennium', 'two millennium physic', 'millennium physic part', 'physic part natural', 'part natural philosophy', 'natural philosophy along', 'philosophy along chemistry', 'along chemistry biology', 'chemistry biology certain', 'biology certain branch', 'certain branch mathematics']


['physic one old academic', 'one old academic discipline', 'old academic discipline perhaps', 'academic discipline perhaps old', 'discipline perhaps old inclusion', 'perhaps old inclusion astronomy', 'old inclusion astronomy last', 'inclusion astronomy last two', 'astronomy last two millennium', 'last two millennium physic', 'two millennium physic part', 'millennium physic part natural', 'physic part natural philosophy', 'part natural philosophy along', 'natural philosophy along chemistry', 'philosophy along chemistry biology', 'along chemistry biology certain', 'chemistry biology certain branch', 'biology certain branch mathematics']


['physic one old academic discipline', 'one old academic discipline perhaps', 'old academic discipline perhaps old', 'academic discipline perhaps old inclusion', 'discipline perhaps old inclusion astronomy', 'perhaps old inclusion astronomy last', 'old inclusion astronomy last two', 'inclusion astronomy last two millennium', 'astronomy last two millennium physic', 'last two millennium physic part', 'two millennium physic part natural', 'millennium physic part natural philosophy', 'physic part natural philosophy along', 'part natural philosophy along chemistry', 'natural philosophy along chemistry biology', 'philosophy along chemistry biology certain', 'along chemistry biology certain branch', 'chemistry biology certain branch mathematics']



4.3.4 Populating term, paragraphVector Tables

After we generated the terms we will use them to populate term, paragraphVector tables.

4.3.4.1 Calculate Term Frequency

We will use nltk.FreqDist() function to calculate the term frequency of each k-gram term in the

paragraph

4.3.4.2 Inserting Terms

We will insert each term with its corresponding term gram.

4.3.4.3 Inserting ParagraphVector

In this step we will link each term with its paragraph and the term frequency by inserting these

into the paragraphVector table.

tf = {} for kgram in data['terms']: tf[kgram] = nltk.FreqDist(data['terms'][kgram])

Figure 4.54 Calculate Term Frequency

query1 ="INSERT INTO term (kgram, term) VALUES (%s, %s) ON DUPLICATE KEY UPDATE kgram = kgram, term = term;" insertTerms = [(str(kgram), str(term)) for kgram in tf for term in tf[kgram]] connection.runMany(query1, insertTerms) connection.commit()

Figure 4.55 insert Terms in Database

query2 = "INSERT IGNORE INTO paragraphVector (paragraphId, termId, termFreq, kgram) VALUES (%s, (SELECT termId FROM term WHERE term = %s AND kgram = %s), %s, %s);" insertDocVec = [(data['paragraphId'], str(term), str(kgram), tf[kgram][term], str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query2, insertDocVec) connection.commit()

Figure 4.56 insert Paragraph Vector in Database


4.3.5 Executing VSM Algorithm

After all paragraphs are inserted into the database after begin processed, we will run some sorted

SQL procedures to update some columns(inverseDocFreq, BM25, pivotNorm) in term and

paragraphVector tables.

Now the system is finished and all terms are evaluated and ready for testing plagiarism.

connection.callProcedure('update_inverseDocFreq') connection.callProcedure('update_BM25', (0.75, 1.5)) connection.callProcedure('update_pivotNorm', (0.75,))

Figure 4.57 Executing the VSM Algorithm


4.4 Testing Plagiarism When a user submits a text or a file to test plagiarism on it, this text must first be splitted into

paragraphs and an inputPaper will be inserted to relate these paragraphs together.

4.4.1 Process Paragraph Then each paragraph must be processed in a similar way like in the pre-processing, first the

paragraph will be inserted into the database in the inputParagraph table, then the text will be passed to

processText() procedure and return a refined bag of words. And finally these words will be used to

generate terms of them and populate the inputParagraphVector table.

connection.run(" INSERT INTO inputPaper (inputPaperId) VALUES(''); ") paragraphs = tokenizeParagrapgs(text)

Figure 4.58 tokenizing and link paragraphs together

for paragraph in paragraphs: data = processText(paragraph) length = len(data) if length < 1: continue cursor = connection.run("INSERT INTO inputParagraph (content,inputPaperId) VALUES (%s,%s)", (paragraph, paperId)) connection.commit() paragraphId = cursor.getlastrowid() term.populateInput_Terms_ParagraphVector(connection, data, paragraphId)

Figure 4.59 Process input paragraphs

def populateInput_Terms_ParagraphVector(connection, words, paragraphId): data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) # Term Frequency representation tf = {} for kgram in data['terms']: tf[kgram] = FreqDist(data['terms'][kgram]) query = "INSERT INTO inputParagraphVector (inputParagraphId, termId, termFreq, kgram) SELECT %s, termId, %s, %s FROM term WHERE term = %s AND kgram = %s ;" insertDocVec = [(data['paragraphId'], tf[kgram][term], str(kgram), str(term), str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query, insertDocVec) connection.commit()

Figure 4.60 Populate input paragraph vector


4.4.2 Calculate Similarity After all data has been inserted, we will call the calculateSimilarity stored procedure in the SQL

to calculate the similarity between all inserted paragraphs and the original paragraphs. The procedure

will use the value of BM25 and thresholds of the different k-grams will be passed to it.

4.4.3 Get Results Finally we will get the results from the similarity table and process them in JSON format to send

them to the client interface.

connection.callProcedure('calculateSimilarity', ('BM25', 0, 0, 2, 5, 8))

Figure 4.61 Calculate Similarity

def getResults(connection, paperId, paragraphsNum, beginTime): cursor = connection.callProcedure('getResults', (paperId,)) columnNames = ('inParagraph', 'originalParagraph', 'page', 'paperTitle'

, 'volume', 'issue', 'journal', 'publisher', 'issn', 'similarity' , 'paragraphMagnitude', 'inputParagraphMagnitude')

for results in cursor.stored_results(): all = results.fetchall() finalResults = [] for result in all: temp = [] for index, name in enumerate(columnNames): value = result[index+2] if name == 'inParagraph' or name == 'originalParagraph': value = value.strip() temp.append((name, value)) temp = dict(temp) finalResults.append(temp)

Figure 4.62 Get Results


4.5 The VSM Algorithm

4.5.1 Calculating similarity

The similarity between input paragraph and paragraphs in dataset is calculated using dot product,

in order to optimize performance without affecting accuracy we represent each paragraph as 5 to 1

grams, calculate similarity on all representation but after each calculation we chose paragraphs to limit

our calculations to as discussed follows.

5.1.1 Pseudo code of Dot product

dotProduct (k-gram, threshold, limiting conditions)

Select the paragraphs to limit calculation according to limiting conditions and calculating similarity

plan.

For each (paragraph, newInputParagraph) pair from the selected paragraphs:

score = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑘𝑔𝑟𝑎𝑚, 𝑖𝑛𝑝𝑢𝑡𝑃𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑘𝑔𝑟𝑎𝑚, 𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ)𝑘𝑔𝑟𝑎𝑚∈ 𝑞∩𝑑

If score > threshold

Insert score in similarity table

5.1.2 Time complexity analysis

Selecting the paragraphs on which we will limit our calculations step is done by joining

similarity table with paragraph table which takes 𝜃(𝑛 ∗ log 𝑚) where m is the maximum of paragraph

table size and similarity table size, n is the minimum of them.

The most costly parts are inner joining paragraphVector and inputParagraphVector tables, and

aggregating the resulted table.

Inner join time complexity is 𝜃(𝑀 log 𝑁 + 𝑀 ∩ 𝑁 )

And aggregation time complexity is 𝜃(𝑀 ∩ 𝑁 ∗ log #𝑖𝑝 ∗ #𝑝)

Where M: size of inputParagraph table i.e. sum of unique terms in each input paragraph.

N: size of paragraph table i.e. sum of unique terms in each paragraph.

𝑀 ∩ 𝑁: sum of common terms between each (paragraph, input paragraph) pair.

#p: number of paragraphs in dataset that have common terms with input paragraphs.

#ip: number of input paragraphs in dataset that have common terms with paragraphs.

To optimize the dot product we need to minimize unnecessary 𝑀 ∩ 𝑁 or limit the number of

paragraphs and input paragraphs that we operate on.


5.1.3 Design discussions based on complexity analysis

It’s recommended to periodically empty the similarity and inputParagraph tables and store them

in a different backup storage to speed up the join operation in selecting paragraphs step (first step).

The calculating similarity is only logarithmically proportional to the dataset size, which means

we can collect as large data as we can without having a fatal impact on the system response

performance.

Since the dot product operation complexity depends on the number of common terms we want to

design a calculating similarity plan which minimize unnecessary common terms, and join paragraphs

with a lot of common terms only when they have high probability of being plagiarized.

This can be done by starting calculating similarity on large k-grams as they have much less

probability than lower ones, and if found possible plagiarism we limit the calculations to the suspected

paragraphs.

If we compared paragraphs on 1-gram or 2-gram we will find a lot of common terms that doesn’t

imply plagiarism, but if we compared them on 5 or 4-gram we will not find common terms unless it’s

plagiarized with a high probability.

5.1.4 Calculating similarity plan

Calculate similarity on 5-gram

If found any match: limit later calculations on the matched paragraphs

Else: do later calculations on all paragraphs

Repeat for 4-gram, 3-gram, and 2-gram

If found any match in the past calculations: calculate similarity on 1-gram

4.5.2 K-means and Clustering

Text documents clustering differ than ordinary clustering because there are as many dimensions

as there are terms in dataset, to avoid iterating each term we make use of the fact that there is only small

number of terms in each document\paragraph and ‘iterate’ –or join since we are using RDBMS-only

over each unique term in document.

The text clustering K-means algorithm is described in the following flowchart.


5.2.1 Time complexity analysis

Define:

K: number of clusters / centroids

P: number of paragraphs

Start

End

Choose random c

centroids from dataset

i = 0

Calculate similarity

between each (centroid,

paragraph) pair using dot

product

Assign each paragraph to the

centroid with maximum

similarity.

Move each centroid to the

mean of the points assigned

to it.

Cost =

∑ ∑ ∑ (𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑝) − 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑐))2

𝑤∈𝑝∪𝑐𝑝∈𝑐𝑐

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡

Max step a centroid moved > epsilon

OR i > maxIter

no

yes

Figure 4.63 Flowchart of the Kmeans text clustering algorithm


N: sum of unique terms in each paragraph, also number of rows in paragraph vector table

𝑁 = ∑ ∑ 1

𝑡∈𝑝𝑝

M: sum of unique terms in each centroid, also number of rows in centroid table

𝑀 = ∑ ∑ 1

𝑡∈𝑐𝑐

Where t: term, p: paragraph, c: centroid

𝑀 ∩ 𝑁 : sum of unique terms that appears in a paragraph and a centroid for each(paragraph,

centroid) pair, also number of rows in the resulting of joining paragraph and centroid tables on term

𝑀 ∩ 𝑁 = ∑ ∑ ∑ 1

𝑡∈𝑝 𝑎𝑛𝑑 𝑡∈𝑐

= ∑ 1

𝑡∈𝑝∩𝑐𝑐𝑝

maxIter: maximum possible number if iterations before the program terminates even if it didn’t

converge yet

The most expensive part in the main loop is the inner join between paragraphs and centroids on

term then aggregation for calculating similarity to assign clusters, note that the mathematical operations

in aggregation part –multiplication or summation- may have larger hidden constants.

Assuming B-tree index on primary key, the time complexity of join:

Best case scenario where there are no duplicates 𝜔(𝑀 log 𝑁).

General case 𝜃(𝑀 log 𝑁 + 𝑀 ∩ 𝑁 )

If both tables where indexed time complexity may be linear M+N instead of M log N

But the cost of inserting records to centroid table in moving centroids step will be n log n instead

of just n.

The time complexity of aggregation 𝜃(𝑀 ∩ 𝑁 ∗ log 𝑘𝑝)

So the time complexity of the Kmeans algorithm is:

𝑂(𝑚𝑎𝑥𝐼𝑡𝑒𝑟 ∗ (𝑀 log 𝑁 + 𝑀 ∩ 𝑁 ∗ log 𝑘𝑝))

And the time complexity of the whole clustering operations is a multiple of the complexity of

Kmeans algorithm by a small integer since it’s repeated multiple times to avoid local optima.

Issue:

Each paragraph in the dataset is assigned to a centroid by measuring similarity between both, if a

paragraph has no common term with the centroids, not only it will have zero similarity, but it will have n


4.6 Server Side

4.6.1 Handling Routing

This is the home routing for testing a plagiarized document, first we check if the pre-processing

is running, and if it’s running, we won’t allow the user to submit any testing in the meantime.

app.get('/', function(req, res) { updateFooterInfo(); var q = "SELECT trainingOn FROM siteInfo WHERE id = 1;" connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); if (rows.length === 0 || !rows[0].trainingOn) { res.render('home'); } else if (rows[0].trainingOn) { res.render('no_test_now'); } }); });

Figure 4.64 Home Page Routing

app.get('/admin/pre_process', function(req, res) { updateFooterInfo(); var q = "SELECT trainingOn FROM siteInfo WHERE id = 1;" connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); if (rows.length === 0 || !rows[0].trainingOn) { var query = "SELECT COUNT(*) AS num FROM paragraph WHERE processed=false;"; connection.query(query, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); var paragraphNums = rows[0].num; res.render('train', {

paragraphNums: paragraphNums, word: "Unprocessed Paragraphs",

load: false });

}); } else if (rows[0].trainingOn) { res.render('train', {word: 'running ...', load: true}); } }); });

Figure 4.65 Pre-Process Page Routing


This is the routing for the pre-processing for the system, where we can follow the progress of the

pre-processing if it’s already running and processing the paragraphs in the database, otherwise, it will

show status of the database including: number of unprocessed paragraphs in the database, and a button

to start the pre-processing.

4.6.2 Running Python System

When the user connect to the server, a socket session id will be assigned to him, then if the

server received a submitTest message from the client, The server will read text sent from the client and

run the python code for testing plagiarism and pass the text to it.

While the python code is running, it will keep sending progressUpdateTest messages to update

the progress bar appears in the client, when the python code sends doneTest message, the server will

pass the results and score of the plagiarism test to the client.

io.on('connection', function(client) { client.on('submitTest', function(data) { var text = data.text; pyshell = new PythonShell('/python/main.py', {mode: 'text',

args: ['test', 'browser']}); pyshell.send(text); pyshell.on('message', function(message) { message = JSON.parse(message); var results; if (message.done) client.emit('doneTest', { data: message.data, score: message.score, time: message.time, scoreValue: message.scoreValue}); else client.emit('progressUpdateTest', { kind: message.kind, percent: message.percent}); }); pyshell.end(function (err) { if (err) throw err; }); }); });

Figure 4.66 Communicating between the Server and

the Core Engine for testing plagiarism


After a socket session id has been assigned to the user, and the server received a submitTrain

message from the client, the server will run the python code for vectorization.

While the python code is running, it will keep sending progressUpdateTrain messages to update

the progress bar appears in the client, when the python code sends doneTest message, the server will

pass the status and timings of the preprocessing.

io.on('connection', function(client) { client.on('submitTrain', function(data) { var q = "INSERT INTO siteInfo (id,trainingOn) VALUES(1, true) " + "ON DUPLICATE KEY UPDATE trainingOn = true"; connection.query(q, function(err, rows, field) { if (err) res.status(404).send("Error: " + err); }); pyshell = new PythonShell('/python/main.py', {mode: 'text', args: ['train', 'browser']}); pyshell.on('message', function(message) { message = JSON.parse(message); if (message.done) { io.emit('doneTrain', { processedParagraphs: message.processedParagraphs, remainParagraphs: message.remainParagraphs, insertTime: message.insertTime, inverseDocVecTime: message.inverseDocVecTime, bm25Time: message.bm25Time, pivotNormTime: message.pivotNormTime, totalTime: message.totalTime }); } else { io.emit('progressUpdateTrain', {kind: message.kind, percent: message.percent}); } }); pyshell.end(function (err) { if (err) throw err; }); }); })

Figure 4.67 Communicating between the Server and

the Core Engine for Pre-processing


4.7 Client Side

After the input text is completely tested, and the plagiarized parts are identified, we use the LCS

Algorithm to detect the similar parts between the input and the matched paragraph.

function LCS(a, b) { var m = a.length, n = b.length, C = [], i, j; for (i = 0; i <= m; i++) C.push([0]); for (j = 0; j < n; j++) C[0].push(0); for (i = 0; i < m; i++) for (j = 0; j < n; j++) C[i+1][j+1] = a[i] === b[j] ?

C[i][j]+1 : Math.max(C[i+1][j], C[i][j+1]);

return (function bt(i, j) { if (i*j === 0) { return ""; } if (a[i-1] === b[j-1]) { return bt(i-1, j-1) + a[i-1] + ' '; } return (C[i][j-1] > C[i-1][j]) ? bt(i, j-1) : bt(i-1, j) + '\n'; }(m, n)); }

Figure 4.68 Least Common Subsequence LCS Algorithm

for (var i = 0; i < data.data.length; i++) { var p1 = data.data[i].inParagraph; var p2 = data.data[i].originalParagraph; var lcs_for_1 = LCS(p1.split(' '), p2.split(' ')); lcs_for_1 = lcs_for_1.split('\n').filter(function(s)

{return s}).map(function(s) {return s.trim()}); var tempStr = '', currentCursor = 0, charCounter = 0; for (j = 0; j < lcs_for_1.length; j++) { var sub = lcs_for_1[j]; tempStr += p1.slice(currentCursor, p1.indexOf(sub)); var t = p1.slice(p1.indexOf(sub), p1.indexOf(sub)+sub.length); tempStr += '<span class="highlight-text-1">' + t + '</span>'; charCounter += t.length; currentCursor = p1.indexOf(sub)+sub.length; } tempStr += p1.slice(currentCursor, p1.length); }

Figure 4.69 Least Common Subsequence LCS Algorithm


The detected parts returned from LCS will be used to highlight the input paragraph and the

matched paragraph and calculate a score for similarity.

4.8 The GUI of the System

This is the Interface of the system where the user will input the text to be tested and start

analyzing, then the system will start working, until it finishes and the results will be shown

Figure 4.70 Submitting an input document


Figure 4.71 The Results of the Process Part 1

The results appear as shown in Fig IV.72:

1. The Yellow Highlighted Text is the Plagiarized text in the input the text

2. The Red Highlighted Text is the source of the plagiarized text

Also Our System shows the Time of the Process, The Percentage of Similarity between the two

texts, and information about the paper that has the source text.


Figure 4.72 The Results of the Process Part 2


Chapter 5 Results and Discussion

5.1 Dataset of the Parser

After parsing the Scientific Papers downloaded by the Crawler and passed to the Parser, and this

is the Statistics after parsing.

Table 1 Statistics of the Parser

Publisher Num of

Journals

Num of

Paper

Avg

Pages/Paper

Avg

Paragraph/Paper

Avg

words/Paper

IEEE 130 609 10 149 9363

Springer 59 541 25 386 21603

Science Direct 4 1206 27 406 21092

Figure 5.1 Number of Papers Published per Year in IEEE

36 11 7 12 12 6 18 13 17 25 82

339

2 5 2

Nu

mb

er

of

Pap

ers

Dates of Publishing

Number of Papers Published per YearIn IEEE


Figure 5.2 Number of Papers Published per Year in Springer

Figure 5.3 Number of Papers Published per Year in Science Direct

We tested the plagiarism engine on a vast real world data without the clustering process, the

dataset contains text from English academic papers collected from IEEE, Springer’s, and Science Direct.

28 20 20 23 5 45

75

233

74

18

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Nu

mb

er

of

Pap

ers

Dates of Publishing

Number of crawled Papers per YearIn Springer

5466 68 74 68

52 52

78

52 59

189

125

35

9480 73

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Nu

mb

er

of

Pap

ers

Dates of Publishing

Number of Papers Published per YearIn Science Direct


5.2 Exploring dataset

We tested the system on two datasets that differ in size and compared the performance of the

system on both datasets, the small one has 15K paragraphs and the big data set has 50K paragraphs.

Here are some useful and insightful statistics about the datasets.

5.4.1 Small dataset (15K) Table 2 Dataset Statistics

Number of paragraphs 15342


Total length of paragraphs 462103

Average paragraph length 30.6739

Table 3 Unique Terms count in each Paragraph

Number of unique k-gram

terms in each paragraph Average Document Frequency

1-gram 358140 11.0725

2-gram 417189 1.5651

3-gram 419472 1.1462

4-gram 410394 1.0766

5-gram 399109 1.0513

Sum of all k-grams

(size of paragraphVector table)

2004304 -

Table 4 Unique Terms count in Dataset

Number of unique k-gram terms in dataset

1-gram 32345

2-gram 266553

3-gram 365979

4-gram 381179

5-gram 379651

Sum of all k-grams

(size of term table)

1425707


5.4.2 Big dataset (50K) Table 5 Dataset Statistics

Number of paragraphs


Total length of paragraphs 1561846

Average paragraph length 30.7498

Table 6 Unique Terms count in each Paragraph

Number of unique k-gram terms in

each paragraph Average Document Frequency

1-gram 1206539 13.6523

2-gram 1402768 1.7401

3-gram 1409069 1.1597

4-gram 1379124 1.0835

5-gram 1342290 1.0572

Sum of all k-grams

(size of paragraphVector table)

6739790 -

Table 7 Unique Terms count in Dataset

Number of unique k-gram terms in dataset

1-gram 88376

2-gram 806165

3-gram 1215077

4-gram 1272892

5-gram 1269639

Sum of all k-grams

(size of term table)

4652149


5.3 Performance The data preprocessed on a machine with specifications:

Processor: 4x Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz

Memory: 3895MB

Operating System: Arch Linux

Kernel: 4.4.1

And the performance was found as follows:

Figure 5.4 Response time against number of paragraphs tested on small dataset

Table 8 Processing time of each module in Plagiarism Engine

Time in seconds

Small dataset Big dataset

Natural Language processing and Vectorization 2899.04 15982.93

Calculating Document Frequency and other

statistics

36.77 171.05

Calculating BM25 weights 65.49 256.52

Calculating Pivoted length Normalization weights 53.07 225.14

Total pre processing time 3054.45 16658.95

Typical response time 10-20 60

8

1419

34

60

89

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120

Tim

e in

sec

on

ds

Number of Paragraphs


Figure 5.5 Screenshot of the System Performance from the System GUI


Note that: these measurements include the time of initiating connections, read/write from database, and

other overhead not only the processing time.

5.4 Detecting plagiarism

We test plagiarism engine one time since the size of the dataset doesn’t have a considerable

effect on the accuracy of the system, and both methods give similar result; we tested the system with

input paragraphs plagiarized from a randomly chosen paragraph from the dataset.

Original paragraph: “. Finally, our proposed fault detection schemes and almost all of the

previously reported ones have been implemented on the recent Xilinx Virtex FPGAs, and their area and

delay overheads have been derived and compared. The FPGA implementation results show the low area

and delay overheads for the proposed fault detection schemes.”

With id = 2766

We will use only “Finally, our proposed fault detection schemes and almost all of the previously

reported ones have been implemented on the recent Xilinx Virtex FPGAs” to test the system.

We chose parameters as follows:

Table 9 Parameters

value

k for BM25 1.5

b for BM25 0.75

b for pivoted length normalization 0.75

Threshold for 5-gram 0





The paragraphs we used in testing and the results of testing are shown in the below table.

5.4.1 Percentage score functions: We used dot product similarity function with BM25 weighting for evaluating similarity,

whatever we need a normalized similarity function to get a percentage score that is easy to understand

by users, so we used cosine similarity and Longest Common Subsequence LCS.

Cosine similarity: it’s a function similar to dot product with built-in normalization, it’s the dot

product score divided by the maximum possible similarity, so it would be 100% if the input paragraph

has the same lemmas of all non stop words of the original paragraphs, changing the phrase tense or


reordering it will not affect the cosine similarity score, if we copied only half the original phrase the

score will be just 50% and so on.

LCS: LCS algorithm finds the common parts between input and original paragraph and

highlights it, the percentage score is the number of common words divided by number of words in

original paragraph, unlike cosine similarity LCS compares the words literally without NLP analysis, so

it can’t detect paraphrased plagiarism.

Table 10 Testing Paragraphs and Results

Paragraph Description Results

(detected?)

Cosine

similarity

LCS

similarity

Finally, our proposed fault detection schemes

and almost all of the previously reported ones

have been implemented on the recent Xilinx

Virtex FPGAs

Copied Yes 50% 43%

Finally, we propose fault detection schemes and

most of the previous reports will have

implementation on the recent Xilinx Virtex

FPGAs

Slightly paraphrased (change

phrase time and stop words)

Yes 40% 22%

At last, we suggest error detection schemes and

nearly all of the detected errors were applied on

the past methods.

Highly paraphrased (replace

words with synonymous or

negotiate antonyms)

No - -

almost all of the previously reported ones and

our proposed fault detection schemes have been

implemented on the recent Xilinx Virtex FPGAs

Slightly rearranged Yes 40% 28%

most of the ones reported previously and fault

detection schemes that we proposed have been

recently implemented on the Virtex Xilinx

FPGAs

Moderately rearranged (no 4

successive non stop words as

the original paragraph)

Yes 40% 13%

schemes of fault detection that we proposed and

nearly all of the ones previously reported have

been recently Xilinx implemented on Virtex

FPGAs

Highly rearranged (no 3

words have the same order as

the original paragraph)

Yes 40% 24%

proposed, our fault schemes Finally detection

and almost reported all of the implemented

previously ones Xilinx have been FPGAs on the

recent Virtex

Extremely rearranged (no

two words have the same

order as the original

paragraph)

No - -


5.5 Discussing results

The system has reasonably fast response, and the calculating similarity step –which happen to

have a small time relative to the total response time- is the only step that depends on the size of the

dataset, luckily it’s only logarithmically proportional to dataset size, this implies that the system is

scalable to large dataset and will have fast response on big data.

Increasing the volume of the data didn’t hurt the system response performance nor affected it’s

accuracy in plagiarism detection.

The system can detect slightly paraphrasing plagiarism because we use shallow NLP techniques

to lemmatization and remove stop words, so the system can detect plagiarized sentences with change in

grammatical tense, the sentence grammar structure, or any stop words.

Shallow NLP can’t detect strongly paraphrased sentences or idea plagiarism, in order to detect

those sentences we will have to use complex deep NLP methods which will consume a huge

preprocessing and response time, note that the shallow NLP processing already occupy the majority of

the system time complexity.

Since we use bag of words method our system is insensitive to words order, however in order to

optimize the system performance we use k-gram fingerprinting method, so the system deals with k-

grams (subsequent words) not each term alone, as we discussed in implementation section we calculate

similarity on 5 to 2 grams, so the system can detect rearranged plagiarized sentence with at least two non

stop word still in the original order as long as these words are not very common.

This explains why the system can detect the three first rearranged sentences but can’t detect the last one,

note that the two words “Virtex FPGAs” are rare enough to imply plagiarism, but it detected another

paragraph from the same paper that contains this modified bi-gram, Anyway the last sentence have no 2-

gram as the original sentence to be detected.

We properly implemented the clustering algorithm and tested it on a very small data and got

satisfying clustering results, but a very slow performance – as expected from a data mining algorithm-

so we couldn’t use it on a real dataset.

If the system was used in a real world application and hosted on a high performance server the

clustering module can be applied then, as the off line processing wouldn’t be a problem and can be done

on HPCC (High Performance Computing Clusters), ant it will be very useful to speed up the response

time.


Chapter 6 Conclusion

Plagiarism is an act of stealing someone else’s work and admitting it’s his work, and now we

have a system to detect this act.

Our system consists of three parts: ETL (Parser, Crawler), Plagiarism Engine, and GUI for

testing a document.

For the ETL: it extracts papers from Open Access Journals on the internet, parses those papers

by extracting the paper info and the paper data, and loads these data into the database.

For the Plagiarism Engine: it does the text processing on the paragraphs then tokenizes the result

and generate k-gram terms.

For the GUI, it receives the document to be tested, split the document into paragraphs and

process the text and generate k-grams, then compare the k-grams with the database using the VSM

Algorithm for similarity, and it will highlight the matching parts that have plagiarism.


Chapter 7 Appendix

7.1 Entity-Relation Diagram (ERD)

Figure 7.1 ERD of the plagiarism Engine database


3. paper table contains info about the paper such as paper title, author, Digital Object Identifier

DOI, and other information.

4. paragraph table contains the text of each paragraph, the length of the paragraph (after stop

words removal), a normalizing magnitude of the paragraph –the similarity between the

paragraph and an identical copy of it-, a foreign key of the papered, and other information.

5. paragraphVector table contains a Vectorized representation of paragraphs with difrrent

weights (TF, pivoted length normalization, BM25).

6. similarity table contains the calculated similarity between paragraphs and input paragraphs.

7. inputParagraph(Vector) table similar to original tables, but contain less information.

7.2 Stored procedures

CREATE DEFINER=`root`@`localhost` PROCEDURE `calculateMagnitude`() BEGIN UPDATE paragraph INNER JOIN (SELECT paragraphId, SUM(BM25 * termFreq) AS magnitude FROM paragraphVector WHERE kgram = 1 GROUP BY paragraphId ) AS PV ON paragraph.paragraphId = PV.paragraphId SET paragraph.magnitude = PV.magnitude; END CREATE DEFINER=`root`@`localhost` PROCEDURE `calculateSimilarity`(IN method varchar(9), IN paperId INT, IN threshold5 VARCHAR(10), IN threshold4 VARCHAR(10), IN threshold3 VARCHAR(10), IN threshold2 VARCHAR(10), IN threshold1 VARCHAR(10)) BEGIN CALL dotProduct(method, paperId, '5', '', 0, threshold5); CALL dotProduct(method, paperId, '4', '5', 0, threshold4); CALL dotProduct(method, paperId, '3', '4, 5', 0, threshold3); CALL dotProduct(method, paperId, '2', '3, 4, 5', 0, threshold2); CALL dotProduct(method, paperId, '1', '2, 3, 4, 5', 1, threshold1); END CREATE DEFINER=`root`@`localhost` PROCEDURE `clustering`(IN c INT, IN maxIter INT) BEGIN DECLARE minCost FLOAT; DECLARE i INT; DECLARE cost FLOAT; SET minCost := 10000000000; SET i = 0; WHILE (i < 2) DO SET i = i + 1; CALL kmeans(c, maxIter, cost); IF cost < minCost THEN SET @realStable := @stable;


DROP TABLE IF EXISTS centroid; DROP TABLE IF EXISTS paragraphCluster; CREATE TABLE centroid LIKE tempCentroid; INSERT INTO centroid (SELECT * FROM tempCentroid); CREATE TEMPORARY TABLE paragraphCluster AS (SELECT paragraphId, clusterId FROM paragraph); SET minCost := cost; END IF; END WHILE; SET @minCost := minCost; UPDATE paragraph AS P INNER JOIN paragraphCluster AS PC ON P.paragraphId = PC.paragraphId SET P.clusterId = PC.clusterId; DROP TABLE IF EXISTS tempCentroid; DROP TABLE IF EXISTS paragraphCluster; ALTER TABLE centroid ADD INDEX termId_centroidIndex USING BTREE (termId); END CREATE DEFINER=`root`@`localhost` PROCEDURE `dotProduct`(IN method varchar(9), IN paperId INT, IN kgram varchar(1), IN conditionkgram varchar(50), IN conditionkgramOnly INT(1), IN threshold VARCHAR(10)) BEGIN DECLARE conditionstr VARCHAR(150) DEFAULT ''; DECLARE conditionjoin VARCHAR(350) DEFAULT ''; IF (conditionkgram != '') THEN IF (conditionkgramOnly = 0) THEN SET conditionjoin = 'LEFT OUTER JOIN similarity AS S2 ON (P.paragraphId = S2.paragraphId AND IP.inputParagraphId = S2.inputParagraphId)'; SET conditionstr = CONCAT('AND (S2.kgram IN(', conditionkgram, ') OR S2.kgram IS NULL)'); ELSE SET conditionjoin = 'INNER JOIN similarity AS S2 ON (P.paragraphId = S2.paragraphId AND IP.inputParagraphId = S2.inputParagraphId)'; SET conditionstr = CONCAT('AND S2.kgram IN( ',conditionkgram, ' ) '); END IF; END IF; SET @s = CONCAT('INSERT INTO similarity (paragraphId, inputParagraphId, kgram, similarity) SELECT DISTINCT sim.paragraphId, sim.inputParagraphId, sim.kgram, sim.score FROM paragraph AS P INNER JOIN ( SELECT paragraphId, inputParagraphId, IPV.kgram, SUM(PV.',method,' * IPV.termFreq) AS score FROM paragraphVector AS PV INNER JOIN inputParagraphVector AS IPV ON PV.termId = IPV.termId WHERE IPV.kgram = ',kgram,' GROUP BY paragraphId, inputParagraphId, IPV.kgram HAVING score >= ',threshold,' ) AS sim ON P.paragraphId = sim.paragraphId INNER JOIN ( SELECT * FROM inputParagraph WHERE inputPaperId = ', paperId,' ) AS IP ON sim.inputParagraphId = IP.inputParagraphId ',conditionjoin,' ',conditionstr,' AND P.clusterId = IP.clusterId;');


PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt; END CREATE DEFINER=`root`@`localhost` PROCEDURE `fastUpdateParagraph`(IN paragraphId INT) BEGIN PREPARE stmt1 FROM 'UPDATE term INNER JOIN (SELECT termId, COUNT(paragraphId) AS idf FROM paragraphVector WHERE paragraphId = ? GROUP BY termId) AS PV ON term.termId = PV.termId SET term.inverseDocFreq = IFNULL(term.inverseDocFreq, 0) + PV.idf'; SET @id := paragraphId; EXECUTE stmt1 USING @id; DEALLOCATE PREPARE stmt1; PREPARE stmt2 FROM 'UPDATE paragraphVector as PV, paragraph as P, term AS T, datasetInfo AS I SET PV.BM25 = ((I.k+1) * PV.termFreq)/(PV.termFreq + I.k*(1-I.b+I.b*(P.length/I.avdl)) ) * LOG10( (I.numDoc+1)/T.InverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId AND P.paragraphId = ?'; EXECUTE stmt2 USING @id; DEALLOCATE PREPARE stmt2; PREPARE stmt3 FROM 'UPDATE paragraphVector as PV, paragraph as P, term as T, datasetInfo AS I SET PV.pivotNorm = ( LN(1+LN(1+PV.termFreq))/(1-I.b+I.b*(P.length/I.avdl)) ) * LOG10( (I.numDoc+1)/T.inverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId AND P.paragraphId = ?'; EXECUTE stmt3 USING @id; DEALLOCATE PREPARE stmt3; PREPARE stmt4 FROM 'UPDATE datasetInfo AS D, paragraph AS P SET D.numDoc = D.numDoc + 1, D.avdl = (P.length+D.totalLength)/D.numDoc, D.totalLength = D.totalLength + P.length WHERE D.id = 1 AND P.paragraphId = ?'; EXECUTE stmt4 USING @id; DEALLOCATE PREPARE stmt4; END CREATE DEFINER=`root`@`localhost` PROCEDURE `findCluster`() BEGIN CREATE OR REPLACE VIEW centroidSimilarity AS ( SELECT centroidId, inputParagraphId, SUM(centroid.value * inputParagraphVector.termFreq) AS score FROM centroid INNER JOIN inputParagraphVector ON centroid.termId = inputParagraphVector.termId WHERE inputParagraphVector.kgram = 1 GROUP BY centroidId, inputParagraphId );


UPDATE inputParagraph AS IP INNER JOIN (SELECT CS1.inputParagraphId, CS1.centroidId FROM centroidSimilarity AS CS1 INNER JOIN (SELECT inputParagraphId, MAX(score) AS m FROM centroidSimilarity GROUP BY inputParagraphId) AS CS2 ON CS1.inputParagraphId = CS2.inputParagraphId WHERE CS1.score = CS2.m) AS maxSimilarity ON IP.inputParagraphId = maxSimilarity.inputParagraphId SET IP.clusterId = maxSimilarity.centroidId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `getResults`(IN inPaperId INT) BEGIN SELECT s1.inputParagraphId, s1.paragraphId, s3.inParagraph, s4.originalParagraph, s4.pageNumber, s5.paperTitle, s5.DOI, s5.volume, s5.issue, s6.journal, s6.publisher, s6.ISSN, s7.author, s1.similarity, s4.magnitude FROM similarity AS s1 INNER JOIN (SELECT paragraphId, m.inputParagraphId FROM similarity INNER JOIN (SELECT inputParagraphId, MAX(similarity) AS ms FROM similarity GROUP BY inputParagraphId) AS m ON similarity.inputParagraphId = m.inputParagraphId AND similarity.similarity = m.ms) As s2 ON s1.inputParagraphId = s2.inputParagraphId AND s1.paragraphId = s2.paragraphId INNER JOIN (SELECT inputParagraphId,inputPaperId,content AS inParagraph FROM inputParagraph) s3 ON s1.inputParagraphId = s3.inputParagraphId INNER JOIN (SELECT paperId,paragraphId,pageNumber,magnitude,content AS originalParagraph FROM paragraph) s4 ON s1.paragraphId = s4.paragraphId INNER JOIN (SELECT paperId,DOI,title AS paperTitle,journalId,volume,issue FROM paper) s5 ON s4.paperId = s5.paperId INNER JOIN (SELECT journalId,journal,publisher,ISSN FROM publisher) s6 ON s5.journalId = s6.journalId INNER JOIN (SELECT paperId, GROUP_CONCAT(author SEPARATOR ', ') AS author FROM author GROUP BY paperId) s7 ON s7.paperId = s4.paperId WHERE s1.kgram = 1 AND s3.inputPaperId = inPaperId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `kmeans`(IN c INT, IN maxIter INT, OUT cost FLOAT) BEGIN DECLARE i INT; DECLARE epsilon INT; SET epsilon = 0.01; SET @stable := 0; DROP TABLE IF EXISTS tempCentroid;


PREPARE stmt1 FROM 'CREATE TABLE tempCentroid AS ( SELECT C.Id AS centroidId, P.termId AS termId, P.BM25 AS value FROM paragraphVector AS P INNER JOIN (SELECT paragraphId AS Id FROM paragraph ORDER BY RAND() LIMIT ?) AS C ON P.paragraphId = C.Id WHERE P.kgram = 1 )'; SET @c := c; EXECUTE stmt1 USING @c; DEALLOCATE PREPARE stmt1; DROP TABLE IF EXISTS oldCentroid; CREATE TEMPORARY TABLE oldCentroid LIKE tempCentroid; SET i = 0; SET @ID := 0; PREPARE stmt2 FROM 'SET @ID := (SELECT DISTINCT centroidId FROM tempCentroid ORDER BY centroidId ASC LIMIT ?,1);'; WHILE (i <= c) DO SET @i := i; SET i = i + 1; EXECUTE stmt2 USING @i; UPDATE tempCentroid SET centroidId = @i+1 WHERE centroidId = @ID; END WHILE; DEALLOCATE PREPARE stmt2; SET i = 0; mainLoop: WHILE (i < maxIter) DO SET i = i + 1; CREATE OR REPLACE VIEW paragraphCentroidSimilarity AS ( SELECT paragraphId, centroidId, SUM(paragraphVector.BM25 * tempCentroid.value) AS similarity FROM paragraphVector INNER JOIN tempCentroid ON paragraphVector.termId = tempCentroid.termId WHERE paragraphVector.kgram = 1 GROUP BY paragraphId, centroidId); UPDATE paragraph AS P INNER JOIN (SELECT CS1.paragraphId, CS1.centroidId FROM paragraphCentroidSimilarity AS CS1 INNER JOIN (SELECT paragraphId, MAX(similarity) AS m FROM paragraphCentroidSimilarity GROUP BY paragraphId) AS CS2 ON CS1.paragraphId = CS2.paragraphId WHERE CS1.similarity = CS2.m) AS maxSimilarity ON P.paragraphId = maxSimilarity.paragraphId SET P.clusterId = maxSimilarity.centroidId; TRUNCATE oldCentroid; INSERT INTO oldCentroid (SELECT * FROM tempCentroid); CREATE OR REPLACE VIEW newCentroid AS ( SELECT P.clusterId, PV.termId, SUM(PV.BM25) AS newValue FROM paragraph AS P INNER JOIN paragraphVector AS PV ON (P.paragraphId = PV.paragraphId) WHERE PV.kgram = 1 GROUP BY P.clusterId, PV.termId );


TRUNCATE tempCentroid; INSERT INTO tempCentroid (SELECT * FROM newCentroid); CREATE OR REPLACE VIEW clusterSize AS ( SELECT P.clusterId AS centroidId, COUNT(P.paragraphId) AS size FROM paragraph AS P GROUP BY P.clusterId ); UPDATE tempCentroid AS C INNER JOIN clusterSize AS CS ON C.centroidId = CS.centroidId SET C.value = C.value/CS.size; SET @step := (SELECT MAX(sub.delta) FROM ( SELECT SUM(ABS(C1.value - IFNULL(C2.value, 0))) AS delta FROM tempCentroid AS C1 LEFT OUTER JOIN oldCentroid AS C2 ON (C1.centroidId = C2.centroidId AND C1.termId = C2.termId) GROUP BY C1.centroidId ) AS sub); IF @step < epsilon THEN SET @stable = 1; LEAVE mainLoop; END IF; END WHILE mainLoop; SET @m := (SELECT COUNT(*) FROM paragraph); SET cost := (SELECT (SUM(POWER((PV.BM25 - IFNULL(C.value, 0)), 2)) / @m ) FROM tempCentroid AS C INNER JOIN paragraph AS P ON C.centroidId = P.clusterId RIGHT OUTER JOIN paragraphVector AS PV ON (P.paragraphId = PV.paragraphId AND C.termId = PV.termId)); DROP VIEW IF EXISTS paragraphCentroidSimilarity; DROP TABLE IF EXISTS oldCentroid; DROP VIEW IF EXISTS newCentroid; DROP VIEW IF EXISTS clusterSize; END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_BM25`(IN b FLOAT, IN k FLOAT) BEGIN DECLARE numDoc INT; DECLARE avdl FLOAT; DECLARE totalLength INT; SELECT datasetInfo.numDoc INTO numDoc FROM datasetInfo WHERE id = 1; SELECT datasetInfo.totalLength INTO totalLength FROM datasetInfo WHERE id = 1; SELECT datasetInfo.avdl INTO avdl FROM datasetInfo WHERE id = 1; SET @k := k; SET @b := b; PREPARE stmt1 FROM 'INSERT INTO datasetInfo (id, k, b) VALUES (1, ?, ?) ON DUPLICATE KEY UPDATE k = VALUES(k), b = VALUES(b)'; EXECUTE stmt1 USING @k, @b;


UPDATE paragraphVector as PV, paragraph as P, term as T SET PV.BM25 = ((k+1) * PV.termFreq)/(PV.termFreq + k*(1-b+b*(P.length/avdl)) ) * LOG10( (numDoc+1)/T.InverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId; END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_inverseDocFreq`() BEGIN UPDATE term INNER JOIN (SELECT termId, COUNT(paragraphId) AS idf FROM paragraphVector GROUP BY termId) AS PV ON term.termId = PV.termId SET term.inverseDocFreq = PV.idf; SET @numDoc := (SELECT COUNT(*) FROM paragraph); SET @totalLength := (SELECT SUM(length) FROM paragraph); SET @avdl := @totalLength/@numDoc; INSERT INTO datasetInfo (id, numDoc, avdl, totalLength) VALUES (1, @numDoc, @avdl, @totalLength) ON DUPLICATE KEY UPDATE numDoc = VALUES(numDoc), avdl = VALUES(avdl), totalLength = VALUES(totalLength); END CREATE DEFINER=`root`@`localhost` PROCEDURE `update_pivotNorm`(IN b FLOAT) BEGIN DECLARE numDoc INT; DECLARE avdl FLOAT; DECLARE totalLength INT; SELECT datasetInfo.numDoc INTO numDoc FROM datasetInfo WHERE id = 1; SELECT datasetInfo.totalLength INTO totalLength FROM datasetInfo WHERE id = 1; SELECT datasetInfo.avdl INTO avdl FROM datasetInfo WHERE id = 1; SET @b := b; PREPARE stmt1 FROM 'INSERT INTO datasetInfo (id, b) VALUES (1, ?) ON DUPLICATE KEY UPDATE b = VALUES(b)'; EXECUTE stmt1 USING @b; UPDATE paragraphVector as PV, paragraph as P, term as T SET PV.pivotNorm = ( LN(1+LN(1+PV.termFreq))/(1-b+b*(P.length/avdl)) ) * LOG10( (numDoc+1)/T.inverseDocFreq) WHERE PV.paragraphId = P.paragraphId AND PV.termId = T.termId; END


References

[1] T. Hoad and J. Zobel, " Methods for Identifying Versioned and Plagiarised Documents," Journal of the American

Society for Information Science and Technology, vol. 54, no. 3, p. 203–215, 2003.

[2] K. Monostori, A. Zaslavsky and H. Schmidt, "Document Overlap Detection System for Distributed Digital Libraries,"

Proceedings of the fifth ACM conference on Digital libraries, p. 226–227, 2000.

[3] A. Si, H. V. Leong and R. W. H. Lau, "CHECK: A Document Plagiarism Detection System," SAC ’97: Proceedings of the

1997 ACM symposium on Applied computing, p. 70–77, 1997.

[4] C. Noah, H. Marcus, J. Nick, S. Cole, T. Tony and W.-D. Zach, "Plagiarism Detection," 17 March 2014. [Online].

Available: www.cs.carleton.edu/cs_comps/1314/dlibenno/final-results/plagcomps.pdf.

[5] "Euclidean vector," Wikipedia.

[6] D. M. Christopher, R. Prabhakar and S. Hinrich, "Introduction to Information Retrieval," Cambridge University Press,

2008.

[7] S. T. Piantadosi, "According to Zipf’s law, Zipf’s word frequency law in natural language: a critical review and future

directions," June 2, 2015.

[8] S. Amit, B. Chris and M. Mandar, "Pivoted Document Length Normalization".

[9] S. Robertson and H. Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond".

[10] S. Ullman, "Unsupervised Learning: Clustering," 2014. [Online]. Available:

http://www.mit.edu/~9.54/fall14/slides/Class13.pdf.

My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

Engineering

Transcript of My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers