A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...

88
05.07.2018 Pipeline for TR extraction Milad Alshomary A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1

Transcript of A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...

Page 1: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

A Pipeline for Scalable Text Reuse Analysis

Milad Alshomary

05.07.2018

Bauhaus Universität

1

Page 2: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Overview

2

● Motivation

● A Pipeline for Scalable Text Reuse Extraction

● Application on Wikipedia

● Application on Wikipedia and Common Crawl

● Conclusion

Page 3: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse (TR)Motivation

3

● Quoting

● Verbatim

● Paraphrasing

● Translation

● Summarization

Page 4: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

TR Detection ApplicationsMotivation

4

METER project(Measuring Text Reuse)

Plagiarism detection

Page 5: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Plagiarism detection

5

METER projet(Measuring Text Reuse)

TR Detection Applications

Page 6: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

6

Plagiarism detectionMETER projet

(Measuring Text Reuse)

TR Detection Applications

Page 7: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

7

Page 8: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

8

Page 9: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

9

Page 10: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

10

Page 11: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Wikipedia vs The World

Quality Flaws

11

Scientific community

Page 12: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Wikipedia vs The World

- Web pages = Wikipedia text + advertisements

12

Page 13: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Research Questions

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

13

Page 14: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

14

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

Research Questions

Page 15: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

15

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

Research Questions

Page 16: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

A Pipeline for Scalable Text Reuse Extraction

16

Page 17: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

TR PipelineD1

D2

➔ Input: Two datasets

➔ Output: Text reuse

cases

17

A Pipeline for Scalable Text Reuse Extraction

Page 18: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

TR PipelineD1

D2

➔ Input: Two datasets

➔ Output: Text reuse

cases

18

A Pipeline for Scalable Text Reuse Extraction

Page 19: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

19

➔ Content extraction

➔ Chunking

➔ Feature extraction

A Pipeline for Scalable Text Reuse Extraction

Page 20: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

20

➔ Content extraction

➔ Chunking

➔ Feature extraction

➔ Pairwise scan

➔ Text Reuse heuristics

A Pipeline for Scalable Text Reuse Extraction

Page 21: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

➔ Content extraction

➔ Chunking

➔ Feature extraction

➔ Pairwise scan

➔ Text Reuse heuristics

➔ Detailed scan of text

reuse

➔ Picapica framework

21

A Pipeline for Scalable Text Reuse Extraction

Page 22: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Text Preprocessing

Candidate Elimination

Text Alignment

Keys for scaling-up:

➔ Cluster computing

➔ Heuristics based candidate elimination

algorithms

22

A Pipeline for Scalable Text Reuse Extraction

Page 23: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Text Preprocessing

Candidate Elimination

Text Alignment

Keys for scaling-up:

➔ Cluster computing

➔ Heuristics based candidate elimination

algorithms

23

A Pipeline for Scalable Text Reuse Extraction

Page 24: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

For a candidacy function we proposed

the following methods:

- Cosine similarity of TF-IDF (semantic)

- Paragraph embedding (semantic)

- Stopwords N-grams (structure)

- Weighted average of Stopwords Ngrams and

Paragraph embedding (semantic + structure)

24

d2n

D1 D2

candidacy(d11, d21) → [0, 1]

d11 d21

d22d12

d1n

A Pipeline for Scalable Text Reuse Extraction

Page 25: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

25

A Pipeline for Scalable Text Reuse Extraction

Page 26: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

26

A Pipeline for Scalable Text Reuse Extraction

Page 27: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

27

- 232 documents

- ~ 90% have < 10 alignements (TR case)

A Pipeline for Scalable Text Reuse Extraction

Page 28: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

TR sampleEvaluation of “candidacy” function:

- For each document in TR sample:

- Sort all Wikipedia articles

according to the proposed

“candidacy” .

- Precision/Recall on

Thresholds of [1, 101,..,100k]

- A True Positive (TP) is a pair of

documents that have TR.

28

T1 T2 T3

A Pipeline for Scalable Text Reuse Extraction

Page 29: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

TR sampleEvaluation of “candidacy” function:

29

T1 T2 T3

r1r2

p1

p2

A Pipeline for Scalable Text Reuse Extraction

Page 30: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Semantic hashing function:

- Hashes documents into binary

hashes.

- Similar documents get similar or

exact binary hash.

30

011001 011001

A Pipeline for Scalable Text Reuse Extraction

Page 31: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.001001

011001

001000

Inverted index

011001

011001

D1D2

31

A Pipeline for Scalable Text Reuse Extraction

Page 32: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

32

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

Page 33: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

33

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

Page 34: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

34

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

Page 35: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

35

di

dj

A Pipeline for Scalable Text Reuse Extraction

Page 36: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

36

di001

100

dj

001

A Pipeline for Scalable Text Reuse Extraction

Page 37: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

37

Learning

VDSH

A Pipeline for Scalable Text Reuse Extraction

Page 38: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Transform

011001

38

Learning

VDSH VDSH

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

A Pipeline for Scalable Text Reuse Extraction

Page 39: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

39

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

A Pipeline for Scalable Text Reuse Extraction

Page 40: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

40

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

101

001

111

101 101 101

110000

100

A Pipeline for Scalable Text Reuse Extraction

Page 41: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

41

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

101

001

111

101 101 101

110000

100

Precision = 2/3 Recall = 1.0

A Pipeline for Scalable Text Reuse Extraction

Page 42: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Random projection

bits precision recall

…. 8 3.1 x 10-4 0.8741

…. 16 9.9 x 10-4 0.324

VDSH bits precision recall

…. 8 2.8 x 10-4 0.88

…. 16 4.5 x 10-3 0.73

42

Hashing methods evaluation

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

Page 43: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

VDSH bits precision recall

8 2.8 x 10-4 0.88

16 4.5 x 10-3 0.73

● Retains 73% of the recall● By experiment:

○ Reduces the computations needed by 3 order of magnitude

43

Hashing methods evaluation

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

Page 44: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Application on Wikipedia

44

Page 45: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse In WikipediaApplication on Wikipedia

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

45

Page 46: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse In WikipediaApplication on Wikipedia

100 million text reuse

TR Pipeline

Wikipedia

Wikipedia

46

Wikipedia Articles

360k Wikipedia Article

Page 47: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(1) Two texts describe the same

topic.

(2) Two texts describe two

different topics, that share similar

characteristics

47

Text Reuse In WikipediaApplication on Wikipedia

Page 48: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(1) Two texts describe the same

topic.

Text Reuse

Structure Text Reuse

Content Text Reuse

48

Text Reuse In WikipediaApplication on Wikipedia

Page 49: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

- Tow texts describing same

topic.

Text Reuse

Structure Text Reuse

Content Text Reuse

49

Text Reuse In WikipediaApplication on Wikipedia

Page 50: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(2) Two texts describe two

different topics, that share similar

characteristics

Text Reuse

Structure Text Reuse

Content Text Reuse

50

Text Reuse In WikipediaApplication on Wikipedia

Page 51: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(2) Two texts describe two

different topics, that share similar

characteristics

Text Reuse

Structure Text Reuse

Content Text Reuse

51

Text Reuse In WikipediaApplication on Wikipedia

Page 52: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 52

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

Vertical relation Horizontal relation

Page 53: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 53

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

Page 54: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 54

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

Page 55: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Application on Wikipedia and Common Crawl

55

Page 56: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WebApplication on Wikipedia and Common Crawl

56

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

Page 57: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

WWW

- Crawling

Extracted web content

- Content extraction

- Keeping only english

pages

Web Sample

10% random

sample

57

Application on Wikipedia and Common Crawl

Page 58: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

WWW

- Crawling

Extracted web content

- Content extraction

- Keeping only english

pages

Web Sample

10% random

sample

- 59 million web pages.

- 1.4 million websites.

- 70% of these websites

contains less than 10 web

pages

Number of web pages

Num

ber

of w

ebsi

tes

58

Application on Wikipedia and Common Crawl

Page 59: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

TR Pipeline

Web Sample

Wikipedia

- 1.6 million text reuse cases.

- 15k pages reuse Wikipedia text.

- 4.8k websites reuse Wikipedia text.

59

Application on Wikipedia and Common Crawl

Page 60: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Monthly revenue estimation:

- Rough estimate of Ads revenue

- Based on CPM (Cost Per Millie)

- Sampled 100 webpages and

manually checked the existence of

Advertisements.

60

Application on Wikipedia and Common Crawl

Page 61: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

61

Application on Wikipedia and Common Crawl

Page 62: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

62

Application on Wikipedia and Common Crawl

Page 63: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

63

Application on Wikipedia and Common Crawl

Page 64: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

64

Application on Wikipedia and Common Crawl

Page 65: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

The rough estimate of monthly revenue of Wikipedia content

65

Application on Wikipedia and Common Crawl

Page 66: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page

- Percentage of pages reusing Wikipedia >= 0.5

- 87 websites.

- Estimated monthly revenue: $15k

66

Application on Wikipedia and Common Crawl

Page 67: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Extracted from Wikipedia API

67

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

Page 68: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Estimated from marketing reports

68

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

Page 69: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

69

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

Page 70: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

70

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page

Application on Wikipedia and Common Crawl

Page 71: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

71

Monthly revenue:

Application on Wikipedia and Common Crawl

Per Web sample Number of reusing web pages Revenue(per webpage)

59 million 15k $900k

590 million 150k $9 million

Page 72: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Per Web sample Number of reusing web pages Revenue(per webpage)

59 million 15k $900k

590 million 150k $9 million

72

Monthly revenue:

Application on Wikipedia and Common Crawl

Page 73: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Conclusion

73

Page 74: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Summaryconclusion

74

● Pipeline for TR Extraction

● Text Reuse in Wikipedia

● Text Reuse between

Wikipedia and the Web

Text Preprocessing

Candidate Elimination Text Alignment

Text Reuse

Structure Text Reuse

Content Text Reuse

Per website (all websites)

Per website (highly reuse)

Per Webpage

$1.2 million $15k $900k

Page 75: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Future Workconclusion

TR Pipeline

Wikipedia

?

75

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

Page 76: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

Text Preprocessing

Candidate Elimination Text Alignment

76

Page 77: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?

Text Preprocessing

Candidate Elimination Text Alignment

77

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

Page 78: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?

Text Preprocessing

Candidate Elimination Text Alignment

78

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

Page 79: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary

Backup Slides

79

Page 80: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 80

- Candidate Elimination functions:

Page 81: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 81

- Stopwords N-grams procedure:

Wiki paragraphs stopwords stopword

ngrams

Extract stop words generate n-grams filtered stopword ngrams

Top 50 frequent stopwords:

the, of, and, a, in, to,is, was, it, for, with, he,

be, on, i, that, by, at, you, 's, are, not,his,

this, from, but, had, which, she, they, or, an,

were, we, their, been, has, have, will, would,

her, there, can, all,as, if, who, what, said

filter n-grams

- Let C = {the, of, and, a, in, to, ’s} stopwords that increases false positive.

- X is accepted n-gram if:- It doesn’t contain more than n-1

stopwords from C- The maximal sequence of stopwords

belonging to C is less than n-2

binary count vector

- Binary count vector ignores the frequency in which a specific n-gram happened in a paragraph.

- We apply the scoring function on the binary count vector

Page 82: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 82

- VDSH explained:

VDSH USAGE

Page 83: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 83

- Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10

Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALLRECALL

Prec

isio

n

Prec

isio

n

Page 84: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 84

- Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10

Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALLRECALL

Prec

isio

n

Prec

isio

n

Page 85: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 85

- Candidate Elimination procedure over the cluster:

Page 86: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 86

- Hash based Candidate Elimination procedure over the cluster:

Page 87: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 87

- Hash based Candidate Elimination procedure over the cluster:

Page 88: A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse Analysis Milad Alshomary 05.07.2018 Bauhaus Universität 1. Milad Alshomary Pipeline

05.07.2018Pipeline for TR extractionMilad Alshomary 88

- Heuristics:

- H1: ne_sim ∈ (0.5, 1.0] AND 10grams_sim > 0.5 AND (s_percent_reused < 0.5 or

t_percent_reused < 0.5) => content reuse otherwise structure reuse

- 6700 content reuse cases only

- Validation on two random samples of size 100:

Structure reuse Content reuse

Sample1 100% 58%

Sample2 (Text1 or Text2 > 200)

100% 73%