Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011...

57
Analyzing a Large Corpus of Crowdsourced Plagiarism Master’s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar 06.06.2013 Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst Advisors: Dr. Steven Burrows Dr. Matthias Hagen Dr. Martin Potthast

Transcript of Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011...

Page 1: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Analyzing a Large Corpus ofCrowdsourced Plagiarism

Master’s Thesis

Michael Völske

Fakultät Medien

Bauhaus-Universität Weimar

06.06.2013

Supervised by:

Prof. Dr. Benno SteinProf. Dr. Volker Rodehorst

Advisors:

Dr. Steven BurrowsDr. Matthias HagenDr. Martin Potthast

Page 2: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Analyzing a Large Corpus ofCrowdsourced Plagiarism

Master’s Thesis

Outline · Introduction

· The Webis-TRC-12 Dataset

· Categorizing Crowdsourced Text Reuse

· Search Missions For Source Retrieval

· Summary

Page 3: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

IntroductionModeling Text Reuse From the Web

Search I‘m Feeling Lucky

Search Copy & Paste Paraphrase

[Potthast 2011]

3 [∧] © Michael Völske 2013

Page 4: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

IntroductionPrevious Plagiarism Corpora

Source Selection InsertionAutomaticExcerpt

AutomaticModification

q Automatically generated artificial plagiarism

q PAN-PC-10/11: >25.000 documents; >60.000 plagiarism cases

q Automatic transformations do not preserve semantics

4 [∧] © Michael Völske 2013

Page 5: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 Dataset

5 [∧] © Michael Völske 2013

Page 6: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

6 [∧] © Michael Völske 2013

Page 7: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Topics

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

7 [∧] © Michael Völske 2013

Page 8: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Topics

Example topic:

Obama’s family.

Write about President Barack Obama’s family history, including genealogy, nationalorigins, places and dates of birth, etc. Where did Barack Obama’s parents andgrandparents come from? Also include a brief biography of Obama’s mother.

q Based on TREC Web Track topics 2009–2011

q 150 topics, 297 essays

q Target essay length: 5000 words

8 [∧] © Michael Völske 2013

Page 9: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Authors

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

9 [∧] © Michael Völske 2013

Page 10: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Authors

Num

ber

of E

ssay

s

Author ID

q Crowdsourcing: 27 total

q Professional writers hired on oDesk + volunteers

q Fluent English speakers

10 [∧] © Michael Völske 2013

Page 11: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Authors

Num

ber

of E

ssay

s

Author ID

Author Demographics (n=12)Age (Median) 37Years Writing (Median) 8Academic degreePostgrad 33%Undergrad 25%EnglishNative 67%Second Language 33%

q Crowdsourcing: 27 total

q Professional writers hired on oDesk + volunteers

q Fluent English speakers

11 [∧] © Michael Völske 2013

Page 12: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Sources

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

12 [∧] © Michael Völske 2013

Page 13: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Sources

q ClueWeb09: 500 million English pages

q Representative sample of the web

q Commonly used in search engine evaluation (TREC)

13 [∧] © Michael Völske 2013

Page 14: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Search Engine

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

14 [∧] © Michael Völske 2013

Page 15: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Search Engine

[Potthast et al. 2012a]

q Used for source retrieval

q Indexes ClueWeb

q Records fine-grained interaction log [example]

15 [∧] © Michael Völske 2013

Page 16: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Editor

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

16 [∧] © Michael Völske 2013

Page 17: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Editor

[Potthast et al. 2012b]

17 [∧] © Michael Völske 2013

Page 18: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetConstruction Overview: Editor

[Potthast et al. 2012b]

q Custom web-based rich text editor

q Records sources of re-used text passages

q New revision every 300ms of inactivity

q Detailed revision history [example]

18 [∧] © Michael Völske 2013

Page 19: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetThree Main Data Sources

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

19 [∧] © Michael Völske 2013

Page 20: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetResearch Questions

20 [∧] © Michael Völske 2013

Page 21: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetResearch Questions

1. Different text reuse approaches distinguishable?

2. Relation to existing plagiarism categorizations?

3. Influence of text reuse task on search engine interaction?

4. Detectable via query log analysis?

We expect new research insights and impacts to

q text reuse detectionq query formulationq paraphrasing

21 [∧] © Michael Völske 2013

Page 22: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetResearch Questions

1. Different text reuse approaches distinguishable?

2. Relation to existing plagiarism categorizations?

3. Influence of text reuse task on search engine interaction?

4. Detectable via query log analysis?

We expect new research insights and impacts to

q text reuse detectionq query formulationq paraphrasing

22 [∧] © Michael Völske 2013

Page 23: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

The Webis-TRC-12 DatasetResearch Questions

1. Different text reuse approaches distinguishable?

2. Relation to existing plagiarism categorizations?

3. Influence of text reuse task on search engine interaction?

4. Detectable via query log analysis?

We expect new research insights and impacts to

q text reuse detectionq query formulationq paraphrasing

23 [∧] © Michael Völske 2013

Page 24: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text Reuse

24 [∧] © Michael Völske 2013

Page 25: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse

Build-up reuse (left) versus boil-down reuse (right).

q text length (y-axis) over text revision (x-axis)q colors: different source documents (original text is white)q blue dots: position of the writer’s last editq Build-up: 45%; boil-down: 40%; mixed: 12%

25 [∧] © Michael Völske 2013

Page 26: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse

Build-up reuse (left) versus boil-down reuse (right).

q text length (y-axis) over text revision (x-axis)q colors: different source documents (original text is white)q blue dots: position of the writer’s last editq Build-up: 45%; boil-down: 40%; mixed: 12%

26 [∧] © Michael Völske 2013

Page 27: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse

Text

leng

th

Edit number

Text

leng

th

Edit number

Text

leng

th

Edit number

Author 6 (12 topics) Author 20 (9 topics) Author 21 (21 topics)

Build-up reuse: Averaged editing histories by authors.

q one author per plotq gray lines: individual essaysq black line: average

27 [∧] © Michael Völske 2013

Page 28: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse

Text

leng

th

Edit number

Text

leng

th

Edit number

Text

leng

th

Edit number

Author 2 (66 topics) Author 7 (20 topics) Author 24 (27 topics)

Boil-down reuse: Averaged editing histories by authors.

q one author per plotq gray lines: individual essaysq black line: average

28 [∧] © Michael Völske 2013

Page 29: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

29 [∧] © Michael Völske 2013

Page 30: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

30 [∧] © Michael Völske 2013

Page 31: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g q Quantify: N-Gram similarity and ratio of passagesto sources[details]

q Measure for all essays

q Hypothesis: will show evidence of authors’ individ-ual text reuse styles

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

31 [∧] © Michael Völske 2013

Page 32: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (topic count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

32 [∧] © Michael Völske 2013

Page 33: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)

1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (essay count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

33 [∧] © Michael Völske 2013

Page 34: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)

1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (essay count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

34 [∧] © Michael Völske 2013

Page 35: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)

1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (essay count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

35 [∧] © Michael Völske 2013

Page 36: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)

1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (essay count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

36 [∧] © Michael Völske 2013

Page 37: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse

Passages per source (interleaving)

Med

ian

5-gr

am p

assa

ge s

imila

rity

(par

aphr

asin

g)

1

0

0.2

0.8

0.6

0.4

1 2 4 8

Author (essay count)

A14 (1)

A15 (3)

A16 (1)

A17 (33)

A18 (32)

A19 (7)

A20 (8)

A21 (21)

A22 (1)

A23 (1)

A24 (27)

A25 (1)

A26 (1)

A0 (2)

A1 (3)

A2 (66)

A3 (1)

A4 (1)

A5 (37)

A6 (11)

A7 (20)

A8 (1)

A9 (1)

A10 (12)

A11 (1)

A12 (1)

A13 (1)

37 [∧] © Michael Völske 2013

Page 38: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Search Missions For Source Retrieval

38 [∧] © Michael Völske 2013

Page 39: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Search Missions For Source RetrievalDistribution of Queries Over Time

1653320161582105870113231811928271962334710924840148153113154319

642630182083524341142844652605248669750138364234703457

1206167410162326910641362810898474610555088481989421848198

11276201471701395610632370607410451423011116944150274489215599

241588418140135461181851429133611723782466803368121626076

6210822424275691814720830241731616522636960864427464A

F

E

D

C

B

1 5 10 15 20 25

Distribution of queries over time.

q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion

q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve

39 [∧] © Michael Völske 2013

Page 40: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Search Missions For Source RetrievalDistribution of Queries Over Time

1653320161582105870113231811928271962334710924840148153113154319

642630182083524341142844652605248669750138364234703457

1206167410162326910641362810898474610555088481989421848198

11276201471701395610632370607410451423011116944150274489215599

241588418140135461181851429133611723782466803368121626076

6210822424275691814720830241731616522636960864427464A

F

E

D

C

B

1 5 10 15 20 25

Distribution of queries over time.

q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion

q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve

40 [∧] © Michael Völske 2013

Page 41: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Search Missions For Source RetrievalDistribution of Queries Over Time

64 112 165

Distribution of queries over time.

q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion

q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve

41 [∧] © Michael Völske 2013

Page 42: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Search Missions For Source RetrievalCorrelation of Editing and Querying

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Author 5 (18 topics) Author 2 (33 topics) Author 24 (13 topics)Author 20 (9 topics)

Correlation of editing and querying behavior.

averaged editing histories by authors [plots]

distribution of queries over time [plots]

42 [∧] © Michael Völske 2013

Page 43: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Summary

1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse

2. Evidence of two fundamental editing strategies: build-up & boil-down

3. New classification scheme for documents in a text reuse corpus

4. Relationship between editing behavior and search engine use

Future Work

1. Interleaving and paraphrasing in the time dimension

2. Authors’ text reuse strategies across multiple documents

3. Paraphrasing study: track individual passages over time

Thank you for your attention!43 [∧] © Michael Völske 2013

Page 44: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Summary

1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse

2. Evidence of two fundamental editing strategies: build-up & boil-down

3. New classification scheme for documents in a text reuse corpus

4. Relationship between editing behavior and search engine use

Future Work

1. Interleaving and paraphrasing in the time dimension

2. Authors’ text reuse strategies across multiple documents

3. Paraphrasing study: track individual passages over time

Thank you for your attention!44 [∧] © Michael Völske 2013

Page 45: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Summary

1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse

2. Evidence of two fundamental editing strategies: build-up & boil-down

3. New classification scheme for documents in a text reuse corpus

4. Relationship between editing behavior and search engine use

Future Work

1. Interleaving and paraphrasing in the time dimension

2. Authors’ text reuse strategies across multiple documents

3. Paraphrasing study: track individual passages over time

Thank you for your attention!45 [∧] © Michael Völske 2013

Page 46: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske
Page 47: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

References

[Potthast 2011] Martin Potthast. Technologies for Reusing Text from the Web.Dissertation, Bauhaus-Universität Weimar, December 2011.

[Potthast et al. 2012a] Martin Potthast, Matthias Hagen, Benno Stein,Jan Graßegger, Maximilian Michel, Martin Tippmann,and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09Corpus. SIGIR 2012, Portland, Oregon, August 2012.

[Potthast et al. 2012b] Martin Potthast, Tim Gollub, Matthias Hagen,Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer,Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso,and Benno Stein. Overview of the 4th International Competition onPlagiarism Detection. CLEF 2012 Evaluation Labs and Workshop –Working Notes Papers, September 2012.

[<]

Page 48: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Example topic:

Obama’s family.

Write about President Barack Obama’s family history, including genealogy, nationalorigins, places and dates of birth, etc. Where did Barack Obama’s parents andgrandparents come from? Also include a brief biography of Obama’s mother.

[<]

Page 49: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Example topic:

Obama’s family.

Write about President Barack Obama’s family history, including genealogy, nationalorigins, places and dates of birth, etc. Where did Barack Obama’s parents andgrandparents come from? Also include a brief biography of Obama’s mother.

Original topic 001 of the TREC Web Track 2009:

Query. obama family tree

Description. Find information on President Barack Obama’s family history, includinggenealogy, national origins, places and dates of birth, etc.

Sub-topic 1. Find the TIME magazine photo essay “Barack Obama’s Family Tree.”

Sub-topic 2. Where did Barack Obama’s parents and grandparents come from?

Sub-topic 3. Find biographical information on Barack Obama’s mother.

[<]

Page 50: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Type RankFrequency Severity

Clone 1 1Exact copy of another author’s work

Mashup 2 3A mix of material copied verbatim from several sources

Ctrl-C 3 2Significant portions of text copied from a single source

Remix 4 9Paraphrasing from several sources and making the content fit together seamlessly

Recycle 5 5Self-plagiarism

Re-Tweet 6 10Proper citation, but closely follows a single source

Find-Replace 7 7Near copy of a single source, with key phrases changed

Aggregator 8 4Proper citation, but (almost) no original work

404 Error 9 6Citations to non-existent or inaccurate information about sources

Hybrid 10 8Combining properly cited sources with plagiarism in one paper

[Turnitin 2012]

[<]

Page 51: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

How to quantify?

q Measure at the passage levelq Passage: Block of text reused from the

same sourceq Paraphrasing: simple N-Gram similarity

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 52: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Three passages:

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 53: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Paraphrasing: N-Gram similarity

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 54: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Paraphrasing: N-Gram similarity

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 55: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Paraphrasing: N-Gram similarity

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 56: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Paraphrasing: N-Gram similarity

ϕn(Nc, Ns) :=|Nc ∩Ns||Nc|

q Nc: N-Grams in the passageq Ns: N-Grams in the source

We choose n = 5.

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]

Page 57: Analyzing a Large Corpus of Crowdsourced Plagiarism · q Based on TREC Web Track topics 2009–2011 q 150 topics, 297 essays q Target essay length: 5000 words 8[^] ©Michael Völske

Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse

Find-Replace Remix

Clone, Ctrl-C Mashup

Interleavinglow high

low

high

Par

aphr

asin

g

Interleaving: Passages per source

pps(C, S) :=|C||S|

q C: passagesq S: sources

Classification Scheme for Text Reuse.

q types of plagiarism as distinguished by Turnitin [Turnitin 2012]

q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving

[<]