Detecting student copying in a corpus of science laboratory reports: simple and smart approaches...

23
Detecting student copying in a corpus of science laboratory reports: simple and smart approaches Eric Atwell, Paul Gent, Julia Medori, Clive Souter University of Leeds
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Detecting student copying in a corpus of science laboratory reports: simple and smart approaches...

Detecting student copying in a corpus of science laboratory reports: simple and smart

approaches

Eric Atwell, Paul Gent, Julia Medori, Clive Souter

University of Leeds

Overview

• Project Aim: develop a system for detecting student copying in Biomedical Science laboratory practical reports

• Test Corpus: 220 student courseworks• Experiments to test and compare "simple”

Zipping and Bigrams, and "smart” commercial-strength Turnitin, Copycatch, Copyfind

• Conclusions

Intro: Aims

• Biomedical Sciences want to detect student copying… and deter copying in future

• We were also interested to see how “simple” and “smart” approaches compared

• … is “smart” necessary/useful?

BMS lab report assessment

• BMS students learn by laboratory experiments and report on results

• Peer-assessed in large classes• ?opportunity to copy?• Plagiarism-detection system should

reduce this temptation • … and quantify the problem

Interdisciplinary team:

• Eric Atwell, School of Computing• Paul Gent, School of Biomedical

Sciences• Julia Medori, forensic linguist,

Computational Linguistics Lab, Trinity College Dublin

• Clive Souter, Centre for Joint Honours in Science

Requirements analysis

• Specific properties of lab reports (diagrams, tables cant be checked)

• BMS teaching staff: what to look for, how much overlap is “acceptable” (answers correlate to questions)

Survey of available systems

• CheatChecker: n-grams matched• SIM, YAP: longest common subsequences• CopyCatch: “unusual” similarities, used for

forensic linguistics• CopyFind: aimed at science, not humanities• Turnitin: most famous, but unclear?• Clough: PhD proj Sheffeld using NLP

Test Corpus of 220 Biomedical Sciences Student

Lab Reports • 103 1st-year • (specific questions)• 94 2nd-year • (longer, less constrained, more

originality)

Adding 23 artifical files

• 21 artificial TestFiles:5%, 10%… T1 + 95%, 90%… T2testfileAB v testfileABC

• Proven plagiarism case: Scanfile1, Scanfile2

Format and content of lab reports

• Lab report MUST have: Intro, Methods, Results, Discussion

• Unlike other genres in plagiarism detection research:

• Limited originality (unlike Humanities essays)

• Unlimited vocabulary (unlike programming exercises)

Experiments to test and compare systems

• "simple” v."smart” commercial-strength

• Zipping, • Bigrams, • Turnitin, • Copycatch, • Copyfind

Zipping: simple 1

• File compressors merge/delete repeated substrings

• So… compare size of zip(A+B) with sizes of zip(A) and zip(B)

• If zip(A+B) is small, this shows copying between A and B

• Simple pipeline of unix tools

Bigrams: simple 2

• 2 similar files share high-frequency bigrams: letters, words

• So… compare top N bigrams in A,B• Top 10,20,30 letter-bigrams are shared

IF A,B are very similar• Top 30 word-bigram flags more “subtle”

copying

Turnitin.com

• Commercial Web-based service; Leeds Univ has subscribed

• Cumbersome: cut and paste text, 24-hour turnaround

• Originality Report measure overlap between A and all others

• So.. Lots of small overlaps score worse than 1 big overlap

CopyCatch

• C++/Java PC package, works on a directory (text, Word, RTF, HTML)

• Commercial – pay per PC• Finds most similar pairs of files• Many extra features for forensic

linguists - confusing for others?• Flags most corpus copying cases

CopyFind

• Freeware from Virginia Univ• Also works on a PC directory, more

cumbersome than CopyCatch • Default threshold: copying missed• Threshold can be lowered to catch this,

BUT also more red herrings

Results

Usability: Submission

• Zipping: filenames in program• Bigrams: give directory name• Turnitin: Copy+Paste each text• CopyCatch: select files from file-

manager window• Copyfind: give list of filenames

Usability: output

• Zipping: sorted list of filename-pairs• Bigrams: sorted list of filename-pairs• Turnitin: Originality Reports with similarities

highlighted• CopyCatch: sorted list of pairs + stats + texts

side-by-side with similarities highlighted• Copyfind: reports for each pair above

threshold, similarities highlighted

Discussion

• “simple” as good as “smart” at finding cases of copying

• Surprising: Zipping just uses file-compression, Bigrams does not compare frequencies, only list of common bigrams…

• “smart” systems are easier to use

Is “smart” necessary/useful?

• Zipping/Bigrams are not smart• But… Lab report MUST have: Intro, Methods,

Results, Discussion…• …copying in Discussion is “more suspicious”

for human judges• ? This could be used to downgrade red

herrings with overlap in Intro, Methods (“copied” from coursework specification)

Conclusions

• Biomedical Sciences have a system for future real use (CopyCatch)

• Corpus for further research• ? solution: “simple”, plus extra to

eliminate red herrings ?• Future work: look at other genres, other

forensic linguistics problems

Acknowledgements

• This project previously presented at AICS’02: Artificial Intelligence and Cognitive Science

• (but this is NOT plagiarism as it is acknowledged !?!)

• Funded by HEFCE C&IT grant• Julia Medori: available for further

consultancy!