Detecting student copying in a corpus of science laboratory reports: simple and smart approaches...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Detecting student copying in a corpus of science laboratory reports: simple and smart approaches...
Detecting student copying in a corpus of science laboratory reports: simple and smart
approaches
Eric Atwell, Paul Gent, Julia Medori, Clive Souter
University of Leeds
Overview
• Project Aim: develop a system for detecting student copying in Biomedical Science laboratory practical reports
• Test Corpus: 220 student courseworks• Experiments to test and compare "simple”
Zipping and Bigrams, and "smart” commercial-strength Turnitin, Copycatch, Copyfind
• Conclusions
Intro: Aims
• Biomedical Sciences want to detect student copying… and deter copying in future
• We were also interested to see how “simple” and “smart” approaches compared
• … is “smart” necessary/useful?
BMS lab report assessment
• BMS students learn by laboratory experiments and report on results
• Peer-assessed in large classes• ?opportunity to copy?• Plagiarism-detection system should
reduce this temptation • … and quantify the problem
Interdisciplinary team:
• Eric Atwell, School of Computing• Paul Gent, School of Biomedical
Sciences• Julia Medori, forensic linguist,
Computational Linguistics Lab, Trinity College Dublin
• Clive Souter, Centre for Joint Honours in Science
Requirements analysis
• Specific properties of lab reports (diagrams, tables cant be checked)
• BMS teaching staff: what to look for, how much overlap is “acceptable” (answers correlate to questions)
Survey of available systems
• CheatChecker: n-grams matched• SIM, YAP: longest common subsequences• CopyCatch: “unusual” similarities, used for
forensic linguistics• CopyFind: aimed at science, not humanities• Turnitin: most famous, but unclear?• Clough: PhD proj Sheffeld using NLP
Test Corpus of 220 Biomedical Sciences Student
Lab Reports • 103 1st-year • (specific questions)• 94 2nd-year • (longer, less constrained, more
originality)
Adding 23 artifical files
• 21 artificial TestFiles:5%, 10%… T1 + 95%, 90%… T2testfileAB v testfileABC
• Proven plagiarism case: Scanfile1, Scanfile2
Format and content of lab reports
• Lab report MUST have: Intro, Methods, Results, Discussion
• Unlike other genres in plagiarism detection research:
• Limited originality (unlike Humanities essays)
• Unlimited vocabulary (unlike programming exercises)
Experiments to test and compare systems
• "simple” v."smart” commercial-strength
• Zipping, • Bigrams, • Turnitin, • Copycatch, • Copyfind
Zipping: simple 1
• File compressors merge/delete repeated substrings
• So… compare size of zip(A+B) with sizes of zip(A) and zip(B)
• If zip(A+B) is small, this shows copying between A and B
• Simple pipeline of unix tools
Bigrams: simple 2
• 2 similar files share high-frequency bigrams: letters, words
• So… compare top N bigrams in A,B• Top 10,20,30 letter-bigrams are shared
IF A,B are very similar• Top 30 word-bigram flags more “subtle”
copying
Turnitin.com
• Commercial Web-based service; Leeds Univ has subscribed
• Cumbersome: cut and paste text, 24-hour turnaround
• Originality Report measure overlap between A and all others
• So.. Lots of small overlaps score worse than 1 big overlap
CopyCatch
• C++/Java PC package, works on a directory (text, Word, RTF, HTML)
• Commercial – pay per PC• Finds most similar pairs of files• Many extra features for forensic
linguists - confusing for others?• Flags most corpus copying cases
CopyFind
• Freeware from Virginia Univ• Also works on a PC directory, more
cumbersome than CopyCatch • Default threshold: copying missed• Threshold can be lowered to catch this,
BUT also more red herrings
Usability: Submission
• Zipping: filenames in program• Bigrams: give directory name• Turnitin: Copy+Paste each text• CopyCatch: select files from file-
manager window• Copyfind: give list of filenames
Usability: output
• Zipping: sorted list of filename-pairs• Bigrams: sorted list of filename-pairs• Turnitin: Originality Reports with similarities
highlighted• CopyCatch: sorted list of pairs + stats + texts
side-by-side with similarities highlighted• Copyfind: reports for each pair above
threshold, similarities highlighted
Discussion
• “simple” as good as “smart” at finding cases of copying
• Surprising: Zipping just uses file-compression, Bigrams does not compare frequencies, only list of common bigrams…
• “smart” systems are easier to use
Is “smart” necessary/useful?
• Zipping/Bigrams are not smart• But… Lab report MUST have: Intro, Methods,
Results, Discussion…• …copying in Discussion is “more suspicious”
for human judges• ? This could be used to downgrade red
herrings with overlap in Intro, Methods (“copied” from coursework specification)
Conclusions
• Biomedical Sciences have a system for future real use (CopyCatch)
• Corpus for further research• ? solution: “simple”, plus extra to
eliminate red herrings ?• Future work: look at other genres, other
forensic linguistics problems