RoFT: A Tool for Evaluating Human Detection of Machine ...ccb/publications/roft-demo.pdf · RoFT: A...
Transcript of RoFT: A Tool for Evaluating Human Detection of Machine ...ccb/publications/roft-demo.pdf · RoFT: A...
RoFT: A Tool for EvaluatingHuman Detection of Machine-Generated Text
Liam Dugan*, Daphne Ippolito*, Arun Kirubarajan*, Chris Callison-BurchUniversity of Pennsylvania
{ldugan, daphnei, kiruba, ccb}@seas.upenn.edu
Abstract
In recent years, large neural networks for nat-ural language generation (NLG) have madeleaps and bounds in their ability to generatefluent text. However, the tasks of evaluatingquality differences between NLG systems andunderstanding how humans perceive the gener-ated text remain both crucial and difficult. Inthis system demonstration, we present Real orFake Text (RoFT), a website that tackles bothof these challenges by inviting users to try theirhand at detecting machine-generated text in avariety of domains. We introduce a novel eval-uation task based on detecting the boundaryat which a text passage that starts off human-written transitions to being machine-generated.We show preliminary results of using RoFT toevaluate detection of machine-generated newsarticles.
1 Introduction
Despite considerable advancements in building nat-ural language generation (NLG) systems that canoutput extremely fluent English text, there is stillnot very much understanding of how humans per-ceive machine-generated text. Such an understand-ing is crucial both to evaluate improvements inNLG systems and to analyze the societal ramifica-tions of machine-generated text becoming increas-ingly easy to produce.
When evaluating NLG systems, it is consideredstandard practice to ask evaluators to rate generatedtext on criteria such as fluency, naturalness, or rel-evance to a prompt on a Likert scale (van der Leeet al., 2019). Preference studies, where a rater isshown two generated excerpts and asked which onethey prefer, are also common. Some recent workhas focused on the detection problem: how capableare humans of distinguishing textual excerpts gen-
∗Authors listed alphabetically contributed equally.
sentenceseemssense
make
think
human
just
text
article
seem
word
oddone
wordsplacemakes
way
name
rest
�rst
use
weird
last
writetopic
much
�t
said
context
person
sound
used
style
part looks
will
two
shortnews
say
end
many
little
change
Seems
instead
cansure
story
well
now
sounding
long
might
Sounds
without
natural
follow
�ow
earlier
feel
feels
put
content
Justright
details
know
newtitle
bit
response
prompt
repeating
really
seemed
come
black
mistake
anythingnumbers
even
get
usage
subject
kindfull
repetitive
obituary
formal
New
mention
ones
already
lot
says
leadfact
simple
related
job
common
shorter
repeated
whole
page
unnatural
clear
state
guess
see
statement
line
address
thought
goes
believe
made
choice
appear
actually
read
error
started
point testing
nothing
Repetition
writer
quite
wordedauthor
detailed
good
strangely
placed
generate
others
discussing
awkward
York
wordy
facts
starts
marks
thing
along
format
inconsistent
enough
letters
home
wrote
doesnt
location
saying
work
time
Weird
order
easily
stop
someone
usually
sudden
back
since
tense
usinganswer
high
wedding
market
start
completely
year
abrupt
appeared
caps
markgibberish
public
years
prior
today
relevant
personal
number
another
products
exceptgot
also
attention
bad
general
done
match
repetition
John
mean
look
describe
nowhere
repeat
logical
comes
things
maybe
missing
understand
imagine
collage
always
bone
never
gives
event
want
may
told
tone
Machine
goal
better
million
conversation
ended
political
proper
jump
First
still
reference
yesterday
though
grammatically
given
Maybe
half
city
explained
relation
analysis
appears
preceding
brief
gave
main
various
else
type
ring
succinct
company
o�er
appropriate
coherent
magazine
Museum
pretty
percent
similar
website
direct
felt
copy
veers
dance
value
cars
abruptly
share
give
score
book
jobs
correct
due
next
readerfamily
around
di�erently
making program
wom
en
anyone
War
Earlier
quote
focus
quotes
major
shi�
changes
sounded
commas
East
lack
phrases
opinion
husband
Punctuatio
n
help
robotic
contains
improper
getting
typo
problem
straight
dropped
car
talk
unity
week
�ve
original
talks
break
totally
State
Secretary
United
game
research
redundant
media
report
references
comma
came
uses
headline
Looks
Topic
switched
adjectives
initial
Writing
tied
Repeated
April
friends
explain
phrasing
prices
�awless
writes
feet
suspicious
quick
complex
intelligence
voter
level
funny
charged
don’t
abbreviations
past
run-on
team
�ght
terrorism
Olympic
NYT
excerpt
basement
misspelled
character
National
chosen
certainly
sti�
recite
suppression
according
detail
top
site
date
photographed
issue
furniture
oriented
hear
factor
international
R48minimize
arranged
include
pressure
paragraph
touch
abuzz
unique
means
lived
o�cial
expectService
sta�
CFDA
care
obituaries
President
copy/paste
smooth
check
citation
WWD
meaning
currently
bring
pickle
men
large
reasonsslightly
Also
shi�ed
telling
four
Humans
longer
Friday
regular
dramatic
stats
deceased
early
Bad
thrown
higher
contradicts
remember
Dri�
dollar
Repeating
agoschool
Double
1st
presence
easy
revealedveered
un�nished
crime
movie
Dave
form
spacing
mechine
Makes
unusualdoesn’t
Incoherent
refer
referring
capital
support
take
typical situation
possible
murder
machines
art
close
bot
case
going
piece
keeps
mob
councilman
integrity
link
William
humans
isolated
passage
Steven
day
contain
November
compared
Ben
isn’t
Incoherency
issued
version
double
oddly
overly
hit
pulled
NASA
later
incoherent
opinions
reviewmeant
�rm
danced
individuals
failed
info
forced
GOOD
money
actual
shows
death
States
Complete
reason
able
larger
rise
despite
Williams
continued
dashes
errors
returned
randomly
exhibit
LEGENTHALDIANS
power
forward
enforcement
bridegroom
away
repeatation
judge
law
degrees
sit
ri�
Center
Playboy
Island
Park
calculated
discuss
economy
BORISSA
respect
interest
speak
message
giving
road
working
relate
group
protection
grammer
May
student
wording
nest
refuge
PCB
However
strictly
air
ball
Closing
Devils
nonsense
NEC
League
great
thinking
near
store
rather
France
City
boys
moral
mother
o�en
armor
Baker
idea
box
turning
pace
tie
board
military
big
doubt
typing
history
east
fed
le�
whose Bush
cut
police
heard
MiddleSudden
�uent
criminal
died
post
rigid
points
man
names
accountant
several
posthumous
BETTER
LaGuardia
Fiorello
Scotland
billion
waiter
Galleries
formerly
dunk
Separately
gotten
show
coherence
Highway
Hospital
gurgling
entire
brew
gi�
7th
extol
normally
Bishop
denmark
Murphy
ownds
nineteenth
christians
vague
ambiguous
intellecthyphen
poor
speaks
Excessive
Cleese
Style
Barrymore
extreme
formality
Simple
unclear
Monday
June
no-no
London
bed
o�ce
rupees
Construction
buildingpeacefully
naturally
500,000
Rockville
asterisk
Diocesestroke
Centre
Improper
yes
following
Aged
absolutely
�ags
context
ually
terms
disabilities
Americans
underside
Doesn’t
yards
less
lie
damage
permanent
decorated
guerrilla
diverse
Morton
memorial
�ghter
kindly
suitcase
comment
happy
request
Opening
expands
guesisng
base
beneath
pronoun
Johnson’s
open
Kevin
Eric
tag
deck
Maas
joined
Chicago Pennie
uncomfort
partner
Henry
Skywalker
studied
banjo
horn
French
cra�
wish
ONLINE
Muslim
majority
India's
Kashmir
Jammu
technical
Aristide
ludicrous
develop
mistaken
Haitian
violence
approaching
chain
hardware
qualify
Lyman
jumps
Serbian
spring
fears
heighten
plot
Department
determined
exposure
persons
Black
term
mission
conjugating
Memorials
Judi's
maiden
appreciated
reconnaissance
counter-drug
Cromwell's
Army
cleanest
produce
arrived
search
o�cers
Colombian
soldiers
plane
converted
3-point
picky
likely
digestible
fragment
retired
length
Clear
Area
stations
design
helped
illegall
y
il-legally
competitive
cooper
ating
laborato
ries
virology
Leading
W.H.O
hanging
players
development
reduced
brings
Lower
Side
months
elections
enacted
unilaterally
amendments
national
deviate
O.K
Perhaps
containing
Dece
taxes
expense
Slightly
unconnected
sad
quantity
incongruous
shopping
declined
income
Operating
choreography
Bernardo
Wunderbar
require
okay
probably
obsession
weight-loss
farfetched
tsunami
mystery
suicide
Qaeda
caused
anonymity
bristling
Providence
throws
free
play
seconds
granddaughter
Shor
Toots
saloonkeeper
renowned
bombings
Afghanistan-Pakistan
girl
relevance
send
Get
misspelling
accept
listed
selections
Missing
Run-on
rational
actions
Captions
honestly
ready
region
vocative
It’s
Israeli
Palestinians
three
eyes
Ware
lucky
alto
blunt
times
saxophones
sweating
stepped
dwindled
Falmouth
lobster
rolls
Two
together
relates
Chase
classi�cations
rater
verbatim
fair
caregivers
gone
spied
Small
breach
rare
stating
precedents
authority
diversion
arugula
service
third
bassist
ahead
aware
setting
ave
eat
veering
fun
orzo
tuna
roll
river
Free
torts
education
toward
Fitchburg
marathon
stripping
mad
usual
discovering
Rouge
jungles
Weather
spam
hyphenated
dry
real
sister
audience
extending
Washington
Rice
trip
Thursday
handling
trials
best
Joel
must
PROPLEMS
weddings
core
Inevitable
rules
plan
reiterated
feed
process
calls
Rothwax
Believe
run
leave
let
Kim
Jong
credentials
Depp
caved
talked
Even
texts
member
broken
runs
till
tarot
�uid
veto
hid
Block
bills
total
ear
McCoy
units
cast
solemn
oath
vote
lied
speci�c
putting
Boca
regularly
comparison
Tells
boat
need
informal
guessing
turns
fake
scored
Lawrence
pure
tend German
PhD seals
cold
list
entity
Facts
Jewry
bleak
Louis
bigger
Former
forma
fail
Iraq
Inc
glare
room
Murano
vice
chief
product
Kolb
5-year
works
stare
fall
THREA
TS
bio
beat
ways
dra�
God
Paul
NBA
invest
scope
via
note
age
visas
places
Pass
Felt
player
Lack
DMV
walk
likes
far
eight
Darryl
orbit
crew
West
businessDoe
rr
guy's
dead
clip
sort
sea
days
Kelly
souls
raise
cash
Lots
Fair
frills
tell
can‘t
guns
I’ll
Hills
50's
gem
seen
sides
child
Wild
dash
gear
cope
N.J
inch
code
local
bids
abut
Sosa
tra
YES
whenever
vibes
loses
li�ed
rates
run-up
span
late
low
head
steel
race
bee
Easy
Goer
It‘s
bill
fan
add
�le
ever
silk
nee
deal
bits
lurid
setlimp
son
joy
male
taste
sites
wait
loss
sent
Nonsense
kept
�ll
six
south
Figure 1: A word cloud of common words that annota-tors used to describe why they thought sentences weremachine-generated.
erated by a system from those written by anotherhuman (Ippolito et al., 2020; Zellers et al., 2019).
However, due to the prohibitive cost of runninghuman evaluation studies, most prior work in thisarea has been rather limited in scope. For example,analyses usually show results on only a single cate-gory of text (news articles, stories, webtext, etc.).This could be problematic since different domainshave different levels of named entities, world facts,narrative coherence, and other properties that im-pact the success of NLG systems. In addition, mostpapers only evaluate on a very limited selection ofdecoding strategy hyperparameters. Holtzman et al.(2019) and Ippolito et al. (2020) both show thatthe decoding strategy chosen at inference time canhave a significant impact on the quality of gener-ated text.
In this work, we introduce the Real or Fake Text(RoFT) system, a novel application for simultane-ously collecting quality annotations of machine-generated text while allowing the public to as-sess and improve their skill at detecting machine-generated text.
In RoFT, we propose to use the task of detect-ing when text is machine-generated as a qualitycriterion for comparing NLG systems. FollowingIppolito et al. (2020), we make the counter-intuitiveassumption that the worse annotators are at detect-ing that text is machine-generated, the better wecan say that NLG system is at generating text.
In RoFT’s detection task, annotators are shown apassage of text one sentence at a time. The first sev-eral sentences are from a real human-written textsource and the next several sentences are a machine-generated continuation. The user’s goal is to guesswhere the boundary is. When they think that a sen-tence is machine-generated, they are asked to givean explanation for their choice, and afterwards thetrue boundary is revealed.
In the remainder of this paper, we discuss whywe think this task is interesting from a research per-spective and describe the technical details behindour implementation. We show preliminary resultsthat showcase the types of analyses that are possi-ble with the collected data, and finally we discussplans for future work.
The RoFT website is located at http://www.
roft.io/. The source code is available un-der an MIT License at https://github.com/
kirubarajan/roft.
2 Research Motivations
The purpose behind RoFT is to collect annotationson the scale needed to probe the quality of textgenerated under a variety of NLG conditions andsystems. In this section, we describe three researchquestions we aim to answer using RoFT data.
2.1 Length Threshold for Detection
State-of-the-art generative models tend to producetext that is locally fluent but lacking in long-termstructure or coherence. Intuition suggests that flu-ent NLG systems ought to produce text that is highquality for long durations (measured in number ofsentences). As such, we are interested in using thethe boundary detection task—whether annotatorscan detect the boundary between human-writtentext and a machine-generated continuation—as acomparison method for NLG systems. We hypoth-esize that for better quality systems, the generatedtext will be able to fool humans for more sentences.
2.2 Text Genre/Style
Generative language models have now been trainedand fine-tuned on a great diversity of genres andstyles of text, from Reddit posts (Keskar et al.,2019) and short stories (Fan et al., 2018) toWikipedia (Liu et al., 2018) and news articles(Zellers et al., 2019). Each of these datasets hasits own distinct challenges for generation; for ex-ample, in the story domain it is acceptable for agenerator to make up facts while this would be un-acceptable in a Wikipedia article. We are interestedin how these differences might impact the abilityof humans to detect machine-generated text.
2.3 Reasons Text is Low Quality
A study by van der Lee et al. (2019) found thatless than 3% of recent papers on NLG ask for free-text comments when performing human evalua-tions. And yet, understanding why humans thinktext is low quality can be very important for diag-nosing problems in NLG systems (Reiter and Belz,2009). Therefore, the RoFT platform collects free-form textual explanations from our annotators ontheir decisions. Such data, though inevitably noisy,could provide insights into the types of errors thatNLG systems introduce, the types of errors humansare sensitive to, and even the types of errors human-written corpora contain (when a rater inadvertentlypredicts that a human-written sentence is machine-generated).
2.4 Human Factor
The boundary detection task posed by RoFT isan artificial one. We do not expect that real-worlduses of machine-generated text would involve a tidysplit of prompt sentences followed by a machine-generated continuation. However, we believe thateven an artificial framing such as RoFT’s has boththe potential to educate the public on what to lookfor in machine-generated test and give researchersinsights into how humans perceive and react to suchtext. We are in particular interested in whetherannotators improve over time and in how demo-graphic (for example, paid crowd worker vs. uni-versity student) impacts detection skill.
3 System Overview
This section gives an overview of RoFT’s design,including the task that annotators are asked to com-plete and methods for encouraging organic traffic.
3.1 Task Definition
The RoFT annotation task is posed as a game.Users first choose which category they would liketo play in, where different categories correspondto different text domains or NLG systems. The“game” then consists of a series of rounds. Eachround starts with the user being presented a singlesentence that is guaranteed to be human-written.For example, this might be the first sentence of aNew York Times article. Afterwards, users mayselect to display more sentences, one at a time. Ateach step, they must decide if they believe that themost recent sentence is still written by a human.When the user decides they are confident that a ma-chine has written the most recent sentence (i.e. theyhave found the “boundary sentence”), the roundends. The user is then asked to provide a naturallanguage explanation of what prompted their deci-sion. In essence, the annotators’ goal is to identifythe exact sentence where a machine “takes over”and the text is no longer human-written. Figure 2gives screenshots of the flow of a single round.
3.2 Implementation
The RoFT annotation website is designed to col-lect data needed to answer a variety of researchquestions, including those posed in Section 2. Inparticular, our system stores detailed metadata foreach annotation. These include the order in whicha user completed annotations, the type of user ac-count associated with each annotation (e.g. paidworker or organic traffic), the NLG system used toproduce each generation, and the amount of timeeach annotation takes. The system was developedin Python using the Django Framework and a SQLdatabase. The use of a relational database enablessophisticated queries to be made on the collectedannotations for analysis. We plan to make dumpsof the database available to other researchers tofurther promote research into the evaluation of gen-erated text.
3.3 Gamification
Since the cost of collecting human annotations viaa crowd platform such as Amazon Mechanical Turkcan be prohibitively expensive for large studies, weaimed to build the RoFT website in a manner thatwould encourage sustained participation withoutthe need for a financial incentive.
Each user has a Profile page (shown in Figure3) where they can see statistics on the total number
(a) The user is shown an initial sentence and then one sentence of continuation at a time. At each step, the user decides if the latest sentence is human-written or machine-generated and presses the appropriate button.
(b) When the user decides that the most recent sentence is machine-generated, they are asked to provide an explanation for their decision.
(c) The true boundary is then revealed. In this case, the user would be alerted that they received 5 points since they guessed the boundary correctly.
Figure 2: The user interface for annotation.
Figure 3: A user’s profile page.
of annotations they’ve done, how many points theyhave earned, and how many questions they have an-swered perfectly. There is also a leaderboard whereusers can check how their point count compares toother raters. The leaderboard encourages users todo more annotations, since this is the only way tomove up on the rankings.
We received unsolicited compliments from ourinitial annotators such as “Interesting, fun task” and“Really convincing passages.” We intend to add fur-ther gamification elements, including leaderboardsbroken down by text domain, comprehensive statis-tics on user progress and skill, and the ability to seeand up-vote the free-text comments of other users.
3.4 Generations
We ultimately plan to use RoFT to study differencesin detection performance across a variety of NLGsystems and text domains. The initial version ofRoFT includes two complementary categories oftext: news and fictional stories. Users have theoption to choose which category they would like toannotate.
For the news category, prompts are drawn fromthe New York Times Annotated Corpus (Sandhaus,2008) and are truncated to between 1 and 10 sen-tences long. GROVER (Zellers et al., 2019) is thenconditioned on these starting sentences and askedto complete the article. Finally, the outputs fromGROVER are truncated so that the sum total num-ber of sentences for each example is 10.
The data on fictional stories was prepared sim-ilarly except that the Reddit Writting Promptsdataset (Fan et al., 2018) was used for the prompts,and the GPT-2 XL model (Radford et al., 2019)was used for generation.
Each category contains over 1,500 examples,where for each example the number of human-written context sentences as well as the values ofthe decoding strategy hyperparameters were cho-sen randomly. For our initial seeding of data, Nu-cleus sampling (Holtzman et al., 2019) was used forall decoding, where the p hyperparameter, whichcontrols the diversity of the generated text, wasrandomly selected to be anywhere from p = 0(argmax) to p = 1.0 (full random sampling).
4 Case Study
To show the efficacy of RoFT as an evaluationtool, we present a case study from our initial pilotof over 3000 annotations of generations from thenews article domain.
4.1 Data Collection
While our eventual hope is for the RoFT websiteto have enough organic traffic for useful data to becollected, for the purposes of this study, two hun-dred Amazon Mechanical Turk workers were paidto complete 10 annotations each on the website.In total, we collected 3244 annotations (7.9% ofannotators continued past the minimum of 10 ques-tions they were required to do to get paid). 10% ofexamples the crowd workers saw were designatedattention check questions in which the prompt ex-plicitly stated they should select “human-written”at every step. About 25% of crowd workers failedthis check, and after filtering out these annotators,we were left with a total of 1848 high-quality an-notations, which we will refer to as the filteredannotation set.
4.2 Inter-Annotator Agreement
There were 768 examples which had at least twocrowd workers provide annotations for them (645of which had at least three annotations provided).This led to 6,115 instances of pairs of annota-tions on the same examples. Of these, 18.3% pre-dicted the exact same sentence as the boundary,and 28.4%, predicted boundaries at most one sen-tence apart from each other. When consideringonly the filtered annotation set, there were 2,064pairs of annotations. Of these, 18.6% predicted theexact same sentence as the boundary, and 28.3%predicted boundaries at most one sentence apartfrom each other.
Figure 4: A histogram of the filtered annotation setgrouped by the distance (in number of sentences) be-tween the sentence selected by the annotator and thetrue boundary sentence.
4.3 Evaluation Measures
We consider three methods for evaluating annotatorability.
4.3.1 AccuracyAmong annotators that passed our attention check,15.8% of the filtered annotations correctly identi-fied the exact boundary between machine and gen-erated text. Additionally, the average annotationfrom our filtered set was 1.989 sentences after thetrue boundary. This is consistent with our intuition,namely that current state-of-the-art NLG systemsare capable of fooling humans but typically onlyfor one or two sentences.
4.3.2 Distance from BoundaryIn Figure 4, we show a histogram of the numberof annotations which were each possible distanceaway from the true boundary. As a note, valuescloser to zero are more likely by construction be-cause there are more opportunities for these dis-tances to be selected. For example, a distance of -9is only possible if the generation boundary is at the10th sentence, while a distance of 0 is possible inevery configuration. However, the observed distri-bution of distances is far from symmetric, with theleft tail (composed of annotators picking human-written sentences) dropping off precipitously whilethe right tail (composed of machine-generated sen-tences) decreases more linearly. This indicates thatour annotators are successfully picking up on cluesin the generated text, and thus the sentence-by-sentence structure of the RoFT experiment is an
effective way to evaluate text. These preliminaryresults bode well for future large-scale use of thetool.
4.3.3 Points AwardedWhile accuracy may be a simple and intuitive met-ric for assessing performance, it is sub-optimal forour purposes as it does not give partial credit forguesses that are after the boundary, despite suchguesses being successful identifications of gener-ated text. Average distance (in sentences) fromboundary is not sufficient either, as it does notweight all guesses before the boundary equally neg-atively and thus over-penalizes too-early annota-tions on examples with late-occurring boundaries.
To combat these issues, we developed a pointsystem to better capture annotator ability. Aftereach annotation, a user is assigned points based ontheir performance: 5 points for guessing exactly onthe boundary and a linearly decreasing number ofpoints for each sentence beyond the boundary. Nopoints are awarded for guesses that appear beforethe boundary. We use the average points per anno-tation as our metric for the experiments shown inFigure 5.
4.4 Skill Range of Annotators
There was a significant range in detection abilityacross the crowd workers. The top 5% of the fil-tered worker pool earned an average of 3.34 pointsper annotations while the bottom 5% earned an av-erage of 0.35. Since it is difficult to separate out theinfluence of inherent skill from that of misalignedincentives (AMT workers were paid for comple-tion, not correctness), more research is necessaryto understand differences in annotator ability.
4.5 Impact of Decoding Strategy
During our small-scale case study, we did not seea noticeable correlation between the values of theNucleus Sampling (Holtzman et al., 2019) hyper-parameter p and the detection accuracy of humansas reported in Figure 5b. This is likely due to thelow number of annotations per value of p (n=180)and we hope to run a more comprehensive versionof this experiment with more data in the future.
4.6 Impact of Revealing the Boundary
As part of the gamification aspect of the RoFT plat-form, we reveal the true boundary to our annotatorsafter every annotation they complete. This featureadds a level of interactivity to the process and is
0 2 4 6 8Index of Annotation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Aver
age
Poin
ts
Average Points over time
(a)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Value of p
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Aver
age
Poin
ts
Average Points per p value
(b)
Figure 5: In (a) we show the average number of points (Section 4.3) received per annotation in the filtered annota-tion set grouped by the temporal order in which they were shown to the annotators 0 (first) to 9 (last). In (b) weshow average number of points received per item in the filtered annotation for each values of p used for decoding.Error bars are standard deviation. No statistically significant trends were observed in this preliminary study.
crucial for ensuring that the RoFT experiment isenjoyable and appeals to the general public. To bet-ter understand how this decision affected annotatorskill, we analyzed if our annotators got more accu-rate as they did more annotations. Figure 5a showsthat over a session of 10 annotations, annotators ex-hibit little to no improvement at the annotation taskover time. Future studies using the RoFT platformwill further investigate if human annotators can betrained to detect generated text over long periodsof time and multiple gameplay sessions.
4.7 Free-form CommentsOur proposed annotation system allows annotatorsto provide a natural language explanation of whythey made a particular decision (e.g. classifying asentence as human-written or machine-generated).Due to minimal oversight, many annotators re-usedor copy/pasted their comments across annotations.Filtering for duplicates, we collected over 1200unique comments, out of around 3000 annotations.Manual inspection shows that many annotations re-lied on similar clues such as: problems with entail-ment, formatting (i.e. punctuation), and repetition.These responses can be used to inform future im-provements to existing NLG systems and decodingstrategies. Additionally, it is possible to use datamining techniques to extract an error taxonomyfrom the provided natural langauge description oferrors.
5 Related Work
Nearly all papers in NLG do some form of humanevaluation, usually using Amazon Mechanical Turk
Sample Annotation
Seems like a conversational statement that doesnt logi-cally follow from a book title reference
not relevant to preceding sentences
I don’t think that a human would write about tarot cardsin an obituary and it says obituaries plural.
The sentence is too short and simple, sweatingcomputerized.
First time I heard of dinosaur-eating mammals
The sentence is left hanging.
Repeated the second line again and To is written as TO
Table 1: Examples of explanations crowd workersgave for why they thought a sentence was machine-generated.
(van der Lee et al., 2019). Typically the interfacesfor these evaluations are simple web forms. van derLee et al. (2019) offers a survey of many of thesemethods. Custom-designed websites for collectingor displaying human evaluations of generated texthave become increasingly prominent in the open-ended dialog domain, with ChatEval (Sedoc et al.,2018) and ConvAI (Pavlopoulos et al., 2019) beingtwo examples.
However, RoFT was primarily influenced byother “real or fake” websites that attempt togamify the detection task, such as http://www.
whichfaceisreal.com/ for generated face im-ages and https://faketrump.ai/ for generatedTweets. Our task is similar to the one used for hu-
man evaluation in Ippolito et al. (2020), except intheir task the text shown to raters was either entirelyhuman-written or entirely machine-generated.
The boundary detection task we propose wasinspired by the Dialog Breakdown Detection Chal-lenge (Higashinaka et al., 2016), in which the goalis to automatically detect the first system utterancein a conversation between a human and a chatbotsystem that causes a dialogue breakdown.
6 Conclusion and Future Work
In this work, we have introduced RoFT and shownhow it can be used to collect annotations on howwell human raters can tell when an article transi-tions from being human-written to being machine-generated.
Ultimately, we plan to use RoFT to conduct alarge-scale systematic study of the impact of de-coding strategy, fine-tuning dataset, prompt genre,and other factors on the detectability of machine-generated text. We also intend to collect and releasea large dataset of natural language explanations forwhy humans think text is machine-generated. Wehope that these will provide insights into problemswith both the human-written text we use as promptsand into the types of errors that NLG systems make.
Such a study will require tens of thousands ofhuman annotations. We hope that by gamifying theannotation process and encouraging organic trafficto the website, we can ultimately bypass the needfor crowd workers who, since they are paid by theannotation, are disincentivized from taking the timeto provide high quality annotations.
We believe that RoFT provides a powerful toolfor understanding the strengths and limitations of agreat variety of NLG systems, and we look forwardto working with researchers interested in testingout their own model outputs within the RoFT eval-uation framework.
Acknowledgements
This research is based upon work supported inpart by the DARPA KAIROS Program (contractFA8750-19-2-1004), the DARPA LwLL Program(contract FA8750-19-2-0201), and the IARPA BET-TER Program (contract 2019-19051600004). Ap-proved for Public Release, Distribution Unlimited.The views and conclusions contained herein arethose of the authors and should not be interpretedas necessarily representing the official policies, ei-ther expressed or implied, of DARPA, IARPA, or
the U.S. Government. The RoFT website is alsosupported by a grant from the Google Cloud Plat-form research credits program.
We thank the members of our lab for their feed-back on the design of the RoFT user interface.
ReferencesAngela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
erarchical neural story generation. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 889–898.
Ryuichiro Higashinaka, Kotaro Funakoshi, YukaKobayashi, and Michimasa Inaba. 2016. The dia-logue breakdown detection challenge: Task descrip-tion, datasets, and evaluation metrics. In Proceed-ings of the Tenth International Conference on Lan-guage Resources and Evaluation (LREC’16), pages3146–3150.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural text de-generation. In International Conference on Learn-ing Representations.
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic detec-tion of generated text is easiest when humans arefooled. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 1808–1822.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer language model for control-lable generation. SalesForce Einstein.ai blog.
Chris van der Lee, Albert Gatt, Emiel van Miltenburg,Sander Wubben, and Emiel Krahmer. 2019. Bestpractices for the human evaluation of automaticallygenerated text. In Proceedings of the 12th Interna-tional Conference on Natural Language Generation,pages 355–368.
Peter J Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating wikipedia by summariz-ing long sequences. In International Conference onLearning Representations.
John Pavlopoulos, Nithum Thain, Lucas Dixon, andIon Androutsopoulos. 2019. Convai at semeval-2019 task 6: Offensive language identification andcategorization with perspective and bert. In Proceed-ings of the 13th International Workshop on SemanticEvaluation, pages 571–576.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8):9.
Ehud Reiter and Anja Belz. 2009. An investigation intothe validity of some metrics for automatically evalu-ating natural language generation systems. Compu-tational Linguistics, 35(4):529–558.
Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12):e26752.
Joao Sedoc, Daphne Ippolito, Arun Kirubarajan, JaiThirani, Lyle Ungar, and Chris Callison-Burch.2018. Chateval: A tool for the systematic evalua-tion of chatbots. In Proceedings of the Workshop onIntelligent Interactive Systems and Language Gener-ation (2IS&NLG), pages 42–44.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In Advances in Neural Information Process-ing Systems, pages 9054–9065.