with training and testing on averaged judgments … NLG ... · 4 Adapted from Gatt, Belz and Kow...
Transcript of with training and testing on averaged judgments … NLG ... · 4 Adapted from Gatt, Belz and Kow...
1
1
NLG Lecture 16: Evaluation
Jon Oberlander
2
Evaluation is hot: reminder - Why “averages” can hurt
Compare training and testing on individual judgments with training and testing on averaged judgments …
Best results: – Minimise ranking error by avoiding compromises
such as mixtures of learned rules
3
Evaluation is hot: reminder - Current accuracy is state of the art
Iacobelli, Gill, Nowson & Oberlander. 2011.
4
Evaluation is hot: reminder - Crag Evaluation: Results
2
5
Plan
Distinguish some types of NLG evaluation. Automatic, intrinsic evaluation:
– What’s best? – Why are corpus-based gold standard problematic?
Task-based, extrinsic evaluations – Are they too expensive?
A way to feed back from task-based evaluations to help select best automatic, intrinsic metrics.
Adapted from Belz 2009 6
Some preliminaries - Belz 2009
The user-oriented vs. developer-oriented distinction concerns evaluation purpose. – Developer-oriented evaluations focus on functionality … and seek to
assess the quality of a system’s (or component’s) outputs. – User-oriented evaluations … look at a set of requirements ( …
acceptable processing time, maintenance cost, etc.) of the user (embedding application or person) and assess how well different technological alternatives fulfill them.
Another common distinction is about evaluation methods: – Intrinsic evaluations assess properties of systems in their own right,
for example, comparing their outputs to reference outputs in a corpus – Extrinsic evaluations assess the effect of a system on something that
is external to it, for example, the effect on human performance at a given task or the value added to an application.
Note also: – Subjective user evaluation (did you like the output/the system?) – Objective user evaluation (how fast/accurate are users on tasks?)
Adapted from Belz 2009 7
Intrinsic, developer evaluations hold sway
1. Assessment by trained assessors of the quality of system outputs according to different quality criteria, typically using rating scales;
2. Automatic measurements of the degree of similarity between system outputs and reference outputs; and
3. Human assessment of the degree of similarity between system outputs and reference outputs.
Adapted from Belz 2009 8
Why it’s extrinsic evaluation that really matters
“If we don’t include application purpose in task definitions then not only do we not know which applications (if indeed any) systems are good for,
we also don’t know whether the task definition (including output representations) is appropriate for the application purpose we have in mind.” (p113)
3
Adapted from Belz 2009 9
Why we often settle for less …
For understanding, there is usually a single target output. – But for generation, with multiple outputs, similarity to outputs
matters …
Metrics like BLEU and ROUGE are only “surrogate measures” – We test them via their “correlation with human ratings of
quality, using Pearson’s product-moment correlation coefficient, and sometimes Spearman’s rank-order correlation coefficient”
– We don’t then test the human ratings (cf. personality in Lecture 15)!
– “If human judgment says a system is good, then if an automatic measure says the system is good, it simply confirms human judgment; if the automatic measure says the system is bad, then the measure is a bad one”
– But if intrinsic conflicts with extrinsic, should be worried
Adapted from Belz 2009 10
Intrinsic vs extrinsic - reasons to be concerned
Law et al. (2005) – “in intrinsic assessments doctors rated … graphs more highly
than … texts, in an extrinsic diagnostic performance test they performed better with the texts than the graphs.”
Engelhardt, Bailey, and Ferreira (2006) – “subjects rated over-descriptions as highly as concise
descriptions, but performed worse at a visual identification task with over-descriptions than with concise descriptions.”
Adapted from Belz 2009 11
Further reasons for concern
“Stable averages of human quality judgments, let alone high levels of agreement, are hard to achieve” – See Walker et al - Lectures 14 and 15.
Does a human top line always mean machines must perform more poorly? – “In NLG, domain experts have been shown to prefer system-
generated language to alternatives produced by human experts”
“The explanation routinely given for not carrying out extrinsic evaluations is that they are too time-consuming and expensive.” – Later on, we will question the validity of that position.
Adapted from Gatt, Belz and Kow 2009 12
Shared tasks in NLG - TUNA-REG (Gatt, Belz and Kow 2009)
4
Adapted from Gatt, Belz and Kow 2009 13
The TUNA-REG evaluation context
“Each file in the TUNA corpus consists of a single pairing of a domain (a representation of 7 entities and their attributes) and a human-authored description for one of the entities (the target referent).
Descriptions were collected in an online elicitation experiment which was advertised mainly on a website hosted at the University of Zurich Web Experimentation List
Participants were shown pictures of the entities in the given domain and were asked to type a description of the target referent
In the +LOC condition, participants were told that they could refer to entities using any of their properties (including their location on the screen). In the −LOC condition, they were discouraged from doing so, though not prevented.”
Adapted from Gatt, Belz and Kow 2009 14
Test test, task
The test set … consists of 112 items, each with a different domain paired with two human-authored descriptions.
GRE algorithms take as input a domain (entities and attributes), plus the intended referent; output a set of attributes of the referent, which distinguish it from “confusers”
TUNA-REG adds realization
Adapted from Gatt, Belz and Kow 2009 15
Teams
IS: – extended full-brevity algorithm which uses a nearest
neighbour technique to select the attribute set (AS) most similar to a given writer’s previous ASs
GRAPH: – existing graph-based attribute selection component, which
represents a domain as a weighted graph, and uses a cost function for attributes. The team developed a new realiser which uses a set of templates derived from the descriptions in the TUNA corpus.
NIL-UCM: – The three systems submitted by this group use a standard
evolutionary algorithm for attribute selection USP:
– The system submitted by this group, USP-EACH, is a frequency-based greedy attribute se- lection strategy which takes into account the +/ − LOC attribute in the TUNA data.
Adapted from Gatt, Belz and Kow 2009 16
Evaluation measures - especially automatic, intrinsic metrics
Accuracy – measures the percentage of cases where a system’s output word string
was identical to the corresponding description in the corpus. String-edit distance (SE)
– is the classic Levenshtein dis- tance measure and computes the minimal number of insertions, deletions and substitutions required to transform one string into another.
BLEU-x – is an n-gram based string comparison measure, … computes the
proportion of word n-grams of length x and less that a system output shares with several reference outputs.
NIST – where BLEU gives equal weight to all n-grams, NIST gives more
importance to less frequent n-grams, which are taken to be more informative. … maximum NIST score depends on the size of the test set.
5
Adapted from Gatt, Belz and Kow 2009 17
Results - ranked by String Edit Distance
Adapted from Gatt, Belz and Kow 2009 18
Human, intrinsic
“Intrinsic human evaluation involved descriptions for all 112 test data items from all six submitted systems, as well as from the two sets of human-authored descriptions.”
8 participants provided 112 judgments each (14 per system each) – “Q1: How clear is this description? Try to imagine someone
who could see the same grid with the same pictures, but didn’t know which of the pictures was the target. How easily would they be able to find it, based on the phrase given?
– Q2: How fluent is this description? Here your task is to judge how well the phrase reads. Is it good, clear English?”
Answers on a slider.
Adapted from Gatt, Belz and Kow 2009 19
Results - ranked by Adequacy
Adapted from Gatt, Belz and Kow 2009 20
Task-based evaluation
“In each of their 5 practice trials and 56 real trials, participants were shown a system output (i.e. a WORD-STRING), together with its corresponding domain, displayed as the set of corresponding images on the screen.
In this experiment the intended referent was not highlighted in the on-screen display, and the participant’s task was to identify the intended referent among the pictures by mouse-clicking on it.”
6
Adapted from Gatt, Belz and Kow 2009 21
Results - ordered by identification accuracy
Adapted from Gatt, Belz and Kow 2009 22
How do the various measures correlate?
Human intrinsic judgments of adequacy correlate significantly with identification accuracy, and (inversely) with speed. – They also correlate significantly with the automatic intrinsic
metric of accuracy. Accuracy has some relationship to task measures.
– But otherwise, automatic measures have little to say!
Adapted from Foster 2008 23
Are corpus-based intrinsic measures OK? Foster 2008
“When automatically evaluating generated output, the goal is to find metrics that can easily be computed and that can also be shown to correlate with human judgements of quality.
Many automated generation evaluations measure the similarity between the generated output and a corpus of gold-standard target outputs, often using measures such as precision and recall.
Such measures of corpus similarity are straightforward to compute and easy to interpret; however, they are not always appropriate for generation systems.
Several recent studies … have shown that strict corpus-similarity measures tend to favour repetitive generation strategies that do not diverge much, on average, from the corpus data, while human judges often prefer output with more variety.”
24
Comic (1/7)
7
Adapted from Foster 2008 25
Animating an embodied conversational agent
“The most common display used by the speaker was a downward nod
User-preference evaluation had the single largest differential effect on the displays used. – When the speaker described features of the design that the
user was expected to like, he was relatively more likely to turn to the right and to raise his eyebrows
– on features that the user was expected to dislike, on the other hand, there was a higher probability of left leaning, lowered eyebrows, and narrowed eyes …”
Adapted from Foster 2008 26
Strategies for head animation schedules
The rule-based strategy – includes displays only on derivation-tree nodes corresponding to
specific tile-design properties: that is, manufacturer and series names, colours, and decorative motifs.
– for every node associated with a positive evaluation, this strategy selects a right turn and brow raise, etc.
Data-driven strategies consider all nodes in the syntactic tree for a sentence as possible sites for a facial display. – The system considers the set of displays that occurred on all nodes in
the corpus with the same syntactic, semantic, and pragmatic context – Majority strategy selects the most common option in all cases; – Weighted strategy makes a stochastic choice among all of the
options based on the relative frequency. • Suppose the speaker made no motion 80% of the time, a downward nod
15% of the time, and a downward nod with a brow raise the other 5% of the time.
• For nodes with this context, the majority strategy would always choose no mo- tion, while the weighted strategy would choose no motion with probability 0.8, etc.
Adapted from Foster 2008 27
Results - how the three strategies compare with corpus
Adapted from Foster 2008 28
Results - how the generated data measure up, within themselves
8
Adapted from Foster 2008 29
Relating these measures back to human judgments
We also have available users’ preferences among alternatives (Foster and Oberlander 2007): – Original is best; rule-based does rather well – Weighted is better than majority
“None of the corpus-reproduction metrics had any relationship to the users’ preferences, while the number and diversity of displays per sentence appear to have contributed much more strongly to the choices made by the human judges.”
Do not use similarity to corpus as your gold standard!
Adapted from Koller et al. 2009 30
Shared tasks in NLG - GIVE - Koller et al. 2009
Subjects try to solve a treasure hunt in a virtual 3D world that they have not seen before. The computer has a complete symbolic representation of the virtual world.
The challenge for the NLG system is to generate, in real time, natural-language instructions that will guide the users to the successful completion of their task."
Adapted from Koller et al. 2009 31
Generating Instructions in Virtual Environments
“Embedding the NLG task in a virtual world encourages the participating research teams to consider communication in a situated setting.
Experiments have shown that human instruction givers make the instruction follower move to a different location in order to use a simpler referring expression (RE) - see Stoia et al. 2006.
The virtual environments scenario is so open-ended, it - and specifically the instruction-giving task - can potentially be of interest to a wide range of NLG researchers. This is most obvious for research in sentence planning (GRE, aggregation, lexical choice) and realization (the real-time nature of the task imposes high demands on the system’s efficiency).”
Adapted from Koller et al. 2009 32
The GIVE software architecture
1. The client, which displays the 3D world to users and allows them to interact with it;
2. The NLG servers, which generate the natural- language instructions; and
3. The Matchmaker, which establishes connections between clients and NLG servers."
9
Adapted from Koller et al. 2009 33
Evaluation World 1
Adapted from Koller et al. 2009 34
Evaluation World 2
Adapted from Koller et al. 2009 35
Evaluation World 3
Adapted from Koller et al. 2009 36
Teams and data gathering
“Five NLG systems were evaluated in GIVE-1: 1. one system from the University of Texas at Austin (Austin in
the graphics below); 2. one system from Union College in Schenectady, NY (Union); 3. one system from the Universidad Complutense de Madrid
(Madrid); 4. two systems from the University of Twente: one serious
contribution (Twente) and one more playful one (Warm-Cold).
Over the course of three months, we collected 1143 valid games.
A game counted as valid if the game client didn’t crash, the game wasn’t marked as a test game by the developers, and the player completed the tutorial.”
10
Adapted from Koller et al. 2009 37 Adapted from Koller et al. 2009 38
Adapted from Guiliani et al. In Submission 39
JAST - evaluating a Joint Action robot
Adapted from Guiliani et al. In Submission 40
Experimented with dialogue strategies and with reference generation
11
Adapted from Guiliani et al. In Submission 41
Task-based evaluation: Subjective and objective results
Adapted from Guiliani et al. In Submission 42
Reconecting task with available intrinsic metrics - PARADISE
The PARADISE evaluation framework (Walker et al., 2000) explores the relationship between the subjective and objective factors.
PARADISE uses stepwise multiple linear regression to predict subjective user satisfaction,
based on measures representing the performance dimensions of task success, dialogue quality, and dialogue efficiency, resulting in a predictor function.
43
Summary
Much work has focussed on automatic, intrinsic evaluation Some metrics (eg NIST-5) are related to human, intrinsic
evaluations. – But they’re still only a surrogate for extrinsic evaluation!
Temptation to use automatic corpus-based metrics should be resisted - some other automatic metrics may be superior, especially when variation is valued.
Task-based, extrinsic evaluations are the best, and are not as expensive as sometimes been claimed.
PARADISE can allow findings from task-based evaluations to feedback into appropriate engineering choices and selection of appropriate automatic, intrinsic metrics.
44
References
Belz, A. (2009) That's nice ... what can you do with it? Computational Linguistics, 35, 111-118.
Byron, D., Koller, A., Striegnitz, K., Cassell, J., Dale, R., Moore, J. and Oberlander, J. (2009) Report on the First NLG Challenge on Generating Instructions in Virtual Environments (GIVE). In Proc of the 12th European Workshop on Natural Language Generation, Athens, March 2009.
Foster, M.E. (2008). Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of INLG 2008, Salt Fork, Ohio, June 2008.
Foster, M.E. and Oberlander, J. (2007) Corpus-based generation of head and eyebrow motion for an embodied conversational agent. International Journal of Language Resources and Evaluation, 41, 305-323
A. Gatt, A. Belz and E. Kow (2009). The TUNA-REG Challenge 2009: Overview and evaluation results. Proceedings of the 12th European Workshop on Natural Language Generation (ENLG-09).
Guiliani, M. et al. (in sub) Situated Reference in a Hybrid Human-Robot Interaction System.
Koller, A., Striegnitz, K., Byron, D., Cassell, J., Dale, R., Dalzel-Job, S., Oberlander, J. and Moore, J. (2009) Validating the web-based evaluation of NLG systems. In Proceedings of ACL-47. Singapore, August 2009.
M. Walker, C. Kamm, and D. Litman. 2000. Towards developing general models of usability with PARADISE. Natural Language Engineering, 6:363-377.