with training and testing on averaged judgments … NLG ... · 4 Adapted from Gatt, Belz and Kow...

11
1 1 NLG Lecture 16: Evaluation Jon Oberlander 2 Evaluation is hot: reminder - Why “averages” can hurt Compare training and testing on individual judgments with training and testing on averaged judgments … Best results: Minimise ranking error by avoiding compromises such as mixtures of learned rules 3 Evaluation is hot: reminder - Current accuracy is state of the art Iacobelli, Gill, Nowson & Oberlander. 2011. 4 Evaluation is hot: reminder - Crag Evaluation: Results

Transcript of with training and testing on averaged judgments … NLG ... · 4 Adapted from Gatt, Belz and Kow...

1

1

NLG Lecture 16: Evaluation

Jon Oberlander

2

Evaluation is hot: reminder - Why “averages” can hurt

  Compare training and testing on individual judgments with training and testing on averaged judgments …

  Best results: –  Minimise ranking error by avoiding compromises

such as mixtures of learned rules

3

Evaluation is hot: reminder - Current accuracy is state of the art

Iacobelli, Gill, Nowson & Oberlander. 2011.

4

Evaluation is hot: reminder - Crag Evaluation: Results

2

5

Plan

  Distinguish some types of NLG evaluation.   Automatic, intrinsic evaluation:

–  What’s best? –  Why are corpus-based gold standard problematic?

  Task-based, extrinsic evaluations –  Are they too expensive?

  A way to feed back from task-based evaluations to help select best automatic, intrinsic metrics.

Adapted from Belz 2009 6

Some preliminaries - Belz 2009

  The user-oriented vs. developer-oriented distinction concerns evaluation purpose. –  Developer-oriented evaluations focus on functionality … and seek to

assess the quality of a system’s (or component’s) outputs. –  User-oriented evaluations … look at a set of requirements ( …

acceptable processing time, maintenance cost, etc.) of the user (embedding application or person) and assess how well different technological alternatives fulfill them.

  Another common distinction is about evaluation methods: –  Intrinsic evaluations assess properties of systems in their own right,

for example, comparing their outputs to reference outputs in a corpus –  Extrinsic evaluations assess the effect of a system on something that

is external to it, for example, the effect on human performance at a given task or the value added to an application.

  Note also: –  Subjective user evaluation (did you like the output/the system?) –  Objective user evaluation (how fast/accurate are users on tasks?)

Adapted from Belz 2009 7

Intrinsic, developer evaluations hold sway

1.  Assessment by trained assessors of the quality of system outputs according to different quality criteria, typically using rating scales;

2.  Automatic measurements of the degree of similarity between system outputs and reference outputs; and

3.  Human assessment of the degree of similarity between system outputs and reference outputs.

Adapted from Belz 2009 8

Why it’s extrinsic evaluation that really matters

  “If we don’t include application purpose in task definitions then not only do we not know which applications (if indeed any) systems are good for,

  we also don’t know whether the task definition (including output representations) is appropriate for the application purpose we have in mind.” (p113)

3

Adapted from Belz 2009 9

Why we often settle for less …

  For understanding, there is usually a single target output. –  But for generation, with multiple outputs, similarity to outputs

matters …

  Metrics like BLEU and ROUGE are only “surrogate measures” –  We test them via their “correlation with human ratings of

quality, using Pearson’s product-moment correlation coefficient, and sometimes Spearman’s rank-order correlation coefficient”

–  We don’t then test the human ratings (cf. personality in Lecture 15)!

–  “If human judgment says a system is good, then if an automatic measure says the system is good, it simply confirms human judgment; if the automatic measure says the system is bad, then the measure is a bad one”

–  But if intrinsic conflicts with extrinsic, should be worried

Adapted from Belz 2009 10

Intrinsic vs extrinsic - reasons to be concerned

  Law et al. (2005) –  “in intrinsic assessments doctors rated … graphs more highly

than … texts, in an extrinsic diagnostic performance test they performed better with the texts than the graphs.”

  Engelhardt, Bailey, and Ferreira (2006) –  “subjects rated over-descriptions as highly as concise

descriptions, but performed worse at a visual identification task with over-descriptions than with concise descriptions.”

Adapted from Belz 2009 11

Further reasons for concern

  “Stable averages of human quality judgments, let alone high levels of agreement, are hard to achieve” –  See Walker et al - Lectures 14 and 15.

  Does a human top line always mean machines must perform more poorly? –  “In NLG, domain experts have been shown to prefer system-

generated language to alternatives produced by human experts”

  “The explanation routinely given for not carrying out extrinsic evaluations is that they are too time-consuming and expensive.” –  Later on, we will question the validity of that position.

Adapted from Gatt, Belz and Kow 2009 12

Shared tasks in NLG - TUNA-REG (Gatt, Belz and Kow 2009)

4

Adapted from Gatt, Belz and Kow 2009 13

The TUNA-REG evaluation context

  “Each file in the TUNA corpus consists of a single pairing of a domain (a representation of 7 entities and their attributes) and a human-authored description for one of the entities (the target referent).

  Descriptions were collected in an online elicitation experiment which was advertised mainly on a website hosted at the University of Zurich Web Experimentation List

  Participants were shown pictures of the entities in the given domain and were asked to type a description of the target referent

  In the +LOC condition, participants were told that they could refer to entities using any of their properties (including their location on the screen). In the −LOC condition, they were discouraged from doing so, though not prevented.”

Adapted from Gatt, Belz and Kow 2009 14

Test test, task

  The test set … consists of 112 items, each with a different domain paired with two human-authored descriptions.

  GRE algorithms take as input a domain (entities and attributes), plus the intended referent; output a set of attributes of the referent, which distinguish it from “confusers”

  TUNA-REG adds realization

Adapted from Gatt, Belz and Kow 2009 15

Teams

  IS: –  extended full-brevity algorithm which uses a nearest

neighbour technique to select the attribute set (AS) most similar to a given writer’s previous ASs

  GRAPH: –  existing graph-based attribute selection component, which

represents a domain as a weighted graph, and uses a cost function for attributes. The team developed a new realiser which uses a set of templates derived from the descriptions in the TUNA corpus.

  NIL-UCM: –  The three systems submitted by this group use a standard

evolutionary algorithm for attribute selection   USP:

–  The system submitted by this group, USP-EACH, is a frequency-based greedy attribute se- lection strategy which takes into account the +/ − LOC attribute in the TUNA data.

Adapted from Gatt, Belz and Kow 2009 16

Evaluation measures - especially automatic, intrinsic metrics

  Accuracy –  measures the percentage of cases where a system’s output word string

was identical to the corresponding description in the corpus.   String-edit distance (SE)

–  is the classic Levenshtein dis- tance measure and computes the minimal number of insertions, deletions and substitutions required to transform one string into another.

  BLEU-x –  is an n-gram based string comparison measure, … computes the

proportion of word n-grams of length x and less that a system output shares with several reference outputs.

  NIST –  where BLEU gives equal weight to all n-grams, NIST gives more

importance to less frequent n-grams, which are taken to be more informative. … maximum NIST score depends on the size of the test set.

5

Adapted from Gatt, Belz and Kow 2009 17

Results - ranked by String Edit Distance

Adapted from Gatt, Belz and Kow 2009 18

Human, intrinsic

  “Intrinsic human evaluation involved descriptions for all 112 test data items from all six submitted systems, as well as from the two sets of human-authored descriptions.”

  8 participants provided 112 judgments each (14 per system each) –  “Q1: How clear is this description? Try to imagine someone

who could see the same grid with the same pictures, but didn’t know which of the pictures was the target. How easily would they be able to find it, based on the phrase given?

–  Q2: How fluent is this description? Here your task is to judge how well the phrase reads. Is it good, clear English?”

  Answers on a slider.

Adapted from Gatt, Belz and Kow 2009 19

Results - ranked by Adequacy

Adapted from Gatt, Belz and Kow 2009 20

Task-based evaluation

  “In each of their 5 practice trials and 56 real trials, participants were shown a system output (i.e. a WORD-STRING), together with its corresponding domain, displayed as the set of corresponding images on the screen.

  In this experiment the intended referent was not highlighted in the on-screen display, and the participant’s task was to identify the intended referent among the pictures by mouse-clicking on it.”

6

Adapted from Gatt, Belz and Kow 2009 21

Results - ordered by identification accuracy

Adapted from Gatt, Belz and Kow 2009 22

How do the various measures correlate?

  Human intrinsic judgments of adequacy correlate significantly with identification accuracy, and (inversely) with speed. –  They also correlate significantly with the automatic intrinsic

metric of accuracy.   Accuracy has some relationship to task measures.

–  But otherwise, automatic measures have little to say!

Adapted from Foster 2008 23

Are corpus-based intrinsic measures OK? Foster 2008

  “When automatically evaluating generated output, the goal is to find metrics that can easily be computed and that can also be shown to correlate with human judgements of quality.

  Many automated generation evaluations measure the similarity between the generated output and a corpus of gold-standard target outputs, often using measures such as precision and recall.

  Such measures of corpus similarity are straightforward to compute and easy to interpret; however, they are not always appropriate for generation systems.

  Several recent studies … have shown that strict corpus-similarity measures tend to favour repetitive generation strategies that do not diverge much, on average, from the corpus data, while human judges often prefer output with more variety.”

24

Comic (1/7)

7

Adapted from Foster 2008 25

Animating an embodied conversational agent

  “The most common display used by the speaker was a downward nod

  User-preference evaluation had the single largest differential effect on the displays used. –  When the speaker described features of the design that the

user was expected to like, he was relatively more likely to turn to the right and to raise his eyebrows

–  on features that the user was expected to dislike, on the other hand, there was a higher probability of left leaning, lowered eyebrows, and narrowed eyes …”

Adapted from Foster 2008 26

Strategies for head animation schedules

  The rule-based strategy –  includes displays only on derivation-tree nodes corresponding to

specific tile-design properties: that is, manufacturer and series names, colours, and decorative motifs.

–  for every node associated with a positive evaluation, this strategy selects a right turn and brow raise, etc.

  Data-driven strategies consider all nodes in the syntactic tree for a sentence as possible sites for a facial display. –  The system considers the set of displays that occurred on all nodes in

the corpus with the same syntactic, semantic, and pragmatic context –  Majority strategy selects the most common option in all cases; –  Weighted strategy makes a stochastic choice among all of the

options based on the relative frequency. •  Suppose the speaker made no motion 80% of the time, a downward nod

15% of the time, and a downward nod with a brow raise the other 5% of the time.

•  For nodes with this context, the majority strategy would always choose no mo- tion, while the weighted strategy would choose no motion with probability 0.8, etc.

Adapted from Foster 2008 27

Results - how the three strategies compare with corpus

Adapted from Foster 2008 28

Results - how the generated data measure up, within themselves

8

Adapted from Foster 2008 29

Relating these measures back to human judgments

  We also have available users’ preferences among alternatives (Foster and Oberlander 2007): –  Original is best; rule-based does rather well –  Weighted is better than majority

  “None of the corpus-reproduction metrics had any relationship to the users’ preferences, while the number and diversity of displays per sentence appear to have contributed much more strongly to the choices made by the human judges.”

  Do not use similarity to corpus as your gold standard!

Adapted from Koller et al. 2009 30

Shared tasks in NLG - GIVE - Koller et al. 2009

  Subjects try to solve a treasure hunt in a virtual 3D world that they have not seen before. The computer has a complete symbolic representation of the virtual world.

  The challenge for the NLG system is to generate, in real time, natural-language instructions that will guide the users to the successful completion of their task."

Adapted from Koller et al. 2009 31

Generating Instructions in Virtual Environments

  “Embedding the NLG task in a virtual world encourages the participating research teams to consider communication in a situated setting.

  Experiments have shown that human instruction givers make the instruction follower move to a different location in order to use a simpler referring expression (RE) - see Stoia et al. 2006.

  The virtual environments scenario is so open-ended, it - and specifically the instruction-giving task - can potentially be of interest to a wide range of NLG researchers. This is most obvious for research in sentence planning (GRE, aggregation, lexical choice) and realization (the real-time nature of the task imposes high demands on the system’s efficiency).”

Adapted from Koller et al. 2009 32

The GIVE software architecture

1.  The client, which displays the 3D world to users and allows them to interact with it;

2.  The NLG servers, which generate the natural- language instructions; and

3.  The Matchmaker, which establishes connections between clients and NLG servers."

9

Adapted from Koller et al. 2009 33

Evaluation World 1

Adapted from Koller et al. 2009 34

Evaluation World 2

Adapted from Koller et al. 2009 35

Evaluation World 3

Adapted from Koller et al. 2009 36

Teams and data gathering

  “Five NLG systems were evaluated in GIVE-1: 1.  one system from the University of Texas at Austin (Austin in

the graphics below); 2.  one system from Union College in Schenectady, NY (Union); 3.  one system from the Universidad Complutense de Madrid

(Madrid); 4.  two systems from the University of Twente: one serious

contribution (Twente) and one more playful one (Warm-Cold).

  Over the course of three months, we collected 1143 valid games.

  A game counted as valid if the game client didn’t crash, the game wasn’t marked as a test game by the developers, and the player completed the tutorial.”

10

Adapted from Koller et al. 2009 37 Adapted from Koller et al. 2009 38

Adapted from Guiliani et al. In Submission 39

JAST - evaluating a Joint Action robot

Adapted from Guiliani et al. In Submission 40

Experimented with dialogue strategies and with reference generation

11

Adapted from Guiliani et al. In Submission 41

Task-based evaluation: Subjective and objective results

Adapted from Guiliani et al. In Submission 42

Reconecting task with available intrinsic metrics - PARADISE

  The PARADISE evaluation framework (Walker et al., 2000) explores the relationship between the subjective and objective factors.

  PARADISE uses stepwise multiple linear regression to predict subjective user satisfaction,

  based on measures representing the performance dimensions of task success, dialogue quality, and dialogue efficiency, resulting in a predictor function.

43

Summary

  Much work has focussed on automatic, intrinsic evaluation   Some metrics (eg NIST-5) are related to human, intrinsic

evaluations. –  But they’re still only a surrogate for extrinsic evaluation!

  Temptation to use automatic corpus-based metrics should be resisted - some other automatic metrics may be superior, especially when variation is valued.

  Task-based, extrinsic evaluations are the best, and are not as expensive as sometimes been claimed.

  PARADISE can allow findings from task-based evaluations to feedback into appropriate engineering choices and selection of appropriate automatic, intrinsic metrics.

44

References

  Belz, A. (2009) That's nice ... what can you do with it? Computational Linguistics, 35, 111-118.

  Byron, D., Koller, A., Striegnitz, K., Cassell, J., Dale, R., Moore, J. and Oberlander, J. (2009) Report on the First NLG Challenge on Generating Instructions in Virtual Environments (GIVE). In Proc of the 12th European Workshop on Natural Language Generation, Athens, March 2009.

  Foster, M.E. (2008). Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of INLG 2008, Salt Fork, Ohio, June 2008.

  Foster, M.E. and Oberlander, J. (2007) Corpus-based generation of head and eyebrow motion for an embodied conversational agent. International Journal of Language Resources and Evaluation, 41, 305-323

  A. Gatt, A. Belz and E. Kow (2009). The TUNA-REG Challenge 2009: Overview and evaluation results. Proceedings of the 12th European Workshop on Natural Language Generation (ENLG-09).

  Guiliani, M. et al. (in sub) Situated Reference in a Hybrid Human-Robot Interaction System.

  Koller, A., Striegnitz, K., Byron, D., Cassell, J., Dale, R., Dalzel-Job, S., Oberlander, J. and Moore, J. (2009) Validating the web-based evaluation of NLG systems. In Proceedings of ACL-47. Singapore, August 2009.

  M. Walker, C. Kamm, and D. Litman. 2000. Towards developing general models of usability with PARADISE. Natural Language Engineering, 6:363-377.