karasimos_mpiro

download karasimos_mpiro

of 118

Transcript of karasimos_mpiro

  • 7/31/2019 karasimos_mpiro

    1/118

  • 7/31/2019 karasimos_mpiro

    2/118

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemii

    Declaration

    I hereby declare that this MSc dissertation is of my own composition and that it contains

    no material previous submitted for the award of any other degree. The word reported in

    this MSc Dissertation has been executed by myself, except where due acknowledgement

    is made in the text.

    (Athanasios N. Karasimos)

  • 7/31/2019 karasimos_mpiro

    3/118

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemiii

    AcknowledgementsI desire to express my gratitude to all those people without whom this task would

    have been much harder and difficult to accomplish.

    Above all, thank you Amy and Colin, for being so patient, helpful, supportive and

    co-operative, and for being there every time I need you. You make many things clear

    and easy with your useful and clever advice. It was said that the beginning and the end

    of a dissertation is the supervisors.

    Many thanks to all these people who participated in both experiments. Without their

    participation the evaluation would be impossible to be done. I appreciate the time you

    spent for my experiment.

    And special thanks to my classmate, Behzad whom I own my dissertations topic

    and I am grateful for all these useful conversations. Many thanks to Ellen Burk for

    guiding me out of the statistics labyrinth.

    Finally, at the Greek front, thanks to Aggeliki, Efi, Stavroula and Stathis for their

    care and support throughout the whole year. Additionally, special thanks to Alexander

    Melengoglou, who offered his valuable knowledge and comments. I could not be here

    now without the strong support and love of my family. Thank you Anna and Sotiria for

    your corrections.

  • 7/31/2019 karasimos_mpiro

    4/118

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemiv

    To my mother,

    my uncle

    and specially to my sister,

    Helen, ().

  • 7/31/2019 karasimos_mpiro

    5/118

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemv

    Abstract

    Half of the problem in Natural Language Generation (NLG) is the evaluation of a

    NLG system. In the last few decades the research about evaluation has increased and

    made some serious steps on this direction. This study describes an evaluation of a

    multilingual personalized information objects system (M-PIRO), which dynamically

    generates descriptions for exhibits in a virtual museum exhibition. In the evaluation,

    learning outcomes in-between subjects who read two sets of texts about coins and

    vessels were compared to those of subjects who read these text sets with different text

    structure. The aim was to attempt to prove that the text type factors, comparison andaggregation are essential for a better performance. Several types of data were collected

    by post-session tests of factual recall knowledge and a questionnaire about the evaluated

    system. Results showed that performance measures did differ between subjects in the

    two conditions (presence and absence of the text type factors); additionally, the data

    analysis revealed that text difficulty and the subjects impression of learning were also

    statistically significant. These issues are all considered in order to determine if the goal

    of M-PIRO is achieved and to suggest some improvements to it. The study concludes

    with an outline of further future work.

  • 7/31/2019 karasimos_mpiro

    6/118

    Contents

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemvi

    Contents

    Declaration.......................................................................................................................ii

    Acknowledgements.........................................................................................................iiiAbstract............................................................................................................................v

    Contents ..........................................................................................................................vi

    Index of tables, graphs and pictures ..........................................................................viii

    1. Introduction .................................................................................................................1

    1.1. Natural Language Generation Systems ...............................................................1

    1.2. Evaluating Natural Language Generation Systems.............................................2

    1.3. Purpose and Outline of the study.........................................................................5

    2. The M-PIRO NLG System............................................................................................6

    2.1. TheILEXNLG System...........................................................................................6

    2.1.1. The ILEX Dynamic Hypertext System ............................................................6

    2.1.2. The evaluation of the ILEX System: Dynamic vs. Static version....................7

    2.2. TheM-PIRO System...............................................................................................9

    2.2.1. The M-PIRO Domain and Generation Architecture ........................................9

    2.2.2. The M-PIRO Authoring Tool .........................................................................12

    3. Aggregation and Comparison in the M-PIRO ..........................................................14

    3.1. Aggregation........................................................................................................14

    3.1.1. What is aggregation?....................................................................................14

    3.1.2. The implementation of aggregation in the M-PIRO System..........................15

    3.2. Comparison........................................................................................................17

    3.2.1. What is comparison?....................................................................................17

    3.2.2. The implementation of comparison in M-PIRO System................................19

    4. The Pilot Experiment................................................................................................21

    4.1. Introduction........................................................................................................21

    4.2. Method ...............................................................................................................23

    4.2.1. Designing and choosing the exhibit texts ....................................................23

    4.2.2. Subjects ........................................................................................................26

    4.2.3. Procedure......................................................................................................26

    4.3. Results and Discussion.......................................................................................28

  • 7/31/2019 karasimos_mpiro

    7/118

    Contents

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemvii

    5. The Main Experiment...............................................................................................32

    5.1. Introduction........................................................................................................32

    5.2. Method ...............................................................................................................32

    5.2.1. Designing and choosing the exhibit texts ....................................................32

    5.2.2. Subjects ........................................................................................................35

    5.2.3 Procedure.......................................................................................................36

    5.3. Results ................................................................................................................37

    6. General Discussion ....................................................................................................48

    6.1. The results of both experiments .........................................................................48

    6.1.1. Interpreting the results..................................................................................48

    6.1.2. Ordering effect: a possible flaw in experimental design..............................51

    6.2. Suggestions and improvements ..........................................................................53

    6.3. Future work........................................................................................................56

    6.4. Conclusion .........................................................................................................58

    Bibliography ..................................................................................................................64

    Appendix I:The M-PIRO generatedtexts for the Main Experiment.64

    Coins Text Sequence [English] with comparison and aggregation..........................64

    Coins Text Sequence [English] without comparison and aggregation.....................67

    Vessels Text Sequence [English] with comparison and aggregation .......................70

    Vessels Text Sequence [English] without comparison and aggregation ..................73

    Coins Text Sequence [Greek] with comparison and aggregation ............................77

    Coins Text Sequence [Greek] without comparison and aggregation .......................80

    Vessels Text Sequence [Greek] with comparison and aggregation..........................83

    Vessels Text Sequence [Greek] without comparison and aggregation.....................87

    Appendix II: What did you learn from the virtual exhibition? ....................................... 91

    The Questions for the Coins Text Sequence [English] .............................................91

    The Questions for the Vessels Text Sequence [English] ...........................................93

    The Questions for the Coins Text Sequence [Greek] ................................................95

    The Questions for the Vessels Text Sequence [Greek]..............................................97

    Questionnaire..........................................................................................................100

    Appendix III : The Statistical guide ............................................................... .......................... 101

  • 7/31/2019 karasimos_mpiro

    8/118

    Index of tables, graphs and pictures

    Athanasios N. Karasimos Evaluation ofM-PIRO Systemviii

    Index of tables, graphs and picturesTable 2.1. The M-PIRO pipeline generation architecture 9

    Table 2.2. Part of M-PIRO entity hierarchy organization of types and levels 10

    Table 3.1. The relationships between two representation 17

    Table 3.2. Short and long conjunctions for similarity and contrast 18

    Table 3.3. The comparison ordering based on information importance for M-PIRO system 18

    Table 4.1. Part of the lekythos text generated by M-PIRO with and without comparison andaggregation 23

    Table 4.2. Some questions from the factual recall text (all the questions are in Appendix II). Thecorrect answers are in bold type 24

    Table 4.3. The results of the participants in both versions of the pilot experiment 26

    Graph 4.4. The performance of the participants based on the option of comparison andaggregation

    27

    Graph 4.5. The performance of the participants based on the group factor 28

    Table 4.6. The questionnaire results of the pilot experiment 29

    Picture 5.1. A web page from the vessels sequence that contains the Spherical CorinthianAryballos 31

    Table 5.2. Two examples of the vessels texts with more complicated comparisons 33

    Table 5.3. The results of the participants in both languages of the main experiment 36

    Graph 5.4. The score performance per person depending on text type factors (Greek Version) 37

    Graph 5.5. The score performance per person depending on text type factors (English Version) 38

    Graph 5.6. The results per participant depending on the group factor [Greek version] 39

    Graph 5.7. The results per participant depending on the group factor [English version] 39

    Graph 5.8. The performance for all the participants depending on the genre factor [both version] 40

    Graph 5.9. Box plots for performance depending on genre and text type factors [English version] 41

    Graph 5.10. Box plots for performance depending on genre and text type factors [Greek version] 41

    Graph 5.11. The performance of Greek participants depending on text difficulty (coins vs. vessels) 42

    Graph 5.12. The performance of English participants depending on text difficulty (coins vs.vessels) 42

    Graph 5.13. The questionnaire results summary of the English participants for both groups 43

    Graph 5.14. The questionnaire results summary of the Greek participants for both groups 43

    Graph 6.1. Box plots for the performance of the participants depending on the language factor 49

  • 7/31/2019 karasimos_mpiro

    9/118

    Chapter 1Introduction

    Athanasios N. Karasimos Evaluation ofM-PIRO System1

    Chapter 1

    Introduction

    1.1.Natural Language Generation Systems

    Natural Language Generation (NLG) is the assembly of the text word-by-word

    using knowledge of morphology, syntax, semantics and text structure (O Donnell et

    al., 2001). As a branch of computational linguistics, cognitive science and artificial

    intelligence, NLG is the process of constructing natural language outputs from non-

    linguistic inputs (symbolic or numeric inputs), in particular of mapping some

    underlying representation of information to a meaningful, understandable and specificpresentation of that information in spoken and/ or textual linguistic form (in human

    languages). A complete NLG system has to take many decisions to produce an

    appropriate output. The goal of NLG can be characterized as the inverse of that of

    Natural Language Understanding (NLU), whereas in NLU the concern is to map from

    text output to some underlying representation of the meaning (Mellish & Dale, 1998;

    Jurafsky & Martin, 2000: 764-794; O Donnell et al., 2001). The generation process in

    an NLG system typically consists of the five following main stages:

    Content determination, in which the system decides what information should be

    included as appropriate in the text, and what should be omitted; this selection depends

    upon a variety of contextual factors and the particular user to whom it is to be directed.

    Document structuring, in which it is decided how the text should be organized and

    structured; this means that (for the information already included) the NLG system has to

    choose the appropriate structure to convey the information.

    Lexical selection, in which the system chooses the particular word, word types and

    phrases that are required to communicate the specified information; it may be also

    possible to vary the words used for stylistic effect.

  • 7/31/2019 karasimos_mpiro

    10/118

    Chapter 1Introduction

    Athanasios N. Karasimos Evaluation ofM-PIRO System2

    Sentence structure1, which involves the processes ofaggregation, in which the

    system must apportion the selected content into phrase, clause and sentence-sized

    chunks and often in the interests of fluency place several pieces of information into

    one sentence, andreferring expression generation, in which the system determines the

    properties of an entity by referring to that entity.

    And Surface realization, in which the system determines the mapping of the

    underlying text into a natural text of grammatically correct sentences.

    The M-PIRO NLG system, which will be discussed in Chapter 2 and evaluated, is a

    dynamical hypertext2

    natural language generation system.

    1.2. Evaluating Natural Language Generation Systems

    Over the last 15 years, the level of interest and concern expressed by the natural

    language processing (NLP) researchers with regard to evaluation has increased

    substantially. In early NLG work, the quality of the output of the system was assessed

    by the system authors themselves. However, they misestimated the worthiness of the

    evaluation for the improvement of an NLG system. In contrast, it has nowadays become

    widely accepted in the language processing community that NLP researchers shouldappreciate the evaluation of a system and pay more attention to its results, since it plays

    an essential role in the development of NLG systems, techniques and strategies.

    Mellish and Dale (1998: 349) claim that NLG is exactly half of the problem of

    natural language processing work; the other half is the evaluation of a system.

    According to them, the first serious work in the field of evaluation took place in 1990 at

    a workshop held on the theme of the Evaluation of Natural Language Generation

    Systems and the papers of Meteer and McDonald (1991) and Moore (1991). The main

    problem which they tried to distinguish, is the difference of the evaluation of a system

    from the evaluation of underlying theories. Dealing with this and some other essential

    1 For some researchers (Reiter & Dale, 2000. Building a Natural Language Generation System;Melengoglou, 2002) lexical selection, aggregation and referring expression generation are part of

    microplanning.2Dynamic hypertext refers to a NLG system which creates its outputs dynamically at run-time, when

    the user requests them; such a output text is generated on call, not pre-written by a human author.

  • 7/31/2019 karasimos_mpiro

    11/118

    Chapter 1Introduction

    Athanasios N. Karasimos Evaluation ofM-PIRO System3

    problems, the empirical work in this field was noticeably increased based on the above

    work papers.

    Evaluation can have many objectives and can consider several different dimensions

    of an NLG system or theory. Sometimes some evaluation objectives combine aspects of

    a system and its underlying theory, making the task harder. Mellish and Dale (1998)

    suggest that the evaluation of NLG techniques must be broken down into three main

    categories. Evaluating properties of the Theory is the assessment of the

    characteristics of some theory underlying an NLG system or a part of it; the

    implementation of this theory, e.g. Rhetorical Structure Theory, helps us to characterize

    as appropriate or not this theory for the system. Evaluating properties of the System is

    the assessment of specific characteristics of some NLG system; it might be acomparison of two NLG systems or their algorithms, the performance of a NLG system

    in two different version of it, or the output of the generator with the characteristics of a

    corpus of target text. Finally, Applications Potential is the evaluation of the potential

    utility of an NLG system in some environment, for instance if its use provides a better

    solution than some other approach (Mellish & Dale, 1998: 353-354).

    Previous approaches to evaluation of NLG theories are very few. The main problem

    is that a good NLG system is based on a theory; nevertheless, during its constructionseveral practical problems must be solved and, therefore, the solutions may be

    unconnected to the underlying theory. There were some evaluations of grammars

    based on some theories, like Rhetorical Structure Theory (Mann & Thompson, 1988).

    Robin tested his revision-based theory with a natural corpus for the domain of baseball

    summaries. As Mellish and Dale (1998: 355) report, this kind of evaluation is

    dangerous, since most reported work on NLG evaluates its theory indirectly through

    the systems that implement them.

    In contrast to the evaluation of NLG theories, the question how good is my NLG

    system is much easier to answer. There are different kinds of system aspects that can be

    evaluated. Accuracy evaluation means the assessment of the relationship between input

    and output, if the generated text conveys the desired meaning to the reader (Jordan et

    al., 1993). Fluency and intelligibility evaluation concerns the quality of generated text

    and includes notions such as syntactic correctness, stylistic appropriateness,

  • 7/31/2019 karasimos_mpiro

    12/118

    Chapter 1Introduction

    Athanasios N. Karasimos Evaluation ofM-PIRO System4

    organization and coherence. Despite the unclearness of these notions measurements,

    Minnis (1993) made some proposals for evaluation. There are quite a few evaluations in

    this field, like Bangalore et al. (2000) who evaluated the system FERGUS and found that

    its output understandability and quality was correlated each other3. Finally, a task

    evaluation involves observing how well a task is performed by using the NLG system.

    Usually, task evaluation is used for evaluation of Machine Translation (MT) systems,

    such as IDAS assessment (Levine & Mellish, 1995); however, it was also used for other

    kind of NLG systems, such as PEBA-II (Milosavljevic & Oberlander, 1998), ILEX4 (Cox

    et al., 1999) and AMVF (Carerini and Moore, 2000).

    Finally, there are some issues and general problems which arise when dealing with

    evaluating a NLG system. Firstly, the very major problem in evaluating an NLG systemis that of assessing its output and the question arises of what the output should be. A

    fluency and intelligibility evaluation can deal with this issue, but it lacks objective

    criteria, where the results of a task evaluation indirectly reflect the properties of the

    system. Secondly, it is needed to evaluate measurable attributes for the performance of

    the system, and thirdly, these attributes must be compared with something else,

    otherwise it is hard to say that something is good or bad, easy or difficult, acceptable or

    not. Additionally, it is really essential to get adequate test data for the evaluation. An

    evaluation without sufficient data will unsurprisingly fail. Finally, the human subjects

    may not agree and it would be wise to handle the disagreement among human judges by

    avoiding taking into account their judgements; therefore the authors should guide5

    them

    to give specific objective and clear opinions and not vague critics and thoughts.

    3They evaluate to different versions of FERGUS (Flexible Empiricist/ Rationalist Generation Using

    Syntax) using evaluation metrics (accuracy) which are useful to them as relative quantitative assessmentsof different models.

    4 For more details see section 2.1.2 of the second chapter.5However, the subjects should not be guided where the authors desire, for instance the subjects to

    say what the authors want to hear.

  • 7/31/2019 karasimos_mpiro

    13/118

    Chapter 1Introduction

    Athanasios N. Karasimos Evaluation ofM-PIRO System5

    1.3. Purpose and Outline of the study

    The purpose of this dissertation is to present and describe the evaluation of the M-

    PIRO NLG system and to export useful and helpful conclusions about a further

    improvement of the system and a future development of an NLG system.

    The hypotheses of this evaluation are the following: firstly, we expect that a text

    with comparison and aggregation will help the subjects perform better and learn more.

    Secondly, the performance will differ depending on the difficulty of a text. Thirdly, the

    subjects will characterize as more fluent and natural a text with comparison and

    aggregation than a text without these factors. Finally, they will feel more comfortable

    and certain that they learn more from a text with the above factors.

    The remainder of this dissertation is organized as followed:

    Chapter 2 provides an overview of the ILEX NLG system (2.1.1) and its evaluation

    (2.1.2). Then it presents the M-PIRO NLG system and describes the domain (2.2.1) and

    architecture (2.2.2) of this system and its authoring tool (2.2.3).

    Chapter 3 examines both comparison and aggregation in M-PIRO system. It also

    provides a literature background and description of the implementation of these factor in

    the current system.

    Chapter 4 gives a description of the pilot experiment that preceded the main

    evaluation experiment. After the presentation of the main purpose (4.1), the method is

    illustrated (4.2) and the results are presented and analyzed (4.3).

    Chapter 5 provides the design of the main experiment for the evaluation. As the

    purpose and the design was largely the same as that of the pilot (5.1-5.2), the essentialpart which is emphasized, is the analysis of the results (5.3).

    Chapter 6 closes the dissertation with a general discussion of the results of this

    study (6.1.1) and of the notice of a ordering effect in the experimental design (6.1.2).

    Furthermore, there are some suggestions, improvements (6.2) and future work (6.3)

    about the M-PIRO system.

  • 7/31/2019 karasimos_mpiro

    14/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System6

    Chapter 2

    The M-PIRO NLG System

    2.1. The ILEX NLG System

    2.1.1. TheILEXDynamic Hypertext System

    ILEX6

    (the Intelligent Labeling Explorer) is a dynamic hypertext generation system

    developed at the University of Edinburgh (1997-2000) in collaboration with the

    National Museum of Scotland. Its task was to generate labels (descriptions) for objects

    of a virtual museum exhibition in several languages, using a single knowledge database

    storing information in a language-neutral form. ILEX generated labels which werepersonalized, whereas they were tailored opportunistically depending on the users

    interests and the users interaction history with the system.

    The user model ofILEX provides labels generation for the categories ofchild, adult

    and expert. It models the users in terms of their relation to information, such as the

    interest, the importance and the level ofassimilation of the information, and provides

    values for each predicate type. Since the authors can not predict the exact nature of the

    user, ILEX allows the users to control directly the displayed generation text for the

    objects and the freedom to browse the collection of objects in any order; based on the

    authors pre-assumptions of the values of their relation to information, the system

    proceeds with the users requests and adapts the structure of its label to the user.

    Therefore, the ILEXs aim is to produce exhibits descriptions that a real curator might

    give and the visitors feel like they are in real museum with a guider.

    The opportunistic text tailoring is achieved in ILEX via the use of referring

    expressions, comparison expression, nominal anaphora and approaches derived from

    Rhetorical Structure Theory. Built on Rhetorical Structure Theory7

    the aggregation is

    organized into nucleus-Satellite relations (Like most Arts and Crafts style jewels, it has

    an elaborate design) and multi-nuclei relations (This jewel is a necklace and was

    6For an extended discussion of this system, see O Donnell et al. (2001); see also, Milodavljevic and

    Oberlander (1998), Cox et al. (1999), O Donnell et al. (2000).7For more details and an extensive discussion of the theory, see Mann and Thompson (1988).

  • 7/31/2019 karasimos_mpiro

    15/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System7

    made by a British designer called Edward Spencer). For the comparison expression it

    uses the users navigation log; it introduces an already known concept (This necklace

    is also in Arts and Crafts style), make simple comparisons with the previous visited

    exhibit (For instance the previous item uses oval-shaped stones (in other words it

    features rounded stones). However this necklace does not feature rounded stones) and

    steers clear of repeating information which has already been mentioned (Cox et al.,

    1999; Melengoglou, 2002).

    Nevertheless, ILEX is not without problems and flaws (Isard et al., 2003). They

    captured much of information by interviewing a curator and then hand-coding

    taxonomic information and other assertions. The authors use type text strings literally

    rather than in terms of knowledge-base objects stored in some language neutral form;therefore it was hard to present information in any other language rather than the

    original input language. Related to this, there is some problems with linguistic grammar

    (the English and Spanish works well, but for the other languages the grammars should

    rebuild or reconstruct). Furthermore, the same level values of adult, expert and child

    types do not change essentially the text structure. Finally, the system is less modular

    than desirable.

    2.1.2. The evaluation of theILEXSystem: Dynamic vs. Static version

    The Cox, O Donnell, Oberlander (1999) paper describes an evaluation ofILEX, the

    intelligent labelling explorer and in the evaluation, learning outcomes in subjects who

    used the dynamic ILEX system were analysed and contrasted to subjects who used a

    static version of the system. Their goal was to attempt to isolate learning effects

    specifically due to dynamic hypertext generation (Cox et al., 1999). In previous work

    (Levine & Mellish, 1995) the IDAS system that used natural generation techniques in the

    automatic generation of hypertexts was evaluated, where the users task was to retrieve

    relevant information to answer specific questions; however, they did not use any

    comparison group for their assessment and also, built their results and discussion on the

    users page visits and navigation logs.

    Since Cox, O Donnell and Oberlanders aim was not to compare their dynamic

    hypermedia with a traditional media system or to observe aspects of hypermedia such as

  • 7/31/2019 karasimos_mpiro

    16/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System8

    configurations of links, they used two different version of ILEX: a traditionally

    configured version with static pages and no user modeling against the original

    intelligent system with dynamically generated text containing referring expressions and

    comparisons based on user model (Cox et al., 1999). Both versions contained the same

    six jewels. Three instruments were devised for use in the evaluation, 1. a recall test of

    factual knowledge about jewels in the exhibition, 2. a curator task8

    and 3. a usability

    questionnaire. They used twenty subjects allocated to the dynamic ILEX system and ten

    subjects to the static version of the system. The results were quite interesting. Both

    groups scored similarly on the two tests in performance terms; however, the log data

    processing revealed that the dynamic system users made more visits to the case of

    jewels than static subjects, and made proportionately more navigation-related button

    clicks than their static ILEXcounterpants (Cox et al., 1999: 7). Based on these results

    they claim that the users did not benefit from the dynamic version properties since they

    did not score better; additionally, the learning pattern and performance varied and was

    achieved by different ways depending on the log data.

    Nevertheless, there were some flaws to this experiment, since there was not used

    the same number of users for both versions. Moreover, the subjects were not exposed to

    the same experimental conditions, since they used different versions and, therefore, it

    was possible to occur essentially a main effect for groups9, which was not mentioned if

    it existed or not. Furthermore, the required time was too much for only six exhibits and

    it might potentially have affected the results of the performance, since the time

    conditions were not real (in a normal case no one would spend ninety minutes for a

    twelve paragraphs text of six exhibits). They claimed that any learning effect is almost

    dependent on the navigation route just as Levine & Mellish (1993); I maintain that the

    learning effects are beyond any log and navigation route, since there are many factors

    that learning processing depends on (Mellish & Dale, 1998; Jurafsky & Martin, 2000).

    Other experiments have been carried to test different properties of a NLG system

    and evaluate the system depending on them. Properties of the underlying theory,

    properties of the system and the applications potential were evaluated. Nonetheless, the

    8This task consisted some presentations of unseen jewels in the exhibition and the subjects were

    asked to examine the exhibit and classify it by answering multiple-choice questions.9For more about statistics terminology see Appendix III.

  • 7/31/2019 karasimos_mpiro

    17/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System9

    lack of a specific evaluation theory and the disagreement of subjective qualities like

    fluency, readability, good style and appropriateness constitute an essential drawback of

    the evaluation of an NLG system. Some researchers failed to properly evaluate systems

    since they used immeasurable phenomena or subjective criteria. According to Mellish &

    Dales (1998) evaluation theory approach, there are some important issues and

    problems that must be solved for an evaluation design, which they are going to be

    discussed in the pilot experiment section.

    2.2. The M-PIRO System

    The M-PIRO10

    NLG system (Multilingual Personalized Information Objects) is a

    more recent project of the Information Societies Programme of the European Union that

    also generates descriptions for virtual museum object (exhibits). It is a descendant of the

    ILEX System and has focused on developing language-engineering technology for

    personalized information objects, specifically on multilingual information delivery

    (Isard et al., 2003). It incorporates high-quality speech output, an authoring tool,

    improved user modeling and a modular core generation domain (Androutsopoulos et al.,

    2002).

    2.2.1. The M-PIRO Domain and Generation Architecture

    domain model domain database + domain semantics

    CONTENT SELECTION selection of facts to convey to the user

    information to be conveyed

    TEXT PLANNING ordering of facts + document structure (RST)

    text structure

    MICRO-PLANNING lexicalisation + referring expression generation

    document specifications

    SURFACE REALISATION text generation

    exhibit description

    Table 2.1: The M-PIRO pipeline generation architecture

    10 The projects consortium consisted of the University of Edinburgh (Scotland), ITC-irst (Italy),

    NCSR Demokritos (Greece), the University of Athens (Greece) and the Foundation of the Hellenic

    World (Greece).

  • 7/31/2019 karasimos_mpiro

    18/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System10

    The table 2.1 outlines the stages in the M-PIRO NLG architecture (Androutsopoulos

    et al., 2002) and the process for generating a text.

    Domain authoring

    The knowledge base contains all the necessary information about entities and

    relationships; entities are abstract and concrete. The major task is the hierarchy of entity

    types, such as entities and sub-entities, (e.g. exhibit and material: statue and marble)

    which can contain more levels of an entity, for example the entity statue has complex

    statue, kouros, imperial portrait, etc. Similarly, the relations between entities are

    expressed by using fields, since the domain author can define fields for each entity,

    which are then inherited by all entity types below in the hierarchy (Isard et al., 2003).

    For example, creation-periodapplies to statue and to all descendents, such as complex

    statue, kouros, imperial portrait and it must be filled by on entity of the historical-

    periodgroup (archaic, classical, hellenistic and roman).

    Basic Type Entity Type Entity Level Entity

    Copy complex-statue

    a-location kouros

    portrait

    statue

    exhibit7

    exihibit imperial portrait

    exhibit17

    coin

    jewel

    relief

    Table 2.2: Part of M-PIRO entity hierarchy organization of types and levels

    Microplanning expressions

    Each field has associated information that specifies how the relationship it

    represents can be expressed as a sentence. As mentioned in the introduction,

    microplanning involves lexical selection, aggregation and referring expression

    generation; the specifications for these are known as microplanning expressions. Either

    some clause plans are created, in which a verb is selected using a pull-down menu of

    the verbs or some templates, which are built the expression using strings and references

  • 7/31/2019 karasimos_mpiro

    19/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System11

    to the two entities whose relationship is expressed by the field. Furthermore,

    microplanning is to populate from the language-dependent lexicon, which contains

    entries for nouns and verbs for lexical selection. Articles and prepositions are domain

    independent and therefore, they are stored as a separate resource.

    Three languages

    M-PIRO can generate text in three languages: English, Greek and Italian. The

    grammar for English is based mainly on the ILEX grammar; nonetheless, the grammar

    for Greek was constructed from the beginning having as base the English one, and the

    Italian grammar was based on the ILEX Spanish one. As already mentioned, the lexicon

    has a larger independent domain now and a full inflection system especially for Greek

    and Italian due to their rich morphological systems. Moreover, M-PIRO supports high-

    quality speech output (Festival11

    for English and Italian and DEMOSTHeNES12

    for Greek)

    User Modeling

    One major advantage of the system is that the user modelling information is stored

    separately from the domain and linguistic resources, in a personalization server. Each

    time the user interacts with the system, he gives his personal details; thus, the system

    always has access to the users personal profile and information on the history of the

    users interactions with the collection. Also user types for adults, experts and children

    were defined by the authors. Each entity type field has values for interest, importance

    and repetitions for each user type. The repetitions value is to calculate the assimilation

    score and rate per user (low rate repetition for experts, high for children). The

    microplanning expressions and the lexicon entries depend on it and change because ofthe user type. There is a comparison module based on a list of important information

    and there is an aggregation module that uses techniques such as simple conjunction,

    11Developed by the University of Edinburgh. For more details see the official web pages of Festival

    (http://www.cstr.ed.ac.uk/projects/festival/) and M-PIRO (http://www.ltg.ed.ac.uk/mpiro, http://

    mpiro.ime.gr).12 Developed by the University of Athens. For more details see the official web pages of

    DEMOSTHeNES (http://laertis.di.uoa.gr/speech/synthesis/demosthenes/) and M-PIRO.

  • 7/31/2019 karasimos_mpiro

    20/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System12

    relative clauses, and syntactic embedding to join together single facts; these two factors

    will to be discussed in more detail later in chapter 3.

    Modularity

    M-PIROs system architecture is significantly more modular than that of its

    predecessor ILEX, which lacks modularity. In particular the linguistic resources,

    database, and user-modeling subsystems are now separate from the systems that

    perform the natural language generation and speech synthesis giving the system a

    satisfactory level of modularity; of course, it is not possible to move to a new

    application domain without specifying both what will be talked about and what

    vocabulary will be used when talking about it.

    2.2.2. The M-PIRO Authoring Tool

    According to the authors (Androutsopoulos et al., 2002; Spiliotopoulos et al., 2002;

    Isard et al., 2003) compared with domain authoring, this is the a simpler process of

    defining specific instances of entities and filling the entities fields with the

    corresponding information. The authoring tool previews the output from the generation

    system on the basis of the current state of the database. The authoring tool is tailored to

    allow the domain experts to manipulate not only the contents of the database, but also

    its structure and domain dependent linguistic resources that control how the information

    of the database is rendered in natural language. The difficult part of the authoring is

    done by an expert, e.g. museum curator, who designs the hierarchy and adds the basic

    types, entity types, microplans, etc. The easier part of the authoring, which is what I

    have already referred to, is done by non-experts, who add particular entities. So an

    expert will add the entity type amphora, but a non-expert can then add lots of

    particular amphora entities, e.g. exhibit1, exhibit18, and use the microplans which the

    expert has built to add information about the particular entity.

    Domain and exhibit authoring can be used together to check information (given by

    a designer or curator) and create a text. For example, the domain authoring can define a

    basic type material and exhibit, a relation made-of, a specific material [marble], an

    entity type statue that is a subtype of exhibit and an entity type imperial portrait that is a

  • 7/31/2019 karasimos_mpiro

    21/118

    Chapter 2 TheM-PIRO NLG System

    Athanasios N. Karasimos Evaluation ofM-PIRO System13

    subtype of statue [portrait of Octaves Augustus]; the text will be This exhibit is an

    imperial portrait. It is made of marble. It is designed to be used by people, such as

    museum curators, who have no experience in language technology [one of the basic

    rules of usability of Nielsen (, 2000)]. Finally, they can create the types of

    visitors (adult, expert, child) and attach fields and microplanning expressions to their

    properties.

  • 7/31/2019 karasimos_mpiro

    22/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System14

    Chapter 3

    Aggregation and Comparison in the M-PIRO

    3.1. Aggregation

    3.1.1. What is aggregation13

    ?

    In a Natural Language Generation system aggregation is part of the microplanning

    section, where texts are composed of verb-based, clause-sized propositions. These

    propositions are likely to contain repetitions and redundancies, which make the text to

    seem boring, non-fluent and unnatural to human readers (Melengoglou, 2002). The

    overcoming of this problem is the use of aggregation. The question what isaggregation by searching in the literature had attempted to be answered variedly; many

    researchers tried to define aggregation. So, aggregation is considered to be the

    generation of fluent, more readable and less boring text by eliminating redundancy and

    combining semantically the text components at any level to achieve a more concise and

    coherent text. This can be a recap of the literature trials for a definition. The effect of

    aggregation can be seen very clearly in the following example from Reape and Mellish

    (1999), in which two propositions with obvious common features were combined to

    produce a single sentence:

    1. The car is here

    2. The car is blue

    [1+2]. The blue car is here

    The goal of aggregation is to produce a text which is concise, coherent, cohesive and

    fluent; however, the goals that aggregation tries to achieve cover most of the territory of

    the default communicative goals of NLG systems generally. Linguistics theories

    separate aggregation into several types, such as conceptual aggregation (the reduction

    of the number of propositions in the text while increasing the complexity of conceptual

    roles value), discourse aggregation (any operation that achieves to make a discourse

    structure better), semantic aggregation (the combination of two semantic entities into

    13For extended discussion, see Reape & Mellish (1999).

  • 7/31/2019 karasimos_mpiro

    23/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System15

    one or semantic grouping and logical transformations), syntactic aggregation (grouping

    subjects or predicates the most common form), lexical aggregation (lexicalization or

    choice of the particular lexemes to realize lexical predicates and structures) and

    referential aggregation (the referring expressions).

    The input to aggregation is usually a tree containing the ordering of facts and the

    dependencies that relate the facts. In this tree aggregation detects shared components

    among neighboring text-treenodes and combines them with an attempt to remove

    redundancies and repetitions in the resulting text. The most common type of

    aggregation is simple conjunction or disjunction which joins facts together by the mean

    of coordination or contrast, like and, but and or. Another common type is

    syntactic embedding which subordinates a clause to a proposition surrounded bycommas (Alexander, the king of Macedonia, conquered the Persians). Generally,

    according to Melengoglou et al. (2003) the choice of particular aggregation

    operations seems to be highly domain-specific.

    3.1.2. The implementation of aggregation in theM-PIRO System

    Aggregation in the M-PIRO project receives as input a sequence of semantic

    representations of facts, which are classifying facts or facts presenting attributes. The

    facts are connected with rhetorical relations, which have to be made explicit to the

    aggregation model as it is very probable that the intended meaning could be lost in the

    generated text. On the contrary, aggregation will not evereffect the meaning of a text

    when the relations are specific. The aggregation module can combine a classifying

    fact with a fact presenting attribute into a complex sentence and two facts

    presenting attributes into a compound sentence.

    Melengoglou (2002) built the M-PIRO aggregation module rules, which were

    capable of producing a text that is more concise and readable. They were grouped into

    three major combinations. Aggregating identity-attribute pairs includes type-comma

    (This exhibit is a drachma, created during the classical period), type-qualifier (This

    portrait depicts Alexander the Great, a king from Macedonia) and type-semicolon (In

    the background you can see rows of columns, temples and other buildings; in the

    foreground there is a ship and a statue of a male form. Aggregating attribute pairs

  • 7/31/2019 karasimos_mpiro

    24/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System16

    includessimple conjunction (This coin originates from Patras and it is now exhibited in

    the Numismatic Museum of Athens) and shared subject-predicate (This stamnos was

    decorated by the painter of Dinos with the red figure technique and is made of clay).

    Finally, aggregating nucleus-satellite pairs includes syntactic embedding (This is

    another relief, a tomb stele).

    There is an hierarchy in the rules, since it is essential that the system must select the

    proper aggregation rule to apply and reject the others. Therefore, some rules have higher

    priority as the resulting text is less redundant and more readable and they help to clarify

    the meaning. According to these priorities, the syntactic embedding is the most

    important rule in the set, while simple conjunction is the least significant one.

    Additionally, it is not clearly preferred the type-comma from the shared subject-predicate. After the choice of the appropriate rule, its parameters must be specified, such

    as the maximum facts the system should convey to a sentence and the verification of

    sentence quality.

    Applying the rules in a sequence, there are two main steps for the aggregation

    algorithm. The first step is the important user-modelling parameter of the maximum

    facts per sentence, which determines the number of facts that Exprimo should convey to

    a particular type of user in each sentence; short sentences may be suitable for smallchildren, but for adults long sentence may be well suited - the use of short sentence

    becomes irritating and boring to them - (Melengoglou et al., 2003). Moreover the

    conflict in application of two adjacent aggregation rules must be eliminated; therefore

    the system should adapt correctly the right aggregation rule to the new linguistic

    structure and give up the other. This choice is necessary for the sentence quality, since

    there are two kind of restrictions: user-modelling restrictions and text quality

    restrictions. For all this restrictions, Melengoglous module had some suggestions and

    proposals that solved the problems and the conflicts. The following sentences illustrates

    the effect of different values of the max facts per sentence parameter of four

    propositions generated by the M-PIRO system.

    Max facts = 1:

    This exhibit is a stamnos. It was decorated by the painter of Dinos. It was

    painted with the red figure technique. It is made of clay

  • 7/31/2019 karasimos_mpiro

    25/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System17

    Max facts = 2:

    This exhibit is a stamnos, decorated by the painter of Dinos. It was painted with

    the red figure technique and is made of clay

    Max facts = 3:

    This exhibit is a stamnos, decorated by the painter of Dinos with the red figuretechnique. It is made of clay

    Max facts = 4:

    This exhibit is a stamnos; it was decorated by the painter of Dinos with the red

    figure technique and is made of clay

    Table 3.1.:A set of propositions generated by M-PIRO with different values of the max facts per sentence

    parameter

    3.2. Comparison

    3.2.1. What is comparison?

    Comparison is like the ancient Roman god, Ianus, which has two faces; comparison

    is constituted by two related but so different aspects, similarity and contrast. Similarity

    prototypically signals by connectives like also and too and contrast by connectives like

    whereas and while. Unfortunately, the literature contains fewer in depth studies of

    comparison generally. The Rhetorical Structure Theory of Mann and Thompson (1998)

    included very elementary discussion and relations about comparison applied to a NLG

    system. The lack of articles and an extended discussion in linguistic theories for

    comparison made the task of implementation in a NLG system even harder and created

    a few controversial suggestions and solutions. Comparison is used as a mean of

    enhancing the experience of the user by both facilitating learning and broadening the

    users knowledge by introducing them to new and relevant items in the domain.

    Similarity deals with two propositions which contain some common components.

    1. Michael is German and teaches linguistics.

    2. Maria is Italian and teaches linguistics.

    [1+2=>]. Michael is German and teaches linguistics. Maria is

    Italian; she also teaches linguistics.

    On the other hand, contrast deals with two propositions which contain contrary

    components or two different aspects of a relation or entity.

  • 7/31/2019 karasimos_mpiro

    26/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System18

    3. Michael is a German. He teaches linguistics

    4. Angelina is English. She teaches history

    [3+4=>]. Michael is a German and teaches linguistics. Angelina

    is an English; but, she teaches history (or She is also a teacher;

    however, she teaches history)

    Comparison are seldom thought of as something in and of themselves rather, they

    are considered to be a part of a larger explanation; according to Milosavljevic (1999)

    they are in fact a central part of descriptions. She claims that it is widely accepted that

    when describing a new concept to a hearer, the hearers understanding of the new

    concept can be encouraged by drawing analogies with understood concepts or solutions

    to problems (1999: 28). In addition, Trevskys theory (Keane et al., 2001) suggests that

    the similarity of the two entities, A and B, to another as being a weighted function of the

    intersection of the features of A and B less the sum of a weighted function of the

    distinctive features in each of the entities; in particular, a new entity has been

    constructed based on the two entities similarity properties. The relationships between

    two representations may be a. commonalities (a property-pair of the entities matches), b.

    alignable differences (a property-pair of the entities has different values) and c. non-

    alignable differences (a value of a property-pair is absent) [see table 3.2].

    stamnos drachma

    Commonalities Creation period : classical Creation period: classical

    Alignable differences Material: clay Material: silver

    Non-alignable differences Painted by : Dinos -----

    Table 3.2.: The relationships between two representation

    Discourse analysis and pragmatics have dealt with the problem of comparison but

    only skin-deeply. It was proposed a group of various conjunctions for similarity, such as

    both, similarly, another way in which these two. are similar, in the same way, these

    .. are similar in that and likewise, and for contrast, such as different in many ways,

    is different, whereas, another difference, but, also differ in, however, while (more

    details in table 3.3.). Nevertheless, the conditions of the combinations of information,

    the preference of some conjunctions instead of others, the appropriateness of a

    conjunction and the change of the text structure have not been examined deeply and

  • 7/31/2019 karasimos_mpiro

    27/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System19

    seriously, so that the comparison module of the recent NLG systems still remains quite

    simplistic and sometimes has a few weaknesses because of monotony of the expressions

    and absence of pluralism.

    Short conjunctions Long conjunctions

    SIMILARITY Similarly,

    Likewise,

    ...the same...

    ...the same as...

    ...also...

    ..., too.

    both

    In the same way,

    X is similar to Y in that (they)...

    X and Y are similar in that (they)...

    Like X, Y [verb]...

    In like manner,

    One way in which X is similar to Y is (that)...

    Another way in which X is similar to Y is (that)...

    Short conjunctions Long conjunctions

    CONTRAST

    However,

    In contrast,By contrast,

    ..., but

    ..., yet

    On the other hand,

    even though + [sentence]

    although + [sentence]

    whereas + [sentence]

    unlike + [sentence]

    while + [sentence]

    nevertheless,Table 3.3.: Short and long conjunctions for similarity and contrast

    3.2.2. The implementation of comparison in M-PIRO System

    An essential aim in generating comparisons was to avoid making individual, full-

    clause repetition to previously seen exhibits, which tend to be boring, distractive and

    irritating; this made its educational goal disputable and controversial. Wherefore, the M-

    PIRO comparison module attempts to overcome this problem by using the class

    hierarchy to group previously examined exhibits into broader categories and make

    either group references or short individual references, when the system knows that the

    name of the exhibit is sufficient to make a unique reference (Androutsopoulos et al.,

    2002) [see table 3.3.].

    Comparison ordering (importance)

    ENGLISH

    sculpted-by

    potter-is

    painted-by

    original location

    creation-period

    painting-technique used

    made-of

    GREEK

    potter-is

    original location

    creation-period

    painting-technique used

    made of

    Table 3.4.: The comparison ordering based on information importance forM-PIRO system

  • 7/31/2019 karasimos_mpiro

    28/118

    Chapter 3Aggregation and Comparison inM-PIRO

    Athanasios N. Karasimos Evaluation ofM-PIRO System20

    The system can make two kinds of comparison: similarity (like the stamnos, this

    lekythos was created during the classical period) and contrast (Unlike the previous

    coins which were made of silver, this stater is made of gold). When the system is

    requested for the next exhibit, it stores the information of the exhibit that can be

    compared (see the previous table); particularly it locates the target predicates. As a first

    step to forming a comparison, Exprimo completes the domain class hierarchy tree for

    the previously examined exhibits that belong to the same exhibit subclass at the current

    exhibit (we assume that the user visited only exhibits from the subclass vessel). The

    system includes all the possible potential comparators which were collected by the

    system for each past exhibit. The next step is to remove firstly subsets (to avoid making

    full-clause references to previous exhibits) and secondly similar subclasses that are not

    directly related to the exhibit. Finally, the system removes the weakest relatives by

    checking any identical comparators that are higher in the hierarchy and distant relatives

    by checking direct relatives of the exhibit in the current focus. For example, the system

    checks for the comparator made-of of an archaic stater (silver); the past exhibit had

    the form classical tetradrachm made-of silverand the post-previous exhibit a drachma

    made-of silver. The super-class is now coins [made-of silver] and therefore, the system

    removes the other entities (tetradrachm and drachma) to use the super-class entry

    (coins) and generates the comparison like the previous coins, this stater is made of

    silver; however, for instance if both previous exhibits were the same sub-entity such as

    tetradrachm and both made of silver, it is meaningless to keep the super-class entry.

    Therefore, the generated comparison would be like the tetradrachms, this stater is made

    of silver. Finally the system uses for similarity the phrases another and like the

    (previous) Xand for comparison unlike the (previous) X.

  • 7/31/2019 karasimos_mpiro

    29/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System21

    Chapter 4

    The Pilot Experiment

    4.1. Introduction

    Ultimately, the pilot experiment intended to examine not only the performance in

    the texts with-in subjects and between-in subjects, but also what kind of text structure

    the subjects preferred and how much they thought that they had learnt from a text with

    or without comparison and aggregation. It was decided that the best way to answer all

    these questions would be to give the subjects two thematically different text sets

    produced by M-PIRO system, where the one had enabled the options of comparison andaggregation and the other did not, and subjects were tested in a factual recall test and

    were asked to decide which one is more natural and more fluent to them. There were,

    however, a lot of parameters to consider and test before proceeding to the main

    experiment.

    Experiments that had been conducted in earlier studies and involved evaluation of

    natural language generation systems aimed at testing different kinds of system

    properties such as accuracy (Jordan, Dorr & Benoit 1993), fluency/ intelligibility

    (Minnis 1998), task(IDAS: Levine & Mellish, 1995; ILEX: Cox et al., 1999) and could

    therefore provide different reference for the purpose of their evaluation experiment.

    These experiments and approaches evaluated different aspects and properties of a NLG

    system. Mellish & Dale (1998:349-350) tried to distinguish between the evaluation of

    systems and the evaluation of their theories that were underlied, and distinguish both of

    these from task evaluation; each of these aspects is considered by looking how

    evaluation has been carried out in the field so far. Although the last few decades (1980sand after) the evaluation has increased substantially, there has not been done many

    works about evaluation, neither in the linguistics field nor in the natural language

    generation system experimental design theory. According to Mellish & Dale (1998) and

    Bangalore et al. (2000) the problem is the confusion that is caused by the inability to

    distinguishing properly natural language processing and natural language processing;

    perhaps the most important of the reasons is that from a practical perspective we are

    faced with a world where there is a great deal of textual material whose value mightbe

  • 7/31/2019 karasimos_mpiro

    30/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System22

    leveraged by the successful achievement of partial achievement of the goals of NLU

    (Mellish & Dale, 1998: 352).

    The only work about taskevaluation was done by Levine & Mellish (1995) for IDAS

    and Cox et al. (1999) for ILEX system. As it has been already noticed in section 2.1.2,

    the evaluation of ILEX failed to support the hypothesis that the dynamic hypertext

    version would improve the performance of the subjects. Therefore, it was very

    important to treat the issues and problems (Mellish &Dale, 1998) by evaluating M-PIRO

    carefully and test the solutions before running the main experiment. For the purpose

    of this study, the two variables that would be tested, were the comparison and

    aggregation14

    . The major problem in evaluating an NLG system is that of assessing the

    output. Since there is not any objective criterion for comparing the appropriateness ofthe text, it was decided to assess the output with and without the two variables.

    Thereupon, it was critical and appropriate text output for the experiment and all the

    subjects to be exposed in the same conditions [unlike the Cox et al.(1999) experiment].

    Furthermore, the knowledge that the human subjects would earn from the texts, was

    measured, since the human subjects are the most valuable resource. Finally, the last

    problem that had to be solved before the main experiment, was how to handle

    disagreement among human judges, since humans will not always agree about

    subjective qualities like good style, coherence and readability (Mellish &

    Dale, 1999:363). Therefore, it is preferable to avoid consulting explicit human

    judgements for this reason. After finding answers to all this questions and choosing the

    best parameters, the only way to test them is with a pilot experiment, which would

    show how adequate were are the data, texts and design.

    14 Comparison is the main factor which the users history and the interaction with other texts are

    build on, and aggregation expect for making a text more fluent and natural, improves the interaction with-

    in the text since the information are not (bricks and tiles

    situated without order).

  • 7/31/2019 karasimos_mpiro

    31/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System23

    4.2. Method

    4.2.1. Designing and choosing the exhibit texts

    M-PIRO Authoring Tool15

    was used to choose the exhibits, to design and run the

    pilot experiment, as well as to preview the texts which were generated. The authoring

    tool is a very useful tool not only for introducing new entities, sub-entities, information

    and exhibits, but also for observing the knowledge database for each exhibit and for

    choosing which exhibit order would be the optional for the experiment.

    The core experimental procedure for the experiment was the two text variables that

    they help us to evaluate the system, aggregation and comparison. The first decision washow these variables were going to be used in the experiment and if they would be tested

    together or separately. There could be a text with comparison and aggregation, a text

    with comparison only and no aggregation, a text with aggregation and no comparison

    and another text without comparison and aggregation. The ILEX evaluation showed that

    only comparison did not support the hypothesis that users history (compare with what

    the user had already visited) and failed to help them perform better and learn more than

    the user whose history had been turned off (Cox et al., 1999). Moreover, the

    aggregation option combine facts inside an exhibits text and could not help the user by

    itself to remember more details. Therefore, it was decided to be used a text with

    comparison and aggregation and another one without comparison and aggregation.

    Consequently, for the first text it was not chosen the option ofDisable the users

    history, and in the users profile it was selected four facts per sentence (max) for

    aggregation and for the second text the users history was disabled and it was selected

    only one fact per sentence.

    Continuing with the design of the experiment, it was decided to use the user model

    profile for adults. This users profile has as default values four facts per sentence (two

    more than the childusers profile) and two repetitions for assimilation (one more than

    the expertusers profile). Because of the time limitations of the experiments (both pilot

    and main) it had to be kept only in one profile and, therefore, the participants were only

    15For more details read the session 2.2.3. The M-PIRO Authoring Tool

  • 7/31/2019 karasimos_mpiro

    32/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System24

    adults. Moreover, none of the adult subjects should be an expert or had at some point

    been acquainted with the subject of numismatics16

    , angiology17

    and archeology.

    The second core part of the experiment was the exhibit texts and the decision if the

    variables were going to be tested with-in or in-between the subjects. Although for the

    evaluation of ILEX, some subjects used the dynamic version and the others the static, it

    was determined to give two text sets each subject, one with comparison and aggregation

    and the other without them. However, these sets could not have exactly the same

    exhibits as it would be impossible to evaluate the performance. Thereupon, there were

    chosen two completely different text sets which contained six exhibits each and all the

    exhibits belonged to the same main entity. Exhibits were avoided to be imported from

    different entities in one set like statues, portraits and jewels, since that would make thetext much more difficult for the user and M-PIRO could not produce comparison pairs

    between unrelated at some point exhibits. The first text set had only coins exhibits18

    which were a drachma, a classical tetradrachm, a tetradrachm, an archaic stater, an

    Hellenistic staterand a coin of the reign of Commodus. The second text set had only

    vessels exhibits which were a Hadra ware hydria, a black kantharos cylix, a classical

    cylix painted by Eucharides, a lekythos painted by Amasis and a lekythos.

    Lekythos (with comparison and aggregation)This exhibit is another lekythos. Like the black kantharos, it was created during

    the classical period. It dates from between 470 and 460 B.C. It shows an athlete

    preparing to throw his javelin. This lekythos was painted with the red figure

    technique. In antiquity, javelin throwing was intimately bound up with the Greek

    way of life. Before it became a feature of sporting life, the javelin was one of the

    weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden

    spear about the height of a tall man. This exhibit is a lekythos. The lekythos was

    originally from Attica. Like this previous lekythos, it was originally from Attica.

    Lekythos (without comparison and aggregation)

    This exhibit is a lekythos. It was created during the classical period. It dates frombetween 475 and 470 B.C. It depicts an athlete preparing to perform a jump. This

    lekythos was painted with the red figure technique. The origin of the long jump lies

    in the challenge presented by traversing the cliffs, ravines and rough terrain of the

    Greek countryside, and, accordingly, by the challenge of waging war on this terrain.

    It was a complicated sport in which the athletes used special weights, the halteres, to

    16Numismatics is the science, whose field is the history and study of (ancient) coins and medals.17Angiology is the science, whose field is the history and study of ancient vessels.18 The whole texts for the Coins and the Vessels in both version are in Appendix I (English and

    Greek texts)

  • 7/31/2019 karasimos_mpiro

    33/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System25

    increase their momentum and the distance of the jump. On this lekythos, the athlete

    holds the weights in his hands and is about to jump away from the springboard. In

    order to win, he needs not only great speed and strong legs but also precise

    coordination between his hands and feet as they make contact with the springboard.

    This is why the long jump was occasionally accompanied by music, which helped

    the jumper pace his rhythm. This lekythos was originally from Attica.

    Table 4.1.: Part of the lekythos text generated by M-PIRO with and without comparison and

    aggregation

    Moreover, we considered that the exhibits of each text set must belong to different main

    entity; if both text sets contained, for example, different exhibits but all of them vessels,

    then it would be a biased problem since the users would not be any more nave when

    they read the second text set. In addition, it was interesting to see how the performance

    of the subjects would be in easy and difficult texts, particularly in the easy text set of

    coins and the difficult text set of vessels. Hence, the half of the participants read the

    coins text sequence with comparison and aggregation and the vessels text sequence

    without them and the other participants the coins text sequence without comparison and

    aggregation and the vessels text sequence with them.

    Finally, two instruments were devised for use in evaluation. They consisted of a

    recall test of factual knowledge about coins and vessels in the exhibition and a usability

    questionnaire. The test was administrated to subjects printed in paper. The factual recall

    test which was introduced to the subjects with the title What did you learn from the

    virtual exhibition was a multiple choice test. Almost the half of the questions was

    about combined and contrasted facts between the exhibits with a variety of difficulty.

    Four examples (two from both question sets) are show below:

    8.During which period were the two cylixes created? (archaic, classical, Hellenistic,

    roman)

    12. Which color is the background of a red-painting technique decorated exhibit?

    (red, black, white, clay, blue)

    3. The tetradrachms are made of . (gold, silver, bronze, marble, there is no

    info in the texts, different material for each one)

    12. Whose picture is on Hellenistic stater? (King Perseus, Alexander the Great,

    Athena, Apollo)

    Table 4.2.: Some questions from the factual recall text (all the questions are in Appendix II). The

    correct answers are in bold type.

  • 7/31/2019 karasimos_mpiro

    34/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System26

    4.2.2. Subjects

    The subject that have taken part in the pilot experiment were 8 native speakers of

    Greek and English aged 23 to 31, four for each language version of M-PIRO. Four of

    them were male and the other four were female. At the time of the experiment they were

    all MSc or PhD students at the University of Edinburgh; the Greek subjects have spent

    1-5 years studying in Britain, which did not affect their understanding of Greek

    scientific text. Although all the Greek participants have some elementary courses of

    Ancient Greek history and Archeology at the Gymnasium and Lyceum classes, they

    were nave as they did not have any previous experience either with the subject of

    vessels and coins or with a natural language system text output either. Similarly, the

    English participants were also nave as they did not have any previous experience with

    archeological texts about vessels and coins and/ or natural language system text output.

    Additionally, only one of the English participants had taken a course for Greek

    Language and therefore, he was familiar with the Greek words. Furthermore, none of

    the subjects had any history of reading problems (like dyslexia) or understanding

    problems. Finally, all the participants were nave far as it concerns the goal of the

    experiment.

    4.2.3. Procedure

    The experiment took place in the Computer Micro Lab room of the Department of

    Theoretical and Applied Linguistics of the University of Edinburgh and in my flat. The

    subjects were usually alone in very quiet conditions so as no one can could disturb

    while they were reading of the texts. The M-PIRO text output was printed in A4 pages.

    Before the experiment started the subjects had been given a short introduction about

    its nature. They were told that they were going to read two different texts, in particular

    two different text sets of six museum exhibits each of which generated by an NLG

    system. Moreover, they were informed about what kind of information was in the texts;

    nevertheless they were not explained anything about the text structure or the text

    difficulty. Additionally, the were told that they should try to learn and remember the

    descriptions and references related to the exhibits since they were going to answer a 13-

    multiple choice question set for each set without having any text in front of them. They

  • 7/31/2019 karasimos_mpiro

    35/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System27

    had fifteen minutes to reach each exhibits text sequence and they could ask anything

    before start reading. When it was specified that their task would be to answer some

    questions, all the participants wanted to know what kind of questions were going to

    respond to and if they had to learn the texts by heart to memorize them. Upon these

    questions they were given some examples and told that they should read them like any

    other text or document. When the time run out or they felt that they could answer the

    questions, they were given more instruction. They were told that they had to choose

    only one possible answer but it was written in the question that there were two possible

    right answers. Specially, they were asked not to answer any question which they did not

    have any clue about or remember anything about, since it was possible to choose the

    right answer by chance and that would not be good for the statistical analysis. It was

    pointed out that it was obvious not to remember everything and they should not feel

    uncomfortable because of their unacknowledged questions. The subjects were

    encouraged to ask any questions before the beginning of the experiment or make any

    comments after the end of the experiment.

    Firstly, they read the first text set and answered the corresponding questions;

    afterwards, they read the second text set and answered the questions again. I chose

    randomly which text they would read first; therefore, some participants read first the

    text set with comparison and aggregation and the others the text set without comparison

    and aggregation; as well as, some participants read firstly the coins texts and the others

    the vessels texts. This procedure covered all possible combination of the text order,

    since it was necessary to examine the ordering effect and its possible flaw in the

    experimental design19

    . Testing the order of the two text group in the pilot experiment, it

    would give some important information for the design of the main experiment. Finally,

    the participants were asked to fill a questionnaire for both text sets; for this session, they

    had an opportunity to check again the texts and fill the questions without time pressure.

    At the end, they were interviewed, where they were informed about the purpose of the

    experiment and they discussed about their own critical comments and ideas for the

    experiment.

    19More details and discussion about this problem in a later session of this dissertation. See at 6.1.2.

    Ordering effects: a possible flaw in experimental design.

  • 7/31/2019 karasimos_mpiro

    36/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System28

    4.3. Results and Discussion

    The 8 double-session data results, one for each participant, were marked by the

    author, saved in Word (Microsoft Office 2000) and then exported to Excel and SPSS

    11.5 for further analysis, process and creation of tables and graphs.

    Coins Vessels Difference

    Group A (Eng) 12 11,5 -1.5

    Std. Deviation 2,82 2,12 +0,70

    Group B (Eng) 11 6 +5

    Std. Deviation 1,41 0 +1,41

    Group A (Gr) 9 11 +2

    Std. Deviation 2,82 1,41 +1,41

    Group B 7 5,5 +1,5

    Std. Deviation 2,82 2,12 +0,70

    Table 4.3.: The results of the participants in both versions of the pilot experiment

    There were 13 multiple-choice questions and the highest possible score was 15. The

    results of the experiment hinted not only that the participants scored better in the

    questions of the text with comparison and aggregation, but also that they preferred the

    text with these options as more fluent and natural. However, there was an exception

    among the participants; a participant scored much better in the text without comparison

    and aggregation and she claimed at the questionnaire that this funny comparison

    thing between the vessels texts made her tired and she did not like the text as it did not

    help herremember details cause of the repetitions. Nevertheless, this case was a very

    rare exception. Furthermore, despite the grouping effect in the Greek participants, it

    seems that the comparison and aggregation made a greater effect in the vessels texts,

    since the difference between the groups were 6 and 5,5. This fact supports the

    hypothesis that comparison and aggregation help the user to learn more and remember

    better some details, as the users do not have usually many difficulties with easy texts.

    Furthermore, standard deviation numbers did allow us to make more clear comments, as

    it was expected that the standard deviation of mean of the texts with comparison and

    aggregation would be smaller than those of the texts without them; unfortunately, this

    was not supported by the data of the pilot.

  • 7/31/2019 karasimos_mpiro

    37/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System29

    Pilot Experiment Performance

    5

    9

    13

    9

    10

    12

    13

    10

    4

    7

    10

    8

    6 6

    11

    15

    0

    2

    4

    6

    8

    10

    12

    14

    16

    1 2 3 4 5 6 7 8

    Subjects

    Score Text with comparison and aggregation

    Texts without comparison and aggregation

    Graph 4.1.: The performance of the participants based on the option of comparison and aggregation

    (the first four are the Greek participants and the other four the English)

    Pilot Experiment Performance

    0

    2

    4

    6

    8

    10

    12

    14

    16

    1 2 3 4 5 6 7 8

    Subjects

    Score

    Coins with comp.-aggr.

    Vessels without comp.-aggr.

    Coins without comp.-aggr.

    Vessels with comp.-aggr.

    Graph 4.2.: The performance of the participants based on the group factor.

    Based on the results posted and described above and despite the fact that they were

    not statistical significance (1, 6 = 2.748, sig., .148) in the text type factors (comparison

    and aggregation), I decided to work with both text sets per participant in the main

    experiment. The main effect for groups (p< .05, sig., .031) was the main cause for this

    statistical insignificance and, therefore, it was expected that it would not appear in the

    main experiment especially since the performance with-in the subjects supported one of

    my hypothesis. As it is illustrated in the graphs, there was obviously an effect on

    participants performance on the questions, depending on the test type factors.

    Moreover, the participants characterized the coins texts as easy and the vessels texts as

  • 7/31/2019 karasimos_mpiro

    38/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System30

    difficult/ very difficult. Finally, it was almost always (the one exception) more

    preferable the text set with comparison and aggregation.

    The results of the pilot did not allow for many useful assumptions on what to expect

    in the main experiment, as there did not seem to be consistency among the groups. For

    instance, when looking at the scores of each group in both versions, it is a big surprise

    the fact that English participants performed better than the Greeks. Although a

    comparison between these two versions is unfair because of the subjects different

    background knowledge and how it interfered in the pilot, the results of the pilot raised

    some questions which may be answered in the main experiment. Therewithal, the

    comments of the participants gave helpful guideline for the main experiment. They

    found the vessels text question too difficult and they suggested importing some dummyanswers in the multiple-choice; by the way, they thought that more questions would not

    be a problem. Furthermore, it was suggested that each exhibits text had to be split up in

    two or three pages. Additionally, they considered that the time had to be between

    twenty minutes, because less would be not enough and more it would be getting boring.

    Finally, the participants who read first the text with the vessels (with or without

    comparison and aggregation) complained that the difficulty of the text exhausted them

    since it was tough and all the information almost completely new for them, and

    therefore, it was harder to read the second text set with the coins; so they preferred to

    had read the coins text set first. This ordering effect may be a possible flaw for the main

    experiment and it is going to be mentioned and discussed in details on session 6.1.2.

    However, it was a major problem which should be solved somehow.

    Coins text Vessels text

    How interesting have youfound the text?

    Interesting Neutral

    How difficult are the questions easy Difficult

    Did you enjoyed the texts? yes Yes

    Which text is more fluent and

    natural?

    7 subjects chose the text with

    comparison and aggregation

    1 subject chose the text

    without comparison

    Table 4.6.: The questionnaire results of the pilot experiment

    To sum up, the results of the pilot study made fairly clear that a text with

    comparison and aggregation are supportive in subjects reading to remember, learn and

  • 7/31/2019 karasimos_mpiro

    39/118

    Chapter 4 The Pilot Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System31

    perform better than a text without these factors. This outcome was considered to be

    encouraging for the main experiment. It would be interesting to find out if there would

    be difference in the performance with-in and between-in subjects and if there is a

    difference how big it would be depending on the difficulty and stiffness of the text

    itself. The data we got from the pilot experiment was not enough to make any

    assumptions; nevertheless, they supported some of our theories/ hypotheses and gave us

    some ideas for new hypotheses such as if the comparison and aggregation make the text

    easier, the dependence of the learning feeling on the text type and what forces the

    participants to choose a text as more fluent and natural.

  • 7/31/2019 karasimos_mpiro

    40/118

    Chapter 5 The Main Experiment

    Athanasios N. Karasimos Evaluation ofM-PIRO System32

    Chapter 5

    The Main Experiment

    5.1. Introduction

    The main experiment set out to evaluate two language versions (English and Greek)

    of the M-PIRO NLG system by testing the text structure factors of comparison and

    aggregation; it would support or fail our four hypotheses. As the pilot experiment

    results revealed, not only the participants not only scored better in the texts with

    comparison and aggregation, but also showed a bigger score difference in the difficult

    text sets.

    What is now anticipated is to observe the difference in the performance depending

    on the text type and not depending on any other factors such as the group, the genre and

    the language. This would hopefully show if the participants scores were related to the

    text factors (comparison and aggregation) and there would be no statistical significance

    in the group, the genre and the language factors. Moreover, it was anticipated that the

    participants would characterize as natural and fluent the text with the text struc