karasimos_mpiro

7/31/2019 karasimos_mpiro

1/118


2/118

Athanasios N. Karasimos Evaluation ofM-PIRO Systemii

Declaration

I hereby declare that this MSc dissertation is of my own composition and that it contains

no material previous submitted for the award of any other degree. The word reported in

this MSc Dissertation has been executed by myself, except where due acknowledgement

is made in the text.

(Athanasios N. Karasimos)


3/118

Athanasios N. Karasimos Evaluation ofM-PIRO Systemiii

AcknowledgementsI desire to express my gratitude to all those people without whom this task would

have been much harder and difficult to accomplish.

Above all, thank you Amy and Colin, for being so patient, helpful, supportive and

co-operative, and for being there every time I need you. You make many things clear

and easy with your useful and clever advice. It was said that the beginning and the end

of a dissertation is the supervisors.

Many thanks to all these people who participated in both experiments. Without their

participation the evaluation would be impossible to be done. I appreciate the time you

spent for my experiment.

And special thanks to my classmate, Behzad whom I own my dissertations topic

and I am grateful for all these useful conversations. Many thanks to Ellen Burk for

guiding me out of the statistics labyrinth.

Finally, at the Greek front, thanks to Aggeliki, Efi, Stavroula and Stathis for their

care and support throughout the whole year. Additionally, special thanks to Alexander

Melengoglou, who offered his valuable knowledge and comments. I could not be here

now without the strong support and love of my family. Thank you Anna and Sotiria for

your corrections.


4/118

Athanasios N. Karasimos Evaluation ofM-PIRO Systemiv

To my mother,

my uncle

and specially to my sister,

Helen, ().


5/118

Athanasios N. Karasimos Evaluation ofM-PIRO Systemv

Abstract

Half of the problem in Natural Language Generation (NLG) is the evaluation of a

NLG system. In the last few decades the research about evaluation has increased and

made some serious steps on this direction. This study describes an evaluation of a

multilingual personalized information objects system (M-PIRO), which dynamically

generates descriptions for exhibits in a virtual museum exhibition. In the evaluation,

learning outcomes in-between subjects who read two sets of texts about coins and

vessels were compared to those of subjects who read these text sets with different text

structure. The aim was to attempt to prove that the text type factors, comparison andaggregation are essential for a better performance. Several types of data were collected

by post-session tests of factual recall knowledge and a questionnaire about the evaluated

system. Results showed that performance measures did differ between subjects in the

two conditions (presence and absence of the text type factors); additionally, the data

analysis revealed that text difficulty and the subjects impression of learning were also

statistically significant. These issues are all considered in order to determine if the goal

of M-PIRO is achieved and to suggest some improvements to it. The study concludes

with an outline of further future work.


6/118

Contents

Athanasios N. Karasimos Evaluation ofM-PIRO Systemvi

Contents

Declaration.......................................................................................................................ii

Acknowledgements.........................................................................................................iiiAbstract............................................................................................................................v

Contents ..........................................................................................................................vi

Index of tables, graphs and pictures ..........................................................................viii

1. Introduction .................................................................................................................1

1.1. Natural Language Generation Systems ...............................................................1

1.2. Evaluating Natural Language Generation Systems.............................................2

1.3. Purpose and Outline of the study.........................................................................5

2. The M-PIRO NLG System............................................................................................6

2.1. TheILEXNLG System...........................................................................................6

2.1.1. The ILEX Dynamic Hypertext System ............................................................6

2.1.2. The evaluation of the ILEX System: Dynamic vs. Static version....................7

2.2. TheM-PIRO System...............................................................................................9

2.2.1. The M-PIRO Domain and Generation Architecture ........................................9

2.2.2. The M-PIRO Authoring Tool .........................................................................12

3. Aggregation and Comparison in the M-PIRO ..........................................................14

3.1. Aggregation........................................................................................................14

3.1.1. What is aggregation?....................................................................................14

3.1.2. The implementation of aggregation in the M-PIRO System..........................15

3.2. Comparison........................................................................................................17

3.2.1. What is comparison?....................................................................................17

3.2.2. The implementation of comparison in M-PIRO System................................19

4. The Pilot Experiment................................................................................................21

4.1. Introduction........................................................................................................21

4.2. Method ...............................................................................................................23

4.2.1. Designing and choosing the exhibit texts ....................................................23

4.2.2. Subjects ........................................................................................................26

4.2.3. Procedure......................................................................................................26

4.3. Results and Discussion.......................................................................................28


7/118

Contents

Athanasios N. Karasimos Evaluation ofM-PIRO Systemvii

5. The Main Experiment...............................................................................................32

5.1. Introduction........................................................................................................32

5.2. Method ...............................................................................................................32

5.2.1. Designing and choosing the exhibit texts ....................................................32

5.2.2. Subjects ........................................................................................................35

5.2.3 Procedure.......................................................................................................36

5.3. Results ................................................................................................................37

6. General Discussion ....................................................................................................48

6.1. The results of both experiments .........................................................................48

6.1.1. Interpreting the results..................................................................................48

6.1.2. Ordering effect: a possible flaw in experimental design..............................51

6.2. Suggestions and improvements ..........................................................................53

6.3. Future work........................................................................................................56

6.4. Conclusion .........................................................................................................58

Bibliography ..................................................................................................................64

Appendix I:The M-PIRO generatedtexts for the Main Experiment.64

Coins Text Sequence [English] with comparison and aggregation..........................64

Coins Text Sequence [English] without comparison and aggregation.....................67

Vessels Text Sequence [English] with comparison and aggregation .......................70

Vessels Text Sequence [English] without comparison and aggregation ..................73

Coins Text Sequence [Greek] with comparison and aggregation ............................77

Coins Text Sequence [Greek] without comparison and aggregation .......................80

Vessels Text Sequence [Greek] with comparison and aggregation..........................83

Vessels Text Sequence [Greek] without comparison and aggregation.....................87

Appendix II: What did you learn from the virtual exhibition? ....................................... 91

The Questions for the Coins Text Sequence [English] .............................................91

The Questions for the Vessels Text Sequence [English] ...........................................93

The Questions for the Coins Text Sequence [Greek] ................................................95

The Questions for the Vessels Text Sequence [Greek]..............................................97

Questionnaire..........................................................................................................100

Appendix III : The Statistical guide ............................................................... .......................... 101


8/118

Index of tables, graphs and pictures

Athanasios N. Karasimos Evaluation ofM-PIRO Systemviii

Index of tables, graphs and picturesTable 2.1. The M-PIRO pipeline generation architecture 9

Table 2.2. Part of M-PIRO entity hierarchy organization of types and levels 10

Table 3.1. The relationships between two representation 17

Table 3.2. Short and long conjunctions for similarity and contrast 18

Table 3.3. The comparison ordering based on information importance for M-PIRO system 18

Table 4.1. Part of the lekythos text generated by M-PIRO with and without comparison andaggregation 23

Table 4.2. Some questions from the factual recall text (all the questions are in Appendix II). Thecorrect answers are in bold type 24

Table 4.3. The results of the participants in both versions of the pilot experiment 26

Graph 4.4. The performance of the participants based on the option of comparison andaggregation

27

Graph 4.5. The performance of the participants based on the group factor 28

Table 4.6. The questionnaire results of the pilot experiment 29

Picture 5.1. A web page from the vessels sequence that contains the Spherical CorinthianAryballos 31

Table 5.2. Two examples of the vessels texts with more complicated comparisons 33

Table 5.3. The results of the participants in both languages of the main experiment 36

Graph 5.4. The score performance per person depending on text type factors (Greek Version) 37

Graph 5.5. The score performance per person depending on text type factors (English Version) 38

Graph 5.6. The results per participant depending on the group factor [Greek version] 39

Graph 5.7. The results per participant depending on the group factor [English version] 39

Graph 5.8. The performance for all the participants depending on the genre factor [both version] 40

Graph 5.9. Box plots for performance depending on genre and text type factors [English version] 41

Graph 5.10. Box plots for performance depending on genre and text type factors [Greek version] 41

Graph 5.11. The performance of Greek participants depending on text difficulty (coins vs. vessels) 42

Graph 5.12. The performance of English participants depending on text difficulty (coins vs.vessels) 42

Graph 5.13. The questionnaire results summary of the English participants for both groups 43

Graph 5.14. The questionnaire results summary of the Greek participants for both groups 43

Graph 6.1. Box plots for the performance of the participants depending on the language factor 49


9/118

Chapter 1Introduction

Athanasios N. Karasimos Evaluation ofM-PIRO System1

Chapter 1

Introduction

1.1.Natural Language Generation Systems

Natural Language Generation (NLG) is the assembly of the text word-by-word

using knowledge of morphology, syntax, semantics and text structure (O Donnell et

al., 2001). As a branch of computational linguistics, cognitive science and artificial

intelligence, NLG is the process of constructing natural language outputs from non-

linguistic inputs (symbolic or numeric inputs), in particular of mapping some

underlying representation of information to a meaningful, understandable and specificpresentation of that information in spoken and/ or textual linguistic form (in human

languages). A complete NLG system has to take many decisions to produce an

appropriate output. The goal of NLG can be characterized as the inverse of that of

Natural Language Understanding (NLU), whereas in NLU the concern is to map from

text output to some underlying representation of the meaning (Mellish & Dale, 1998;

Jurafsky & Martin, 2000: 764-794; O Donnell et al., 2001). The generation process in

an NLG system typically consists of the five following main stages:

Content determination, in which the system decides what information should be

included as appropriate in the text, and what should be omitted; this selection depends

upon a variety of contextual factors and the particular user to whom it is to be directed.

Document structuring, in which it is decided how the text should be organized and

structured; this means that (for the information already included) the NLG system has to

choose the appropriate structure to convey the information.

Lexical selection, in which the system chooses the particular word, word types and

phrases that are required to communicate the specified information; it may be also

possible to vary the words used for stylistic effect.


10/118



Sentence structure1, which involves the processes ofaggregation, in which the

system must apportion the selected content into phrase, clause and sentence-sized

chunks and often in the interests of fluency place several pieces of information into

one sentence, andreferring expression generation, in which the system determines the

properties of an entity by referring to that entity.

And Surface realization, in which the system determines the mapping of the

underlying text into a natural text of grammatically correct sentences.

The M-PIRO NLG system, which will be discussed in Chapter 2 and evaluated, is a

dynamical hypertext2

natural language generation system.

1.2. Evaluating Natural Language Generation Systems

Over the last 15 years, the level of interest and concern expressed by the natural

language processing (NLP) researchers with regard to evaluation has increased

substantially. In early NLG work, the quality of the output of the system was assessed

by the system authors themselves. However, they misestimated the worthiness of the

evaluation for the improvement of an NLG system. In contrast, it has nowadays become

widely accepted in the language processing community that NLP researchers shouldappreciate the evaluation of a system and pay more attention to its results, since it plays

an essential role in the development of NLG systems, techniques and strategies.

Mellish and Dale (1998: 349) claim that NLG is exactly half of the problem of

natural language processing work; the other half is the evaluation of a system.

According to them, the first serious work in the field of evaluation took place in 1990 at

a workshop held on the theme of the Evaluation of Natural Language Generation

Systems and the papers of Meteer and McDonald (1991) and Moore (1991). The main

problem which they tried to distinguish, is the difference of the evaluation of a system

from the evaluation of underlying theories. Dealing with this and some other essential

1 For some researchers (Reiter & Dale, 2000. Building a Natural Language Generation System;Melengoglou, 2002) lexical selection, aggregation and referring expression generation are part of

microplanning.2Dynamic hypertext refers to a NLG system which creates its outputs dynamically at run-time, when

the user requests them; such a output text is generated on call, not pre-written by a human author.


11/118



problems, the empirical work in this field was noticeably increased based on the above

work papers.

Evaluation can have many objectives and can consider several different dimensions

of an NLG system or theory. Sometimes some evaluation objectives combine aspects of

a system and its underlying theory, making the task harder. Mellish and Dale (1998)

suggest that the evaluation of NLG techniques must be broken down into three main

categories. Evaluating properties of the Theory is the assessment of the

characteristics of some theory underlying an NLG system or a part of it; the

implementation of this theory, e.g. Rhetorical Structure Theory, helps us to characterize

as appropriate or not this theory for the system. Evaluating properties of the System is

the assessment of specific characteristics of some NLG system; it might be acomparison of two NLG systems or their algorithms, the performance of a NLG system

in two different version of it, or the output of the generator with the characteristics of a

corpus of target text. Finally, Applications Potential is the evaluation of the potential

utility of an NLG system in some environment, for instance if its use provides a better

solution than some other approach (Mellish & Dale, 1998: 353-354).

Previous approaches to evaluation of NLG theories are very few. The main problem

is that a good NLG system is based on a theory; nevertheless, during its constructionseveral practical problems must be solved and, therefore, the solutions may be

unconnected to the underlying theory. There were some evaluations of grammars

based on some theories, like Rhetorical Structure Theory (Mann & Thompson, 1988).

Robin tested his revision-based theory with a natural corpus for the domain of baseball

summaries. As Mellish and Dale (1998: 355) report, this kind of evaluation is

dangerous, since most reported work on NLG evaluates its theory indirectly through

the systems that implement them.

In contrast to the evaluation of NLG theories, the question how good is my NLG

system is much easier to answer. There are different kinds of system aspects that can be

evaluated. Accuracy evaluation means the assessment of the relationship between input

and output, if the generated text conveys the desired meaning to the reader (Jordan et

al., 1993). Fluency and intelligibility evaluation concerns the quality of generated text

and includes notions such as syntactic correctness, stylistic appropriateness,


12/118



organization and coherence. Despite the unclearness of these notions measurements,

Minnis (1993) made some proposals for evaluation. There are quite a few evaluations in

this field, like Bangalore et al. (2000) who evaluated the system FERGUS and found that

its output understandability and quality was correlated each other3. Finally, a task

evaluation involves observing how well a task is performed by using the NLG system.

Usually, task evaluation is used for evaluation of Machine Translation (MT) systems,

such as IDAS assessment (Levine & Mellish, 1995); however, it was also used for other

kind of NLG systems, such as PEBA-II (Milosavljevic & Oberlander, 1998), ILEX4 (Cox

et al., 1999) and AMVF (Carerini and Moore, 2000).

Finally, there are some issues and general problems which arise when dealing with

evaluating a NLG system. Firstly, the very major problem in evaluating an NLG systemis that of assessing its output and the question arises of what the output should be. A

fluency and intelligibility evaluation can deal with this issue, but it lacks objective

criteria, where the results of a task evaluation indirectly reflect the properties of the

system. Secondly, it is needed to evaluate measurable attributes for the performance of

the system, and thirdly, these attributes must be compared with something else,

otherwise it is hard to say that something is good or bad, easy or difficult, acceptable or

not. Additionally, it is really essential to get adequate test data for the evaluation. An

evaluation without sufficient data will unsurprisingly fail. Finally, the human subjects

may not agree and it would be wise to handle the disagreement among human judges by

avoiding taking into account their judgements; therefore the authors should guide5

them

to give specific objective and clear opinions and not vague critics and thoughts.

3They evaluate to different versions of FERGUS (Flexible Empiricist/ Rationalist Generation Using

Syntax) using evaluation metrics (accuracy) which are useful to them as relative quantitative assessmentsof different models.

4 For more details see section 2.1.2 of the second chapter.5However, the subjects should not be guided where the authors desire, for instance the subjects to

say what the authors want to hear.


13/118



1.3. Purpose and Outline of the study

The purpose of this dissertation is to present and describe the evaluation of the M-

PIRO NLG system and to export useful and helpful conclusions about a further

improvement of the system and a future development of an NLG system.

The hypotheses of this evaluation are the following: firstly, we expect that a text

with comparison and aggregation will help the subjects perform better and learn more.

Secondly, the performance will differ depending on the difficulty of a text. Thirdly, the

subjects will characterize as more fluent and natural a text with comparison and

aggregation than a text without these factors. Finally, they will feel more comfortable

and certain that they learn more from a text with the above factors.

The remainder of this dissertation is organized as followed:

Chapter 2 provides an overview of the ILEX NLG system (2.1.1) and its evaluation

(2.1.2). Then it presents the M-PIRO NLG system and describes the domain (2.2.1) and

architecture (2.2.2) of this system and its authoring tool (2.2.3).

Chapter 3 examines both comparison and aggregation in M-PIRO system. It also

provides a literature background and description of the implementation of these factor in

the current system.

Chapter 4 gives a description of the pilot experiment that preceded the main

evaluation experiment. After the presentation of the main purpose (4.1), the method is

illustrated (4.2) and the results are presented and analyzed (4.3).

Chapter 5 provides the design of the main experiment for the evaluation. As the

purpose and the design was largely the same as that of the pilot (5.1-5.2), the essentialpart which is emphasized, is the analysis of the results (5.3).

Chapter 6 closes the dissertation with a general discussion of the results of this

study (6.1.1) and of the notice of a ordering effect in the experimental design (6.1.2).

Furthermore, there are some suggestions, improvements (6.2) and future work (6.3)

about the M-PIRO system.


14/118

Chapter 2 TheM-PIRO NLG System


Chapter 2

The M-PIRO NLG System

2.1. The ILEX NLG System

2.1.1. TheILEXDynamic Hypertext System

ILEX6

(the Intelligent Labeling Explorer) is a dynamic hypertext generation system

developed at the University of Edinburgh (1997-2000) in collaboration with the

National Museum of Scotland. Its task was to generate labels (descriptions) for objects

of a virtual museum exhibition in several languages, using a single knowledge database

storing information in a language-neutral form. ILEX generated labels which werepersonalized, whereas they were tailored opportunistically depending on the users

interests and the users interaction history with the system.

The user model ofILEX provides labels generation for the categories ofchild, adult

and expert. It models the users in terms of their relation to information, such as the

interest, the importance and the level ofassimilation of the information, and provides

values for each predicate type. Since the authors can not predict the exact nature of the

user, ILEX allows the users to control directly the displayed generation text for the

objects and the freedom to browse the collection of objects in any order; based on the

authors pre-assumptions of the values of their relation to information, the system

proceeds with the users requests and adapts the structure of its label to the user.

Therefore, the ILEXs aim is to produce exhibits descriptions that a real curator might

give and the visitors feel like they are in real museum with a guider.

The opportunistic text tailoring is achieved in ILEX via the use of referring

expressions, comparison expression, nominal anaphora and approaches derived from

Rhetorical Structure Theory. Built on Rhetorical Structure Theory7

the aggregation is

organized into nucleus-Satellite relations (Like most Arts and Crafts style jewels, it has

an elaborate design) and multi-nuclei relations (This jewel is a necklace and was

6For an extended discussion of this system, see O Donnell et al. (2001); see also, Milodavljevic and

Oberlander (1998), Cox et al. (1999), O Donnell et al. (2000).7For more details and an extensive discussion of the theory, see Mann and Thompson (1988).


15/118



made by a British designer called Edward Spencer). For the comparison expression it

uses the users navigation log; it introduces an already known concept (This necklace

is also in Arts and Crafts style), make simple comparisons with the previous visited

exhibit (For instance the previous item uses oval-shaped stones (in other words it

features rounded stones). However this necklace does not feature rounded stones) and

steers clear of repeating information which has already been mentioned (Cox et al.,

1999; Melengoglou, 2002).

Nevertheless, ILEX is not without problems and flaws (Isard et al., 2003). They

captured much of information by interviewing a curator and then hand-coding

taxonomic information and other assertions. The authors use type text strings literally

rather than in terms of knowledge-base objects stored in some language neutral form;therefore it was hard to present information in any other language rather than the

original input language. Related to this, there is some problems with linguistic grammar

(the English and Spanish works well, but for the other languages the grammars should

rebuild or reconstruct). Furthermore, the same level values of adult, expert and child

types do not change essentially the text structure. Finally, the system is less modular

than desirable.

2.1.2. The evaluation of theILEXSystem: Dynamic vs. Static version

The Cox, O Donnell, Oberlander (1999) paper describes an evaluation ofILEX, the

intelligent labelling explorer and in the evaluation, learning outcomes in subjects who

used the dynamic ILEX system were analysed and contrasted to subjects who used a

static version of the system. Their goal was to attempt to isolate learning effects

specifically due to dynamic hypertext generation (Cox et al., 1999). In previous work

(Levine & Mellish, 1995) the IDAS system that used natural generation techniques in the

automatic generation of hypertexts was evaluated, where the users task was to retrieve

relevant information to answer specific questions; however, they did not use any

comparison group for their assessment and also, built their results and discussion on the

users page visits and navigation logs.

Since Cox, O Donnell and Oberlanders aim was not to compare their dynamic

hypermedia with a traditional media system or to observe aspects of hypermedia such as


16/118



configurations of links, they used two different version of ILEX: a traditionally

configured version with static pages and no user modeling against the original

intelligent system with dynamically generated text containing referring expressions and

comparisons based on user model (Cox et al., 1999). Both versions contained the same

six jewels. Three instruments were devised for use in the evaluation, 1. a recall test of

factual knowledge about jewels in the exhibition, 2. a curator task8

and 3. a usability

questionnaire. They used twenty subjects allocated to the dynamic ILEX system and ten

subjects to the static version of the system. The results were quite interesting. Both

groups scored similarly on the two tests in performance terms; however, the log data

processing revealed that the dynamic system users made more visits to the case of

jewels than static subjects, and made proportionately more navigation-related button

clicks than their static ILEXcounterpants (Cox et al., 1999: 7). Based on these results

they claim that the users did not benefit from the dynamic version properties since they

did not score better; additionally, the learning pattern and performance varied and was

achieved by different ways depending on the log data.

Nevertheless, there were some flaws to this experiment, since there was not used

the same number of users for both versions. Moreover, the subjects were not exposed to

the same experimental conditions, since they used different versions and, therefore, it

was possible to occur essentially a main effect for groups9, which was not mentioned if

it existed or not. Furthermore, the required time was too much for only six exhibits and

it might potentially have affected the results of the performance, since the time

conditions were not real (in a normal case no one would spend ninety minutes for a

twelve paragraphs text of six exhibits). They claimed that any learning effect is almost

dependent on the navigation route just as Levine & Mellish (1993); I maintain that the

learning effects are beyond any log and navigation route, since there are many factors

that learning processing depends on (Mellish & Dale, 1998; Jurafsky & Martin, 2000).

Other experiments have been carried to test different properties of a NLG system

and evaluate the system depending on them. Properties of the underlying theory,

properties of the system and the applications potential were evaluated. Nonetheless, the

8This task consisted some presentations of unseen jewels in the exhibition and the subjects were

asked to examine the exhibit and classify it by answering multiple-choice questions.9For more about statistics terminology see Appendix III.


17/118



lack of a specific evaluation theory and the disagreement of subjective qualities like

fluency, readability, good style and appropriateness constitute an essential drawback of

the evaluation of an NLG system. Some researchers failed to properly evaluate systems

since they used immeasurable phenomena or subjective criteria. According to Mellish &

Dales (1998) evaluation theory approach, there are some important issues and

problems that must be solved for an evaluation design, which they are going to be

discussed in the pilot experiment section.

2.2. The M-PIRO System

The M-PIRO10

NLG system (Multilingual Personalized Information Objects) is a

more recent project of the Information Societies Programme of the European Union that

also generates descriptions for virtual museum object (exhibits). It is a descendant of the

ILEX System and has focused on developing language-engineering technology for

personalized information objects, specifically on multilingual information delivery

(Isard et al., 2003). It incorporates high-quality speech output, an authoring tool,

improved user modeling and a modular core generation domain (Androutsopoulos et al.,

2002).

2.2.1. The M-PIRO Domain and Generation Architecture

domain model domain database + domain semantics

CONTENT SELECTION selection of facts to convey to the user

information to be conveyed

TEXT PLANNING ordering of facts + document structure (RST)

text structure

MICRO-PLANNING lexicalisation + referring expression generation

document specifications

SURFACE REALISATION text generation

exhibit description

Table 2.1: The M-PIRO pipeline generation architecture

10 The projects consortium consisted of the University of Edinburgh (Scotland), ITC-irst (Italy),

NCSR Demokritos (Greece), the University of Athens (Greece) and the Foundation of the Hellenic

World (Greece).


18/118



The table 2.1 outlines the stages in the M-PIRO NLG architecture (Androutsopoulos

et al., 2002) and the process for generating a text.

Domain authoring

The knowledge base contains all the necessary information about entities and

relationships; entities are abstract and concrete. The major task is the hierarchy of entity

types, such as entities and sub-entities, (e.g. exhibit and material: statue and marble)

which can contain more levels of an entity, for example the entity statue has complex

statue, kouros, imperial portrait, etc. Similarly, the relations between entities are

expressed by using fields, since the domain author can define fields for each entity,

which are then inherited by all entity types below in the hierarchy (Isard et al., 2003).

For example, creation-periodapplies to statue and to all descendents, such as complex

statue, kouros, imperial portrait and it must be filled by on entity of the historical-

periodgroup (archaic, classical, hellenistic and roman).

Basic Type Entity Type Entity Level Entity

Copy complex-statue

a-location kouros

portrait

statue

exhibit7

exihibit imperial portrait

exhibit17

coin

jewel

relief

Table 2.2: Part of M-PIRO entity hierarchy organization of types and levels

Microplanning expressions

Each field has associated information that specifies how the relationship it

represents can be expressed as a sentence. As mentioned in the introduction,

microplanning involves lexical selection, aggregation and referring expression

generation; the specifications for these are known as microplanning expressions. Either

some clause plans are created, in which a verb is selected using a pull-down menu of

the verbs or some templates, which are built the expression using strings and references


19/118



to the two entities whose relationship is expressed by the field. Furthermore,

microplanning is to populate from the language-dependent lexicon, which contains

entries for nouns and verbs for lexical selection. Articles and prepositions are domain

independent and therefore, they are stored as a separate resource.

Three languages

M-PIRO can generate text in three languages: English, Greek and Italian. The

grammar for English is based mainly on the ILEX grammar; nonetheless, the grammar

for Greek was constructed from the beginning having as base the English one, and the

Italian grammar was based on the ILEX Spanish one. As already mentioned, the lexicon

has a larger independent domain now and a full inflection system especially for Greek

and Italian due to their rich morphological systems. Moreover, M-PIRO supports high-

quality speech output (Festival11

for English and Italian and DEMOSTHeNES12

for Greek)

User Modeling

One major advantage of the system is that the user modelling information is stored

separately from the domain and linguistic resources, in a personalization server. Each

time the user interacts with the system, he gives his personal details; thus, the system

always has access to the users personal profile and information on the history of the

users interactions with the collection. Also user types for adults, experts and children

were defined by the authors. Each entity type field has values for interest, importance

and repetitions for each user type. The repetitions value is to calculate the assimilation

score and rate per user (low rate repetition for experts, high for children). The

microplanning expressions and the lexicon entries depend on it and change because ofthe user type. There is a comparison module based on a list of important information

and there is an aggregation module that uses techniques such as simple conjunction,

11Developed by the University of Edinburgh. For more details see the official web pages of Festival

(http://www.cstr.ed.ac.uk/projects/festival/) and M-PIRO (http://www.ltg.ed.ac.uk/mpiro, http://

mpiro.ime.gr).12 Developed by the University of Athens. For more details see the official web pages of

DEMOSTHeNES (http://laertis.di.uoa.gr/speech/synthesis/demosthenes/) and M-PIRO.


20/118



relative clauses, and syntactic embedding to join together single facts; these two factors

will to be discussed in more detail later in chapter 3.

Modularity

M-PIROs system architecture is significantly more modular than that of its

predecessor ILEX, which lacks modularity. In particular the linguistic resources,

database, and user-modeling subsystems are now separate from the systems that

perform the natural language generation and speech synthesis giving the system a

satisfactory level of modularity; of course, it is not possible to move to a new

application domain without specifying both what will be talked about and what

vocabulary will be used when talking about it.

2.2.2. The M-PIRO Authoring Tool

According to the authors (Androutsopoulos et al., 2002; Spiliotopoulos et al., 2002;

Isard et al., 2003) compared with domain authoring, this is the a simpler process of

defining specific instances of entities and filling the entities fields with the

corresponding information. The authoring tool previews the output from the generation

system on the basis of the current state of the database. The authoring tool is tailored to

allow the domain experts to manipulate not only the contents of the database, but also

its structure and domain dependent linguistic resources that control how the information

of the database is rendered in natural language. The difficult part of the authoring is

done by an expert, e.g. museum curator, who designs the hierarchy and adds the basic

types, entity types, microplans, etc. The easier part of the authoring, which is what I

have already referred to, is done by non-experts, who add particular entities. So an

expert will add the entity type amphora, but a non-expert can then add lots of

particular amphora entities, e.g. exhibit1, exhibit18, and use the microplans which the

expert has built to add information about the particular entity.

Domain and exhibit authoring can be used together to check information (given by

a designer or curator) and create a text. For example, the domain authoring can define a

basic type material and exhibit, a relation made-of, a specific material [marble], an

entity type statue that is a subtype of exhibit and an entity type imperial portrait that is a


21/118



subtype of statue [portrait of Octaves Augustus]; the text will be This exhibit is an

imperial portrait. It is made of marble. It is designed to be used by people, such as

museum curators, who have no experience in language technology [one of the basic

rules of usability of Nielsen (, 2000)]. Finally, they can create the types of

visitors (adult, expert, child) and attach fields and microplanning expressions to their

properties.


22/118

Chapter 3Aggregation and Comparison inM-PIRO


Chapter 3

Aggregation and Comparison in the M-PIRO

3.1. Aggregation

3.1.1. What is aggregation13

?

In a Natural Language Generation system aggregation is part of the microplanning

section, where texts are composed of verb-based, clause-sized propositions. These

propositions are likely to contain repetitions and redundancies, which make the text to

seem boring, non-fluent and unnatural to human readers (Melengoglou, 2002). The

overcoming of this problem is the use of aggregation. The question what isaggregation by searching in the literature had attempted to be answered variedly; many

researchers tried to define aggregation. So, aggregation is considered to be the

generation of fluent, more readable and less boring text by eliminating redundancy and

combining semantically the text components at any level to achieve a more concise and

coherent text. This can be a recap of the literature trials for a definition. The effect of

aggregation can be seen very clearly in the following example from Reape and Mellish

(1999), in which two propositions with obvious common features were combined to

produce a single sentence:

1. The car is here

2. The car is blue

[1+2]. The blue car is here

The goal of aggregation is to produce a text which is concise, coherent, cohesive and

fluent; however, the goals that aggregation tries to achieve cover most of the territory of

the default communicative goals of NLG systems generally. Linguistics theories

separate aggregation into several types, such as conceptual aggregation (the reduction

of the number of propositions in the text while increasing the complexity of conceptual

roles value), discourse aggregation (any operation that achieves to make a discourse

structure better), semantic aggregation (the combination of two semantic entities into

13For extended discussion, see Reape & Mellish (1999).


23/118



one or semantic grouping and logical transformations), syntactic aggregation (grouping

subjects or predicates the most common form), lexical aggregation (lexicalization or

choice of the particular lexemes to realize lexical predicates and structures) and

referential aggregation (the referring expressions).

The input to aggregation is usually a tree containing the ordering of facts and the

dependencies that relate the facts. In this tree aggregation detects shared components

among neighboring text-treenodes and combines them with an attempt to remove

redundancies and repetitions in the resulting text. The most common type of

aggregation is simple conjunction or disjunction which joins facts together by the mean

of coordination or contrast, like and, but and or. Another common type is

syntactic embedding which subordinates a clause to a proposition surrounded bycommas (Alexander, the king of Macedonia, conquered the Persians). Generally,

according to Melengoglou et al. (2003) the choice of particular aggregation

operations seems to be highly domain-specific.

3.1.2. The implementation of aggregation in theM-PIRO System

Aggregation in the M-PIRO project receives as input a sequence of semantic

representations of facts, which are classifying facts or facts presenting attributes. The

facts are connected with rhetorical relations, which have to be made explicit to the

aggregation model as it is very probable that the intended meaning could be lost in the

generated text. On the contrary, aggregation will not evereffect the meaning of a text

when the relations are specific. The aggregation module can combine a classifying

fact with a fact presenting attribute into a complex sentence and two facts

presenting attributes into a compound sentence.

Melengoglou (2002) built the M-PIRO aggregation module rules, which were

capable of producing a text that is more concise and readable. They were grouped into

three major combinations. Aggregating identity-attribute pairs includes type-comma

(This exhibit is a drachma, created during the classical period), type-qualifier (This

portrait depicts Alexander the Great, a king from Macedonia) and type-semicolon (In

the background you can see rows of columns, temples and other buildings; in the

foreground there is a ship and a statue of a male form. Aggregating attribute pairs


24/118



includessimple conjunction (This coin originates from Patras and it is now exhibited in

the Numismatic Museum of Athens) and shared subject-predicate (This stamnos was

decorated by the painter of Dinos with the red figure technique and is made of clay).

Finally, aggregating nucleus-satellite pairs includes syntactic embedding (This is

another relief, a tomb stele).

There is an hierarchy in the rules, since it is essential that the system must select the

proper aggregation rule to apply and reject the others. Therefore, some rules have higher

priority as the resulting text is less redundant and more readable and they help to clarify

the meaning. According to these priorities, the syntactic embedding is the most

important rule in the set, while simple conjunction is the least significant one.

Additionally, it is not clearly preferred the type-comma from the shared subject-predicate. After the choice of the appropriate rule, its parameters must be specified, such

as the maximum facts the system should convey to a sentence and the verification of

sentence quality.

Applying the rules in a sequence, there are two main steps for the aggregation

algorithm. The first step is the important user-modelling parameter of the maximum

facts per sentence, which determines the number of facts that Exprimo should convey to

a particular type of user in each sentence; short sentences may be suitable for smallchildren, but for adults long sentence may be well suited - the use of short sentence

becomes irritating and boring to them - (Melengoglou et al., 2003). Moreover the

conflict in application of two adjacent aggregation rules must be eliminated; therefore

the system should adapt correctly the right aggregation rule to the new linguistic

structure and give up the other. This choice is necessary for the sentence quality, since

there are two kind of restrictions: user-modelling restrictions and text quality

restrictions. For all this restrictions, Melengoglous module had some suggestions and

proposals that solved the problems and the conflicts. The following sentences illustrates

the effect of different values of the max facts per sentence parameter of four

propositions generated by the M-PIRO system.

Max facts = 1:

This exhibit is a stamnos. It was decorated by the painter of Dinos. It was

painted with the red figure technique. It is made of clay


25/118



Max facts = 2:

This exhibit is a stamnos, decorated by the painter of Dinos. It was painted with

the red figure technique and is made of clay

Max facts = 3:

This exhibit is a stamnos, decorated by the painter of Dinos with the red figuretechnique. It is made of clay

Max facts = 4:

This exhibit is a stamnos; it was decorated by the painter of Dinos with the red

figure technique and is made of clay

Table 3.1.:A set of propositions generated by M-PIRO with different values of the max facts per sentence

parameter

3.2. Comparison

3.2.1. What is comparison?

Comparison is like the ancient Roman god, Ianus, which has two faces; comparison

is constituted by two related but so different aspects, similarity and contrast. Similarity

prototypically signals by connectives like also and too and contrast by connectives like

whereas and while. Unfortunately, the literature contains fewer in depth studies of

comparison generally. The Rhetorical Structure Theory of Mann and Thompson (1998)

included very elementary discussion and relations about comparison applied to a NLG

system. The lack of articles and an extended discussion in linguistic theories for

comparison made the task of implementation in a NLG system even harder and created

a few controversial suggestions and solutions. Comparison is used as a mean of

enhancing the experience of the user by both facilitating learning and broadening the

users knowledge by introducing them to new and relevant items in the domain.

Similarity deals with two propositions which contain some common components.

1. Michael is German and teaches linguistics.

2. Maria is Italian and teaches linguistics.

[1+2=>]. Michael is German and teaches linguistics. Maria is

Italian; she also teaches linguistics.

On the other hand, contrast deals with two propositions which contain contrary

components or two different aspects of a relation or entity.


26/118



3. Michael is a German. He teaches linguistics

4. Angelina is English. She teaches history

[3+4=>]. Michael is a German and teaches linguistics. Angelina

is an English; but, she teaches history (or She is also a teacher;

however, she teaches history)

Comparison are seldom thought of as something in and of themselves rather, they

are considered to be a part of a larger explanation; according to Milosavljevic (1999)

they are in fact a central part of descriptions. She claims that it is widely accepted that

when describing a new concept to a hearer, the hearers understanding of the new

concept can be encouraged by drawing analogies with understood concepts or solutions

to problems (1999: 28). In addition, Trevskys theory (Keane et al., 2001) suggests that

the similarity of the two entities, A and B, to another as being a weighted function of the

intersection of the features of A and B less the sum of a weighted function of the

distinctive features in each of the entities; in particular, a new entity has been

constructed based on the two entities similarity properties. The relationships between

two representations may be a. commonalities (a property-pair of the entities matches), b.

alignable differences (a property-pair of the entities has different values) and c. non-

alignable differences (a value of a property-pair is absent) [see table 3.2].

stamnos drachma

Commonalities Creation period : classical Creation period: classical

Alignable differences Material: clay Material: silver

Non-alignable differences Painted by : Dinos -----

Table 3.2.: The relationships between two representation

Discourse analysis and pragmatics have dealt with the problem of comparison but

only skin-deeply. It was proposed a group of various conjunctions for similarity, such as

both, similarly, another way in which these two. are similar, in the same way, these

.. are similar in that and likewise, and for contrast, such as different in many ways,

is different, whereas, another difference, but, also differ in, however, while (more

details in table 3.3.). Nevertheless, the conditions of the combinations of information,

the preference of some conjunctions instead of others, the appropriateness of a

conjunction and the change of the text structure have not been examined deeply and


27/118



seriously, so that the comparison module of the recent NLG systems still remains quite

simplistic and sometimes has a few weaknesses because of monotony of the expressions

and absence of pluralism.

Short conjunctions Long conjunctions

SIMILARITY Similarly,

Likewise,

...the same...

...the same as...

...also...

..., too.

both

In the same way,

X is similar to Y in that (they)...

X and Y are similar in that (they)...

Like X, Y [verb]...

In like manner,

One way in which X is similar to Y is (that)...

Another way in which X is similar to Y is (that)...

Short conjunctions Long conjunctions

CONTRAST

However,

In contrast,By contrast,

..., but

..., yet

On the other hand,

even though + [sentence]

although + [sentence]

whereas + [sentence]

unlike + [sentence]

while + [sentence]

nevertheless,Table 3.3.: Short and long conjunctions for similarity and contrast

3.2.2. The implementation of comparison in M-PIRO System

An essential aim in generating comparisons was to avoid making individual, full-

clause repetition to previously seen exhibits, which tend to be boring, distractive and

irritating; this made its educational goal disputable and controversial. Wherefore, the M-

PIRO comparison module attempts to overcome this problem by using the class

hierarchy to group previously examined exhibits into broader categories and make

either group references or short individual references, when the system knows that the

name of the exhibit is sufficient to make a unique reference (Androutsopoulos et al.,

2002) [see table 3.3.].

Comparison ordering (importance)

ENGLISH

sculpted-by

potter-is

painted-by

original location

creation-period

painting-technique used

made-of

GREEK

potter-is

original location

creation-period

painting-technique used

made of

Table 3.4.: The comparison ordering based on information importance forM-PIRO system


28/118



The system can make two kinds of comparison: similarity (like the stamnos, this

lekythos was created during the classical period) and contrast (Unlike the previous

coins which were made of silver, this stater is made of gold). When the system is

requested for the next exhibit, it stores the information of the exhibit that can be

compared (see the previous table); particularly it locates the target predicates. As a first

step to forming a comparison, Exprimo completes the domain class hierarchy tree for

the previously examined exhibits that belong to the same exhibit subclass at the current

exhibit (we assume that the user visited only exhibits from the subclass vessel). The

system includes all the possible potential comparators which were collected by the

system for each past exhibit. The next step is to remove firstly subsets (to avoid making

full-clause references to previous exhibits) and secondly similar subclasses that are not

directly related to the exhibit. Finally, the system removes the weakest relatives by

checking any identical comparators that are higher in the hierarchy and distant relatives

by checking direct relatives of the exhibit in the current focus. For example, the system

checks for the comparator made-of of an archaic stater (silver); the past exhibit had

the form classical tetradrachm made-of silverand the post-previous exhibit a drachma

made-of silver. The super-class is now coins [made-of silver] and therefore, the system

removes the other entities (tetradrachm and drachma) to use the super-class entry

(coins) and generates the comparison like the previous coins, this stater is made of

silver; however, for instance if both previous exhibits were the same sub-entity such as

tetradrachm and both made of silver, it is meaningless to keep the super-class entry.

Therefore, the generated comparison would be like the tetradrachms, this stater is made

of silver. Finally the system uses for similarity the phrases another and like the

(previous) Xand for comparison unlike the (previous) X.


29/118

Chapter 4 The Pilot Experiment


Chapter 4

The Pilot Experiment

4.1. Introduction

Ultimately, the pilot experiment intended to examine not only the performance in

the texts with-in subjects and between-in subjects, but also what kind of text structure

the subjects preferred and how much they thought that they had learnt from a text with

or without comparison and aggregation. It was decided that the best way to answer all

these questions would be to give the subjects two thematically different text sets

produced by M-PIRO system, where the one had enabled the options of comparison andaggregation and the other did not, and subjects were tested in a factual recall test and

were asked to decide which one is more natural and more fluent to them. There were,

however, a lot of parameters to consider and test before proceeding to the main

experiment.

Experiments that had been conducted in earlier studies and involved evaluation of

natural language generation systems aimed at testing different kinds of system

properties such as accuracy (Jordan, Dorr & Benoit 1993), fluency/ intelligibility

(Minnis 1998), task(IDAS: Levine & Mellish, 1995; ILEX: Cox et al., 1999) and could

therefore provide different reference for the purpose of their evaluation experiment.

These experiments and approaches evaluated different aspects and properties of a NLG

system. Mellish & Dale (1998:349-350) tried to distinguish between the evaluation of

systems and the evaluation of their theories that were underlied, and distinguish both of

these from task evaluation; each of these aspects is considered by looking how

evaluation has been carried out in the field so far. Although the last few decades (1980sand after) the evaluation has increased substantially, there has not been done many

works about evaluation, neither in the linguistics field nor in the natural language

generation system experimental design theory. According to Mellish & Dale (1998) and

Bangalore et al. (2000) the problem is the confusion that is caused by the inability to

distinguishing properly natural language processing and natural language processing;

perhaps the most important of the reasons is that from a practical perspective we are

faced with a world where there is a great deal of textual material whose value mightbe


30/118



leveraged by the successful achievement of partial achievement of the goals of NLU

(Mellish & Dale, 1998: 352).

The only work about taskevaluation was done by Levine & Mellish (1995) for IDAS

and Cox et al. (1999) for ILEX system. As it has been already noticed in section 2.1.2,

the evaluation of ILEX failed to support the hypothesis that the dynamic hypertext

version would improve the performance of the subjects. Therefore, it was very

important to treat the issues and problems (Mellish &Dale, 1998) by evaluating M-PIRO

carefully and test the solutions before running the main experiment. For the purpose

of this study, the two variables that would be tested, were the comparison and

aggregation14

. The major problem in evaluating an NLG system is that of assessing the

output. Since there is not any objective criterion for comparing the appropriateness ofthe text, it was decided to assess the output with and without the two variables.

Thereupon, it was critical and appropriate text output for the experiment and all the

subjects to be exposed in the same conditions [unlike the Cox et al.(1999) experiment].

Furthermore, the knowledge that the human subjects would earn from the texts, was

measured, since the human subjects are the most valuable resource. Finally, the last

problem that had to be solved before the main experiment, was how to handle

disagreement among human judges, since humans will not always agree about

subjective qualities like good style, coherence and readability (Mellish &

Dale, 1999:363). Therefore, it is preferable to avoid consulting explicit human

judgements for this reason. After finding answers to all this questions and choosing the

best parameters, the only way to test them is with a pilot experiment, which would

show how adequate were are the data, texts and design.

14 Comparison is the main factor which the users history and the interaction with other texts are

build on, and aggregation expect for making a text more fluent and natural, improves the interaction with-

in the text since the information are not (bricks and tiles

situated without order).


31/118



4.2. Method

4.2.1. Designing and choosing the exhibit texts

M-PIRO Authoring Tool15

was used to choose the exhibits, to design and run the

pilot experiment, as well as to preview the texts which were generated. The authoring

tool is a very useful tool not only for introducing new entities, sub-entities, information

and exhibits, but also for observing the knowledge database for each exhibit and for

choosing which exhibit order would be the optional for the experiment.

The core experimental procedure for the experiment was the two text variables that

they help us to evaluate the system, aggregation and comparison. The first decision washow these variables were going to be used in the experiment and if they would be tested

together or separately. There could be a text with comparison and aggregation, a text

with comparison only and no aggregation, a text with aggregation and no comparison

and another text without comparison and aggregation. The ILEX evaluation showed that

only comparison did not support the hypothesis that users history (compare with what

the user had already visited) and failed to help them perform better and learn more than

the user whose history had been turned off (Cox et al., 1999). Moreover, the

aggregation option combine facts inside an exhibits text and could not help the user by

itself to remember more details. Therefore, it was decided to be used a text with

comparison and aggregation and another one without comparison and aggregation.

Consequently, for the first text it was not chosen the option ofDisable the users

history, and in the users profile it was selected four facts per sentence (max) for

aggregation and for the second text the users history was disabled and it was selected

only one fact per sentence.

Continuing with the design of the experiment, it was decided to use the user model

profile for adults. This users profile has as default values four facts per sentence (two

more than the childusers profile) and two repetitions for assimilation (one more than

the expertusers profile). Because of the time limitations of the experiments (both pilot

and main) it had to be kept only in one profile and, therefore, the participants were only

15For more details read the session 2.2.3. The M-PIRO Authoring Tool


32/118



adults. Moreover, none of the adult subjects should be an expert or had at some point

been acquainted with the subject of numismatics16

, angiology17

and archeology.

The second core part of the experiment was the exhibit texts and the decision if the

variables were going to be tested with-in or in-between the subjects. Although for the

evaluation of ILEX, some subjects used the dynamic version and the others the static, it

was determined to give two text sets each subject, one with comparison and aggregation

and the other without them. However, these sets could not have exactly the same

exhibits as it would be impossible to evaluate the performance. Thereupon, there were

chosen two completely different text sets which contained six exhibits each and all the

exhibits belonged to the same main entity. Exhibits were avoided to be imported from

different entities in one set like statues, portraits and jewels, since that would make thetext much more difficult for the user and M-PIRO could not produce comparison pairs

between unrelated at some point exhibits. The first text set had only coins exhibits18

which were a drachma, a classical tetradrachm, a tetradrachm, an archaic stater, an

Hellenistic staterand a coin of the reign of Commodus. The second text set had only

vessels exhibits which were a Hadra ware hydria, a black kantharos cylix, a classical

cylix painted by Eucharides, a lekythos painted by Amasis and a lekythos.

Lekythos (with comparison and aggregation)This exhibit is another lekythos. Like the black kantharos, it was created during

the classical period. It dates from between 470 and 460 B.C. It shows an athlete

preparing to throw his javelin. This lekythos was painted with the red figure

technique. In antiquity, javelin throwing was intimately bound up with the Greek

way of life. Before it became a feature of sporting life, the javelin was one of the

weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden

spear about the height of a tall man. This exhibit is a lekythos. The lekythos was

originally from Attica. Like this previous lekythos, it was originally from Attica.

Lekythos (without comparison and aggregation)

This exhibit is a lekythos. It was created during the classical period. It dates frombetween 475 and 470 B.C. It depicts an athlete preparing to perform a jump. This

lekythos was painted with the red figure technique. The origin of the long jump lies

in the challenge presented by traversing the cliffs, ravines and rough terrain of the

Greek countryside, and, accordingly, by the challenge of waging war on this terrain.

It was a complicated sport in which the athletes used special weights, the halteres, to

16Numismatics is the science, whose field is the history and study of (ancient) coins and medals.17Angiology is the science, whose field is the history and study of ancient vessels.18 The whole texts for the Coins and the Vessels in both version are in Appendix I (English and

Greek texts)


33/118



increase their momentum and the distance of the jump. On this lekythos, the athlete

holds the weights in his hands and is about to jump away from the springboard. In

order to win, he needs not only great speed and strong legs but also precise

coordination between his hands and feet as they make contact with the springboard.

This is why the long jump was occasionally accompanied by music, which helped

the jumper pace his rhythm. This lekythos was originally from Attica.

Table 4.1.: Part of the lekythos text generated by M-PIRO with and without comparison and

aggregation

Moreover, we considered that the exhibits of each text set must belong to different main

entity; if both text sets contained, for example, different exhibits but all of them vessels,

then it would be a biased problem since the users would not be any more nave when

they read the second text set. In addition, it was interesting to see how the performance

of the subjects would be in easy and difficult texts, particularly in the easy text set of

coins and the difficult text set of vessels. Hence, the half of the participants read the

coins text sequence with comparison and aggregation and the vessels text sequence

without them and the other participants the coins text sequence without comparison and

aggregation and the vessels text sequence with them.

Finally, two instruments were devised for use in evaluation. They consisted of a

recall test of factual knowledge about coins and vessels in the exhibition and a usability

questionnaire. The test was administrated to subjects printed in paper. The factual recall

test which was introduced to the subjects with the title What did you learn from the

virtual exhibition was a multiple choice test. Almost the half of the questions was

about combined and contrasted facts between the exhibits with a variety of difficulty.

Four examples (two from both question sets) are show below:

8.During which period were the two cylixes created? (archaic, classical, Hellenistic,

roman)

12. Which color is the background of a red-painting technique decorated exhibit?

(red, black, white, clay, blue)

3. The tetradrachms are made of . (gold, silver, bronze, marble, there is no

info in the texts, different material for each one)

12. Whose picture is on Hellenistic stater? (King Perseus, Alexander the Great,

Athena, Apollo)

Table 4.2.: Some questions from the factual recall text (all the questions are in Appendix II). The

correct answers are in bold type.


34/118



4.2.2. Subjects

The subject that have taken part in the pilot experiment were 8 native speakers of

Greek and English aged 23 to 31, four for each language version of M-PIRO. Four of

them were male and the other four were female. At the time of the experiment they were

all MSc or PhD students at the University of Edinburgh; the Greek subjects have spent

1-5 years studying in Britain, which did not affect their understanding of Greek

scientific text. Although all the Greek participants have some elementary courses of

Ancient Greek history and Archeology at the Gymnasium and Lyceum classes, they

were nave as they did not have any previous experience either with the subject of

vessels and coins or with a natural language system text output either. Similarly, the

English participants were also nave as they did not have any previous experience with

archeological texts about vessels and coins and/ or natural language system text output.

Additionally, only one of the English participants had taken a course for Greek

Language and therefore, he was familiar with the Greek words. Furthermore, none of

the subjects had any history of reading problems (like dyslexia) or understanding

problems. Finally, all the participants were nave far as it concerns the goal of the

experiment.

4.2.3. Procedure

The experiment took place in the Computer Micro Lab room of the Department of

Theoretical and Applied Linguistics of the University of Edinburgh and in my flat. The

subjects were usually alone in very quiet conditions so as no one can could disturb

while they were reading of the texts. The M-PIRO text output was printed in A4 pages.

Before the experiment started the subjects had been given a short introduction about

its nature. They were told that they were going to read two different texts, in particular

two different text sets of six museum exhibits each of which generated by an NLG

system. Moreover, they were informed about what kind of information was in the texts;

nevertheless they were not explained anything about the text structure or the text

difficulty. Additionally, the were told that they should try to learn and remember the

descriptions and references related to the exhibits since they were going to answer a 13-

multiple choice question set for each set without having any text in front of them. They


35/118



had fifteen minutes to reach each exhibits text sequence and they could ask anything

before start reading. When it was specified that their task would be to answer some

questions, all the participants wanted to know what kind of questions were going to

respond to and if they had to learn the texts by heart to memorize them. Upon these

questions they were given some examples and told that they should read them like any

other text or document. When the time run out or they felt that they could answer the

questions, they were given more instruction. They were told that they had to choose

only one possible answer but it was written in the question that there were two possible

right answers. Specially, they were asked not to answer any question which they did not

have any clue about or remember anything about, since it was possible to choose the

right answer by chance and that would not be good for the statistical analysis. It was

pointed out that it was obvious not to remember everything and they should not feel

uncomfortable because of their unacknowledged questions. The subjects were

encouraged to ask any questions before the beginning of the experiment or make any

comments after the end of the experiment.

Firstly, they read the first text set and answered the corresponding questions;

afterwards, they read the second text set and answered the questions again. I chose

randomly which text they would read first; therefore, some participants read first the

text set with comparison and aggregation and the others the text set without comparison

and aggregation; as well as, some participants read firstly the coins texts and the others

the vessels texts. This procedure covered all possible combination of the text order,

since it was necessary to examine the ordering effect and its possible flaw in the

experimental design19

. Testing the order of the two text group in the pilot experiment, it

would give some important information for the design of the main experiment. Finally,

the participants were asked to fill a questionnaire for both text sets; for this session, they

had an opportunity to check again the texts and fill the questions without time pressure.

At the end, they were interviewed, where they were informed about the purpose of the

experiment and they discussed about their own critical comments and ideas for the

experiment.

19More details and discussion about this problem in a later session of this dissertation. See at 6.1.2.

Ordering effects: a possible flaw in experimental design.


36/118



4.3. Results and Discussion

The 8 double-session data results, one for each participant, were marked by the

author, saved in Word (Microsoft Office 2000) and then exported to Excel and SPSS

11.5 for further analysis, process and creation of tables and graphs.

Coins Vessels Difference

Group A (Eng) 12 11,5 -1.5

Std. Deviation 2,82 2,12 +0,70

Group B (Eng) 11 6 +5

Std. Deviation 1,41 0 +1,41

Group A (Gr) 9 11 +2


Group B 7 5,5 +1,5


Table 4.3.: The results of the participants in both versions of the pilot experiment

There were 13 multiple-choice questions and the highest possible score was 15. The

results of the experiment hinted not only that the participants scored better in the

questions of the text with comparison and aggregation, but also that they preferred the

text with these options as more fluent and natural. However, there was an exception

among the participants; a participant scored much better in the text without comparison

and aggregation and she claimed at the questionnaire that this funny comparison

thing between the vessels texts made her tired and she did not like the text as it did not

help herremember details cause of the repetitions. Nevertheless, this case was a very

rare exception. Furthermore, despite the grouping effect in the Greek participants, it

seems that the comparison and aggregation made a greater effect in the vessels texts,

since the difference between the groups were 6 and 5,5. This fact supports the

hypothesis that comparison and aggregation help the user to learn more and remember

better some details, as the users do not have usually many difficulties with easy texts.

Furthermore, standard deviation numbers did allow us to make more clear comments, as

it was expected that the standard deviation of mean of the texts with comparison and

aggregation would be smaller than those of the texts without them; unfortunately, this

was not supported by the data of the pilot.


37/118



Pilot Experiment Performance

5

9

13

9

10

12

13

10

4

7

10

8

6 6

11

15

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8

Subjects

Score Text with comparison and aggregation

Texts without comparison and aggregation

Graph 4.1.: The performance of the participants based on the option of comparison and aggregation

(the first four are the Greek participants and the other four the English)

Pilot Experiment Performance

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8

Subjects

Score

Coins with comp.-aggr.

Vessels without comp.-aggr.

Coins without comp.-aggr.

Vessels with comp.-aggr.

Graph 4.2.: The performance of the participants based on the group factor.

Based on the results posted and described above and despite the fact that they were

not statistical significance (1, 6 = 2.748, sig., .148) in the text type factors (comparison

and aggregation), I decided to work with both text sets per participant in the main

experiment. The main effect for groups (p< .05, sig., .031) was the main cause for this

statistical insignificance and, therefore, it was expected that it would not appear in the

main experiment especially since the performance with-in the subjects supported one of

my hypothesis. As it is illustrated in the graphs, there was obviously an effect on

participants performance on the questions, depending on the test type factors.

Moreover, the participants characterized the coins texts as easy and the vessels texts as


38/118



difficult/ very difficult. Finally, it was almost always (the one exception) more

preferable the text set with comparison and aggregation.

The results of the pilot did not allow for many useful assumptions on what to expect

in the main experiment, as there did not seem to be consistency among the groups. For

instance, when looking at the scores of each group in both versions, it is a big surprise

the fact that English participants performed better than the Greeks. Although a

comparison between these two versions is unfair because of the subjects different

background knowledge and how it interfered in the pilot, the results of the pilot raised

some questions which may be answered in the main experiment. Therewithal, the

comments of the participants gave helpful guideline for the main experiment. They

found the vessels text question too difficult and they suggested importing some dummyanswers in the multiple-choice; by the way, they thought that more questions would not

be a problem. Furthermore, it was suggested that each exhibits text had to be split up in

two or three pages. Additionally, they considered that the time had to be between

twenty minutes, because less would be not enough and more it would be getting boring.

Finally, the participants who read first the text with the vessels (with or without

comparison and aggregation) complained that the difficulty of the text exhausted them

since it was tough and all the information almost completely new for them, and

therefore, it was harder to read the second text set with the coins; so they preferred to

had read the coins text set first. This ordering effect may be a possible flaw for the main

experiment and it is going to be mentioned and discussed in details on session 6.1.2.

However, it was a major problem which should be solved somehow.

Coins text Vessels text

How interesting have youfound the text?

Interesting Neutral

How difficult are the questions easy Difficult

Did you enjoyed the texts? yes Yes

Which text is more fluent and

natural?

7 subjects chose the text with

comparison and aggregation

1 subject chose the text

without comparison

Table 4.6.: The questionnaire results of the pilot experiment

To sum up, the results of the pilot study made fairly clear that a text with

comparison and aggregation are supportive in subjects reading to remember, learn and


39/118



perform better than a text without these factors. This outcome was considered to be

encouraging for the main experiment. It would be interesting to find out if there would

be difference in the performance with-in and between-in subjects and if there is a

difference how big it would be depending on the difficulty and stiffness of the text

itself. The data we got from the pilot experiment was not enough to make any

assumptions; nevertheless, they supported some of our theories/ hypotheses and gave us

some ideas for new hypotheses such as if the comparison and aggregation make the text

easier, the dependence of the learning feeling on the text type and what forces the

participants to choose a text as more fluent and natural.


40/118

Chapter 5 The Main Experiment


Chapter 5

The Main Experiment

5.1. Introduction

The main experiment set out to evaluate two language versions (English and Greek)

of the M-PIRO NLG system by testing the text structure factors of comparison and

aggregation; it would support or fail our four hypotheses. As the pilot experiment

results revealed, not only the participants not only scored better in the texts with

comparison and aggregation, but also showed a bigger score difference in the difficult

text sets.

What is now anticipated is to observe the difference in the performance depending

on the text type and not depending on any other factors such as the group, the genre and

the language. This would hopefully show if the participants scores were related to the

text factors (comparison and aggregation) and there would be no statistical significance

in the group, the genre and the language factors. Moreover, it was anticipated that the

participants would characterize as natural and fluent the text with the text struc

karasimos_mpiro

Documents

Transcript of karasimos_mpiro