karasimos_mpiro
-
Upload
maria-fysaraki -
Category
Documents
-
view
221 -
download
0
Transcript of karasimos_mpiro
-
7/31/2019 karasimos_mpiro
1/118
-
7/31/2019 karasimos_mpiro
2/118
Athanasios N. Karasimos Evaluation ofM-PIRO Systemii
Declaration
I hereby declare that this MSc dissertation is of my own composition and that it contains
no material previous submitted for the award of any other degree. The word reported in
this MSc Dissertation has been executed by myself, except where due acknowledgement
is made in the text.
(Athanasios N. Karasimos)
-
7/31/2019 karasimos_mpiro
3/118
Athanasios N. Karasimos Evaluation ofM-PIRO Systemiii
AcknowledgementsI desire to express my gratitude to all those people without whom this task would
have been much harder and difficult to accomplish.
Above all, thank you Amy and Colin, for being so patient, helpful, supportive and
co-operative, and for being there every time I need you. You make many things clear
and easy with your useful and clever advice. It was said that the beginning and the end
of a dissertation is the supervisors.
Many thanks to all these people who participated in both experiments. Without their
participation the evaluation would be impossible to be done. I appreciate the time you
spent for my experiment.
And special thanks to my classmate, Behzad whom I own my dissertations topic
and I am grateful for all these useful conversations. Many thanks to Ellen Burk for
guiding me out of the statistics labyrinth.
Finally, at the Greek front, thanks to Aggeliki, Efi, Stavroula and Stathis for their
care and support throughout the whole year. Additionally, special thanks to Alexander
Melengoglou, who offered his valuable knowledge and comments. I could not be here
now without the strong support and love of my family. Thank you Anna and Sotiria for
your corrections.
-
7/31/2019 karasimos_mpiro
4/118
Athanasios N. Karasimos Evaluation ofM-PIRO Systemiv
To my mother,
my uncle
and specially to my sister,
Helen, ().
-
7/31/2019 karasimos_mpiro
5/118
Athanasios N. Karasimos Evaluation ofM-PIRO Systemv
Abstract
Half of the problem in Natural Language Generation (NLG) is the evaluation of a
NLG system. In the last few decades the research about evaluation has increased and
made some serious steps on this direction. This study describes an evaluation of a
multilingual personalized information objects system (M-PIRO), which dynamically
generates descriptions for exhibits in a virtual museum exhibition. In the evaluation,
learning outcomes in-between subjects who read two sets of texts about coins and
vessels were compared to those of subjects who read these text sets with different text
structure. The aim was to attempt to prove that the text type factors, comparison andaggregation are essential for a better performance. Several types of data were collected
by post-session tests of factual recall knowledge and a questionnaire about the evaluated
system. Results showed that performance measures did differ between subjects in the
two conditions (presence and absence of the text type factors); additionally, the data
analysis revealed that text difficulty and the subjects impression of learning were also
statistically significant. These issues are all considered in order to determine if the goal
of M-PIRO is achieved and to suggest some improvements to it. The study concludes
with an outline of further future work.
-
7/31/2019 karasimos_mpiro
6/118
Contents
Athanasios N. Karasimos Evaluation ofM-PIRO Systemvi
Contents
Declaration.......................................................................................................................ii
Acknowledgements.........................................................................................................iiiAbstract............................................................................................................................v
Contents ..........................................................................................................................vi
Index of tables, graphs and pictures ..........................................................................viii
1. Introduction .................................................................................................................1
1.1. Natural Language Generation Systems ...............................................................1
1.2. Evaluating Natural Language Generation Systems.............................................2
1.3. Purpose and Outline of the study.........................................................................5
2. The M-PIRO NLG System............................................................................................6
2.1. TheILEXNLG System...........................................................................................6
2.1.1. The ILEX Dynamic Hypertext System ............................................................6
2.1.2. The evaluation of the ILEX System: Dynamic vs. Static version....................7
2.2. TheM-PIRO System...............................................................................................9
2.2.1. The M-PIRO Domain and Generation Architecture ........................................9
2.2.2. The M-PIRO Authoring Tool .........................................................................12
3. Aggregation and Comparison in the M-PIRO ..........................................................14
3.1. Aggregation........................................................................................................14
3.1.1. What is aggregation?....................................................................................14
3.1.2. The implementation of aggregation in the M-PIRO System..........................15
3.2. Comparison........................................................................................................17
3.2.1. What is comparison?....................................................................................17
3.2.2. The implementation of comparison in M-PIRO System................................19
4. The Pilot Experiment................................................................................................21
4.1. Introduction........................................................................................................21
4.2. Method ...............................................................................................................23
4.2.1. Designing and choosing the exhibit texts ....................................................23
4.2.2. Subjects ........................................................................................................26
4.2.3. Procedure......................................................................................................26
4.3. Results and Discussion.......................................................................................28
-
7/31/2019 karasimos_mpiro
7/118
Contents
Athanasios N. Karasimos Evaluation ofM-PIRO Systemvii
5. The Main Experiment...............................................................................................32
5.1. Introduction........................................................................................................32
5.2. Method ...............................................................................................................32
5.2.1. Designing and choosing the exhibit texts ....................................................32
5.2.2. Subjects ........................................................................................................35
5.2.3 Procedure.......................................................................................................36
5.3. Results ................................................................................................................37
6. General Discussion ....................................................................................................48
6.1. The results of both experiments .........................................................................48
6.1.1. Interpreting the results..................................................................................48
6.1.2. Ordering effect: a possible flaw in experimental design..............................51
6.2. Suggestions and improvements ..........................................................................53
6.3. Future work........................................................................................................56
6.4. Conclusion .........................................................................................................58
Bibliography ..................................................................................................................64
Appendix I:The M-PIRO generatedtexts for the Main Experiment.64
Coins Text Sequence [English] with comparison and aggregation..........................64
Coins Text Sequence [English] without comparison and aggregation.....................67
Vessels Text Sequence [English] with comparison and aggregation .......................70
Vessels Text Sequence [English] without comparison and aggregation ..................73
Coins Text Sequence [Greek] with comparison and aggregation ............................77
Coins Text Sequence [Greek] without comparison and aggregation .......................80
Vessels Text Sequence [Greek] with comparison and aggregation..........................83
Vessels Text Sequence [Greek] without comparison and aggregation.....................87
Appendix II: What did you learn from the virtual exhibition? ....................................... 91
The Questions for the Coins Text Sequence [English] .............................................91
The Questions for the Vessels Text Sequence [English] ...........................................93
The Questions for the Coins Text Sequence [Greek] ................................................95
The Questions for the Vessels Text Sequence [Greek]..............................................97
Questionnaire..........................................................................................................100
Appendix III : The Statistical guide ............................................................... .......................... 101
-
7/31/2019 karasimos_mpiro
8/118
Index of tables, graphs and pictures
Athanasios N. Karasimos Evaluation ofM-PIRO Systemviii
Index of tables, graphs and picturesTable 2.1. The M-PIRO pipeline generation architecture 9
Table 2.2. Part of M-PIRO entity hierarchy organization of types and levels 10
Table 3.1. The relationships between two representation 17
Table 3.2. Short and long conjunctions for similarity and contrast 18
Table 3.3. The comparison ordering based on information importance for M-PIRO system 18
Table 4.1. Part of the lekythos text generated by M-PIRO with and without comparison andaggregation 23
Table 4.2. Some questions from the factual recall text (all the questions are in Appendix II). Thecorrect answers are in bold type 24
Table 4.3. The results of the participants in both versions of the pilot experiment 26
Graph 4.4. The performance of the participants based on the option of comparison andaggregation
27
Graph 4.5. The performance of the participants based on the group factor 28
Table 4.6. The questionnaire results of the pilot experiment 29
Picture 5.1. A web page from the vessels sequence that contains the Spherical CorinthianAryballos 31
Table 5.2. Two examples of the vessels texts with more complicated comparisons 33
Table 5.3. The results of the participants in both languages of the main experiment 36
Graph 5.4. The score performance per person depending on text type factors (Greek Version) 37
Graph 5.5. The score performance per person depending on text type factors (English Version) 38
Graph 5.6. The results per participant depending on the group factor [Greek version] 39
Graph 5.7. The results per participant depending on the group factor [English version] 39
Graph 5.8. The performance for all the participants depending on the genre factor [both version] 40
Graph 5.9. Box plots for performance depending on genre and text type factors [English version] 41
Graph 5.10. Box plots for performance depending on genre and text type factors [Greek version] 41
Graph 5.11. The performance of Greek participants depending on text difficulty (coins vs. vessels) 42
Graph 5.12. The performance of English participants depending on text difficulty (coins vs.vessels) 42
Graph 5.13. The questionnaire results summary of the English participants for both groups 43
Graph 5.14. The questionnaire results summary of the Greek participants for both groups 43
Graph 6.1. Box plots for the performance of the participants depending on the language factor 49
-
7/31/2019 karasimos_mpiro
9/118
Chapter 1Introduction
Athanasios N. Karasimos Evaluation ofM-PIRO System1
Chapter 1
Introduction
1.1.Natural Language Generation Systems
Natural Language Generation (NLG) is the assembly of the text word-by-word
using knowledge of morphology, syntax, semantics and text structure (O Donnell et
al., 2001). As a branch of computational linguistics, cognitive science and artificial
intelligence, NLG is the process of constructing natural language outputs from non-
linguistic inputs (symbolic or numeric inputs), in particular of mapping some
underlying representation of information to a meaningful, understandable and specificpresentation of that information in spoken and/ or textual linguistic form (in human
languages). A complete NLG system has to take many decisions to produce an
appropriate output. The goal of NLG can be characterized as the inverse of that of
Natural Language Understanding (NLU), whereas in NLU the concern is to map from
text output to some underlying representation of the meaning (Mellish & Dale, 1998;
Jurafsky & Martin, 2000: 764-794; O Donnell et al., 2001). The generation process in
an NLG system typically consists of the five following main stages:
Content determination, in which the system decides what information should be
included as appropriate in the text, and what should be omitted; this selection depends
upon a variety of contextual factors and the particular user to whom it is to be directed.
Document structuring, in which it is decided how the text should be organized and
structured; this means that (for the information already included) the NLG system has to
choose the appropriate structure to convey the information.
Lexical selection, in which the system chooses the particular word, word types and
phrases that are required to communicate the specified information; it may be also
possible to vary the words used for stylistic effect.
-
7/31/2019 karasimos_mpiro
10/118
Chapter 1Introduction
Athanasios N. Karasimos Evaluation ofM-PIRO System2
Sentence structure1, which involves the processes ofaggregation, in which the
system must apportion the selected content into phrase, clause and sentence-sized
chunks and often in the interests of fluency place several pieces of information into
one sentence, andreferring expression generation, in which the system determines the
properties of an entity by referring to that entity.
And Surface realization, in which the system determines the mapping of the
underlying text into a natural text of grammatically correct sentences.
The M-PIRO NLG system, which will be discussed in Chapter 2 and evaluated, is a
dynamical hypertext2
natural language generation system.
1.2. Evaluating Natural Language Generation Systems
Over the last 15 years, the level of interest and concern expressed by the natural
language processing (NLP) researchers with regard to evaluation has increased
substantially. In early NLG work, the quality of the output of the system was assessed
by the system authors themselves. However, they misestimated the worthiness of the
evaluation for the improvement of an NLG system. In contrast, it has nowadays become
widely accepted in the language processing community that NLP researchers shouldappreciate the evaluation of a system and pay more attention to its results, since it plays
an essential role in the development of NLG systems, techniques and strategies.
Mellish and Dale (1998: 349) claim that NLG is exactly half of the problem of
natural language processing work; the other half is the evaluation of a system.
According to them, the first serious work in the field of evaluation took place in 1990 at
a workshop held on the theme of the Evaluation of Natural Language Generation
Systems and the papers of Meteer and McDonald (1991) and Moore (1991). The main
problem which they tried to distinguish, is the difference of the evaluation of a system
from the evaluation of underlying theories. Dealing with this and some other essential
1 For some researchers (Reiter & Dale, 2000. Building a Natural Language Generation System;Melengoglou, 2002) lexical selection, aggregation and referring expression generation are part of
microplanning.2Dynamic hypertext refers to a NLG system which creates its outputs dynamically at run-time, when
the user requests them; such a output text is generated on call, not pre-written by a human author.
-
7/31/2019 karasimos_mpiro
11/118
Chapter 1Introduction
Athanasios N. Karasimos Evaluation ofM-PIRO System3
problems, the empirical work in this field was noticeably increased based on the above
work papers.
Evaluation can have many objectives and can consider several different dimensions
of an NLG system or theory. Sometimes some evaluation objectives combine aspects of
a system and its underlying theory, making the task harder. Mellish and Dale (1998)
suggest that the evaluation of NLG techniques must be broken down into three main
categories. Evaluating properties of the Theory is the assessment of the
characteristics of some theory underlying an NLG system or a part of it; the
implementation of this theory, e.g. Rhetorical Structure Theory, helps us to characterize
as appropriate or not this theory for the system. Evaluating properties of the System is
the assessment of specific characteristics of some NLG system; it might be acomparison of two NLG systems or their algorithms, the performance of a NLG system
in two different version of it, or the output of the generator with the characteristics of a
corpus of target text. Finally, Applications Potential is the evaluation of the potential
utility of an NLG system in some environment, for instance if its use provides a better
solution than some other approach (Mellish & Dale, 1998: 353-354).
Previous approaches to evaluation of NLG theories are very few. The main problem
is that a good NLG system is based on a theory; nevertheless, during its constructionseveral practical problems must be solved and, therefore, the solutions may be
unconnected to the underlying theory. There were some evaluations of grammars
based on some theories, like Rhetorical Structure Theory (Mann & Thompson, 1988).
Robin tested his revision-based theory with a natural corpus for the domain of baseball
summaries. As Mellish and Dale (1998: 355) report, this kind of evaluation is
dangerous, since most reported work on NLG evaluates its theory indirectly through
the systems that implement them.
In contrast to the evaluation of NLG theories, the question how good is my NLG
system is much easier to answer. There are different kinds of system aspects that can be
evaluated. Accuracy evaluation means the assessment of the relationship between input
and output, if the generated text conveys the desired meaning to the reader (Jordan et
al., 1993). Fluency and intelligibility evaluation concerns the quality of generated text
and includes notions such as syntactic correctness, stylistic appropriateness,
-
7/31/2019 karasimos_mpiro
12/118
Chapter 1Introduction
Athanasios N. Karasimos Evaluation ofM-PIRO System4
organization and coherence. Despite the unclearness of these notions measurements,
Minnis (1993) made some proposals for evaluation. There are quite a few evaluations in
this field, like Bangalore et al. (2000) who evaluated the system FERGUS and found that
its output understandability and quality was correlated each other3. Finally, a task
evaluation involves observing how well a task is performed by using the NLG system.
Usually, task evaluation is used for evaluation of Machine Translation (MT) systems,
such as IDAS assessment (Levine & Mellish, 1995); however, it was also used for other
kind of NLG systems, such as PEBA-II (Milosavljevic & Oberlander, 1998), ILEX4 (Cox
et al., 1999) and AMVF (Carerini and Moore, 2000).
Finally, there are some issues and general problems which arise when dealing with
evaluating a NLG system. Firstly, the very major problem in evaluating an NLG systemis that of assessing its output and the question arises of what the output should be. A
fluency and intelligibility evaluation can deal with this issue, but it lacks objective
criteria, where the results of a task evaluation indirectly reflect the properties of the
system. Secondly, it is needed to evaluate measurable attributes for the performance of
the system, and thirdly, these attributes must be compared with something else,
otherwise it is hard to say that something is good or bad, easy or difficult, acceptable or
not. Additionally, it is really essential to get adequate test data for the evaluation. An
evaluation without sufficient data will unsurprisingly fail. Finally, the human subjects
may not agree and it would be wise to handle the disagreement among human judges by
avoiding taking into account their judgements; therefore the authors should guide5
them
to give specific objective and clear opinions and not vague critics and thoughts.
3They evaluate to different versions of FERGUS (Flexible Empiricist/ Rationalist Generation Using
Syntax) using evaluation metrics (accuracy) which are useful to them as relative quantitative assessmentsof different models.
4 For more details see section 2.1.2 of the second chapter.5However, the subjects should not be guided where the authors desire, for instance the subjects to
say what the authors want to hear.
-
7/31/2019 karasimos_mpiro
13/118
Chapter 1Introduction
Athanasios N. Karasimos Evaluation ofM-PIRO System5
1.3. Purpose and Outline of the study
The purpose of this dissertation is to present and describe the evaluation of the M-
PIRO NLG system and to export useful and helpful conclusions about a further
improvement of the system and a future development of an NLG system.
The hypotheses of this evaluation are the following: firstly, we expect that a text
with comparison and aggregation will help the subjects perform better and learn more.
Secondly, the performance will differ depending on the difficulty of a text. Thirdly, the
subjects will characterize as more fluent and natural a text with comparison and
aggregation than a text without these factors. Finally, they will feel more comfortable
and certain that they learn more from a text with the above factors.
The remainder of this dissertation is organized as followed:
Chapter 2 provides an overview of the ILEX NLG system (2.1.1) and its evaluation
(2.1.2). Then it presents the M-PIRO NLG system and describes the domain (2.2.1) and
architecture (2.2.2) of this system and its authoring tool (2.2.3).
Chapter 3 examines both comparison and aggregation in M-PIRO system. It also
provides a literature background and description of the implementation of these factor in
the current system.
Chapter 4 gives a description of the pilot experiment that preceded the main
evaluation experiment. After the presentation of the main purpose (4.1), the method is
illustrated (4.2) and the results are presented and analyzed (4.3).
Chapter 5 provides the design of the main experiment for the evaluation. As the
purpose and the design was largely the same as that of the pilot (5.1-5.2), the essentialpart which is emphasized, is the analysis of the results (5.3).
Chapter 6 closes the dissertation with a general discussion of the results of this
study (6.1.1) and of the notice of a ordering effect in the experimental design (6.1.2).
Furthermore, there are some suggestions, improvements (6.2) and future work (6.3)
about the M-PIRO system.
-
7/31/2019 karasimos_mpiro
14/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System6
Chapter 2
The M-PIRO NLG System
2.1. The ILEX NLG System
2.1.1. TheILEXDynamic Hypertext System
ILEX6
(the Intelligent Labeling Explorer) is a dynamic hypertext generation system
developed at the University of Edinburgh (1997-2000) in collaboration with the
National Museum of Scotland. Its task was to generate labels (descriptions) for objects
of a virtual museum exhibition in several languages, using a single knowledge database
storing information in a language-neutral form. ILEX generated labels which werepersonalized, whereas they were tailored opportunistically depending on the users
interests and the users interaction history with the system.
The user model ofILEX provides labels generation for the categories ofchild, adult
and expert. It models the users in terms of their relation to information, such as the
interest, the importance and the level ofassimilation of the information, and provides
values for each predicate type. Since the authors can not predict the exact nature of the
user, ILEX allows the users to control directly the displayed generation text for the
objects and the freedom to browse the collection of objects in any order; based on the
authors pre-assumptions of the values of their relation to information, the system
proceeds with the users requests and adapts the structure of its label to the user.
Therefore, the ILEXs aim is to produce exhibits descriptions that a real curator might
give and the visitors feel like they are in real museum with a guider.
The opportunistic text tailoring is achieved in ILEX via the use of referring
expressions, comparison expression, nominal anaphora and approaches derived from
Rhetorical Structure Theory. Built on Rhetorical Structure Theory7
the aggregation is
organized into nucleus-Satellite relations (Like most Arts and Crafts style jewels, it has
an elaborate design) and multi-nuclei relations (This jewel is a necklace and was
6For an extended discussion of this system, see O Donnell et al. (2001); see also, Milodavljevic and
Oberlander (1998), Cox et al. (1999), O Donnell et al. (2000).7For more details and an extensive discussion of the theory, see Mann and Thompson (1988).
-
7/31/2019 karasimos_mpiro
15/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System7
made by a British designer called Edward Spencer). For the comparison expression it
uses the users navigation log; it introduces an already known concept (This necklace
is also in Arts and Crafts style), make simple comparisons with the previous visited
exhibit (For instance the previous item uses oval-shaped stones (in other words it
features rounded stones). However this necklace does not feature rounded stones) and
steers clear of repeating information which has already been mentioned (Cox et al.,
1999; Melengoglou, 2002).
Nevertheless, ILEX is not without problems and flaws (Isard et al., 2003). They
captured much of information by interviewing a curator and then hand-coding
taxonomic information and other assertions. The authors use type text strings literally
rather than in terms of knowledge-base objects stored in some language neutral form;therefore it was hard to present information in any other language rather than the
original input language. Related to this, there is some problems with linguistic grammar
(the English and Spanish works well, but for the other languages the grammars should
rebuild or reconstruct). Furthermore, the same level values of adult, expert and child
types do not change essentially the text structure. Finally, the system is less modular
than desirable.
2.1.2. The evaluation of theILEXSystem: Dynamic vs. Static version
The Cox, O Donnell, Oberlander (1999) paper describes an evaluation ofILEX, the
intelligent labelling explorer and in the evaluation, learning outcomes in subjects who
used the dynamic ILEX system were analysed and contrasted to subjects who used a
static version of the system. Their goal was to attempt to isolate learning effects
specifically due to dynamic hypertext generation (Cox et al., 1999). In previous work
(Levine & Mellish, 1995) the IDAS system that used natural generation techniques in the
automatic generation of hypertexts was evaluated, where the users task was to retrieve
relevant information to answer specific questions; however, they did not use any
comparison group for their assessment and also, built their results and discussion on the
users page visits and navigation logs.
Since Cox, O Donnell and Oberlanders aim was not to compare their dynamic
hypermedia with a traditional media system or to observe aspects of hypermedia such as
-
7/31/2019 karasimos_mpiro
16/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System8
configurations of links, they used two different version of ILEX: a traditionally
configured version with static pages and no user modeling against the original
intelligent system with dynamically generated text containing referring expressions and
comparisons based on user model (Cox et al., 1999). Both versions contained the same
six jewels. Three instruments were devised for use in the evaluation, 1. a recall test of
factual knowledge about jewels in the exhibition, 2. a curator task8
and 3. a usability
questionnaire. They used twenty subjects allocated to the dynamic ILEX system and ten
subjects to the static version of the system. The results were quite interesting. Both
groups scored similarly on the two tests in performance terms; however, the log data
processing revealed that the dynamic system users made more visits to the case of
jewels than static subjects, and made proportionately more navigation-related button
clicks than their static ILEXcounterpants (Cox et al., 1999: 7). Based on these results
they claim that the users did not benefit from the dynamic version properties since they
did not score better; additionally, the learning pattern and performance varied and was
achieved by different ways depending on the log data.
Nevertheless, there were some flaws to this experiment, since there was not used
the same number of users for both versions. Moreover, the subjects were not exposed to
the same experimental conditions, since they used different versions and, therefore, it
was possible to occur essentially a main effect for groups9, which was not mentioned if
it existed or not. Furthermore, the required time was too much for only six exhibits and
it might potentially have affected the results of the performance, since the time
conditions were not real (in a normal case no one would spend ninety minutes for a
twelve paragraphs text of six exhibits). They claimed that any learning effect is almost
dependent on the navigation route just as Levine & Mellish (1993); I maintain that the
learning effects are beyond any log and navigation route, since there are many factors
that learning processing depends on (Mellish & Dale, 1998; Jurafsky & Martin, 2000).
Other experiments have been carried to test different properties of a NLG system
and evaluate the system depending on them. Properties of the underlying theory,
properties of the system and the applications potential were evaluated. Nonetheless, the
8This task consisted some presentations of unseen jewels in the exhibition and the subjects were
asked to examine the exhibit and classify it by answering multiple-choice questions.9For more about statistics terminology see Appendix III.
-
7/31/2019 karasimos_mpiro
17/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System9
lack of a specific evaluation theory and the disagreement of subjective qualities like
fluency, readability, good style and appropriateness constitute an essential drawback of
the evaluation of an NLG system. Some researchers failed to properly evaluate systems
since they used immeasurable phenomena or subjective criteria. According to Mellish &
Dales (1998) evaluation theory approach, there are some important issues and
problems that must be solved for an evaluation design, which they are going to be
discussed in the pilot experiment section.
2.2. The M-PIRO System
The M-PIRO10
NLG system (Multilingual Personalized Information Objects) is a
more recent project of the Information Societies Programme of the European Union that
also generates descriptions for virtual museum object (exhibits). It is a descendant of the
ILEX System and has focused on developing language-engineering technology for
personalized information objects, specifically on multilingual information delivery
(Isard et al., 2003). It incorporates high-quality speech output, an authoring tool,
improved user modeling and a modular core generation domain (Androutsopoulos et al.,
2002).
2.2.1. The M-PIRO Domain and Generation Architecture
domain model domain database + domain semantics
CONTENT SELECTION selection of facts to convey to the user
information to be conveyed
TEXT PLANNING ordering of facts + document structure (RST)
text structure
MICRO-PLANNING lexicalisation + referring expression generation
document specifications
SURFACE REALISATION text generation
exhibit description
Table 2.1: The M-PIRO pipeline generation architecture
10 The projects consortium consisted of the University of Edinburgh (Scotland), ITC-irst (Italy),
NCSR Demokritos (Greece), the University of Athens (Greece) and the Foundation of the Hellenic
World (Greece).
-
7/31/2019 karasimos_mpiro
18/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System10
The table 2.1 outlines the stages in the M-PIRO NLG architecture (Androutsopoulos
et al., 2002) and the process for generating a text.
Domain authoring
The knowledge base contains all the necessary information about entities and
relationships; entities are abstract and concrete. The major task is the hierarchy of entity
types, such as entities and sub-entities, (e.g. exhibit and material: statue and marble)
which can contain more levels of an entity, for example the entity statue has complex
statue, kouros, imperial portrait, etc. Similarly, the relations between entities are
expressed by using fields, since the domain author can define fields for each entity,
which are then inherited by all entity types below in the hierarchy (Isard et al., 2003).
For example, creation-periodapplies to statue and to all descendents, such as complex
statue, kouros, imperial portrait and it must be filled by on entity of the historical-
periodgroup (archaic, classical, hellenistic and roman).
Basic Type Entity Type Entity Level Entity
Copy complex-statue
a-location kouros
portrait
statue
exhibit7
exihibit imperial portrait
exhibit17
coin
jewel
relief
Table 2.2: Part of M-PIRO entity hierarchy organization of types and levels
Microplanning expressions
Each field has associated information that specifies how the relationship it
represents can be expressed as a sentence. As mentioned in the introduction,
microplanning involves lexical selection, aggregation and referring expression
generation; the specifications for these are known as microplanning expressions. Either
some clause plans are created, in which a verb is selected using a pull-down menu of
the verbs or some templates, which are built the expression using strings and references
-
7/31/2019 karasimos_mpiro
19/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System11
to the two entities whose relationship is expressed by the field. Furthermore,
microplanning is to populate from the language-dependent lexicon, which contains
entries for nouns and verbs for lexical selection. Articles and prepositions are domain
independent and therefore, they are stored as a separate resource.
Three languages
M-PIRO can generate text in three languages: English, Greek and Italian. The
grammar for English is based mainly on the ILEX grammar; nonetheless, the grammar
for Greek was constructed from the beginning having as base the English one, and the
Italian grammar was based on the ILEX Spanish one. As already mentioned, the lexicon
has a larger independent domain now and a full inflection system especially for Greek
and Italian due to their rich morphological systems. Moreover, M-PIRO supports high-
quality speech output (Festival11
for English and Italian and DEMOSTHeNES12
for Greek)
User Modeling
One major advantage of the system is that the user modelling information is stored
separately from the domain and linguistic resources, in a personalization server. Each
time the user interacts with the system, he gives his personal details; thus, the system
always has access to the users personal profile and information on the history of the
users interactions with the collection. Also user types for adults, experts and children
were defined by the authors. Each entity type field has values for interest, importance
and repetitions for each user type. The repetitions value is to calculate the assimilation
score and rate per user (low rate repetition for experts, high for children). The
microplanning expressions and the lexicon entries depend on it and change because ofthe user type. There is a comparison module based on a list of important information
and there is an aggregation module that uses techniques such as simple conjunction,
11Developed by the University of Edinburgh. For more details see the official web pages of Festival
(http://www.cstr.ed.ac.uk/projects/festival/) and M-PIRO (http://www.ltg.ed.ac.uk/mpiro, http://
mpiro.ime.gr).12 Developed by the University of Athens. For more details see the official web pages of
DEMOSTHeNES (http://laertis.di.uoa.gr/speech/synthesis/demosthenes/) and M-PIRO.
-
7/31/2019 karasimos_mpiro
20/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System12
relative clauses, and syntactic embedding to join together single facts; these two factors
will to be discussed in more detail later in chapter 3.
Modularity
M-PIROs system architecture is significantly more modular than that of its
predecessor ILEX, which lacks modularity. In particular the linguistic resources,
database, and user-modeling subsystems are now separate from the systems that
perform the natural language generation and speech synthesis giving the system a
satisfactory level of modularity; of course, it is not possible to move to a new
application domain without specifying both what will be talked about and what
vocabulary will be used when talking about it.
2.2.2. The M-PIRO Authoring Tool
According to the authors (Androutsopoulos et al., 2002; Spiliotopoulos et al., 2002;
Isard et al., 2003) compared with domain authoring, this is the a simpler process of
defining specific instances of entities and filling the entities fields with the
corresponding information. The authoring tool previews the output from the generation
system on the basis of the current state of the database. The authoring tool is tailored to
allow the domain experts to manipulate not only the contents of the database, but also
its structure and domain dependent linguistic resources that control how the information
of the database is rendered in natural language. The difficult part of the authoring is
done by an expert, e.g. museum curator, who designs the hierarchy and adds the basic
types, entity types, microplans, etc. The easier part of the authoring, which is what I
have already referred to, is done by non-experts, who add particular entities. So an
expert will add the entity type amphora, but a non-expert can then add lots of
particular amphora entities, e.g. exhibit1, exhibit18, and use the microplans which the
expert has built to add information about the particular entity.
Domain and exhibit authoring can be used together to check information (given by
a designer or curator) and create a text. For example, the domain authoring can define a
basic type material and exhibit, a relation made-of, a specific material [marble], an
entity type statue that is a subtype of exhibit and an entity type imperial portrait that is a
-
7/31/2019 karasimos_mpiro
21/118
Chapter 2 TheM-PIRO NLG System
Athanasios N. Karasimos Evaluation ofM-PIRO System13
subtype of statue [portrait of Octaves Augustus]; the text will be This exhibit is an
imperial portrait. It is made of marble. It is designed to be used by people, such as
museum curators, who have no experience in language technology [one of the basic
rules of usability of Nielsen (, 2000)]. Finally, they can create the types of
visitors (adult, expert, child) and attach fields and microplanning expressions to their
properties.
-
7/31/2019 karasimos_mpiro
22/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System14
Chapter 3
Aggregation and Comparison in the M-PIRO
3.1. Aggregation
3.1.1. What is aggregation13
?
In a Natural Language Generation system aggregation is part of the microplanning
section, where texts are composed of verb-based, clause-sized propositions. These
propositions are likely to contain repetitions and redundancies, which make the text to
seem boring, non-fluent and unnatural to human readers (Melengoglou, 2002). The
overcoming of this problem is the use of aggregation. The question what isaggregation by searching in the literature had attempted to be answered variedly; many
researchers tried to define aggregation. So, aggregation is considered to be the
generation of fluent, more readable and less boring text by eliminating redundancy and
combining semantically the text components at any level to achieve a more concise and
coherent text. This can be a recap of the literature trials for a definition. The effect of
aggregation can be seen very clearly in the following example from Reape and Mellish
(1999), in which two propositions with obvious common features were combined to
produce a single sentence:
1. The car is here
2. The car is blue
[1+2]. The blue car is here
The goal of aggregation is to produce a text which is concise, coherent, cohesive and
fluent; however, the goals that aggregation tries to achieve cover most of the territory of
the default communicative goals of NLG systems generally. Linguistics theories
separate aggregation into several types, such as conceptual aggregation (the reduction
of the number of propositions in the text while increasing the complexity of conceptual
roles value), discourse aggregation (any operation that achieves to make a discourse
structure better), semantic aggregation (the combination of two semantic entities into
13For extended discussion, see Reape & Mellish (1999).
-
7/31/2019 karasimos_mpiro
23/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System15
one or semantic grouping and logical transformations), syntactic aggregation (grouping
subjects or predicates the most common form), lexical aggregation (lexicalization or
choice of the particular lexemes to realize lexical predicates and structures) and
referential aggregation (the referring expressions).
The input to aggregation is usually a tree containing the ordering of facts and the
dependencies that relate the facts. In this tree aggregation detects shared components
among neighboring text-treenodes and combines them with an attempt to remove
redundancies and repetitions in the resulting text. The most common type of
aggregation is simple conjunction or disjunction which joins facts together by the mean
of coordination or contrast, like and, but and or. Another common type is
syntactic embedding which subordinates a clause to a proposition surrounded bycommas (Alexander, the king of Macedonia, conquered the Persians). Generally,
according to Melengoglou et al. (2003) the choice of particular aggregation
operations seems to be highly domain-specific.
3.1.2. The implementation of aggregation in theM-PIRO System
Aggregation in the M-PIRO project receives as input a sequence of semantic
representations of facts, which are classifying facts or facts presenting attributes. The
facts are connected with rhetorical relations, which have to be made explicit to the
aggregation model as it is very probable that the intended meaning could be lost in the
generated text. On the contrary, aggregation will not evereffect the meaning of a text
when the relations are specific. The aggregation module can combine a classifying
fact with a fact presenting attribute into a complex sentence and two facts
presenting attributes into a compound sentence.
Melengoglou (2002) built the M-PIRO aggregation module rules, which were
capable of producing a text that is more concise and readable. They were grouped into
three major combinations. Aggregating identity-attribute pairs includes type-comma
(This exhibit is a drachma, created during the classical period), type-qualifier (This
portrait depicts Alexander the Great, a king from Macedonia) and type-semicolon (In
the background you can see rows of columns, temples and other buildings; in the
foreground there is a ship and a statue of a male form. Aggregating attribute pairs
-
7/31/2019 karasimos_mpiro
24/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System16
includessimple conjunction (This coin originates from Patras and it is now exhibited in
the Numismatic Museum of Athens) and shared subject-predicate (This stamnos was
decorated by the painter of Dinos with the red figure technique and is made of clay).
Finally, aggregating nucleus-satellite pairs includes syntactic embedding (This is
another relief, a tomb stele).
There is an hierarchy in the rules, since it is essential that the system must select the
proper aggregation rule to apply and reject the others. Therefore, some rules have higher
priority as the resulting text is less redundant and more readable and they help to clarify
the meaning. According to these priorities, the syntactic embedding is the most
important rule in the set, while simple conjunction is the least significant one.
Additionally, it is not clearly preferred the type-comma from the shared subject-predicate. After the choice of the appropriate rule, its parameters must be specified, such
as the maximum facts the system should convey to a sentence and the verification of
sentence quality.
Applying the rules in a sequence, there are two main steps for the aggregation
algorithm. The first step is the important user-modelling parameter of the maximum
facts per sentence, which determines the number of facts that Exprimo should convey to
a particular type of user in each sentence; short sentences may be suitable for smallchildren, but for adults long sentence may be well suited - the use of short sentence
becomes irritating and boring to them - (Melengoglou et al., 2003). Moreover the
conflict in application of two adjacent aggregation rules must be eliminated; therefore
the system should adapt correctly the right aggregation rule to the new linguistic
structure and give up the other. This choice is necessary for the sentence quality, since
there are two kind of restrictions: user-modelling restrictions and text quality
restrictions. For all this restrictions, Melengoglous module had some suggestions and
proposals that solved the problems and the conflicts. The following sentences illustrates
the effect of different values of the max facts per sentence parameter of four
propositions generated by the M-PIRO system.
Max facts = 1:
This exhibit is a stamnos. It was decorated by the painter of Dinos. It was
painted with the red figure technique. It is made of clay
-
7/31/2019 karasimos_mpiro
25/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System17
Max facts = 2:
This exhibit is a stamnos, decorated by the painter of Dinos. It was painted with
the red figure technique and is made of clay
Max facts = 3:
This exhibit is a stamnos, decorated by the painter of Dinos with the red figuretechnique. It is made of clay
Max facts = 4:
This exhibit is a stamnos; it was decorated by the painter of Dinos with the red
figure technique and is made of clay
Table 3.1.:A set of propositions generated by M-PIRO with different values of the max facts per sentence
parameter
3.2. Comparison
3.2.1. What is comparison?
Comparison is like the ancient Roman god, Ianus, which has two faces; comparison
is constituted by two related but so different aspects, similarity and contrast. Similarity
prototypically signals by connectives like also and too and contrast by connectives like
whereas and while. Unfortunately, the literature contains fewer in depth studies of
comparison generally. The Rhetorical Structure Theory of Mann and Thompson (1998)
included very elementary discussion and relations about comparison applied to a NLG
system. The lack of articles and an extended discussion in linguistic theories for
comparison made the task of implementation in a NLG system even harder and created
a few controversial suggestions and solutions. Comparison is used as a mean of
enhancing the experience of the user by both facilitating learning and broadening the
users knowledge by introducing them to new and relevant items in the domain.
Similarity deals with two propositions which contain some common components.
1. Michael is German and teaches linguistics.
2. Maria is Italian and teaches linguistics.
[1+2=>]. Michael is German and teaches linguistics. Maria is
Italian; she also teaches linguistics.
On the other hand, contrast deals with two propositions which contain contrary
components or two different aspects of a relation or entity.
-
7/31/2019 karasimos_mpiro
26/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System18
3. Michael is a German. He teaches linguistics
4. Angelina is English. She teaches history
[3+4=>]. Michael is a German and teaches linguistics. Angelina
is an English; but, she teaches history (or She is also a teacher;
however, she teaches history)
Comparison are seldom thought of as something in and of themselves rather, they
are considered to be a part of a larger explanation; according to Milosavljevic (1999)
they are in fact a central part of descriptions. She claims that it is widely accepted that
when describing a new concept to a hearer, the hearers understanding of the new
concept can be encouraged by drawing analogies with understood concepts or solutions
to problems (1999: 28). In addition, Trevskys theory (Keane et al., 2001) suggests that
the similarity of the two entities, A and B, to another as being a weighted function of the
intersection of the features of A and B less the sum of a weighted function of the
distinctive features in each of the entities; in particular, a new entity has been
constructed based on the two entities similarity properties. The relationships between
two representations may be a. commonalities (a property-pair of the entities matches), b.
alignable differences (a property-pair of the entities has different values) and c. non-
alignable differences (a value of a property-pair is absent) [see table 3.2].
stamnos drachma
Commonalities Creation period : classical Creation period: classical
Alignable differences Material: clay Material: silver
Non-alignable differences Painted by : Dinos -----
Table 3.2.: The relationships between two representation
Discourse analysis and pragmatics have dealt with the problem of comparison but
only skin-deeply. It was proposed a group of various conjunctions for similarity, such as
both, similarly, another way in which these two. are similar, in the same way, these
.. are similar in that and likewise, and for contrast, such as different in many ways,
is different, whereas, another difference, but, also differ in, however, while (more
details in table 3.3.). Nevertheless, the conditions of the combinations of information,
the preference of some conjunctions instead of others, the appropriateness of a
conjunction and the change of the text structure have not been examined deeply and
-
7/31/2019 karasimos_mpiro
27/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System19
seriously, so that the comparison module of the recent NLG systems still remains quite
simplistic and sometimes has a few weaknesses because of monotony of the expressions
and absence of pluralism.
Short conjunctions Long conjunctions
SIMILARITY Similarly,
Likewise,
...the same...
...the same as...
...also...
..., too.
both
In the same way,
X is similar to Y in that (they)...
X and Y are similar in that (they)...
Like X, Y [verb]...
In like manner,
One way in which X is similar to Y is (that)...
Another way in which X is similar to Y is (that)...
Short conjunctions Long conjunctions
CONTRAST
However,
In contrast,By contrast,
..., but
..., yet
On the other hand,
even though + [sentence]
although + [sentence]
whereas + [sentence]
unlike + [sentence]
while + [sentence]
nevertheless,Table 3.3.: Short and long conjunctions for similarity and contrast
3.2.2. The implementation of comparison in M-PIRO System
An essential aim in generating comparisons was to avoid making individual, full-
clause repetition to previously seen exhibits, which tend to be boring, distractive and
irritating; this made its educational goal disputable and controversial. Wherefore, the M-
PIRO comparison module attempts to overcome this problem by using the class
hierarchy to group previously examined exhibits into broader categories and make
either group references or short individual references, when the system knows that the
name of the exhibit is sufficient to make a unique reference (Androutsopoulos et al.,
2002) [see table 3.3.].
Comparison ordering (importance)
ENGLISH
sculpted-by
potter-is
painted-by
original location
creation-period
painting-technique used
made-of
GREEK
potter-is
original location
creation-period
painting-technique used
made of
Table 3.4.: The comparison ordering based on information importance forM-PIRO system
-
7/31/2019 karasimos_mpiro
28/118
Chapter 3Aggregation and Comparison inM-PIRO
Athanasios N. Karasimos Evaluation ofM-PIRO System20
The system can make two kinds of comparison: similarity (like the stamnos, this
lekythos was created during the classical period) and contrast (Unlike the previous
coins which were made of silver, this stater is made of gold). When the system is
requested for the next exhibit, it stores the information of the exhibit that can be
compared (see the previous table); particularly it locates the target predicates. As a first
step to forming a comparison, Exprimo completes the domain class hierarchy tree for
the previously examined exhibits that belong to the same exhibit subclass at the current
exhibit (we assume that the user visited only exhibits from the subclass vessel). The
system includes all the possible potential comparators which were collected by the
system for each past exhibit. The next step is to remove firstly subsets (to avoid making
full-clause references to previous exhibits) and secondly similar subclasses that are not
directly related to the exhibit. Finally, the system removes the weakest relatives by
checking any identical comparators that are higher in the hierarchy and distant relatives
by checking direct relatives of the exhibit in the current focus. For example, the system
checks for the comparator made-of of an archaic stater (silver); the past exhibit had
the form classical tetradrachm made-of silverand the post-previous exhibit a drachma
made-of silver. The super-class is now coins [made-of silver] and therefore, the system
removes the other entities (tetradrachm and drachma) to use the super-class entry
(coins) and generates the comparison like the previous coins, this stater is made of
silver; however, for instance if both previous exhibits were the same sub-entity such as
tetradrachm and both made of silver, it is meaningless to keep the super-class entry.
Therefore, the generated comparison would be like the tetradrachms, this stater is made
of silver. Finally the system uses for similarity the phrases another and like the
(previous) Xand for comparison unlike the (previous) X.
-
7/31/2019 karasimos_mpiro
29/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System21
Chapter 4
The Pilot Experiment
4.1. Introduction
Ultimately, the pilot experiment intended to examine not only the performance in
the texts with-in subjects and between-in subjects, but also what kind of text structure
the subjects preferred and how much they thought that they had learnt from a text with
or without comparison and aggregation. It was decided that the best way to answer all
these questions would be to give the subjects two thematically different text sets
produced by M-PIRO system, where the one had enabled the options of comparison andaggregation and the other did not, and subjects were tested in a factual recall test and
were asked to decide which one is more natural and more fluent to them. There were,
however, a lot of parameters to consider and test before proceeding to the main
experiment.
Experiments that had been conducted in earlier studies and involved evaluation of
natural language generation systems aimed at testing different kinds of system
properties such as accuracy (Jordan, Dorr & Benoit 1993), fluency/ intelligibility
(Minnis 1998), task(IDAS: Levine & Mellish, 1995; ILEX: Cox et al., 1999) and could
therefore provide different reference for the purpose of their evaluation experiment.
These experiments and approaches evaluated different aspects and properties of a NLG
system. Mellish & Dale (1998:349-350) tried to distinguish between the evaluation of
systems and the evaluation of their theories that were underlied, and distinguish both of
these from task evaluation; each of these aspects is considered by looking how
evaluation has been carried out in the field so far. Although the last few decades (1980sand after) the evaluation has increased substantially, there has not been done many
works about evaluation, neither in the linguistics field nor in the natural language
generation system experimental design theory. According to Mellish & Dale (1998) and
Bangalore et al. (2000) the problem is the confusion that is caused by the inability to
distinguishing properly natural language processing and natural language processing;
perhaps the most important of the reasons is that from a practical perspective we are
faced with a world where there is a great deal of textual material whose value mightbe
-
7/31/2019 karasimos_mpiro
30/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System22
leveraged by the successful achievement of partial achievement of the goals of NLU
(Mellish & Dale, 1998: 352).
The only work about taskevaluation was done by Levine & Mellish (1995) for IDAS
and Cox et al. (1999) for ILEX system. As it has been already noticed in section 2.1.2,
the evaluation of ILEX failed to support the hypothesis that the dynamic hypertext
version would improve the performance of the subjects. Therefore, it was very
important to treat the issues and problems (Mellish &Dale, 1998) by evaluating M-PIRO
carefully and test the solutions before running the main experiment. For the purpose
of this study, the two variables that would be tested, were the comparison and
aggregation14
. The major problem in evaluating an NLG system is that of assessing the
output. Since there is not any objective criterion for comparing the appropriateness ofthe text, it was decided to assess the output with and without the two variables.
Thereupon, it was critical and appropriate text output for the experiment and all the
subjects to be exposed in the same conditions [unlike the Cox et al.(1999) experiment].
Furthermore, the knowledge that the human subjects would earn from the texts, was
measured, since the human subjects are the most valuable resource. Finally, the last
problem that had to be solved before the main experiment, was how to handle
disagreement among human judges, since humans will not always agree about
subjective qualities like good style, coherence and readability (Mellish &
Dale, 1999:363). Therefore, it is preferable to avoid consulting explicit human
judgements for this reason. After finding answers to all this questions and choosing the
best parameters, the only way to test them is with a pilot experiment, which would
show how adequate were are the data, texts and design.
14 Comparison is the main factor which the users history and the interaction with other texts are
build on, and aggregation expect for making a text more fluent and natural, improves the interaction with-
in the text since the information are not (bricks and tiles
situated without order).
-
7/31/2019 karasimos_mpiro
31/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System23
4.2. Method
4.2.1. Designing and choosing the exhibit texts
M-PIRO Authoring Tool15
was used to choose the exhibits, to design and run the
pilot experiment, as well as to preview the texts which were generated. The authoring
tool is a very useful tool not only for introducing new entities, sub-entities, information
and exhibits, but also for observing the knowledge database for each exhibit and for
choosing which exhibit order would be the optional for the experiment.
The core experimental procedure for the experiment was the two text variables that
they help us to evaluate the system, aggregation and comparison. The first decision washow these variables were going to be used in the experiment and if they would be tested
together or separately. There could be a text with comparison and aggregation, a text
with comparison only and no aggregation, a text with aggregation and no comparison
and another text without comparison and aggregation. The ILEX evaluation showed that
only comparison did not support the hypothesis that users history (compare with what
the user had already visited) and failed to help them perform better and learn more than
the user whose history had been turned off (Cox et al., 1999). Moreover, the
aggregation option combine facts inside an exhibits text and could not help the user by
itself to remember more details. Therefore, it was decided to be used a text with
comparison and aggregation and another one without comparison and aggregation.
Consequently, for the first text it was not chosen the option ofDisable the users
history, and in the users profile it was selected four facts per sentence (max) for
aggregation and for the second text the users history was disabled and it was selected
only one fact per sentence.
Continuing with the design of the experiment, it was decided to use the user model
profile for adults. This users profile has as default values four facts per sentence (two
more than the childusers profile) and two repetitions for assimilation (one more than
the expertusers profile). Because of the time limitations of the experiments (both pilot
and main) it had to be kept only in one profile and, therefore, the participants were only
15For more details read the session 2.2.3. The M-PIRO Authoring Tool
-
7/31/2019 karasimos_mpiro
32/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System24
adults. Moreover, none of the adult subjects should be an expert or had at some point
been acquainted with the subject of numismatics16
, angiology17
and archeology.
The second core part of the experiment was the exhibit texts and the decision if the
variables were going to be tested with-in or in-between the subjects. Although for the
evaluation of ILEX, some subjects used the dynamic version and the others the static, it
was determined to give two text sets each subject, one with comparison and aggregation
and the other without them. However, these sets could not have exactly the same
exhibits as it would be impossible to evaluate the performance. Thereupon, there were
chosen two completely different text sets which contained six exhibits each and all the
exhibits belonged to the same main entity. Exhibits were avoided to be imported from
different entities in one set like statues, portraits and jewels, since that would make thetext much more difficult for the user and M-PIRO could not produce comparison pairs
between unrelated at some point exhibits. The first text set had only coins exhibits18
which were a drachma, a classical tetradrachm, a tetradrachm, an archaic stater, an
Hellenistic staterand a coin of the reign of Commodus. The second text set had only
vessels exhibits which were a Hadra ware hydria, a black kantharos cylix, a classical
cylix painted by Eucharides, a lekythos painted by Amasis and a lekythos.
Lekythos (with comparison and aggregation)This exhibit is another lekythos. Like the black kantharos, it was created during
the classical period. It dates from between 470 and 460 B.C. It shows an athlete
preparing to throw his javelin. This lekythos was painted with the red figure
technique. In antiquity, javelin throwing was intimately bound up with the Greek
way of life. Before it became a feature of sporting life, the javelin was one of the
weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden
spear about the height of a tall man. This exhibit is a lekythos. The lekythos was
originally from Attica. Like this previous lekythos, it was originally from Attica.
Lekythos (without comparison and aggregation)
This exhibit is a lekythos. It was created during the classical period. It dates frombetween 475 and 470 B.C. It depicts an athlete preparing to perform a jump. This
lekythos was painted with the red figure technique. The origin of the long jump lies
in the challenge presented by traversing the cliffs, ravines and rough terrain of the
Greek countryside, and, accordingly, by the challenge of waging war on this terrain.
It was a complicated sport in which the athletes used special weights, the halteres, to
16Numismatics is the science, whose field is the history and study of (ancient) coins and medals.17Angiology is the science, whose field is the history and study of ancient vessels.18 The whole texts for the Coins and the Vessels in both version are in Appendix I (English and
Greek texts)
-
7/31/2019 karasimos_mpiro
33/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System25
increase their momentum and the distance of the jump. On this lekythos, the athlete
holds the weights in his hands and is about to jump away from the springboard. In
order to win, he needs not only great speed and strong legs but also precise
coordination between his hands and feet as they make contact with the springboard.
This is why the long jump was occasionally accompanied by music, which helped
the jumper pace his rhythm. This lekythos was originally from Attica.
Table 4.1.: Part of the lekythos text generated by M-PIRO with and without comparison and
aggregation
Moreover, we considered that the exhibits of each text set must belong to different main
entity; if both text sets contained, for example, different exhibits but all of them vessels,
then it would be a biased problem since the users would not be any more nave when
they read the second text set. In addition, it was interesting to see how the performance
of the subjects would be in easy and difficult texts, particularly in the easy text set of
coins and the difficult text set of vessels. Hence, the half of the participants read the
coins text sequence with comparison and aggregation and the vessels text sequence
without them and the other participants the coins text sequence without comparison and
aggregation and the vessels text sequence with them.
Finally, two instruments were devised for use in evaluation. They consisted of a
recall test of factual knowledge about coins and vessels in the exhibition and a usability
questionnaire. The test was administrated to subjects printed in paper. The factual recall
test which was introduced to the subjects with the title What did you learn from the
virtual exhibition was a multiple choice test. Almost the half of the questions was
about combined and contrasted facts between the exhibits with a variety of difficulty.
Four examples (two from both question sets) are show below:
8.During which period were the two cylixes created? (archaic, classical, Hellenistic,
roman)
12. Which color is the background of a red-painting technique decorated exhibit?
(red, black, white, clay, blue)
3. The tetradrachms are made of . (gold, silver, bronze, marble, there is no
info in the texts, different material for each one)
12. Whose picture is on Hellenistic stater? (King Perseus, Alexander the Great,
Athena, Apollo)
Table 4.2.: Some questions from the factual recall text (all the questions are in Appendix II). The
correct answers are in bold type.
-
7/31/2019 karasimos_mpiro
34/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System26
4.2.2. Subjects
The subject that have taken part in the pilot experiment were 8 native speakers of
Greek and English aged 23 to 31, four for each language version of M-PIRO. Four of
them were male and the other four were female. At the time of the experiment they were
all MSc or PhD students at the University of Edinburgh; the Greek subjects have spent
1-5 years studying in Britain, which did not affect their understanding of Greek
scientific text. Although all the Greek participants have some elementary courses of
Ancient Greek history and Archeology at the Gymnasium and Lyceum classes, they
were nave as they did not have any previous experience either with the subject of
vessels and coins or with a natural language system text output either. Similarly, the
English participants were also nave as they did not have any previous experience with
archeological texts about vessels and coins and/ or natural language system text output.
Additionally, only one of the English participants had taken a course for Greek
Language and therefore, he was familiar with the Greek words. Furthermore, none of
the subjects had any history of reading problems (like dyslexia) or understanding
problems. Finally, all the participants were nave far as it concerns the goal of the
experiment.
4.2.3. Procedure
The experiment took place in the Computer Micro Lab room of the Department of
Theoretical and Applied Linguistics of the University of Edinburgh and in my flat. The
subjects were usually alone in very quiet conditions so as no one can could disturb
while they were reading of the texts. The M-PIRO text output was printed in A4 pages.
Before the experiment started the subjects had been given a short introduction about
its nature. They were told that they were going to read two different texts, in particular
two different text sets of six museum exhibits each of which generated by an NLG
system. Moreover, they were informed about what kind of information was in the texts;
nevertheless they were not explained anything about the text structure or the text
difficulty. Additionally, the were told that they should try to learn and remember the
descriptions and references related to the exhibits since they were going to answer a 13-
multiple choice question set for each set without having any text in front of them. They
-
7/31/2019 karasimos_mpiro
35/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System27
had fifteen minutes to reach each exhibits text sequence and they could ask anything
before start reading. When it was specified that their task would be to answer some
questions, all the participants wanted to know what kind of questions were going to
respond to and if they had to learn the texts by heart to memorize them. Upon these
questions they were given some examples and told that they should read them like any
other text or document. When the time run out or they felt that they could answer the
questions, they were given more instruction. They were told that they had to choose
only one possible answer but it was written in the question that there were two possible
right answers. Specially, they were asked not to answer any question which they did not
have any clue about or remember anything about, since it was possible to choose the
right answer by chance and that would not be good for the statistical analysis. It was
pointed out that it was obvious not to remember everything and they should not feel
uncomfortable because of their unacknowledged questions. The subjects were
encouraged to ask any questions before the beginning of the experiment or make any
comments after the end of the experiment.
Firstly, they read the first text set and answered the corresponding questions;
afterwards, they read the second text set and answered the questions again. I chose
randomly which text they would read first; therefore, some participants read first the
text set with comparison and aggregation and the others the text set without comparison
and aggregation; as well as, some participants read firstly the coins texts and the others
the vessels texts. This procedure covered all possible combination of the text order,
since it was necessary to examine the ordering effect and its possible flaw in the
experimental design19
. Testing the order of the two text group in the pilot experiment, it
would give some important information for the design of the main experiment. Finally,
the participants were asked to fill a questionnaire for both text sets; for this session, they
had an opportunity to check again the texts and fill the questions without time pressure.
At the end, they were interviewed, where they were informed about the purpose of the
experiment and they discussed about their own critical comments and ideas for the
experiment.
19More details and discussion about this problem in a later session of this dissertation. See at 6.1.2.
Ordering effects: a possible flaw in experimental design.
-
7/31/2019 karasimos_mpiro
36/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System28
4.3. Results and Discussion
The 8 double-session data results, one for each participant, were marked by the
author, saved in Word (Microsoft Office 2000) and then exported to Excel and SPSS
11.5 for further analysis, process and creation of tables and graphs.
Coins Vessels Difference
Group A (Eng) 12 11,5 -1.5
Std. Deviation 2,82 2,12 +0,70
Group B (Eng) 11 6 +5
Std. Deviation 1,41 0 +1,41
Group A (Gr) 9 11 +2
Std. Deviation 2,82 1,41 +1,41
Group B 7 5,5 +1,5
Std. Deviation 2,82 2,12 +0,70
Table 4.3.: The results of the participants in both versions of the pilot experiment
There were 13 multiple-choice questions and the highest possible score was 15. The
results of the experiment hinted not only that the participants scored better in the
questions of the text with comparison and aggregation, but also that they preferred the
text with these options as more fluent and natural. However, there was an exception
among the participants; a participant scored much better in the text without comparison
and aggregation and she claimed at the questionnaire that this funny comparison
thing between the vessels texts made her tired and she did not like the text as it did not
help herremember details cause of the repetitions. Nevertheless, this case was a very
rare exception. Furthermore, despite the grouping effect in the Greek participants, it
seems that the comparison and aggregation made a greater effect in the vessels texts,
since the difference between the groups were 6 and 5,5. This fact supports the
hypothesis that comparison and aggregation help the user to learn more and remember
better some details, as the users do not have usually many difficulties with easy texts.
Furthermore, standard deviation numbers did allow us to make more clear comments, as
it was expected that the standard deviation of mean of the texts with comparison and
aggregation would be smaller than those of the texts without them; unfortunately, this
was not supported by the data of the pilot.
-
7/31/2019 karasimos_mpiro
37/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System29
Pilot Experiment Performance
5
9
13
9
10
12
13
10
4
7
10
8
6 6
11
15
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8
Subjects
Score Text with comparison and aggregation
Texts without comparison and aggregation
Graph 4.1.: The performance of the participants based on the option of comparison and aggregation
(the first four are the Greek participants and the other four the English)
Pilot Experiment Performance
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8
Subjects
Score
Coins with comp.-aggr.
Vessels without comp.-aggr.
Coins without comp.-aggr.
Vessels with comp.-aggr.
Graph 4.2.: The performance of the participants based on the group factor.
Based on the results posted and described above and despite the fact that they were
not statistical significance (1, 6 = 2.748, sig., .148) in the text type factors (comparison
and aggregation), I decided to work with both text sets per participant in the main
experiment. The main effect for groups (p< .05, sig., .031) was the main cause for this
statistical insignificance and, therefore, it was expected that it would not appear in the
main experiment especially since the performance with-in the subjects supported one of
my hypothesis. As it is illustrated in the graphs, there was obviously an effect on
participants performance on the questions, depending on the test type factors.
Moreover, the participants characterized the coins texts as easy and the vessels texts as
-
7/31/2019 karasimos_mpiro
38/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System30
difficult/ very difficult. Finally, it was almost always (the one exception) more
preferable the text set with comparison and aggregation.
The results of the pilot did not allow for many useful assumptions on what to expect
in the main experiment, as there did not seem to be consistency among the groups. For
instance, when looking at the scores of each group in both versions, it is a big surprise
the fact that English participants performed better than the Greeks. Although a
comparison between these two versions is unfair because of the subjects different
background knowledge and how it interfered in the pilot, the results of the pilot raised
some questions which may be answered in the main experiment. Therewithal, the
comments of the participants gave helpful guideline for the main experiment. They
found the vessels text question too difficult and they suggested importing some dummyanswers in the multiple-choice; by the way, they thought that more questions would not
be a problem. Furthermore, it was suggested that each exhibits text had to be split up in
two or three pages. Additionally, they considered that the time had to be between
twenty minutes, because less would be not enough and more it would be getting boring.
Finally, the participants who read first the text with the vessels (with or without
comparison and aggregation) complained that the difficulty of the text exhausted them
since it was tough and all the information almost completely new for them, and
therefore, it was harder to read the second text set with the coins; so they preferred to
had read the coins text set first. This ordering effect may be a possible flaw for the main
experiment and it is going to be mentioned and discussed in details on session 6.1.2.
However, it was a major problem which should be solved somehow.
Coins text Vessels text
How interesting have youfound the text?
Interesting Neutral
How difficult are the questions easy Difficult
Did you enjoyed the texts? yes Yes
Which text is more fluent and
natural?
7 subjects chose the text with
comparison and aggregation
1 subject chose the text
without comparison
Table 4.6.: The questionnaire results of the pilot experiment
To sum up, the results of the pilot study made fairly clear that a text with
comparison and aggregation are supportive in subjects reading to remember, learn and
-
7/31/2019 karasimos_mpiro
39/118
Chapter 4 The Pilot Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System31
perform better than a text without these factors. This outcome was considered to be
encouraging for the main experiment. It would be interesting to find out if there would
be difference in the performance with-in and between-in subjects and if there is a
difference how big it would be depending on the difficulty and stiffness of the text
itself. The data we got from the pilot experiment was not enough to make any
assumptions; nevertheless, they supported some of our theories/ hypotheses and gave us
some ideas for new hypotheses such as if the comparison and aggregation make the text
easier, the dependence of the learning feeling on the text type and what forces the
participants to choose a text as more fluent and natural.
-
7/31/2019 karasimos_mpiro
40/118
Chapter 5 The Main Experiment
Athanasios N. Karasimos Evaluation ofM-PIRO System32
Chapter 5
The Main Experiment
5.1. Introduction
The main experiment set out to evaluate two language versions (English and Greek)
of the M-PIRO NLG system by testing the text structure factors of comparison and
aggregation; it would support or fail our four hypotheses. As the pilot experiment
results revealed, not only the participants not only scored better in the texts with
comparison and aggregation, but also showed a bigger score difference in the difficult
text sets.
What is now anticipated is to observe the difference in the performance depending
on the text type and not depending on any other factors such as the group, the genre and
the language. This would hopefully show if the participants scores were related to the
text factors (comparison and aggregation) and there would be no statistical significance
in the group, the genre and the language factors. Moreover, it was anticipated that the
participants would characterize as natural and fluent the text with the text struc