Evaluation Report on Efficiency, Accuracy and Usability of ...complement (noun, preposition, adverb,...
Transcript of Evaluation Report on Efficiency, Accuracy and Usability of ...complement (noun, preposition, adverb,...
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DEEPTHOUGHT
Hybrid Deep and Shallow Methods
for Knowledge-Intensive
Information Extraction
Deliverable 5.10
Evaluation Report on Efficiency,
Accuracy and Usability of the New
Approach
The DeepThought Consortium
D5.10 Evaluation Document Version1.0
DeepThought IST-2001-37836 III
PROJECT REF. NO. IST-2001-37836
Project acronym DeepThought
Project full title DeepThought - Hybrid Deep and Shallow Methods for
Knowledge-Intensive Information Extraction
Security (distribution level) Rest.
Contractual date of delivery August 2004
Actual date of delivery September 2004
Deliverable number D5.10
Deliverable name Evaluation report on efficiency, accuracy and usability of
the new approach
Type Report
Status & version Final version
Number of pages
WP contributing to the
deliverable
WP5
WP / Task responsible Xtramind
Other contributors USAAR, Celi, NTNU
Author(s) Dorothee Beermann, Berthold Crysmann, Petter Haugereid, Lars Hellan, Dario Gonella, Daniela Kurz, Giampaolo Mazzini, Oliver Plaehn, Melanie Siegel
EC Project Officer Evangelia Markidou
Keywords Evaluation, Matrix, core linguistic machinery, applications,
accuracy, coverage
Abstract In this deliverable, an evaluation of the Italian and Norwegian grammar is presented. Moreover, the core linguistic machinery and the both Applications have been evaluated.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
Table of Contents
1 Grammar Evaluation 1
1.1 Basic Considerations on the Evaluation .....................................................................1
1.2 Evaluation Results: Norwegian Grammar...................................................................1
1.2.1 Phenomena Covered by the Grammar...............................................................1
1.2.2 Phenomena not Covered by the Grammar.........................................................2
1.2.3 Size of the Lexicon and the Grammar ................................................................3
1.2.4 Correlation between the Matrix and the Norwegian Grammar............................3
1.3 Evaluation results: Italian Grammar ..........................................................................11
1.3.1 Phenomena Covered by the Grammar.............................................................12
1.3.2 Phenomena not Covered by the Grammar.......................................................14
1.3.3 Size of the Lexicon and the Grammar ..............................................................14
1.3.4 Correlation between the Matrix Grammar and the Italian Grammar .................15
1.4 Reusability .................................................................................................................16
1.5 Standardized Output Format.....................................................................................16
2 Evaluation of the Heart of Gold 18
2.1.1 PASCAL Data....................................................................................................19
2.1.2 Mobile Phone Corpus ........................................................................................20
2.1.3 Newspaper Corpus ...........................................................................................21
2.1.4 All corpora..........................................................................................................21
2.1.5 Conclusions.......................................................................................................22
2.2 HoG and German ......................................................................................................22
3 Evaluation of the Business Intelligence Application 24
4 Evaluation of the Auto-Response Application 36
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 ii
4.1 Email corpus .............................................................................................................36
4.2 Template Examples ..................................................................................................36
4.3 Evaluation types ........................................................................................................38
4.4 Accuracy of the German Prototype...........................................................................40
4.4.1 First experiment.................................................................................................40
4.4.2 Second Experiment...........................................................................................40
4.5 Accuracy of the Prototype for English.......................................................................40
4.5.1 First Experiment ...............................................................................................40
4.5.2 Second Experiment...........................................................................................41
4.5.3 Conclusion.........................................................................................................41
5 Travel Information Application 43
6 Concertation Plan 43
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 iii
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought 1 IST-2001-37836
1 Grammar Evaluation
1.1 Basic Considerations on the Evaluation
The Matrix grammar as a language independent core grammar has been designed to facilitate
the rapid initial development of grammars for natural languages. Since the matrix is a
collection of generalizations across grammars it can not be evaluated itself. Evaluation of the
matrix is therefore carried out as a case study of its benefit for the development of an actual
grammar. The current version of the Matrix has been used for the development of a
Norwegian and an Italian grammar since the beginning of the project.
The evaluation on hand is based on the following questions:
• How many person months were spent for the grammar development in DeepThought?
• What phenomena are covered by the Norwegian and the Italian grammar?
• What phenomena are not covered, but should be covered? Are these phenomena
language-specific, or do they occur in other languages as well?
• What are the size of the lexicon and the number of types and rules?
• Which types of the Matrix get used and which get not used.
• What would be the effort in defining the used types without Matrix?
1.2 Evaluation Results: Norwegian Grammar
The Norwegian grammar makes use of the matrix v0.6. For the development of the grammar
in DeepThought from M0 until M22 all in all 20 person months were spent. Ten person months
were funded by DeepThought, four person months are paid by other sources and six are
under permanent positions.
1.2.1 Phenomena Covered by the Grammar
The phenomena that can be covered by the Norwegian grammar are the following:
• Lexicon
The lexicon has 84240 lexical entries: 56966 nouns, 13744 adjectives, 13185 verbs (some
of them valence variants of the same lexeme), 284 prepositions/adverbs and 61 other
entries. Verbs, nouns and adjectives are entered into the lexicon as lexemes. Inflectional
rules turn them into words. All other lexical emtries are words (or phrases).
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 2
• Word types
The grammar has mainly 11 word classes: verbs, nouns, pronouns, adjectives, adverbs,
sentential adverbs, determiners, complementizers, infinitive marker, prepositions,
conjunctions.
• Valency
The grammar has 102 different argument frames for verbs. These argument frames are
crossclassifications of the factors presentational/non-presentational, -arity, category of the
complement (noun, preposition, adverb, adjective, subordinate clause or infinitival clause)
and thematic roles. It also treats auxiliaries and modal verb.
• Lexical rules
The grammar has lexical rules in order to handle the position of light pronouns with regard
to sentence adverbials, and also for particle promotion, passive and inversion.
• Syntax
The grammar parses declarative clauses and yes-no questions. It treats topicalization,
wh-questions and relative clauses. It parses sentences with auxiliary verbs. Both
morphological and periphrastic passive is covered. The grammar treats agreement in
NPs.
1.2.2 Phenomena not Covered by the Grammar
The phenomena that cannot be covered by the grammar are listed below. We make a
difference between bugs in grammar (rules that do not work currently) which will be
fixed and rules that are missing and have to be added.
• Rules that do not work:
The imperative inflectional rule works, but imperative sentences do not parse.
Extraction from subordinate clauses does not work for the moment. There is a
vague/unclear distinction between adverbs and prepositions In the big lexicon.
• Rules that are missing:
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 3
The grammar almost doesn't have any treatment of numbers, compounds,
abbreviations or interjections. It doesn't treat procedural markers like "vel", "altså",
"da" or "nok". The comparative and superlative forms of adjectives are missing.
Coordination and comparitive constructions are not treated in the grammar. The
grammar does not have more than one valence frame for nouns and adjectives.
1.2.3 Size of the Lexicon and the Grammar
The lexicon contains 84,240 lemmas. The number of types is 2853 where 202 are
matrix types, 768 are language specific types and 1883 are inflectional pattern types.
The grammar contains 9 l-rules, 409 i-rules and 32 syntactic rules.
1.2.4 Correlation between the Matrix and the Norwegian Grammar
Comparing the Norwegian grammar with the Matrix grammar we find that the following Matrix
types are used in the Norwegian grammar. We distinguish between the types that are
explicitly used the grammar, those that are indirectly used, which means they are not referred
to explicitly, and those that are not used at all.
The matrix.tdl file has 203 types. 91 types are explicitly referred to in the Norwegian type file
(norsk.tdl) either as supertypes or as values of features. 61 types are indirectly used by the
grammar. These types are typically supertypes of other matrix types that that are used in the
Norwegian type file, or types that introduce features that the Norwegian type file uses. The last
51 types are not used by the grammar.
1.2.4.1 Explicitly used Types
• Basic SIGN types
sign
word-or-lexrule
word
norm-lex-item
lexeme
phrase
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 4
• Syntactic types
synsem-min
synsem
expressed-synsem
canonical-synsem
lex-synsem
phr-synsem
non-canonical
gap
unexpressed-reg
anti-synsem
mod-local
local
cat
head
valence
• Semantic types
semsort
message
message_m_rel
command_m_rel
proposition_m_rel
question_m_rel
handle
index
event-or-ref-index
expl-ind
ref-ind
png
tense
event
conj-index
relation
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 5
arg0-relation
arg1-relation
arg12-relation
arg123-relation
noun-relation
named-relation
prep-mod-relation
conjunction-relation
unspec-compound-relation
quant-relation
• Technical types
Bool: + -
xmod
notmod-or-rmod
notmod-or-lmod
notmod
hasmod
lmod
rmod
sort
predsort
avm
list
null
olist
• Lexical types
lex-rule
lexeme-to-word-rule
inflecting-lex-rule
constant-lex-rule
const-ltol-rule
const-ltow-rule
infl-ltow-rule
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 6
• Phrasal types
phrasal
head-valence-phrase
basic-unary-phrase
unary-phrase
basic-binary-phrase
binary-phrase
binary-headed-phrase
head-only
head-initial
head-final
head-compositional
basic-head-filler-phrase
basic-head-subj-phrase
basic-head-spec-phrase
basic-head-comp-phrase
basic-extracted-subj-phrase
extracted-adj-phrase
adj-head-phrase
head-adj-phrase
adj-head-int-phrase
head-adj-int-phrase
1.2.4.2 Indirectly Used Types
Some of these types are '-min' types like sign-min, valence-min and mrs-min that typically do
not introduce any features, but have subtypes that do. Their function is to make parsing more
efficient. They are not referred to explicitly in the Norwegian grammars, but they have matrix
subtypes that are referred to.
Another group of types that are not referred to directly are types that introduce features that
are used in the Norwegian type file (keys, mrs, hook, qeq). .
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 7
Moreover there is a group of intermediate matrix types with structure that are supertypes of
other matrix types (headed-phrase, basic-extracted-arg-phrase, head-mod-phrase). The
language specific types inherit from subtypes of these types.
• Basic SIGN types
sign-min
basic-sign
phrase-or-lexrule
word-or-lexrule-min
• Syntactic types
lex-or-phrase-synsem
expressed-non-canonical
unexpressed
local-min
non-local-min
non-local
cat-minhead-min
valence-min
keys_min
keys
• Content types
mrs-min
mrs
hook
lexkeys
basic_message
prop-or-ques_m_rel
abstr-ques_m_rel
qeq
semarg
individual
tam
subord-or-conj-relation
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 8
norm_rel
named_rel
• Technical types
luk
na-or-+
na-or--
+-or--
na
atom
cons
0-1-list
1-list
diff-list
0-1-dlist
0-dlist
1-dlist
string
alts-min
alts
label
• Lexical types
lex-item
lexeme-to-lexeme-rule
infl-ltol-rule
• Phrasal types
headed-phrase
head-nexus-rel-phrase
head-nexus-que-phrase
head-nexus-phrase
basic-binary-headed-phrase
non-clause
basic-extracted-arg-phrase
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 9
basic-extracted-adj-phrase
head-mod-phrase
basic-head-mod-phrase-simple
head-mod-phrase-simple
isect-mod-phrase
1.2.4.3 Types Ignored by the Grammar
Some of the types that are not used are 'shortcut' types like conj-event, conj-ref-ind and
event-relation (and its subtypes). These types are specified to have a specific value for a
feature. For example the type event-relation is specified to have as its ARG0 the type event.
Two types (no-alts, no-msg) are 'negation' types, types that are introduced in order not to be
compatible with other types.
The semantic types of aspect and mood are not used.
The type clause and its subtypes are not used (relative-clause, non-rel-clause, interrogative-
clause, declarative-clause and imperative-clause).
The OPT mechanism is not used, so the types basic-head-opt-comp-phrase, basic-head-opt-
one-comp-phrase and basic-head-opt-two-comp-phrase are not used. The scopal-mod-
phrase types are not used.
Some relation types like verb-ellipsis-relation, noun-arg1-relation, adv-relation and subord-
relation are not used.
• Syntactic types
no-alts
non-affix-bearing
rule
tree-node-label
meta
scopal-mod
intersective-mod
non-local-none
• Semantic types
psoa
nom-obj
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 10
ctxt-min
ctxt
no-msg
ne_m_rel
instloc
aspect
mood
conj-event
conj-ref-ind
arg1234-relation
event-relation
arg1-ev-relation
arg12-ev-relation
arg123-ev-relation
arg1234-ev-relation
verb-ellipsis-relation
noun-arg1-relation
adv-relation
subord-relation • Technical types
1-plus-list
dl-append
integer
ocons
onull
• Phrasal types
non-headed-phrase
binary-rule-left-to-right
binary-rule-right-to-left
basic-head-final
clause
relative-clause
non-rel-clause
interrogative-clause
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 11
declarative-clause
imperative-clause
basic-head-opt-comp-phrase
basic-head-opt-one-comp-phrase
basic-head-opt-two-comp-phrase
basic-extracted-comp-phrase
scopal-mod-phrase
adj-head-scop-phrase
head-adj-scop-phrase
Facing the number of types given in the lists above the effort in defining a grammar with the
same coverage as the actual one without using the matrix grammar would have been twice.
With the aim of developing more comprehensive and systematic evaluation tools for grammar
coverage, work has also been started, for Norwegian and German, to define test suites for
verb constructions indexed according to construction type properties, and likewise for
derivational morphology for verbs, adjectives and nouns. The scope of this work, enclosed as
separate document to this deliverable, is rather large, and it has not been possible to finalize it
for serving as a benchmark for the current round of comparison; rather it will form part of the
documentation package to be provided by the end of the project.
Presentation of ‘satellite grammars’, based on the Matrix but exploring different phenomena
than those addressed in NorSource, will also be made then.
As documentation documents at the present point are available:
Hellan, L., and P. Haugereid. (2003.) 'The NorSource Grammar - an excercise in the Matrix
Grammar building design'. In: Proceedings of Workshop on Multilingual Grammar
Engineering, ESSLLI 2003.
Hellan, L. (2003) ‘Documentation of NorSource’. Ms, NTNU.
1.3 Evaluation results: Italian Grammar
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 12
The Italian grammar makes use of the matrix v0.6. For the development of the grammar in
DeepThought from M0 until M22 12,5 person months were spent.
1.3.1 Phenomena Covered by the Grammar
A list of the linguistic phenomena covered by the Italian grammar (version 0.6, May 2004) is
given below.
• Agreement :
A type hierarchy for agreement values has been created. The agreement between
determiner and noun, adjective and noun, subject and verb is covered.
• Argument structure (optionality and free order):
The feature OPT bool (introduced in the Matrix at the synsem-min level) is used by lexical
entry types in the COMPS list in order to manage the optionality of the arguments. As far
as the free order of the arguments is concerned (namely the subject inversion, the NP-PP
and the NP-AP inversion in verbal argument structure), we adopt the strategy of using
lexical rules, which simply deal with the inversion of the elements in the COMPS list.
• Passivation :
A lexical rule, passive-lex-to-word, according to the traditional approach (Sag & Wasow,
"Syntactic Theory"), rearranges the elements of the COMPS list. The rule applies to
transitive & non-ergative verbs [TRANS +, ERG -]. Further information concerns the
semantic role ARG1 (coindexed with the ARG2 of the preposition "da" (by). In
correspondence with the passive-lex-to-world another lexical rule deals with the past
participle, so that past participles have two alternative interpretations, one as "passive",
the other as "active" past participle.
• Raising and control verbs
All auxiliaries, modals, pure motions and copulatives are considered as raising verbs (with
structure sharing between the subject of the governed infinitive and the subject of the
governor). All others (governing at least one more complement besides the infinitive one)
are treated as equi (control) verbs (as described in Pollard & Sag, 1994).
• Restructuring verbs:
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 13
Partially according to the approach of Monachesi 1999, following Rizzi, 1982 both auxiliaires
and restructuring verbs (modals, temporal aspectuals and pure motion verbs) are subjected
to the argument composition constructing a verbal complex.
• Perception verbs:
The relative high frequency of perception (and causative) verbs in the Italian corpus
requires an elaborated treatment of this complex phenomenon. All the perception verbs
(hence PDS) have been grouped in 4 verb types:
- PDS control verbs
- PDS verbs with argument composition
- PDS monotransitive verbs (governing a “che” finite clause)
- PDS predicative verbs
Regarding the infinitive complementation of PDS verbs, a “deep” passivation of the infinite
verb is allowed, without a correspondent morphological realization. E.g. the sentence
Giovanni lo ha visto uccidere (John saw him to-kill) can have a double intepretation,
depending on the diathesis of the infinitive "uccidere" (kill). A new feature DEACT bool
(introduced in pred-st) is used for the “deactivation” phenomenon. The value “+” is
assigned by a lexical rule for infinitive verbs
• Cliticization:
According to the recent literature on Italian clitics and clitic climbing, the most convincing
approach seems to be the one suggested by Paola Monachesi in several papers. The
combined action of the "cliticizing" lexical rules and the "argument composition" lexical
rules seems to be adequate in several cases, but it seems to be unefficient in the case of
"multiple restructuring" (being the auxiliary verb a restructuring one, a "double” or “triple”
restructuring is quite frequent). We have tried to overcome the problems connected with
the this multiple restructuring by adopting an hybrid approach to cliticization and clitic
climbing. We use the argument composition mechanism for restructuring verbs, but we
delay the attachment of the clitic until the (possible) restructuring chain has been
completed.
• Clausal complementation:
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 14
Four kinds of clausal complements are taken into account, namely infinitive clauses
introduced by “a” or “di” complementizers, finite clauses introduced by “che” and base
infinitive clauses.
• Modifiers phrases:
Adjectival phrases, adverbial phrases, some kinds of absolute phrases (participial and
gerundive), relative clauses (partially) and subordinate clauses are covered by the
grammar.
1.3.2 Phenomena not Covered by the Grammar
The phenomena that cannot be covered by the grammar are listed below.
From an application driven point of view the following phenomena are highly prioritised.
• Coordination (a first treatment has been introduced in a test-grammar)
• Comparative structures
Moreover an adequate treatment of valence frames for nouns and adjectives is missing. From
a more language-specific perspective, the “clitic-doubling” phenomena should be dealt with,
given their frequency in the corpus.
Furthermore a treatment of unknown words is essential for robustness of the application.
1.3.3 Size of the Lexicon and the Grammar
The grammar contains about 90 lexical types (about 30 for adverbs and 60 for verbs), 9
lexical rules (for different orders of arguments) and 33 construction rules.
The lexicon has 4345 lexical entries. Containing 467 adverbs, 589 transitive verbs, 318
ditransitive verbs, 210 intransitive verbs, 72 other verbs, 2548 nouns and 141 closed class
entries.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 15
1.3.4 Correlation between the Matrix Grammar and the Italian Grammar
The Italian grammar uses, directly or indirectly, 173 types (86%) of the 203 types given by the
Matrix. 30 types of Matrix types are not used by the Italian grammar
1.3.4.1 Types ignored by the Grammar
Types which get ignored are listed below:
no-alts
non-affix-bearing
rule
tree-node-label
meta
non-local-none
psoa
nom-obj
ctxt-min
ctxt
no-msg
command_m_rel
prop-or-ques_m_rel
abstr-ques_m_rel
question_m_rel
ne_m_rel
instloc
conj-event
conj-ref-ind
verb-ellipsis-relation
1-plus-list
dl-append
integer
binary-rule-left-to-right
binary-rule-right-to-left
relative-clause
non-rel-clause
interrogative-clause
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 16
imperative-clause
non-headed-phrase
Defining all the Matrix types used in the Italian grammar starting from scratch would be much
more time-consuming than deriving them from the Matrix. Similar to the development of the
Norwegian grammar the effort in defining the grammar would have been twice.
1.4 Reusability
Reusability can be shown by the fact that the Norwegian and the Italian Grammar make use of
the matrix exploiting the crosslinguistic similarity between Norwegian and Italian.
For Norwegian as well as for Italian a set of rules from the Matrix Grammar can be used.
These are head-specifier-phrases, head-subject-phrases, head-complement phrases,
modification, extraction, filling and inflection.
1.5 Standardized Output Format
Based on a selection of sentences for English, Norwegian and Italian for some parallel
phenomena semantic output data has been constructed. The semantic analyses of all three
languages for the sentence “Abrams barked” are given in figure 1, 2 and 3.
The standardized semantic output is due to the Matrix-based approach. Assuming that
semantic representations should be less language-dependent than syntactic representations,
generality in the output format is easily obtainable. Moreover interoperability is obtained by the
general interface to multilingual backend applications (see also chapter 2 and 3).
TEXT Abrams barked.
TOP h1
RELS { prpstn_m_rel proper_q_rel named_rel _bark_v }
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 17
LBL h1
ARG0 h5
LBL h6
ARG0 x7 pers=3
num=sg
RSTR h9
BODY h8
LBL h10
ARG0 x7 pers=3 num=sg
CARG abrams
LBL h11
ARG0 e2 tense=past
ARG1 x7 pers=3 num=sg
HCONS {h5 qeq h11, h9 qeq h10}
ING {}
Figure 1 Abraham barked
TEXT Ask bjeffet.
TOP h1
RELS {
named_rel
LBL h3
ARG0 x4
CARG ASK
def_q_rel
LBL h7
ARG0 x4
RSTR h10
BODY h11
bjeffe-rel
LBL h12
ARG0 e2
ARG1 x4
proposition_m_rel
LBL h1
ARG0 h17
}
HCONS {h10 qeq h3, h17 qeq h18}
ING {}
Figure 2 Ask bjeffet
TEXT Argo abbaiava
TOP h1
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 18
RELS {
named_rel
LBL h3
ARG0 x4
CARG Argo
_abbaiare_v
LBL h1
ARG0 e2
ARG1 x4
proposition_m_rel
LBL h5
MARG h6
}
HCONS { h6 qeq h1 }
ING {}
Figure 3 Argo abbaiava
2 Evaluation of the Heart of Gold
The Heart of Gold (HoG) is the core architecture combining different approaches to
multilingual language processing. It combines different modules of natural language
processing that provide analyses of varying depth and for multiple languages. There are
different types of module combination:
• The analysis results of NLP tools at lower processing levels can be used by components
at higher levels. For example, the deep linguistic analysis module PET uses default
lexicon entries for Part-of-Speech tags that the POS tagger TnT delivers and for Named-
Entities that Sprout delivers.
• It is possible to configure the HoG so that always the deepest result possible is delivered.
• Furthermore, one can configure the HoG to deliver partial results whenever a complete
analysis is not available. Partial results are taken from the deepest module that delivers
results.
• One can combine modules and grammars for different languages. Each language (we
currently work with English, German, Japanese, Norwegian, Italian and Greek) has its own
configuration of valid modules and grammars.
• The different modules use a compatible output formalism, RMRS. In case of shallower
modules, this robust semantic structure allows for underspecification of, e.g., argument
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 19
structure.
We evaluate what kind of annotation one can get from the HoG using different configurations
and data from different domains. The evaluation described in this document is restricted to
English resources but other languages will be evaluated in the near future as well.
We used three different test configurations for each data set:
1) HoG configured to provide the deepest result possible for each sentence. PET uses both
Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as input and
delivers partial parses in case no spanning analysis is available.
2) HoG configured to provide the deepest result possible for each sentence. PET uses both
Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as input, but
does not deliver partial parses.
3) HoG configured to provide only complete analyses from PET and RASP. No information
from shallower modules (TnT, Sprout) is used.
2.1.1 PASCAL Data
The training data of the PASCAL task contains declarative sentences of various domains.
581 test sentences of the PASCAL training corpus were sent to HoG using configuration 1
as described above. The following table shows the coverage of PET and Rasp.
# sentences # PET results # Rasp results # results581 442 139 581
100% 76,06% 23,92% 100%
The same 581 test sentences of the PASCAL training corpus sent to HoG using
configuration 2 delivered these results:
# sentences # spanning PET results # Rasp results # results581 134 447 581
100% 23,06% 76,94% 100%
For configuration 3, the results are as follows:
# sentences # PET results # spanning PET results # Rasp results # results581 37 14 544 581
100% 6,37% 2,41% 93,63% 100%
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 20
We get annotations for all sentences, either PET or Rasp. Rasp is very robust and able to
give analyses for all sentences for which PET does not give an analysis. In this domain, which
is quite diverse in lexical choices, it can be clearly shown that the usage of default lexicon
entries for recognized part-of-speeches and named-entities heavily influences the
performance of the deep linguistic processing. Without these, PET delivers results in only
6.37% of the sentences and spanning results in only 2.41%, while with the input, the coverage
of PET rises up to 76.06% for partial analyses and 23.06% for spanning results.
2.1.2 Mobile Phone Corpus
692 sentences (many of them fragments and lists) of mobile phone descriptions from the
internet sent to HoG using configuration 1.
# sentences # PET results # Rasp results # results692 631 61 692
100% 91,18% 8,82% 100%
The same 692 sentences of mobile phone descriptions sent to HoG using configuration 2.
# sentences # spanning PET results # Rasp results # results692 140 552 692
100% 20,23% 79,77% 100%
The same 692 sentences of mobile phone descriptions sent to HoG using configuration 3.
# sentences # PET results # spanning PET results # Rasp results # results692 296 65 396 692
100% 42,77% 9,39% 57,23% 100%
The lexicons were tuned to this domain, such that we get more PET results than in the former
domain, in the cases of using or not using default lexicon entries. Still, it can be shown that
the usage of POS and NER information from shallower modules increases the performance
of PET enormously from 9.36% to 20.23%. The data contains many lists and tables the deep
HPSG grammar is not quite prepared for. It shows how the overall processing gains from
being able to fall back to partial parses or (underspecified) Rasp results.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 21
2.1.3 Newspaper Corpus
48 sentences of a (business news) article in the San Francisco Chronicle from 2004-07-
27 (EN) sent to HoG using configuration 1.
# sentences # PET results # Rasp results # results48 31 17 48
100% 64,58% 35,42% 100%
The same 48 sentences sent to HoG using configuration 2.
# sentences # spanning PET results # Rasp results # results48 6 42 48
100% 12,50% 87,50% 100%
The same 48 sentences sent to HoG using configuration 3.
# sentences # PET results # spanning PET results # Rasp results # results48 5 1 43 48
100% 10,42% 2,08% 89,58% 100%
This text is completely out of the training domain and therefore significantly shows the effect
of default lexicon entries.
2.1.4 All corpora
Table 1, 2 and 3 show the three corpora and the coverage of all three in sum.
Table 1 shows the results using configuration 1.
# sentences # PET results # RASP resultsPascal 581 442 139Mobile Phone 692 631 61Newspaper 48 31 17All 1321 1104 217
100% 83,57% 16,43%
Table 2 shows the results using configuration 2.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 22
# sentences # PET results # Rasp resultsPascal 581 134 447Mobile Phone 692 140 552Newspaper 48 6 42All 1321 280 1041
100% 21,20% 78,80%
Table 3 shows the results using configuration 3.
sentences PET results spanning PET results RASP resultsPascal 581 37 14 544Mobile Phone 692 296 65 396Newspaper 48 5 1 43All 1321 338 80 983
100% 25,59% 6,06% 74,41%
2.1.5 Conclusions
First of all, the strategy to use the deepest available result delivered by the Heart of Gold core
architecture guarantees results for all sentences in different domains. These results are
comparable and compatible to each other, because of being formulated in the same
framework, RMRS. It therefore seems very useful to combine very robust modules like Rasp
with deeper modules like HPSG processing.
In different domains, closer and farer away from the development domain in lexicon as well as
syntactic structures, it could be shown that the depth of results increases enormously when
using the results of POS tagging and named-entity recognition in deep linguistic processing.
Over all domains, spanning HPSG (PET) processing increased from 6.06% up to 21.20%.
2.2 HoG and German
The large-scale German HPSG (Müller & Kasper 2000, Crysmann, 2003, to appear)
developed at DFKI has been integrated into the DeepThought architecture HoG during the 2nd
quarter of 2004. The main task of this integration effort was the adaptation of the semantic
output to current (R)MRS standards. Furthermore, interface types and mappings have been
provided to integrate shallow NLP analyses into the deep parser, thereby ensuring robustness
(for further integration scenarios see Frank et al. 2003). Currently, NER output from Sprout
and POS information from TnT are used to address the unknown-word problem.
In order to assess the gains in robustness offered by the integrated deep-shallow processing
adopted by HoG, we ran an experiment on unseen data, measuring the coverage obtained
with and without deep-shallow integration.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 23
As test data, we used the 200 questions from the German section of the CLEF 2003 multi-
lingual QA competition. The corpus was parsed both by a stand-alone cheap and by the
version integrated into HoG. Additionally, we provide figures derived from a mock-up
experiment performed in the context of the DFKI project QUETAL, where NEs had been
manually replaced with dummy strings, directly corresponding to special lexical entries in the
German HPSG.
The standalone system (baseline) was able to deliver a full parse for 34 sentences only
(17%). Inspection of the error log revealed that the most common source for parse failure was
lexical in nature: in 77.5% of the input sentences, at least one lexical item was unknown.
Abstractiing away from the problems of lexical coverage, syntactic coverage was around 80%
(34/42), although these figures are certainly not reliable, owing to the size of the data set.
Deep-shallow integration drastically improves on these figures: by feeding NER and POS tag
information into the deep parser, coverage goes up to 73% (146/200), a figure comparable to
those achieved on corpora for which the grammar had been optimised (e.g. Verbmobil data:
VM-CD01: 74.1%; VM-CD15: 78.4%). We conjecture that even better results may be
obtained by improving the NER component: given that 80% of the 200 CLEF questions
contain at least one NE, it is somewhat surprising that NER only provided partial results for
another 8 test items, which amounts to 4% of the entire test suite.
The results obtained by the German HoG also compare well to the aforementioned mock-up
experiment. Manual substitution of NEs resulted in an overall coverage of 56.5% (113/200).
Owing to the fact that substitution was restricted to NEs, lexical coverage was still an issue,
accounting for 30% (60/100) of parse failures. Relative to the 140 sentences without lexical
errors, we measured a syntactic coverage of around 80%.
To conclude, the integrated shallow-deep approach embodied by HoG, and, most notably, the
combination of NER and POS mappings, proves to be highly successful in improving the
robustness of the deep parser.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 24
3 Evaluation of the Business Intelligence Application
In the Business Intelligence Application, the deep analysis plays the role of “refining” the data
provided by the Sophia 2.1 shallow parsing. More precisely, Sophia 2.1shallow parser extracts
some Opinion Templates from the corpus texts, in which the relevant text segments are
attached as opinion snippets; those text portions are then analysed by the deep grammar, in
order to allow the template to be either validated, refined or filtered out.
As a Web source for selecting and collecting relevant texts, an Italian forum has been chosen
in the domain of mobile phones and tlc (namely www.cellulare.it). As the evaluation had to be
performed manually, the data set (the forum messages) used for the evaluation was rather
small, as the following figure illustrates:
Opinion
Snippet
Sophia
HPSG
grammar
Validated
Filtered
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 25
The 182 Opinion Templates (143 with a negative polarity, 39 with a positive one) have been
taken into account for the evaluation testing.
Here we give an example of a forum message, the correspondent opinion template and the
opinion snippet (XML format) extracted by Sophia2.1:
(Subject)
java su accompli008
(Text Corpus)
le applicazioni j2me sul motorola accompli non funzionano... forse devo configurare qualcosa? se
qualcuno mi può aiutare ringrazio anticipatamente.
788 messages
306 Sophia
Templates
182 Opinion Templates
124 Question Templates
Shallow parsing produced
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 26
Opinion template:
<NLFDoc> <Info id="5" sourceID=" 70852.txt" /> <Maps>
<NLF> <Pred attr="model">
<Val>accompli008</Val> </Pred> <Pred attr="type">
<Val>phone_name</Val> </Pred>
</NLF> <NLF>
<Pred attr="brString"> <Val>l motorola</Val>
</Pred> <Pred attr="brValue">
<Val>BR_MOTOROLA</Val> </Pred> <Pred attr="opValue">
<Val>NEGATIVE</Val> </Pred> <Pred attr="predString">
<Val>non funzionano</Val> </Pred> <Pred attr="type">
<Val>ENT_OP</Val> </Pred>
</NLF> </Maps>
</NLFDoc>
Opinion snippet:
<opinionSt end="90" hasSnippet="true" start="56"> <opinion end="90" pred="-1" start="76">non funzionano</opinion> <entity end="" hasSnippet="false" start="" term="" /> <model brand="BR_MOTOROLA" model="accompli" /> le applicazioni j2me sul motorola accompli non funzionano
</opinionSt>
In our evaluation we can compare two different result sets using two different levels of
processing:
• Shallow analysis only, using Sophia 2.1
• Deep analysis of the textual part of Sophia 2.1 output (the opinion snippets) with the HPSG
Italian grammar
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 27
Some preliminary remarks about the results of the shallow parsing are needed. We decided
to use a “relaxed” configuration of Sophia 2.1, showing a good recall value of 0.82 and,
almost as a consequence, a small degree of precision of 0.37:
Precision
Recall
F-measure
0.37
0.82
0,51
Sophia 2.1
As a consequence of the interaction between shallow and deep parsing as we designed it,
the recall value cannot be affected by the deep analysis results whereas the precision value
should be, by filtering out (some of) the incorrect opinion templates. Indeed, also the F-
measure should be increased.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 28
Deep Analysis Results
182 templates
The deep analysis of the 182 opinion snippets substantially confirmed the results of the
shallow parser in 79 cases, while in 15 cases out of 79 the templates have been enriched
with additional (more detailed) information (e.g., a generic problem concerning a given phone
model was specified as a net connection problem by the deep analysis, and so on).
54 templates, on the other hand, were refused as incorrect, because no connection between
the predicate (i.e. the word(s) indicating the opinion) and the entity (a keyword of the mobile
phone specific domain) in the template produced by Sophia has been found in the RMRS
structure resulting from the deep analysis.
Finally, the deep analysis of 49 input texts did not succeed, either because the original text was incorrect (mainly due to bad punctuation), or because it included one or more linguistic phenomena not covered by the current Italian grammar.
In the following paragraphs we show some example of the different possible results produced
by deep processing: 2 examples of validation (a simple one, and another one producing an
enriched template), and two snippets that the grammar was not able to parse. For each
example, we show the original text snippet, the template produced by Sophia, and the RMRS
structure resulting from the deep grammatical analysis.
49 not parsed
(26,9 %)
54 filtered
(29,7 %)
79 validated (43,4 %)
64 simple valid.
(35,2 %)
15 enriched valid.
(8,2 %)
24 lack of punct.
(13,2 %)
25 not covered
(13,7 %)
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 29
(simple) validation example
ho riscontrato un problema con gli infrarossi del sharp gx20. (I registered a problem with the infrareds of the Sharp gx20) h1 _avere_v_aux1(h1,e2:) ARG1(h1,h3:) _riscontrare_v(h3,e4:) ARG1(h3,x6:) ARG2(h3,x5:) _undef_q_article(h7,x5:) BODY(h7,h9:) RSTR(h7,h8:) qeq(h8:,h26) _problema_n(h10,x5:) ARG1(h10,e11:) _con_p(h12,e11:) ARG2(h12,x14:) _def_q_article(h15,x14:) BODY(h15,h17:) RSTR(h15,h16:) qeq(h16:,h18) noun_name_rel(h18,x14:) CARG(h18,infrarossi_nm) ING(h18:,h10001:) _di_p(h10001,e20:) ARG1(h10001,x14:) ARG2(h10001,x19:) _def_q_article(h21,x19:) BODY(h21,h23:) RSTR(h21,h22:) qeq(h22:,h24) brand_rel(h24,x19:) CARG(h24,sharp_n) ING(h24:,h10002:) model_rel(h10002,x25:) CARG(h10002,gx20_n)
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 30
<opinionSt end="90" hasSnippet="true" start="50"> <opinion confidence="1" end="61" hasSnippet="true" pred="-1" start="50">un
problema</opinion> <entity end="" hasSnippet="false " start="" term="" /> <model brand="BR_SHARP" model="gx20" /> un problema con gli infrarossi del sharp
</opinionSt>
- <NLFDoc> <Info id="1" sourceID=" 70824.txt" /> - <Maps>
- <NLF> - <Pred attr="brString">
<Val>l sharp</Val> </Pred> - <Pred attr="brValue">
<Val>BR_SHARP</Val> </Pred> - <Pred attr="model">
<Val>gx20</Val> </Pred> - <Pred attr="opValue">
<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">
<Val>un problema</Val> </Pred> - <Pred attr="type">
<Val>ENT_OP</Val> </Pred>
</NLF> </Maps>
</NLFDoc>
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 31
(enriched) validation example
il mio v200 ha un problema di connessione col pc (my v200 has a connection problem with the pc)
h1 _def_q_article(h3,x4:) BODY(h3,h6:) RSTR(h3,h5:) qeq(h5:,h7) _mio_j(h7,x4:) ARG1(h7,x4:) ING(h7:,h10001:) model_rel(h10001,x4:) CARG(h10001,v200_n) _avere_v(h1,e2:) ARG1(h1,x4:) ARG2(h1,x8:) _undef_q_article(h9,x8:) BODY(h9,h11:) RSTR(h9,h10:) qeq(h10:,h24) _problema_n(h12,x8:) ARG1(h12,e13:) _di_p(h14,e13:) ARG2(h14,x16:) _connessione_n(h17,x16:) ING(h17:,h10002:) _con_p(h10002,e19:) ARG1(h10002,x16:) ARG2(h10002,x18:) _def_q_article(h20,x18:) BODY(h20,h22:) RSTR(h20,h21:) qeq(h21:,h23) _pc_n(h23,x18:)
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 32
<opinionSt end="68" hasSnippet="true" start="42"> <opinion confidence="1" end="68" hasSnippet="true" pred="-1" start="57">un problema</opinion> <entity end="" hasSnippet="false " start="" term="" /> <model brand="" model="v200" /> v200 ha un problema </opinionSt>
- <NLFDoc> <Info id="3" sourceID=" 71308.txt" /> - <Maps>
- <NLF> - <Pred attr="brString">
<Val>samsung</Val> </Pred> - <Pred attr="brValue">
<Val>BR_SAMSUNG</Val> </Pred> - <Pred attr="model">
<Val>v200</Val> </Pred> - <Pred attr="type">
<Val>phone_name</Val> </Pred>
</NLF> - <NLF>
- <Pred attr="model"> <Val>v200</Val>
</Pred> - <Pred attr="opValue">
<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">
<Val>un problema</Val> </Pred> - <Pred attr="type">
<Val>ENT_OP</Val> </Pred>
</NLF> </Maps>
</NLFDoc>
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 33
Filtering example
probabilmente hai configurato male il nokia come modem. (probably you have configured the nokia badly as a modem)
h1 _probabilmente_r(h1,e2:) ARG1(h1,h3:) qeq(h3:,h1) _avere_v_aux1(h10001,e2:) ARG1(h10001,h4:) ING(h10001:,h10002:) ING(h10001,h10003:) ING(h10001,h10004:) ING(h10001,h1:) _configurare_v(h4,e5:) ARG1(h4,x7:) ARG2(h4,x6:) _male_r(h10002,e2:) ARG1(h10002,h8:) qeq(h8:,h1) _def_q_article(h9,x6:) BODY(h9,h10:) RSTR(h9,h8:) qeq(h8,h1) brand_rel(h10003,x6:) CARG(h10003,nokia_n) _come_p(h10004,e12:) ARG1(h10004,x6:) ARG2(h10004,x11:) _modem_n(h13,x11:)
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 34
<opinionSt end="380" hasSnippet="true" start="356"> <opinion confidence="1" end="360" hasSnippet="true" pred="-1"
start="356">male</opinion> <entity end="380" hasSnippet="true" start="375"
term="CONNECTIVITY">modem</entity> <model brand="BR_NOKIA" model="" /> male il nokia come modem
</opinionSt>
- <NLFDoc> <Info id="2" sourceID=" 70964.txt" /> - <Maps>
- <NLF> - <Pred attr="brString">
<Val>il nokia</Val> </Pred> - <Pred attr="brValue">
<Val>BR_NOKIA</Val> </Pred> - <Pred attr="entString">
<Val>modem</Val> </Pred> - <Pred attr="entValue">
<Val>CONNECTIVITY</Val> </Pred> - <Pred attr="opValue">
<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">
<Val>male</Val> </Pred> - <Pred attr="type">
<Val>ENT_OP</Val> </Pred>
</NLF> </Maps>
</NLFDoc>
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 35
NotParsed example
(lack of punctuation)
ho due problemi che spero voi possiate risolvermi il primo ho sul mio nokia 7650 il programma di
registrazione video ho inviato un filmato su di un altro nokia 7650 con lo stesso programma ma il file si
chiama video3gp e se provo ad aprirlo mi dice che il formato del file è sconosciuto.
(I have two problems that I hope you can help me to solve first I have on my Nokia 7650 the program for
recording video I sent a movie to another nokia 7650 with the same program but the file is called video3gp
and if I try to open it it says to me that the format file is unknown)
(out of linguistic coverage [e.g. parenthetical clauses] and/or exhausted-number-of-edges)
ieri ho acquistato un nokia n-gage, ho acceso il mio portatile toshiba satellite 5100 503 (già predisposto
bluetooth), installo il software in corredo (pc suite for nokia n-gage) e qui mi fermo, nel senso che poi
non si fa più nulla e una volta attivato il bluetooth sul cell e sul portatile si vedono senza problemi, ma
non con i software nokia.
(Yesterday I’ve bought a nokia n-gage, I turned-on my toshiba satellite 5100 503 (already prearranged for
the bluetooth), I install the equipped software (pc suite for nokia n-gage) and here I stay, in the sense
that you don’t do anything else and once you have activated the bluetooth on the cell phone and on the
pc they are seen without any problem but not with the nokia software)
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 36
4 Evaluation of the Auto-Response Application
The component provides information extraction functionalities for the following scenarios in
the content domain of a mobile phone provider: product ordering, mix-ups in deliveries of
products, and replacement of defective products. It takes as input one or more e-mails
(German and/or English) and delivers filled scenario templates as output. These templates
are the result of several processing steps as described in Deliverable 4.9: named entity
recognition, shallow and deep analysis, coreference resolution, mapping of results from the
preceding analysis on domain specific templates, and merging operations on the partially filled
templates resulting in filled scenario templates. The final scenario templates are of the
following types:
• Exchange
• Ordering
• Mix-up
In cases where no merging operations can take place the partially filled templates will be
presented to the user.
4.1 Email corpus
An email corpus was constructed using relevant anonymized customer emails.
Since the evaluation of the component was performed manually, the data set used for the
evaluation was rather limited: 87 emails for German and 84 for English. On average, each
email contains 4 sentences. Hence, the German data set consists of 348 sentences and the
English data set of 336 sentences.
4.2 Template Examples
The following examples show two input e-mails processed by the system and the scenario
templates the system delivered as output. The system identifies the customer using the
predicate argument structure from the deep analysis and by doing a domain-specific
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 37
coreference resolution between certain pronouns and potential antecedents. The assumption
here is the following: the person writing an e-mail referring to herself by “I” or “me” etc.
presumably mentions her name either in the complimentary close or in the address part of the
e-mail. Products will be identified by predicate argument relations and named-entity
recognition. The predicates trigger the process of choosing the correct scenario template.
Below the output of the system is provided for an email asking for replacement of a defective
product and for an email describing the mix-up in a delivery of a product.
Input to the system:
Dear Support-Team!
I just received the Siemens C35 I ordered. But it seems to be broken because it doesn't want to switch
off. I would like to replace this defective phone.
thank you
Terry Severson
7110 Martina Rd
Winona, MN 55987
Output of the system:
Template Id = 967 Weight = 1.0
[Type: Exchange
Products: [Product_0: [Name: C35
Features: defective]]
OrderDate: [ ]
DeliveryDate: [ ]
Provider: [ ]
Customer: Terry Severson]
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 38
Input to the system:
Dear team!
I hope you'll help me!
You sent me a 8210 yesterday but it's not what I expected. I ordered a headset for my Siemens S55 .
I don't want this phone and I hope to will deliver my headset soon! Please correct this asap.
Thanks,
Shaun Port
Output of the system:
Template Id = 10512 Weight = 1.0
[Type: MixUp
OrderTemplate: [Type: Order
Products: [Product_0: [
Name: headset
Features: for_S55]]
Customer: Shaun Port
]
DeliveryTemplate: [Type: Delivery
Products: [Product_0: [
Name: 8210]]
Provider: [ ]
Customer: ...]]
4.3 Evaluation types
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 39
In our evaluation we compare for German and English two different result sets using two
different preprocessing levels configured in the HoG. The deep processing used in our
application
and evaluated corresponds to configuration 2 described in chapter 2. Since the application on
hand should be applied in real world contexts robustness is a necessary precondition. The
usage of Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as
input guarantees this requirement. Using Part-of-Speech tags allows for the recoginition of
unknown words and Named Entities allows for the recognition of unkown potentially important
named entities. Therefore the application has been evaluated with PET using Part-of-Speech
tags and named-enties. For German, a chunk tagger is used as shallow component. For
English we use Rasp, a robust statistical parser. Both have been integrated in the HoG
described in chapter 2. The configurations that have been used for the evaluation are the
following:
1) Only deep analysis as preprocessing
2) Deep and shallow analysis as preprocessing
We measured precision, recall, and F-Score for the scenario templates delivered by the
system by manually comparing them against ‘gold standard’ template annotations in the email
corpus mentioned above.
We did two different types of evaluation: a template-based evaluation and a feature-based
evaluation. During template-based evaluation, a template was judged correct if and only if all
required template features were correctly filled and the type of the template was correct.
During feature-based evaluation, all feature values were evaluated separately. Each single
slot was judged either as correct or false. The relevant features for all three template types
are the following:
• Template type
• Product list (each product counting singly if more than one)
• Product feature list (each feature counting singly if more than one)
• Customer
• Provider
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 40
4.4 Accuracy of the German Prototype
4.4.1 First experiment
Our first experiment shows figures using mainly deep analysis and, as fall back solution,
shallow processing (configuration 3 of chapter 2).
Precision Recall F-Score
Template-based evaluation 50.35 % 45.74 % 47.93 %
Feature-based evaluation 60.46 % 56.38% 58.34 %
Table 1: Precision and recall values for results using configuration 2.
4.4.2 Second Experiment
The second experiment uses only deep analysis as preprocessing (configuration 2 of
chapter).
Precision Recall F-Score
Template-based evaluation 62.25 % 36.95 % 46.30 %
Feature-based evaluation 68.43 % 46.65 % 54.76 %
Table 2: Precision and recall values for results using configuration 1.
4.5 Accuracy of the Prototype for English
4.5.1 First Experiment
The first experiment shows figures using mainly deep analysis and, as fall back solution,
shallow processing (configuration 3 of chapter 2).
Precision Recall F-Score
Template-based evaluation 57.25 % 30.58 % 39.86 %
Feature-based evaluation 83.19 % 47.13 % 60.17 %
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 41
Table 3: Precision and recall values for results using configuration 2.
4.5.2 Second Experiment
The second experiment uses only deep analysis as preprocessing (configuration 2 of
chapter).
Precision Recall F-Score
Template-based evaluation 48.13 % 38.52 % 67.56 %
Feature-based evaluation 75.45 % 61.17 % 42.70 %
Table 4: Precision and recall values for results using configuration 1.
4.5.3 Conclusion
Naturally, precision and recall values for feature-based evaluation are always higher than
those for template-based evaluation. This is due to the fact that in many cases templates
contain only one or two incorrect feature values. During template-based evaluation these
templates were regarded as completely incorrect. This fact explains the difference between
accuracy values considering whole templates and feature values, respectively.
As expected, the precision value when using only deep analysis is higher than the precision
value when combining deep and shallow analysis. Whereas the higher F-Score in the first
experiment for both evaluation types indicates that the combined approach delivers better
results altogether.
The usage of shallow preprocessing mainly supports the identification of ordering templates.
This is mainly due to the difficulty of recognizing templates of type “Exchange” or “Mix-up”
when using only shallow processing. In these cases accurate recognition of predicate
argument structure is a necessary precondition for making the following decisions:
• What are the features of the product?
• Which product has been ordered and which product must be replaced?
Moreover, the relevant agreement features are not available in the domain-specific
coreference resolution between pronouns and customer names as potential antecedent, or
nouns and product named entities as potential antecedent.
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 42
On the other hand, shallow processing does deliver correct templates in some cases (mainly
“Order” templates) for which the deep analysis does not provide a template at all. The
example below illustrates one such case for an email in German.
Input to the system:
Ich suche einen neuen Handy-Vertrag, können Sie mir bitte Ihre Angebote zuschicken? Danke. Peter
Janke Email: [email protected] Adresse: Halbergstrasse 57 66111 Saarbrücken
Output of the system:
Template Id = 25577 Weight = 1.0
[Type: Order
Products: [Product_0: [
Name: Handy-Vertrag
Features: neuen]]
OrderDate: [ ]
Customer: Ich ]
Template Id = 25563 Weight = 1.0
[Type: Order
Products: [Product_0: [
Name: Angebote]]
OrderDate: [ ]
Customer: mir ]
D5.10 Evaluation Document Version 1.0
______________________________________________________________________________________
____
DeepThought IST-2001-37836 43
Therefore, the benefits of the combination of both approaches are evident as the figures in
table 1 - 4 suggest.
5 Travel Information Application
An application using the Norwegian grammar is also under development, aimed at extracting
information from hiking route descriptions and supplying it for a web portal. The HoG
machinery produces RMRSes which are mapped onto standardized information matrices.
The input grammar has a specially developed semantics coping with aspects of paths and
movement (extending the core grammar originally produced).
A documentation of the system will be available by the end of the project, but an evaluation
would be premature at the present point.
6 Concertation Plan
The core linguistic machinery (HoG) developed in DeepThought will be employed in the
following two projects that have been applied for, with outcome to be known early
December.
GERONIMO: developments of the Norwegian HPSG grammar in domains of
e-learning, translation and information extraction, making crucial use of HoG.
WebSEMIOTICS: developing tools for web information extraction with a natural language
interface, using HoG as one component.
Both projects will run at NTNU.