Evaluation Report on Efficiency, Accuracy and Usability of ...complement (noun, preposition, adverb,...

52
D5.10 Evaluation Document Version 1.0 ______________________________________________________________________________________ ____ DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable 5.10 Evaluation Report on Efficiency, Accuracy and Usability of the New Approach The DeepThought Consortium

Transcript of Evaluation Report on Efficiency, Accuracy and Usability of ...complement (noun, preposition, adverb,...

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DEEPTHOUGHT

Hybrid Deep and Shallow Methods

for Knowledge-Intensive

Information Extraction

Deliverable 5.10

Evaluation Report on Efficiency,

Accuracy and Usability of the New

Approach

The DeepThought Consortium

D5.10 Evaluation Document Version1.0

DeepThought IST-2001-37836 II

August 2004

D5.10 Evaluation Document Version1.0

DeepThought IST-2001-37836 III

PROJECT REF. NO. IST-2001-37836

Project acronym DeepThought

Project full title DeepThought - Hybrid Deep and Shallow Methods for

Knowledge-Intensive Information Extraction

Security (distribution level) Rest.

Contractual date of delivery August 2004

Actual date of delivery September 2004

Deliverable number D5.10

Deliverable name Evaluation report on efficiency, accuracy and usability of

the new approach

Type Report

Status & version Final version

Number of pages

WP contributing to the

deliverable

WP5

WP / Task responsible Xtramind

Other contributors USAAR, Celi, NTNU

Author(s) Dorothee Beermann, Berthold Crysmann, Petter Haugereid, Lars Hellan, Dario Gonella, Daniela Kurz, Giampaolo Mazzini, Oliver Plaehn, Melanie Siegel

EC Project Officer Evangelia Markidou

Keywords Evaluation, Matrix, core linguistic machinery, applications,

accuracy, coverage

Abstract In this deliverable, an evaluation of the Italian and Norwegian grammar is presented. Moreover, the core linguistic machinery and the both Applications have been evaluated.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

Table of Contents

1 Grammar Evaluation 1

1.1 Basic Considerations on the Evaluation .....................................................................1

1.2 Evaluation Results: Norwegian Grammar...................................................................1

1.2.1 Phenomena Covered by the Grammar...............................................................1

1.2.2 Phenomena not Covered by the Grammar.........................................................2

1.2.3 Size of the Lexicon and the Grammar ................................................................3

1.2.4 Correlation between the Matrix and the Norwegian Grammar............................3

1.3 Evaluation results: Italian Grammar ..........................................................................11

1.3.1 Phenomena Covered by the Grammar.............................................................12

1.3.2 Phenomena not Covered by the Grammar.......................................................14

1.3.3 Size of the Lexicon and the Grammar ..............................................................14

1.3.4 Correlation between the Matrix Grammar and the Italian Grammar .................15

1.4 Reusability .................................................................................................................16

1.5 Standardized Output Format.....................................................................................16

2 Evaluation of the Heart of Gold 18

2.1.1 PASCAL Data....................................................................................................19

2.1.2 Mobile Phone Corpus ........................................................................................20

2.1.3 Newspaper Corpus ...........................................................................................21

2.1.4 All corpora..........................................................................................................21

2.1.5 Conclusions.......................................................................................................22

2.2 HoG and German ......................................................................................................22

3 Evaluation of the Business Intelligence Application 24

4 Evaluation of the Auto-Response Application 36

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 ii

4.1 Email corpus .............................................................................................................36

4.2 Template Examples ..................................................................................................36

4.3 Evaluation types ........................................................................................................38

4.4 Accuracy of the German Prototype...........................................................................40

4.4.1 First experiment.................................................................................................40

4.4.2 Second Experiment...........................................................................................40

4.5 Accuracy of the Prototype for English.......................................................................40

4.5.1 First Experiment ...............................................................................................40

4.5.2 Second Experiment...........................................................................................41

4.5.3 Conclusion.........................................................................................................41

5 Travel Information Application 43

6 Concertation Plan 43

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 iii

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought 1 IST-2001-37836

1 Grammar Evaluation

1.1 Basic Considerations on the Evaluation

The Matrix grammar as a language independent core grammar has been designed to facilitate

the rapid initial development of grammars for natural languages. Since the matrix is a

collection of generalizations across grammars it can not be evaluated itself. Evaluation of the

matrix is therefore carried out as a case study of its benefit for the development of an actual

grammar. The current version of the Matrix has been used for the development of a

Norwegian and an Italian grammar since the beginning of the project.

The evaluation on hand is based on the following questions:

• How many person months were spent for the grammar development in DeepThought?

• What phenomena are covered by the Norwegian and the Italian grammar?

• What phenomena are not covered, but should be covered? Are these phenomena

language-specific, or do they occur in other languages as well?

• What are the size of the lexicon and the number of types and rules?

• Which types of the Matrix get used and which get not used.

• What would be the effort in defining the used types without Matrix?

1.2 Evaluation Results: Norwegian Grammar

The Norwegian grammar makes use of the matrix v0.6. For the development of the grammar

in DeepThought from M0 until M22 all in all 20 person months were spent. Ten person months

were funded by DeepThought, four person months are paid by other sources and six are

under permanent positions.

1.2.1 Phenomena Covered by the Grammar

The phenomena that can be covered by the Norwegian grammar are the following:

• Lexicon

The lexicon has 84240 lexical entries: 56966 nouns, 13744 adjectives, 13185 verbs (some

of them valence variants of the same lexeme), 284 prepositions/adverbs and 61 other

entries. Verbs, nouns and adjectives are entered into the lexicon as lexemes. Inflectional

rules turn them into words. All other lexical emtries are words (or phrases).

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 2

• Word types

The grammar has mainly 11 word classes: verbs, nouns, pronouns, adjectives, adverbs,

sentential adverbs, determiners, complementizers, infinitive marker, prepositions,

conjunctions.

• Valency

The grammar has 102 different argument frames for verbs. These argument frames are

crossclassifications of the factors presentational/non-presentational, -arity, category of the

complement (noun, preposition, adverb, adjective, subordinate clause or infinitival clause)

and thematic roles. It also treats auxiliaries and modal verb.

• Lexical rules

The grammar has lexical rules in order to handle the position of light pronouns with regard

to sentence adverbials, and also for particle promotion, passive and inversion.

• Syntax

The grammar parses declarative clauses and yes-no questions. It treats topicalization,

wh-questions and relative clauses. It parses sentences with auxiliary verbs. Both

morphological and periphrastic passive is covered. The grammar treats agreement in

NPs.

1.2.2 Phenomena not Covered by the Grammar

The phenomena that cannot be covered by the grammar are listed below. We make a

difference between bugs in grammar (rules that do not work currently) which will be

fixed and rules that are missing and have to be added.

• Rules that do not work:

The imperative inflectional rule works, but imperative sentences do not parse.

Extraction from subordinate clauses does not work for the moment. There is a

vague/unclear distinction between adverbs and prepositions In the big lexicon.

• Rules that are missing:

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 3

The grammar almost doesn't have any treatment of numbers, compounds,

abbreviations or interjections. It doesn't treat procedural markers like "vel", "altså",

"da" or "nok". The comparative and superlative forms of adjectives are missing.

Coordination and comparitive constructions are not treated in the grammar. The

grammar does not have more than one valence frame for nouns and adjectives.

1.2.3 Size of the Lexicon and the Grammar

The lexicon contains 84,240 lemmas. The number of types is 2853 where 202 are

matrix types, 768 are language specific types and 1883 are inflectional pattern types.

The grammar contains 9 l-rules, 409 i-rules and 32 syntactic rules.

1.2.4 Correlation between the Matrix and the Norwegian Grammar

Comparing the Norwegian grammar with the Matrix grammar we find that the following Matrix

types are used in the Norwegian grammar. We distinguish between the types that are

explicitly used the grammar, those that are indirectly used, which means they are not referred

to explicitly, and those that are not used at all.

The matrix.tdl file has 203 types. 91 types are explicitly referred to in the Norwegian type file

(norsk.tdl) either as supertypes or as values of features. 61 types are indirectly used by the

grammar. These types are typically supertypes of other matrix types that that are used in the

Norwegian type file, or types that introduce features that the Norwegian type file uses. The last

51 types are not used by the grammar.

1.2.4.1 Explicitly used Types

• Basic SIGN types

sign

word-or-lexrule

word

norm-lex-item

lexeme

phrase

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 4

• Syntactic types

synsem-min

synsem

expressed-synsem

canonical-synsem

lex-synsem

phr-synsem

non-canonical

gap

unexpressed-reg

anti-synsem

mod-local

local

cat

head

valence

• Semantic types

semsort

message

message_m_rel

command_m_rel

proposition_m_rel

question_m_rel

handle

index

event-or-ref-index

expl-ind

ref-ind

png

tense

event

conj-index

relation

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 5

arg0-relation

arg1-relation

arg12-relation

arg123-relation

noun-relation

named-relation

prep-mod-relation

conjunction-relation

unspec-compound-relation

quant-relation

• Technical types

Bool: + -

xmod

notmod-or-rmod

notmod-or-lmod

notmod

hasmod

lmod

rmod

sort

predsort

avm

list

null

olist

• Lexical types

lex-rule

lexeme-to-word-rule

inflecting-lex-rule

constant-lex-rule

const-ltol-rule

const-ltow-rule

infl-ltow-rule

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 6

• Phrasal types

phrasal

head-valence-phrase

basic-unary-phrase

unary-phrase

basic-binary-phrase

binary-phrase

binary-headed-phrase

head-only

head-initial

head-final

head-compositional

basic-head-filler-phrase

basic-head-subj-phrase

basic-head-spec-phrase

basic-head-comp-phrase

basic-extracted-subj-phrase

extracted-adj-phrase

adj-head-phrase

head-adj-phrase

adj-head-int-phrase

head-adj-int-phrase

1.2.4.2 Indirectly Used Types

Some of these types are '-min' types like sign-min, valence-min and mrs-min that typically do

not introduce any features, but have subtypes that do. Their function is to make parsing more

efficient. They are not referred to explicitly in the Norwegian grammars, but they have matrix

subtypes that are referred to.

Another group of types that are not referred to directly are types that introduce features that

are used in the Norwegian type file (keys, mrs, hook, qeq). .

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 7

Moreover there is a group of intermediate matrix types with structure that are supertypes of

other matrix types (headed-phrase, basic-extracted-arg-phrase, head-mod-phrase). The

language specific types inherit from subtypes of these types.

• Basic SIGN types

sign-min

basic-sign

phrase-or-lexrule

word-or-lexrule-min

• Syntactic types

lex-or-phrase-synsem

expressed-non-canonical

unexpressed

local-min

non-local-min

non-local

cat-minhead-min

valence-min

keys_min

keys

• Content types

mrs-min

mrs

hook

lexkeys

basic_message

prop-or-ques_m_rel

abstr-ques_m_rel

qeq

semarg

individual

tam

subord-or-conj-relation

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 8

norm_rel

named_rel

• Technical types

luk

na-or-+

na-or--

+-or--

na

atom

cons

0-1-list

1-list

diff-list

0-1-dlist

0-dlist

1-dlist

string

alts-min

alts

label

• Lexical types

lex-item

lexeme-to-lexeme-rule

infl-ltol-rule

• Phrasal types

headed-phrase

head-nexus-rel-phrase

head-nexus-que-phrase

head-nexus-phrase

basic-binary-headed-phrase

non-clause

basic-extracted-arg-phrase

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 9

basic-extracted-adj-phrase

head-mod-phrase

basic-head-mod-phrase-simple

head-mod-phrase-simple

isect-mod-phrase

1.2.4.3 Types Ignored by the Grammar

Some of the types that are not used are 'shortcut' types like conj-event, conj-ref-ind and

event-relation (and its subtypes). These types are specified to have a specific value for a

feature. For example the type event-relation is specified to have as its ARG0 the type event.

Two types (no-alts, no-msg) are 'negation' types, types that are introduced in order not to be

compatible with other types.

The semantic types of aspect and mood are not used.

The type clause and its subtypes are not used (relative-clause, non-rel-clause, interrogative-

clause, declarative-clause and imperative-clause).

The OPT mechanism is not used, so the types basic-head-opt-comp-phrase, basic-head-opt-

one-comp-phrase and basic-head-opt-two-comp-phrase are not used. The scopal-mod-

phrase types are not used.

Some relation types like verb-ellipsis-relation, noun-arg1-relation, adv-relation and subord-

relation are not used.

• Syntactic types

no-alts

non-affix-bearing

rule

tree-node-label

meta

scopal-mod

intersective-mod

non-local-none

• Semantic types

psoa

nom-obj

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 10

ctxt-min

ctxt

no-msg

ne_m_rel

instloc

aspect

mood

conj-event

conj-ref-ind

arg1234-relation

event-relation

arg1-ev-relation

arg12-ev-relation

arg123-ev-relation

arg1234-ev-relation

verb-ellipsis-relation

noun-arg1-relation

adv-relation

subord-relation • Technical types

1-plus-list

dl-append

integer

ocons

onull

• Phrasal types

non-headed-phrase

binary-rule-left-to-right

binary-rule-right-to-left

basic-head-final

clause

relative-clause

non-rel-clause

interrogative-clause

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 11

declarative-clause

imperative-clause

basic-head-opt-comp-phrase

basic-head-opt-one-comp-phrase

basic-head-opt-two-comp-phrase

basic-extracted-comp-phrase

scopal-mod-phrase

adj-head-scop-phrase

head-adj-scop-phrase

Facing the number of types given in the lists above the effort in defining a grammar with the

same coverage as the actual one without using the matrix grammar would have been twice.

With the aim of developing more comprehensive and systematic evaluation tools for grammar

coverage, work has also been started, for Norwegian and German, to define test suites for

verb constructions indexed according to construction type properties, and likewise for

derivational morphology for verbs, adjectives and nouns. The scope of this work, enclosed as

separate document to this deliverable, is rather large, and it has not been possible to finalize it

for serving as a benchmark for the current round of comparison; rather it will form part of the

documentation package to be provided by the end of the project.

Presentation of ‘satellite grammars’, based on the Matrix but exploring different phenomena

than those addressed in NorSource, will also be made then.

As documentation documents at the present point are available:

Hellan, L., and P. Haugereid. (2003.) 'The NorSource Grammar - an excercise in the Matrix

Grammar building design'. In: Proceedings of Workshop on Multilingual Grammar

Engineering, ESSLLI 2003.

Hellan, L. (2003) ‘Documentation of NorSource’. Ms, NTNU.

1.3 Evaluation results: Italian Grammar

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 12

The Italian grammar makes use of the matrix v0.6. For the development of the grammar in

DeepThought from M0 until M22 12,5 person months were spent.

1.3.1 Phenomena Covered by the Grammar

A list of the linguistic phenomena covered by the Italian grammar (version 0.6, May 2004) is

given below.

• Agreement :

A type hierarchy for agreement values has been created. The agreement between

determiner and noun, adjective and noun, subject and verb is covered.

• Argument structure (optionality and free order):

The feature OPT bool (introduced in the Matrix at the synsem-min level) is used by lexical

entry types in the COMPS list in order to manage the optionality of the arguments. As far

as the free order of the arguments is concerned (namely the subject inversion, the NP-PP

and the NP-AP inversion in verbal argument structure), we adopt the strategy of using

lexical rules, which simply deal with the inversion of the elements in the COMPS list.

• Passivation :

A lexical rule, passive-lex-to-word, according to the traditional approach (Sag & Wasow,

"Syntactic Theory"), rearranges the elements of the COMPS list. The rule applies to

transitive & non-ergative verbs [TRANS +, ERG -]. Further information concerns the

semantic role ARG1 (coindexed with the ARG2 of the preposition "da" (by). In

correspondence with the passive-lex-to-world another lexical rule deals with the past

participle, so that past participles have two alternative interpretations, one as "passive",

the other as "active" past participle.

• Raising and control verbs

All auxiliaries, modals, pure motions and copulatives are considered as raising verbs (with

structure sharing between the subject of the governed infinitive and the subject of the

governor). All others (governing at least one more complement besides the infinitive one)

are treated as equi (control) verbs (as described in Pollard & Sag, 1994).

• Restructuring verbs:

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 13

Partially according to the approach of Monachesi 1999, following Rizzi, 1982 both auxiliaires

and restructuring verbs (modals, temporal aspectuals and pure motion verbs) are subjected

to the argument composition constructing a verbal complex.

• Perception verbs:

The relative high frequency of perception (and causative) verbs in the Italian corpus

requires an elaborated treatment of this complex phenomenon. All the perception verbs

(hence PDS) have been grouped in 4 verb types:

- PDS control verbs

- PDS verbs with argument composition

- PDS monotransitive verbs (governing a “che” finite clause)

- PDS predicative verbs

Regarding the infinitive complementation of PDS verbs, a “deep” passivation of the infinite

verb is allowed, without a correspondent morphological realization. E.g. the sentence

Giovanni lo ha visto uccidere (John saw him to-kill) can have a double intepretation,

depending on the diathesis of the infinitive "uccidere" (kill). A new feature DEACT bool

(introduced in pred-st) is used for the “deactivation” phenomenon. The value “+” is

assigned by a lexical rule for infinitive verbs

• Cliticization:

According to the recent literature on Italian clitics and clitic climbing, the most convincing

approach seems to be the one suggested by Paola Monachesi in several papers. The

combined action of the "cliticizing" lexical rules and the "argument composition" lexical

rules seems to be adequate in several cases, but it seems to be unefficient in the case of

"multiple restructuring" (being the auxiliary verb a restructuring one, a "double” or “triple”

restructuring is quite frequent). We have tried to overcome the problems connected with

the this multiple restructuring by adopting an hybrid approach to cliticization and clitic

climbing. We use the argument composition mechanism for restructuring verbs, but we

delay the attachment of the clitic until the (possible) restructuring chain has been

completed.

• Clausal complementation:

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 14

Four kinds of clausal complements are taken into account, namely infinitive clauses

introduced by “a” or “di” complementizers, finite clauses introduced by “che” and base

infinitive clauses.

• Modifiers phrases:

Adjectival phrases, adverbial phrases, some kinds of absolute phrases (participial and

gerundive), relative clauses (partially) and subordinate clauses are covered by the

grammar.

1.3.2 Phenomena not Covered by the Grammar

The phenomena that cannot be covered by the grammar are listed below.

From an application driven point of view the following phenomena are highly prioritised.

• Coordination (a first treatment has been introduced in a test-grammar)

• Comparative structures

Moreover an adequate treatment of valence frames for nouns and adjectives is missing. From

a more language-specific perspective, the “clitic-doubling” phenomena should be dealt with,

given their frequency in the corpus.

Furthermore a treatment of unknown words is essential for robustness of the application.

1.3.3 Size of the Lexicon and the Grammar

The grammar contains about 90 lexical types (about 30 for adverbs and 60 for verbs), 9

lexical rules (for different orders of arguments) and 33 construction rules.

The lexicon has 4345 lexical entries. Containing 467 adverbs, 589 transitive verbs, 318

ditransitive verbs, 210 intransitive verbs, 72 other verbs, 2548 nouns and 141 closed class

entries.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 15

1.3.4 Correlation between the Matrix Grammar and the Italian Grammar

The Italian grammar uses, directly or indirectly, 173 types (86%) of the 203 types given by the

Matrix. 30 types of Matrix types are not used by the Italian grammar

1.3.4.1 Types ignored by the Grammar

Types which get ignored are listed below:

no-alts

non-affix-bearing

rule

tree-node-label

meta

non-local-none

psoa

nom-obj

ctxt-min

ctxt

no-msg

command_m_rel

prop-or-ques_m_rel

abstr-ques_m_rel

question_m_rel

ne_m_rel

instloc

conj-event

conj-ref-ind

verb-ellipsis-relation

1-plus-list

dl-append

integer

binary-rule-left-to-right

binary-rule-right-to-left

relative-clause

non-rel-clause

interrogative-clause

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 16

imperative-clause

non-headed-phrase

Defining all the Matrix types used in the Italian grammar starting from scratch would be much

more time-consuming than deriving them from the Matrix. Similar to the development of the

Norwegian grammar the effort in defining the grammar would have been twice.

1.4 Reusability

Reusability can be shown by the fact that the Norwegian and the Italian Grammar make use of

the matrix exploiting the crosslinguistic similarity between Norwegian and Italian.

For Norwegian as well as for Italian a set of rules from the Matrix Grammar can be used.

These are head-specifier-phrases, head-subject-phrases, head-complement phrases,

modification, extraction, filling and inflection.

1.5 Standardized Output Format

Based on a selection of sentences for English, Norwegian and Italian for some parallel

phenomena semantic output data has been constructed. The semantic analyses of all three

languages for the sentence “Abrams barked” are given in figure 1, 2 and 3.

The standardized semantic output is due to the Matrix-based approach. Assuming that

semantic representations should be less language-dependent than syntactic representations,

generality in the output format is easily obtainable. Moreover interoperability is obtained by the

general interface to multilingual backend applications (see also chapter 2 and 3).

TEXT Abrams barked.

TOP h1

RELS { prpstn_m_rel proper_q_rel named_rel _bark_v }

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 17

LBL h1

ARG0 h5

LBL h6

ARG0 x7 pers=3

num=sg

RSTR h9

BODY h8

LBL h10

ARG0 x7 pers=3 num=sg

CARG abrams

LBL h11

ARG0 e2 tense=past

ARG1 x7 pers=3 num=sg

HCONS {h5 qeq h11, h9 qeq h10}

ING {}

Figure 1 Abraham barked

TEXT Ask bjeffet.

TOP h1

RELS {

named_rel

LBL h3

ARG0 x4

CARG ASK

def_q_rel

LBL h7

ARG0 x4

RSTR h10

BODY h11

bjeffe-rel

LBL h12

ARG0 e2

ARG1 x4

proposition_m_rel

LBL h1

ARG0 h17

}

HCONS {h10 qeq h3, h17 qeq h18}

ING {}

Figure 2 Ask bjeffet

TEXT Argo abbaiava

TOP h1

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 18

RELS {

named_rel

LBL h3

ARG0 x4

CARG Argo

_abbaiare_v

LBL h1

ARG0 e2

ARG1 x4

proposition_m_rel

LBL h5

MARG h6

}

HCONS { h6 qeq h1 }

ING {}

Figure 3 Argo abbaiava

2 Evaluation of the Heart of Gold

The Heart of Gold (HoG) is the core architecture combining different approaches to

multilingual language processing. It combines different modules of natural language

processing that provide analyses of varying depth and for multiple languages. There are

different types of module combination:

• The analysis results of NLP tools at lower processing levels can be used by components

at higher levels. For example, the deep linguistic analysis module PET uses default

lexicon entries for Part-of-Speech tags that the POS tagger TnT delivers and for Named-

Entities that Sprout delivers.

• It is possible to configure the HoG so that always the deepest result possible is delivered.

• Furthermore, one can configure the HoG to deliver partial results whenever a complete

analysis is not available. Partial results are taken from the deepest module that delivers

results.

• One can combine modules and grammars for different languages. Each language (we

currently work with English, German, Japanese, Norwegian, Italian and Greek) has its own

configuration of valid modules and grammars.

• The different modules use a compatible output formalism, RMRS. In case of shallower

modules, this robust semantic structure allows for underspecification of, e.g., argument

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 19

structure.

We evaluate what kind of annotation one can get from the HoG using different configurations

and data from different domains. The evaluation described in this document is restricted to

English resources but other languages will be evaluated in the near future as well.

We used three different test configurations for each data set:

1) HoG configured to provide the deepest result possible for each sentence. PET uses both

Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as input and

delivers partial parses in case no spanning analysis is available.

2) HoG configured to provide the deepest result possible for each sentence. PET uses both

Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as input, but

does not deliver partial parses.

3) HoG configured to provide only complete analyses from PET and RASP. No information

from shallower modules (TnT, Sprout) is used.

2.1.1 PASCAL Data

The training data of the PASCAL task contains declarative sentences of various domains.

581 test sentences of the PASCAL training corpus were sent to HoG using configuration 1

as described above. The following table shows the coverage of PET and Rasp.

# sentences # PET results # Rasp results # results581 442 139 581

100% 76,06% 23,92% 100%

The same 581 test sentences of the PASCAL training corpus sent to HoG using

configuration 2 delivered these results:

# sentences # spanning PET results # Rasp results # results581 134 447 581

100% 23,06% 76,94% 100%

For configuration 3, the results are as follows:

# sentences # PET results # spanning PET results # Rasp results # results581 37 14 544 581

100% 6,37% 2,41% 93,63% 100%

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 20

We get annotations for all sentences, either PET or Rasp. Rasp is very robust and able to

give analyses for all sentences for which PET does not give an analysis. In this domain, which

is quite diverse in lexical choices, it can be clearly shown that the usage of default lexicon

entries for recognized part-of-speeches and named-entities heavily influences the

performance of the deep linguistic processing. Without these, PET delivers results in only

6.37% of the sentences and spanning results in only 2.41%, while with the input, the coverage

of PET rises up to 76.06% for partial analyses and 23.06% for spanning results.

2.1.2 Mobile Phone Corpus

692 sentences (many of them fragments and lists) of mobile phone descriptions from the

internet sent to HoG using configuration 1.

# sentences # PET results # Rasp results # results692 631 61 692

100% 91,18% 8,82% 100%

The same 692 sentences of mobile phone descriptions sent to HoG using configuration 2.

# sentences # spanning PET results # Rasp results # results692 140 552 692

100% 20,23% 79,77% 100%

The same 692 sentences of mobile phone descriptions sent to HoG using configuration 3.

# sentences # PET results # spanning PET results # Rasp results # results692 296 65 396 692

100% 42,77% 9,39% 57,23% 100%

The lexicons were tuned to this domain, such that we get more PET results than in the former

domain, in the cases of using or not using default lexicon entries. Still, it can be shown that

the usage of POS and NER information from shallower modules increases the performance

of PET enormously from 9.36% to 20.23%. The data contains many lists and tables the deep

HPSG grammar is not quite prepared for. It shows how the overall processing gains from

being able to fall back to partial parses or (underspecified) Rasp results.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 21

2.1.3 Newspaper Corpus

48 sentences of a (business news) article in the San Francisco Chronicle from 2004-07-

27 (EN) sent to HoG using configuration 1.

# sentences # PET results # Rasp results # results48 31 17 48

100% 64,58% 35,42% 100%

The same 48 sentences sent to HoG using configuration 2.

# sentences # spanning PET results # Rasp results # results48 6 42 48

100% 12,50% 87,50% 100%

The same 48 sentences sent to HoG using configuration 3.

# sentences # PET results # spanning PET results # Rasp results # results48 5 1 43 48

100% 10,42% 2,08% 89,58% 100%

This text is completely out of the training domain and therefore significantly shows the effect

of default lexicon entries.

2.1.4 All corpora

Table 1, 2 and 3 show the three corpora and the coverage of all three in sum.

Table 1 shows the results using configuration 1.

# sentences # PET results # RASP resultsPascal 581 442 139Mobile Phone 692 631 61Newspaper 48 31 17All 1321 1104 217

100% 83,57% 16,43%

Table 2 shows the results using configuration 2.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 22

# sentences # PET results # Rasp resultsPascal 581 134 447Mobile Phone 692 140 552Newspaper 48 6 42All 1321 280 1041

100% 21,20% 78,80%

Table 3 shows the results using configuration 3.

sentences PET results spanning PET results RASP resultsPascal 581 37 14 544Mobile Phone 692 296 65 396Newspaper 48 5 1 43All 1321 338 80 983

100% 25,59% 6,06% 74,41%

2.1.5 Conclusions

First of all, the strategy to use the deepest available result delivered by the Heart of Gold core

architecture guarantees results for all sentences in different domains. These results are

comparable and compatible to each other, because of being formulated in the same

framework, RMRS. It therefore seems very useful to combine very robust modules like Rasp

with deeper modules like HPSG processing.

In different domains, closer and farer away from the development domain in lexicon as well as

syntactic structures, it could be shown that the depth of results increases enormously when

using the results of POS tagging and named-entity recognition in deep linguistic processing.

Over all domains, spanning HPSG (PET) processing increased from 6.06% up to 21.20%.

2.2 HoG and German

The large-scale German HPSG (Müller & Kasper 2000, Crysmann, 2003, to appear)

developed at DFKI has been integrated into the DeepThought architecture HoG during the 2nd

quarter of 2004. The main task of this integration effort was the adaptation of the semantic

output to current (R)MRS standards. Furthermore, interface types and mappings have been

provided to integrate shallow NLP analyses into the deep parser, thereby ensuring robustness

(for further integration scenarios see Frank et al. 2003). Currently, NER output from Sprout

and POS information from TnT are used to address the unknown-word problem.

In order to assess the gains in robustness offered by the integrated deep-shallow processing

adopted by HoG, we ran an experiment on unseen data, measuring the coverage obtained

with and without deep-shallow integration.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 23

As test data, we used the 200 questions from the German section of the CLEF 2003 multi-

lingual QA competition. The corpus was parsed both by a stand-alone cheap and by the

version integrated into HoG. Additionally, we provide figures derived from a mock-up

experiment performed in the context of the DFKI project QUETAL, where NEs had been

manually replaced with dummy strings, directly corresponding to special lexical entries in the

German HPSG.

The standalone system (baseline) was able to deliver a full parse for 34 sentences only

(17%). Inspection of the error log revealed that the most common source for parse failure was

lexical in nature: in 77.5% of the input sentences, at least one lexical item was unknown.

Abstractiing away from the problems of lexical coverage, syntactic coverage was around 80%

(34/42), although these figures are certainly not reliable, owing to the size of the data set.

Deep-shallow integration drastically improves on these figures: by feeding NER and POS tag

information into the deep parser, coverage goes up to 73% (146/200), a figure comparable to

those achieved on corpora for which the grammar had been optimised (e.g. Verbmobil data:

VM-CD01: 74.1%; VM-CD15: 78.4%). We conjecture that even better results may be

obtained by improving the NER component: given that 80% of the 200 CLEF questions

contain at least one NE, it is somewhat surprising that NER only provided partial results for

another 8 test items, which amounts to 4% of the entire test suite.

The results obtained by the German HoG also compare well to the aforementioned mock-up

experiment. Manual substitution of NEs resulted in an overall coverage of 56.5% (113/200).

Owing to the fact that substitution was restricted to NEs, lexical coverage was still an issue,

accounting for 30% (60/100) of parse failures. Relative to the 140 sentences without lexical

errors, we measured a syntactic coverage of around 80%.

To conclude, the integrated shallow-deep approach embodied by HoG, and, most notably, the

combination of NER and POS mappings, proves to be highly successful in improving the

robustness of the deep parser.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 24

3 Evaluation of the Business Intelligence Application

In the Business Intelligence Application, the deep analysis plays the role of “refining” the data

provided by the Sophia 2.1 shallow parsing. More precisely, Sophia 2.1shallow parser extracts

some Opinion Templates from the corpus texts, in which the relevant text segments are

attached as opinion snippets; those text portions are then analysed by the deep grammar, in

order to allow the template to be either validated, refined or filtered out.

As a Web source for selecting and collecting relevant texts, an Italian forum has been chosen

in the domain of mobile phones and tlc (namely www.cellulare.it). As the evaluation had to be

performed manually, the data set (the forum messages) used for the evaluation was rather

small, as the following figure illustrates:

Opinion

Snippet

Sophia

HPSG

grammar

Validated

Filtered

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 25

The 182 Opinion Templates (143 with a negative polarity, 39 with a positive one) have been

taken into account for the evaluation testing.

Here we give an example of a forum message, the correspondent opinion template and the

opinion snippet (XML format) extracted by Sophia2.1:

(Subject)

java su accompli008

(Text Corpus)

le applicazioni j2me sul motorola accompli non funzionano... forse devo configurare qualcosa? se

qualcuno mi può aiutare ringrazio anticipatamente.

788 messages

306 Sophia

Templates

182 Opinion Templates

124 Question Templates

Shallow parsing produced

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 26

Opinion template:

<NLFDoc> <Info id="5" sourceID=" 70852.txt" /> <Maps>

<NLF> <Pred attr="model">

<Val>accompli008</Val> </Pred> <Pred attr="type">

<Val>phone_name</Val> </Pred>

</NLF> <NLF>

<Pred attr="brString"> <Val>l motorola</Val>

</Pred> <Pred attr="brValue">

<Val>BR_MOTOROLA</Val> </Pred> <Pred attr="opValue">

<Val>NEGATIVE</Val> </Pred> <Pred attr="predString">

<Val>non funzionano</Val> </Pred> <Pred attr="type">

<Val>ENT_OP</Val> </Pred>

</NLF> </Maps>

</NLFDoc>

Opinion snippet:

<opinionSt end="90" hasSnippet="true" start="56"> <opinion end="90" pred="-1" start="76">non funzionano</opinion> <entity end="" hasSnippet="false" start="" term="" /> <model brand="BR_MOTOROLA" model="accompli" /> le applicazioni j2me sul motorola accompli non funzionano

</opinionSt>

In our evaluation we can compare two different result sets using two different levels of

processing:

• Shallow analysis only, using Sophia 2.1

• Deep analysis of the textual part of Sophia 2.1 output (the opinion snippets) with the HPSG

Italian grammar

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 27

Some preliminary remarks about the results of the shallow parsing are needed. We decided

to use a “relaxed” configuration of Sophia 2.1, showing a good recall value of 0.82 and,

almost as a consequence, a small degree of precision of 0.37:

Precision

Recall

F-measure

0.37

0.82

0,51

Sophia 2.1

As a consequence of the interaction between shallow and deep parsing as we designed it,

the recall value cannot be affected by the deep analysis results whereas the precision value

should be, by filtering out (some of) the incorrect opinion templates. Indeed, also the F-

measure should be increased.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 28

Deep Analysis Results

182 templates

The deep analysis of the 182 opinion snippets substantially confirmed the results of the

shallow parser in 79 cases, while in 15 cases out of 79 the templates have been enriched

with additional (more detailed) information (e.g., a generic problem concerning a given phone

model was specified as a net connection problem by the deep analysis, and so on).

54 templates, on the other hand, were refused as incorrect, because no connection between

the predicate (i.e. the word(s) indicating the opinion) and the entity (a keyword of the mobile

phone specific domain) in the template produced by Sophia has been found in the RMRS

structure resulting from the deep analysis.

Finally, the deep analysis of 49 input texts did not succeed, either because the original text was incorrect (mainly due to bad punctuation), or because it included one or more linguistic phenomena not covered by the current Italian grammar.

In the following paragraphs we show some example of the different possible results produced

by deep processing: 2 examples of validation (a simple one, and another one producing an

enriched template), and two snippets that the grammar was not able to parse. For each

example, we show the original text snippet, the template produced by Sophia, and the RMRS

structure resulting from the deep grammatical analysis.

49 not parsed

(26,9 %)

54 filtered

(29,7 %)

79 validated (43,4 %)

64 simple valid.

(35,2 %)

15 enriched valid.

(8,2 %)

24 lack of punct.

(13,2 %)

25 not covered

(13,7 %)

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 29

(simple) validation example

ho riscontrato un problema con gli infrarossi del sharp gx20. (I registered a problem with the infrareds of the Sharp gx20) h1 _avere_v_aux1(h1,e2:) ARG1(h1,h3:) _riscontrare_v(h3,e4:) ARG1(h3,x6:) ARG2(h3,x5:) _undef_q_article(h7,x5:) BODY(h7,h9:) RSTR(h7,h8:) qeq(h8:,h26) _problema_n(h10,x5:) ARG1(h10,e11:) _con_p(h12,e11:) ARG2(h12,x14:) _def_q_article(h15,x14:) BODY(h15,h17:) RSTR(h15,h16:) qeq(h16:,h18) noun_name_rel(h18,x14:) CARG(h18,infrarossi_nm) ING(h18:,h10001:) _di_p(h10001,e20:) ARG1(h10001,x14:) ARG2(h10001,x19:) _def_q_article(h21,x19:) BODY(h21,h23:) RSTR(h21,h22:) qeq(h22:,h24) brand_rel(h24,x19:) CARG(h24,sharp_n) ING(h24:,h10002:) model_rel(h10002,x25:) CARG(h10002,gx20_n)

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 30

<opinionSt end="90" hasSnippet="true" start="50"> <opinion confidence="1" end="61" hasSnippet="true" pred="-1" start="50">un

problema</opinion> <entity end="" hasSnippet="false " start="" term="" /> <model brand="BR_SHARP" model="gx20" /> un problema con gli infrarossi del sharp

</opinionSt>

- <NLFDoc> <Info id="1" sourceID=" 70824.txt" /> - <Maps>

- <NLF> - <Pred attr="brString">

<Val>l sharp</Val> </Pred> - <Pred attr="brValue">

<Val>BR_SHARP</Val> </Pred> - <Pred attr="model">

<Val>gx20</Val> </Pred> - <Pred attr="opValue">

<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">

<Val>un problema</Val> </Pred> - <Pred attr="type">

<Val>ENT_OP</Val> </Pred>

</NLF> </Maps>

</NLFDoc>

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 31

(enriched) validation example

il mio v200 ha un problema di connessione col pc (my v200 has a connection problem with the pc)

h1 _def_q_article(h3,x4:) BODY(h3,h6:) RSTR(h3,h5:) qeq(h5:,h7) _mio_j(h7,x4:) ARG1(h7,x4:) ING(h7:,h10001:) model_rel(h10001,x4:) CARG(h10001,v200_n) _avere_v(h1,e2:) ARG1(h1,x4:) ARG2(h1,x8:) _undef_q_article(h9,x8:) BODY(h9,h11:) RSTR(h9,h10:) qeq(h10:,h24) _problema_n(h12,x8:) ARG1(h12,e13:) _di_p(h14,e13:) ARG2(h14,x16:) _connessione_n(h17,x16:) ING(h17:,h10002:) _con_p(h10002,e19:) ARG1(h10002,x16:) ARG2(h10002,x18:) _def_q_article(h20,x18:) BODY(h20,h22:) RSTR(h20,h21:) qeq(h21:,h23) _pc_n(h23,x18:)

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 32

<opinionSt end="68" hasSnippet="true" start="42"> <opinion confidence="1" end="68" hasSnippet="true" pred="-1" start="57">un problema</opinion> <entity end="" hasSnippet="false " start="" term="" /> <model brand="" model="v200" /> v200 ha un problema </opinionSt>

- <NLFDoc> <Info id="3" sourceID=" 71308.txt" /> - <Maps>

- <NLF> - <Pred attr="brString">

<Val>samsung</Val> </Pred> - <Pred attr="brValue">

<Val>BR_SAMSUNG</Val> </Pred> - <Pred attr="model">

<Val>v200</Val> </Pred> - <Pred attr="type">

<Val>phone_name</Val> </Pred>

</NLF> - <NLF>

- <Pred attr="model"> <Val>v200</Val>

</Pred> - <Pred attr="opValue">

<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">

<Val>un problema</Val> </Pred> - <Pred attr="type">

<Val>ENT_OP</Val> </Pred>

</NLF> </Maps>

</NLFDoc>

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 33

Filtering example

probabilmente hai configurato male il nokia come modem. (probably you have configured the nokia badly as a modem)

h1 _probabilmente_r(h1,e2:) ARG1(h1,h3:) qeq(h3:,h1) _avere_v_aux1(h10001,e2:) ARG1(h10001,h4:) ING(h10001:,h10002:) ING(h10001,h10003:) ING(h10001,h10004:) ING(h10001,h1:) _configurare_v(h4,e5:) ARG1(h4,x7:) ARG2(h4,x6:) _male_r(h10002,e2:) ARG1(h10002,h8:) qeq(h8:,h1) _def_q_article(h9,x6:) BODY(h9,h10:) RSTR(h9,h8:) qeq(h8,h1) brand_rel(h10003,x6:) CARG(h10003,nokia_n) _come_p(h10004,e12:) ARG1(h10004,x6:) ARG2(h10004,x11:) _modem_n(h13,x11:)

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 34

<opinionSt end="380" hasSnippet="true" start="356"> <opinion confidence="1" end="360" hasSnippet="true" pred="-1"

start="356">male</opinion> <entity end="380" hasSnippet="true" start="375"

term="CONNECTIVITY">modem</entity> <model brand="BR_NOKIA" model="" /> male il nokia come modem

</opinionSt>

- <NLFDoc> <Info id="2" sourceID=" 70964.txt" /> - <Maps>

- <NLF> - <Pred attr="brString">

<Val>il nokia</Val> </Pred> - <Pred attr="brValue">

<Val>BR_NOKIA</Val> </Pred> - <Pred attr="entString">

<Val>modem</Val> </Pred> - <Pred attr="entValue">

<Val>CONNECTIVITY</Val> </Pred> - <Pred attr="opValue">

<Val>NEGATIVE</Val> </Pred> - <Pred attr="predString">

<Val>male</Val> </Pred> - <Pred attr="type">

<Val>ENT_OP</Val> </Pred>

</NLF> </Maps>

</NLFDoc>

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 35

NotParsed example

(lack of punctuation)

ho due problemi che spero voi possiate risolvermi il primo ho sul mio nokia 7650 il programma di

registrazione video ho inviato un filmato su di un altro nokia 7650 con lo stesso programma ma il file si

chiama video3gp e se provo ad aprirlo mi dice che il formato del file è sconosciuto.

(I have two problems that I hope you can help me to solve first I have on my Nokia 7650 the program for

recording video I sent a movie to another nokia 7650 with the same program but the file is called video3gp

and if I try to open it it says to me that the format file is unknown)

(out of linguistic coverage [e.g. parenthetical clauses] and/or exhausted-number-of-edges)

ieri ho acquistato un nokia n-gage, ho acceso il mio portatile toshiba satellite 5100 503 (già predisposto

bluetooth), installo il software in corredo (pc suite for nokia n-gage) e qui mi fermo, nel senso che poi

non si fa più nulla e una volta attivato il bluetooth sul cell e sul portatile si vedono senza problemi, ma

non con i software nokia.

(Yesterday I’ve bought a nokia n-gage, I turned-on my toshiba satellite 5100 503 (already prearranged for

the bluetooth), I install the equipped software (pc suite for nokia n-gage) and here I stay, in the sense

that you don’t do anything else and once you have activated the bluetooth on the cell phone and on the

pc they are seen without any problem but not with the nokia software)

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 36

4 Evaluation of the Auto-Response Application

The component provides information extraction functionalities for the following scenarios in

the content domain of a mobile phone provider: product ordering, mix-ups in deliveries of

products, and replacement of defective products. It takes as input one or more e-mails

(German and/or English) and delivers filled scenario templates as output. These templates

are the result of several processing steps as described in Deliverable 4.9: named entity

recognition, shallow and deep analysis, coreference resolution, mapping of results from the

preceding analysis on domain specific templates, and merging operations on the partially filled

templates resulting in filled scenario templates. The final scenario templates are of the

following types:

• Exchange

• Ordering

• Mix-up

In cases where no merging operations can take place the partially filled templates will be

presented to the user.

4.1 Email corpus

An email corpus was constructed using relevant anonymized customer emails.

Since the evaluation of the component was performed manually, the data set used for the

evaluation was rather limited: 87 emails for German and 84 for English. On average, each

email contains 4 sentences. Hence, the German data set consists of 348 sentences and the

English data set of 336 sentences.

4.2 Template Examples

The following examples show two input e-mails processed by the system and the scenario

templates the system delivered as output. The system identifies the customer using the

predicate argument structure from the deep analysis and by doing a domain-specific

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 37

coreference resolution between certain pronouns and potential antecedents. The assumption

here is the following: the person writing an e-mail referring to herself by “I” or “me” etc.

presumably mentions her name either in the complimentary close or in the address part of the

e-mail. Products will be identified by predicate argument relations and named-entity

recognition. The predicates trigger the process of choosing the correct scenario template.

Below the output of the system is provided for an email asking for replacement of a defective

product and for an email describing the mix-up in a delivery of a product.

Input to the system:

Dear Support-Team!

I just received the Siemens C35 I ordered. But it seems to be broken because it doesn't want to switch

off. I would like to replace this defective phone.

thank you

Terry Severson

7110 Martina Rd

Winona, MN 55987

[email protected]

Output of the system:

Template Id = 967 Weight = 1.0

[Type: Exchange

Products: [Product_0: [Name: C35

Features: defective]]

OrderDate: [ ]

DeliveryDate: [ ]

Provider: [ ]

Customer: Terry Severson]

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 38

Input to the system:

Dear team!

I hope you'll help me!

You sent me a 8210 yesterday but it's not what I expected. I ordered a headset for my Siemens S55 .

I don't want this phone and I hope to will deliver my headset soon! Please correct this asap.

Thanks,

Shaun Port

Output of the system:

Template Id = 10512 Weight = 1.0

[Type: MixUp

OrderTemplate: [Type: Order

Products: [Product_0: [

Name: headset

Features: for_S55]]

Customer: Shaun Port

]

DeliveryTemplate: [Type: Delivery

Products: [Product_0: [

Name: 8210]]

Provider: [ ]

Customer: ...]]

4.3 Evaluation types

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 39

In our evaluation we compare for German and English two different result sets using two

different preprocessing levels configured in the HoG. The deep processing used in our

application

and evaluated corresponds to configuration 2 described in chapter 2. Since the application on

hand should be applied in real world contexts robustness is a necessary precondition. The

usage of Part-of-Speech tags delivered by TnT and Named Entities detected by Sprout as

input guarantees this requirement. Using Part-of-Speech tags allows for the recoginition of

unknown words and Named Entities allows for the recognition of unkown potentially important

named entities. Therefore the application has been evaluated with PET using Part-of-Speech

tags and named-enties. For German, a chunk tagger is used as shallow component. For

English we use Rasp, a robust statistical parser. Both have been integrated in the HoG

described in chapter 2. The configurations that have been used for the evaluation are the

following:

1) Only deep analysis as preprocessing

2) Deep and shallow analysis as preprocessing

We measured precision, recall, and F-Score for the scenario templates delivered by the

system by manually comparing them against ‘gold standard’ template annotations in the email

corpus mentioned above.

We did two different types of evaluation: a template-based evaluation and a feature-based

evaluation. During template-based evaluation, a template was judged correct if and only if all

required template features were correctly filled and the type of the template was correct.

During feature-based evaluation, all feature values were evaluated separately. Each single

slot was judged either as correct or false. The relevant features for all three template types

are the following:

• Template type

• Product list (each product counting singly if more than one)

• Product feature list (each feature counting singly if more than one)

• Customer

• Provider

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 40

4.4 Accuracy of the German Prototype

4.4.1 First experiment

Our first experiment shows figures using mainly deep analysis and, as fall back solution,

shallow processing (configuration 3 of chapter 2).

Precision Recall F-Score

Template-based evaluation 50.35 % 45.74 % 47.93 %

Feature-based evaluation 60.46 % 56.38% 58.34 %

Table 1: Precision and recall values for results using configuration 2.

4.4.2 Second Experiment

The second experiment uses only deep analysis as preprocessing (configuration 2 of

chapter).

Precision Recall F-Score

Template-based evaluation 62.25 % 36.95 % 46.30 %

Feature-based evaluation 68.43 % 46.65 % 54.76 %

Table 2: Precision and recall values for results using configuration 1.

4.5 Accuracy of the Prototype for English

4.5.1 First Experiment

The first experiment shows figures using mainly deep analysis and, as fall back solution,

shallow processing (configuration 3 of chapter 2).

Precision Recall F-Score

Template-based evaluation 57.25 % 30.58 % 39.86 %

Feature-based evaluation 83.19 % 47.13 % 60.17 %

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 41

Table 3: Precision and recall values for results using configuration 2.

4.5.2 Second Experiment

The second experiment uses only deep analysis as preprocessing (configuration 2 of

chapter).

Precision Recall F-Score

Template-based evaluation 48.13 % 38.52 % 67.56 %

Feature-based evaluation 75.45 % 61.17 % 42.70 %

Table 4: Precision and recall values for results using configuration 1.

4.5.3 Conclusion

Naturally, precision and recall values for feature-based evaluation are always higher than

those for template-based evaluation. This is due to the fact that in many cases templates

contain only one or two incorrect feature values. During template-based evaluation these

templates were regarded as completely incorrect. This fact explains the difference between

accuracy values considering whole templates and feature values, respectively.

As expected, the precision value when using only deep analysis is higher than the precision

value when combining deep and shallow analysis. Whereas the higher F-Score in the first

experiment for both evaluation types indicates that the combined approach delivers better

results altogether.

The usage of shallow preprocessing mainly supports the identification of ordering templates.

This is mainly due to the difficulty of recognizing templates of type “Exchange” or “Mix-up”

when using only shallow processing. In these cases accurate recognition of predicate

argument structure is a necessary precondition for making the following decisions:

• What are the features of the product?

• Which product has been ordered and which product must be replaced?

Moreover, the relevant agreement features are not available in the domain-specific

coreference resolution between pronouns and customer names as potential antecedent, or

nouns and product named entities as potential antecedent.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 42

On the other hand, shallow processing does deliver correct templates in some cases (mainly

“Order” templates) for which the deep analysis does not provide a template at all. The

example below illustrates one such case for an email in German.

Input to the system:

Ich suche einen neuen Handy-Vertrag, können Sie mir bitte Ihre Angebote zuschicken? Danke. Peter

Janke Email: [email protected] Adresse: Halbergstrasse 57 66111 Saarbrücken

Output of the system:

Template Id = 25577 Weight = 1.0

[Type: Order

Products: [Product_0: [

Name: Handy-Vertrag

Features: neuen]]

OrderDate: [ ]

Customer: Ich ]

Template Id = 25563 Weight = 1.0

[Type: Order

Products: [Product_0: [

Name: Angebote]]

OrderDate: [ ]

Customer: mir ]

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 43

Therefore, the benefits of the combination of both approaches are evident as the figures in

table 1 - 4 suggest.

5 Travel Information Application

An application using the Norwegian grammar is also under development, aimed at extracting

information from hiking route descriptions and supplying it for a web portal. The HoG

machinery produces RMRSes which are mapped onto standardized information matrices.

The input grammar has a specially developed semantics coping with aspects of paths and

movement (extending the core grammar originally produced).

A documentation of the system will be available by the end of the project, but an evaluation

would be premature at the present point.

6 Concertation Plan

The core linguistic machinery (HoG) developed in DeepThought will be employed in the

following two projects that have been applied for, with outcome to be known early

December.

GERONIMO: developments of the Norwegian HPSG grammar in domains of

e-learning, translation and information extraction, making crucial use of HoG.

WebSEMIOTICS: developing tools for web information extraction with a natural language

interface, using HoG as one component.

Both projects will run at NTNU.

D5.10 Evaluation Document Version 1.0

______________________________________________________________________________________

____

DeepThought IST-2001-37836 44