Overview of the Multilingual Question Answering Track

Alicante, September, 22, 20006QA@CLEF 2006 Workshop

Overview of the Multilingual Question Answering

Track

Danilo Giampiccolo


Outline

Tasks Test set preparation Participants Evaluation Results Final considerations Future perspectives


QA 2006: Organizing Committee

ITC-irst (Bernardo Magnini): main coordinator CELCT (D. Giampiccolo, P. Forner): general coordination, Italian DFKI (B. Sacalenau): German ELDA/ELRA (C. Ayache): French Linguateca (P. Rocha): Portuguese UNED (A. Penas): Spanish U. Amsterdam (Valentin Jijkoun): Dutch U. Limerick (R. Sutcliff): English Bulgarian Academy of Sciences (P. Osenova): Bulgarian

♦ Only Source Languages:♦ Depok University of Indonesia (M. Adriani): Indonesian♦ IASI, Romania (D. Cristea): Romanian♦ Wrocław University of Technology (J. Pietraszko): Polish


QA@CLEF-06: Tasks

Main task:♦ Monolingual: the language of the question (Source language) and the

language of the news collection (Target language) are the same

♦ Cross-lingual: the questions were formulated in a language different from that of the news collection

One pilot task:♦ WiQA: coordinated by Maarten de Rijke

Two exercises: Answer Validation Exercise (AVE): coordinated by Anselmo Peñas Real Time: a “time-constrained” QA exercise coordinated by the

University of Alicante (coordinated by Fernando Llopis)


Data set: Question format

200 Questions of three kinds FACTOID (loc, mea, org, oth, per, tim; ca. 150): What party did Hitler belong to? DEFINITION (ca. 40): Who is Josef Paul Kleihues?

♦ reduced in number (-25%)♦ two new categories added:

– Object: What is a router?

– Other: What is a tsunami?

LIST (ca. 10): Name works by Tolstoy

♦ Temporally restricted (ca. 40): by date, by period, by event♦ NIL (ca. 20): questions that do not have any known answer in the target

document collection

input format: question type (F, D, L) not indicated

NEW!

NEW!

NEW!

NEW!


Multiple answers: from one to ten exact answers per question

♦ exact = neither more nor less than the information required

♦ each answer has to be supported by– docid

– one to ten text snippets justifying the answer (substrings of the specified document giving the actual context)

NEW!

Data set: run format

NEW!


Activated Tasks (at least one registered participant)

S

T

BG DE EN ES FR IN IT NL PT PL RO

BG

DE

EN

ES

FR

IT

NL

PT

11 Source languages (10 in 2005) 8 Target languages (9 in 2005) No Finnish task / New languages: Polish and Romanian


Activated Tasks

MONOLINGUAL CROSS-LINGUAL TOTAL

CLEF 2003 3 5 8

CLEF 2004 6 13 19

CLEF 2005 8 15 23

CLEF 2006 7 17 24

questions were not translated in all the languages Gold Standard: questions in multiple languages only for tasks were there was at least one registered participant

NEW!

More interest in cross-linguality


Participants

America Europe Asia TOTALRegisteredparticipants

New comers Veterans

Absentveterans

CLEF 2003 3 5 - 8

CLEF 2004 1 17 -18

(+125%)22 13 5 3

CLEF 2005 1 22 124

(+33%)27 9 15 4

CLEF 2006 4 24 230

(+25%)36 10 20 4


List of participants

ACRONYM NAME COUNTRY

SYNAPSE SYNAPSE Developpement France

Ling-Comp U.Rome-La Sapienza Italy

Alicante U.Alicante- Informatica Spain

Hagen U.Hagen-Informatics Germany

Daedalus Daedalus Consortium Spain

Jaen U.Jaen-Intell.Systems Spain

ISLA U.Amsterdam Netherlands

INAOE Inst.Astrophysics,Optics&Electronics Mexico

DEPOK U.Indonesia-Comp.Sci. Indonesia

DFKI DFKI-Lang.Tech. Germany

FURUI Lab. Tokyo Inst Technology Japan

Linguateca Linguateca-Sintef Norway

LIC2M-CEA Centre CEA Saclay France

LINA U.Nantes-LINA France

Priberam Priberam Informatica Portugal

U.Porto U.Porto- AI Portugal

U.Groningen U.Groningen-Letters Netherlands

ACRONYM NAME COUNTRY

Lab.Inf.D‘Avignon

Lab.Inf. D'Avignon France

U.Sao Paulo U.Sao Paulo – Math Brazil

Vanguard Vanguard Engineering Mexico

LCC Language Comp. Corp. USA

UAIC U.AI.I Cuza" Iasi Romania

Wroclaw U. Wroclaw U.of Tech Poland

RFIA-UPV Univ.Politècnica de Valencia Spain

LIMSI CNRS Lab-Orsay Cedex France

U.Stuttgart U.Stuttgart-NLP Germany

ITC ITC-irst, Italy

JRC-ISPRA

Institute for the Protection and the Security of the Citizen

Italy

BTB BulTreeBank Project Sofia

dltg University of Limerick Ireland

Industrial Companies


Submitted runs

Submitted runs

#Monolingual

#Cross-lingual

#

CLEF 2003 17 6 11

CLEF 2004 48 (+182%) 20 28

CLEF 2005 67 (+39.5%) 43 24

CLEF 2006 77 (+13%) 42 35


Number of answers and snippets per question

44%

25%

31%

Number of RUNS with respect to number of answers

1 answer

more than 5 answers

between 2 and 5 answers

74%

21%

4%

1%

Number of SNIPPETS for each answer

1 snippet

2 snippets

3 snippets

> 4 snippets


Evaluation

As in previous campaigns♦ runs manually judged by native speakers♦ each answer: Right, Wrong, ineXact, Unsupported♦ up to two runs for each participating group

Evaluation measures♦ Accuracy (for F,D); main evaluation score, calculated for the FIRST ANSWER only

• excessive workload: some groups could manually assess only one answer (the first one) per question

– 1 answer: Spanish and English– 3 answers: French– 5 answers: Dutch– all answers: Italian, German, Portoguese

♦ P@N for List questions Additional evaluation measures

♦ K1 measure♦ Confident Weighted Score (CWS)♦ Mean Reciprocal Rank (MRR)

NEW!


Question Overlapping among Languages 2005-2006

0

50

100

150

200

250

300

350

400

450

1lan

guag

e

2 lan

gauge

s

3 lan

guag

es

4 lan

guag

es

5 lan

guag

es

6 lan

guag

es

7 lan

guag

es

8 lan

guag

es

9 lan

guag

es

20062005


Results: Best and Average scores

41,5

64,5

39,535

45,5

35

68,95

86,32

14,7

29,36

18,4823,729

17

27,94 24,99

0102030405060708090100

Mono

Bilin

gual

Mono

Bilin

gual

Mono

Bilin

gual

Mono

Bilin

gual

Best

Average CLEF03 CLEF04 CLEF05 CLEF06

49,47

* This result is still under validation.

*


Best results in 2004-2005-2006

34,0

1

23,5

32,5

24,5 28

45,5

28,6

4

27,5

43,5

25,5

42

64

27,5

49,5

64,5

26,6

42,3

3

86,3

2

53,1

6

68,9

5

28,1

9

31,2

65,9

6

0

10

20

30

40

50

60

70

80

90

100

Bulgarian

Germ

an

English

Spanish

French

Italian

Dutch

Portuguese

Best2004

Best2005

Best200622,63

* This result is still under validation.

*


Participants in 2004-2005-2006: compared best results

0

10

20

30

40

50

60

70

80

DFK

I

HAGEN

ALIC

ANTE

INAOE

DAEDALU

S

TALP

U.VALENCIA

ITC-irst

U.LIM

ERICK

GRONINGEN

LIMSI

LINGUATECA

PRIBERAM

LIC2M

-CEA

LINA

SYNAPSE

U.IN

DONESIA

BTB

Best 2004 Best 2005 Best 2006


List questions

Best: 0.8333 (Priberam, Monolingual PT) Average: 0.138

Problems

Wrong classification of List Questions in the Gold Standard♦ Mention a Chinese writer is not a List question!

Definition of List Questions♦ “closed” List questions asking for a finite number of answers

Q: What are the names of the two lovers from Verona separated by family issues in one of Shakespeare’s plays?

A: Romeo and Juliet.♦ “open” List questions requiring a list of items as answer

Q: Name books by Jules Verne.A: Around the World in 80 Days.A: Twenty Thousand Leagues Under The Sea .A: Journey to the Centre of the Earth.


Final considerations

– Increasing interest in multilingual QA• More participants (30, + 25%)• Two new languages as source (Romanian and Polish)• More activated tasks (24, they were 23 in 2005)• More submitted runs (77, +13%) • More cross-lingual tasks (35, +31.5%)

– Gold Standard: questions not translated in all languages • No possibility of activating tasks at the last minutes• Useful as reusuable resource: available in the near future.


Final considerations: 2006 main task innovations

– Multiple answers: • good response

• limited capacity of assessing large numbers of answers.

• feedback welcome from participants

– Supporting snippets:• faster evaluation

• feedback from participants

– “F/D/L/” labels not given in the input format:• positive, as apparently there was no real impact on

– List questions


Future perspective: main task

For discussion:

Romanian as target

Very hard questions (implying reasoning and multiple document answers)

Allow collaboration among different systems

Partial automated evaluation (right answers)

Overview of the Multilingual Question Answering Track

Documents

Transcript of Overview of the Multilingual Question Answering Track