Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

An Exploratory Study of the W3C Mailing List Test Collection for

Retrieval of Emails with Pro/Con Arguments

Yejun Wu & Douglas W. OardUniversity of Maryland, College Park

Ian Soboroff National Institute of Standards and

Technology

July 27-28, 2006 CEAS, Mountain View, CA

Outline

• Build the test collection• Evaluate the test collection (intrinsic evaluation)• Use the test collection (extrinsic evaluation)• Next steps to improve the test collection

2

W3C Mailing List Corpus 3

w3c.org NIST (6/2004)

[email protected]@[email protected]…[email protected]

lists-000-9978864lists-001-0094883…lists-003-9630221

lists-000-9978864lists-001-0094883…lists-003-9630221

Parsing

174,311 emails515MB

Webpages

Unique DocIDs

AutomaticSearch

InteractiveSelection

QueryFormulation

IR Test Collection Design

Measure system:

2 variations: system, user

Information seeking:

• Documents

• Information needs

• Interactive process

4

Docs

IR Test Collection Design

AutomaticSearch

QueryFormulation

Ranked Lists

Evaluation

Evaluation Metric (Mean Average Precision)

Relevance Judgments (by Assessors)

Topic Statement (by Assessors)

Test Collection:

• Documents

• Topic statements

• Relevance judgments

• Metric

5

Docs

Freeze user.

DOCNO="lists-000-9978864”RECEIVED="Sat Mar 18 08:56:28 2000"ISORECEIVED="20000318135628"SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)"ISOSENT="20000310182629"NAME="Kerri Golden"EMAIL="[email protected]"SUBJECT="RTF Word 2000 spec?"ID="[email protected]"EXPIRES="-1”TO=“[email protected]” We are trying to convert Word 2000 docs to XML.Our converter worked fine for W97 documents, but W2000 has a much different RTF format (tables especially).Does anyone know where I can get a hold of a spec for this version of RTF?thanksKerri [email protected]

6

Topic Statement

TopicID: DS8

Query: html vs. xhtml

Narrative: A relevant message will compare the advantages/disadvantages of the two standards.

7

Pool Top 50 Docs/Run for Relevance Judgments

lists-000-9978864lists-000-7643767lists-011-6087388lists-012-1019722…lists-008-2365001

Team1 Run1 Team2 Run3

lists-009-8065221lists-006-2570023lists-000-9978864…lists-012-2365001lists-005-5500248

Average 529 emails/topic

Team 12 Run2


…

...

…

…

1234…50

8

12 teams*3 runs = 36 runs

Relevance Judgmentslists-000-9978864 Topic: , Pro/Con: lists-000-7643767 Topic: , Pro/Con: …Lists-008-2365001 Topic: , Pro/Con:

Researchers as assessors

Use of Test Collection

Difference is not significant (two-tailed, p<0.05)

Measure systems of ranked retrieval

Prec.1.001.00

0.60

0.57

0.50--------------------------------------------Avg. Prec. (AP): 0.73

Rel?

Topic DS1

Score0.950.910.880.820.800.770.630.620.550.51

Rank12345678910

Doc# lists-000-9743321

lists-000-7456300 lists-001-3400432 lists-002-6590811 lists-004-5566320 lists-009-1349620 lists-011-0383209 lists-005-5201023 lists-007-5610095 lists-002-3204102

Topic

AP0.500.380.360.090.100.830.280.200.410.30------0.35

B

A-B+++-++-++-

------N+=7N-=3

TopicDS1DS2DS3DS4DS5DS6DS7DS8DS9DS10

AP0.730.450.560.000.131.000.240.470.530.23

System A

------- ------MAP: 0.43

9

Emerging Topic TypesType/Category: Method, tip, solution

• Example1:

Query: Annotea installation

Narrative: A relevant message will provide at least a tip on Annotea installation.

• Example2:

• Query: file upload http

• Narrative: A relevant message will discuss methods of doing file uploads using http.

10

Topic Type Analysis11

Find categories amenable to pro/con classification

Number of Topics in Categories

0 5 10 15 20 25 30

F: Reasons, design rationales

E: Definitions, functionality

D: Problems, impacts

C: Discuss an issue

B: Methods, tips, solutions

A: Comparions, usefulness, relationships

Category

Measuring Agreement

Chance corrected overlap Judge1

Judge2

R NR

R a b

NR c d

a+b

c+d

a+c b+d a+b+c+d=N

Cohen’s Kappa=

Ndcdb

Ndcca

N

Ndcdb

Ndcca

da

))(())((

))(())(()(

12



…

…

a

b a cOverlap

0 1

Kappa

-1 0 1

PerfectNon PerfectChanceInverse

Assessor Agreement by Category

Correlation b/t Overlap and kappa >0.9, significant at p<0.01

13

Overlap

Kappa

0

0.2

0.4

0.6

0.8

A(26) B(10) C(8) D(4) E(2) F(1) All

Categories (Num of Topics)

Kap

pa pro/con

topical

3 categ

0.0

0.2

0.4

0.6

0.8

1.0

A(26) B(10) C(8) D(4) E(2) F(1) All

Categories (Num of Topics)

Ove

rlap

pro/con

topical

.

Effect of Disagreement on RankingPrimary Judge Secondary Judge

Kendall’s Tau = 1- N

swapsadjpairwise )__min(

W3C Tau

Topical relevance 0.763

(Significant)

Pro/con relevance 0.776

(significant)

Typical text retrieval “Identical” if >= 0.9

Important difference in relevance judgment

14

= 1- 3/5 = 0.4

1

2

3

4

5

1

2

3

4

5

Outline

• Intrinsic evaluation

-- topic type analysis

-- inter-assessor agreement analysis• Extrinsic evaluation:

--Use W3C to evaluate a topic & pro/con system

15

Experiment Design: Round Robin

Pro/Con

Pro/Con

Non-Pro/Con

Non-Pro/Con

Pro/Confeature

Pro/Con

Pro/Con

Non-Pro/Con

Non-Pro/Con

Pro/Confeature

… …

Topic

Topic

48

Training

Topics

Top N terms (N=100)

INQUERY Query1 Evaluation Topic

48-fold

Cross-

Validation

Search

Ranked List

Evaluation Query relevance set (Relevance Judgments)

MAP

16

Compare Two Systems

Topic Retrieval (Baseline):

Query: 100% topic terms: Browser technology support incompatibility

MAP= 0.2743

Topic + Pro/Con Retrieval (Rocchio):

Query: 30% topic terms: Browser technology support incompatibility

70% pro/con terms: advantage, strength, weakness …

MAP= 0.2857

4.3% relative improvement.

Sig. (p<0.05, Wilcoxon signed-rank test)

All Topics

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Topic Type A


-0.1

0

0.1

0.2

8 48 42 37 43 2 54 20 41 59 49 15 24

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Topic Type B


-0.1

0

0.1

0.2

16 19 21 30 7 47 59 17 10.

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Topic Type C


-0.1

0

0.1

0.2

14 27 31 50 26 34 45 25

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Topic Type D


-0.1

0

0.1

0.2

.

33 44 56

.

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Topic Type E


-0.1

0

0.1

0.2

. 4 3 5 .

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Effects of Topic and Topic Types

• Two-way ANOVA• Topic difficulty levels: 27 improved, 16 hurt, 7 unused• Topic types: A, B, C, D, E, F

24

Topic Topic Type

Pro/Con

Relevance

Agreement

Sig.

(p<0.05)

Topical Relevance

Agreement

Topic Topic Type

Pro/Con

Relevance

Agreement

Sig.

(p<0.05)

Topical Relevance

Agreement

Overlap Kappa

Conclusion – Test Collection Evaluation

• Test collection generally useful• Important differences in judgments• Relevance judgments could be improved• Topic type: factor of agreement of pro/con relevance• Categories less of a pro/con nature: -- B (method, tip, solution) : not lead to pro/con -- C (discuss an issue) : vague• Rocchio style system: 4.2% improvement in MAP• Major improvements in A and E• Pro/con relevance judgments useful.

25

Future Work – Better Test Collection Design• Balance topic types: -- half in A. -- F (reason, design rationale): 1 topic.• Study information needs and search process• Improve the process --e.g., better defining topics for pro/con• Use within-category topics for training -- examine the quality of training data by category• Other classification methods: SVM, Naïve Bayes• Separate models for detecting pros and cons. • THANKS!

26

20

Pro/Con Feature Selection

30

8

5 18

15

Topic1 Topic2 Topic48

log(20+1)Topic Weight: log(5+1) log(15+1)

Pro/Con docs

Non Pro/Con

TF=38+1

TF=10+1

“advantage”:

TF=30+1

TF=28+1

TF=40+1

TF=10+1

39/20log21* log------- 11/30

Pro/Con docs

Non Pro/Con

31/8log6* log-------- 29/5

41/15log16*log-------- 11/18

+ +

…

…

…

…

+…

advantagestrength

weaknesshate

opinion…

wow

12345…100

“strength” …

“Microsoft” …

“Html” …

“opinion” …

…

Feature Selection• Pro/con feature vector term weight

)]1,1log[min( iii NnegNposWeightTopic

48

1

)1

1

log(i

neg

neg

pos

pos

i

i

i

i

i

NTF

NTF

WeightTopic

Pos: Pro/Con relevant documents

Neg: Non Pro/Con relevant documents

28

log odds ratio:

Rocchio-style Implementation• Appropriate for topic and pro/con retrieval.• Baseline classifier to test the utility of test collection• Expanded query:

• Q0: initial query; Q1: expanded query.

• Ri: vectors from positive docs

• Si: vectors from negative docs

, : parameters

)(2

1 2

1

1 101

n

i

in

i

i

n

S

n

RQQ

29

Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Documents

Transcript of Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff