Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

29
An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology July 27-28, 2006 CEAS, Mountain View, CA

description

An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments. Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology. July 27-28, 2006. CEAS, Mountain View, CA. 2. - PowerPoint PPT Presentation

Transcript of Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Page 1: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

An Exploratory Study of the W3C Mailing List Test Collection for

Retrieval of Emails with Pro/Con Arguments

Yejun Wu & Douglas W. OardUniversity of Maryland, College Park

Ian Soboroff National Institute of Standards and

Technology

July 27-28, 2006 CEAS, Mountain View, CA

Page 2: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Outline

• Build the test collection• Evaluate the test collection (intrinsic evaluation)• Use the test collection (extrinsic evaluation)• Next steps to improve the test collection

2

Page 3: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

W3C Mailing List Corpus 3

w3c.org NIST (6/2004)

[email protected]@[email protected][email protected]

lists-000-9978864lists-001-0094883…lists-003-9630221

lists-000-9978864lists-001-0094883…lists-003-9630221

Parsing

174,311 emails515MB

Webpages

Unique DocIDs

Page 4: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

AutomaticSearch

InteractiveSelection

QueryFormulation

IR Test Collection Design

Measure system:

2 variations: system, user

Information seeking:

• Documents

• Information needs

• Interactive process

4

Docs

Page 5: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

IR Test Collection Design

AutomaticSearch

QueryFormulation

Ranked Lists

Evaluation

Evaluation Metric (Mean Average Precision)

Relevance Judgments (by Assessors)

Topic Statement (by Assessors)

Test Collection:

• Documents

• Topic statements

• Relevance judgments

• Metric

5

Docs

Freeze user.

Page 6: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

DOCNO="lists-000-9978864”RECEIVED="Sat Mar 18 08:56:28 2000"ISORECEIVED="20000318135628"SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)"ISOSENT="20000310182629"NAME="Kerri Golden"EMAIL="[email protected]"SUBJECT="RTF Word 2000 spec?"ID="[email protected]"EXPIRES="-1”TO=“[email protected]” We are trying to convert Word 2000 docs to XML.Our converter worked fine for W97 documents, but W2000 has a much different RTF format (tables especially).Does anyone know where I can get a hold of a spec for this version of RTF?thanksKerri [email protected]

6

Page 7: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Statement

TopicID: DS8

Query: html vs. xhtml

Narrative: A relevant message will compare the advantages/disadvantages of the two standards.

7

Page 8: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Pool Top 50 Docs/Run for Relevance Judgments

lists-000-9978864lists-000-7643767lists-011-6087388lists-012-1019722…lists-008-2365001

Team1 Run1 Team2 Run3

lists-009-8065221lists-006-2570023lists-000-9978864…lists-012-2365001lists-005-5500248

Average 529 emails/topic

Team 12 Run2

lists-000-7643767lists-012-2365001lists-004-0205442lists-003-6603021…lists-009-8065221

...

1234…50

8

12 teams*3 runs = 36 runs

Relevance Judgmentslists-000-9978864 Topic: , Pro/Con: lists-000-7643767 Topic: , Pro/Con: …Lists-008-2365001 Topic: , Pro/Con:

Researchers as assessors

Page 9: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Use of Test Collection

Difference is not significant (two-tailed, p<0.05)

Measure systems of ranked retrieval

Prec.1.001.00

0.60

0.57

0.50--------------------------------------------Avg. Prec. (AP): 0.73

Rel?

Topic DS1

Score0.950.910.880.820.800.770.630.620.550.51

Rank12345678910

Doc# lists-000-9743321

lists-000-7456300 lists-001-3400432 lists-002-6590811 lists-004-5566320 lists-009-1349620 lists-011-0383209 lists-005-5201023 lists-007-5610095 lists-002-3204102

Topic

AP0.500.380.360.090.100.830.280.200.410.30------0.35

B

A-B+++-++-++-

------N+=7N-=3

TopicDS1DS2DS3DS4DS5DS6DS7DS8DS9DS10

AP0.730.450.560.000.131.000.240.470.530.23

System A

------- ------MAP: 0.43

9

Page 10: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Emerging Topic TypesType/Category: Method, tip, solution

• Example1:

Query: Annotea installation

Narrative: A relevant message will provide at least a tip on Annotea installation.

• Example2:

• Query: file upload http

• Narrative: A relevant message will discuss methods of doing file uploads using http.

10

Page 11: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type Analysis11

Find categories amenable to pro/con classification

Number of Topics in Categories

0 5 10 15 20 25 30

F: Reasons, design rationales

E: Definitions, functionality

D: Problems, impacts

C: Discuss an issue

B: Methods, tips, solutions

A: Comparions, usefulness, relationships

Category

Page 12: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Measuring Agreement

Chance corrected overlap Judge1

Judge2

R NR

R a b

NR c d

a+b

c+d

a+c b+d a+b+c+d=N

Cohen’s Kappa=

Ndcdb

Ndcca

N

Ndcdb

Ndcca

da

))(())((

))(())(()(

12

lists-000-9874732lists-001-0683001lists-003-0000221lists-004-8436200…lists-002-8833514

lists-000-9874732lists-001-0683001lists-003-0000221lists-004-8436200…lists-002-8833514

a

b a cOverlap

0 1

Kappa

-1 0 1

PerfectNon PerfectChanceInverse

Page 13: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Assessor Agreement by Category

Correlation b/t Overlap and kappa >0.9, significant at p<0.01

13

Overlap

Kappa

0

0.2

0.4

0.6

0.8

A(26) B(10) C(8) D(4) E(2) F(1) All

Categories (Num of Topics)

Kap

pa pro/con

topical

3 categ

0.0

0.2

0.4

0.6

0.8

1.0

A(26) B(10) C(8) D(4) E(2) F(1) All

Categories (Num of Topics)

Ove

rlap

pro/con

topical

.

Page 14: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Effect of Disagreement on RankingPrimary Judge Secondary Judge

Kendall’s Tau = 1- N

swapsadjpairwise )__min(

W3C Tau

Topical relevance 0.763

(Significant)

Pro/con relevance 0.776

(significant)

Typical text retrieval “Identical” if >= 0.9

Important difference in relevance judgment

14

= 1- 3/5 = 0.4

1

2

3

4

5

1

2

3

4

5

Page 15: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Outline

• Intrinsic evaluation

-- topic type analysis

-- inter-assessor agreement analysis• Extrinsic evaluation:

--Use W3C to evaluate a topic & pro/con system

15

Page 16: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Experiment Design: Round Robin

Pro/Con

Pro/Con

Non-Pro/Con

Non-Pro/Con

Pro/Confeature

Pro/Con

Pro/Con

Non-Pro/Con

Non-Pro/Con

Pro/Confeature

… …

Topic

Topic

48

Training

Topics

Top N terms (N=100)

INQUERY Query1 Evaluation Topic

48-fold

Cross-

Validation

Search

Ranked List

Evaluation Query relevance set (Relevance Judgments)

MAP

16

Page 17: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Compare Two Systems

Topic Retrieval (Baseline):

Query: 100% topic terms: Browser technology support incompatibility

MAP= 0.2743

Topic + Pro/Con Retrieval (Rocchio):

Query: 30% topic terms: Browser technology support incompatibility

70% pro/con terms: advantage, strength, weakness …

MAP= 0.2857

4.3% relative improvement.

Sig. (p<0.05, Wilcoxon signed-rank test)

Page 18: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

All Topics

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 19: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type A

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

8 48 42 37 43 2 54 20 41 59 49 15 24

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 20: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type B

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

16 19 21 30 7 47 59 17 10.

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 21: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type C

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

14 27 31 50 26 34 45 25

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 22: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type D

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

.

33 44 56

.

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 23: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Topic Type E

Topic + Pro/Con System Better

-0.1

0

0.1

0.2

. 4 3 5 .

Baseline BetterDif

fere

nce

in

Ave

rag

e P

reci

sio

n

Page 24: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Effects of Topic and Topic Types

• Two-way ANOVA• Topic difficulty levels: 27 improved, 16 hurt, 7 unused• Topic types: A, B, C, D, E, F

24

Topic Topic Type

Pro/Con

Relevance

Agreement

Sig.

(p<0.05)

Topical Relevance

Agreement

Topic Topic Type

Pro/Con

Relevance

Agreement

Sig.

(p<0.05)

Topical Relevance

Agreement

Overlap Kappa

Page 25: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Conclusion – Test Collection Evaluation

• Test collection generally useful• Important differences in judgments• Relevance judgments could be improved• Topic type: factor of agreement of pro/con relevance• Categories less of a pro/con nature: -- B (method, tip, solution) : not lead to pro/con -- C (discuss an issue) : vague• Rocchio style system: 4.2% improvement in MAP• Major improvements in A and E• Pro/con relevance judgments useful.

25

Page 26: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Future Work – Better Test Collection Design• Balance topic types: -- half in A. -- F (reason, design rationale): 1 topic.• Study information needs and search process• Improve the process --e.g., better defining topics for pro/con• Use within-category topics for training -- examine the quality of training data by category• Other classification methods: SVM, Naïve Bayes• Separate models for detecting pros and cons. • THANKS!

26

Page 27: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

20

Pro/Con Feature Selection

30

8

5 18

15

Topic1 Topic2 Topic48

log(20+1)Topic Weight: log(5+1) log(15+1)

Pro/Con docs

Non Pro/Con

TF=38+1

TF=10+1

“advantage”:

TF=30+1

TF=28+1

TF=40+1

TF=10+1

39/20log21* log------- 11/30

Pro/Con docs

Non Pro/Con

31/8log6* log-------- 29/5

41/15log16*log-------- 11/18

+ +

+…

advantagestrength

weaknesshate

opinion…

wow

12345…100

“strength” …

“Microsoft” …

“Html” …

“opinion” …

Page 28: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Feature Selection• Pro/con feature vector term weight

)]1,1log[min( iii NnegNposWeightTopic

48

1

)1

1

log(i

neg

neg

pos

pos

i

i

i

i

i

NTF

NTF

WeightTopic

Pos: Pro/Con relevant documents

Neg: Non Pro/Con relevant documents

28

log odds ratio:

Page 29: Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

Rocchio-style Implementation• Appropriate for topic and pro/con retrieval.• Baseline classifier to test the utility of test collection• Expanded query:

• Q0: initial query; Q1: expanded query.

• Ri: vectors from positive docs

• Si: vectors from negative docs

, : parameters

)(2

1 2

1

1 101

n

i

in

i

i

n

S

n

RQQ

29