Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff
description
Transcript of Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff
An Exploratory Study of the W3C Mailing List Test Collection for
Retrieval of Emails with Pro/Con Arguments
Yejun Wu & Douglas W. OardUniversity of Maryland, College Park
Ian Soboroff National Institute of Standards and
Technology
July 27-28, 2006 CEAS, Mountain View, CA
Outline
• Build the test collection• Evaluate the test collection (intrinsic evaluation)• Use the test collection (extrinsic evaluation)• Next steps to improve the test collection
2
W3C Mailing List Corpus 3
w3c.org NIST (6/2004)
[email protected]@[email protected]…[email protected]
lists-000-9978864lists-001-0094883…lists-003-9630221
lists-000-9978864lists-001-0094883…lists-003-9630221
Parsing
174,311 emails515MB
Webpages
Unique DocIDs
AutomaticSearch
InteractiveSelection
QueryFormulation
IR Test Collection Design
Measure system:
2 variations: system, user
Information seeking:
• Documents
• Information needs
• Interactive process
4
Docs
IR Test Collection Design
AutomaticSearch
QueryFormulation
Ranked Lists
Evaluation
Evaluation Metric (Mean Average Precision)
Relevance Judgments (by Assessors)
Topic Statement (by Assessors)
Test Collection:
• Documents
• Topic statements
• Relevance judgments
• Metric
5
Docs
Freeze user.
DOCNO="lists-000-9978864”RECEIVED="Sat Mar 18 08:56:28 2000"ISORECEIVED="20000318135628"SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)"ISOSENT="20000310182629"NAME="Kerri Golden"EMAIL="[email protected]"SUBJECT="RTF Word 2000 spec?"ID="[email protected]"EXPIRES="-1”TO=“[email protected]” We are trying to convert Word 2000 docs to XML.Our converter worked fine for W97 documents, but W2000 has a much different RTF format (tables especially).Does anyone know where I can get a hold of a spec for this version of RTF?thanksKerri [email protected]
6
Topic Statement
TopicID: DS8
Query: html vs. xhtml
Narrative: A relevant message will compare the advantages/disadvantages of the two standards.
7
Pool Top 50 Docs/Run for Relevance Judgments
lists-000-9978864lists-000-7643767lists-011-6087388lists-012-1019722…lists-008-2365001
Team1 Run1 Team2 Run3
lists-009-8065221lists-006-2570023lists-000-9978864…lists-012-2365001lists-005-5500248
Average 529 emails/topic
Team 12 Run2
lists-000-7643767lists-012-2365001lists-004-0205442lists-003-6603021…lists-009-8065221
…
...
…
…
1234…50
8
12 teams*3 runs = 36 runs
Relevance Judgmentslists-000-9978864 Topic: , Pro/Con: lists-000-7643767 Topic: , Pro/Con: …Lists-008-2365001 Topic: , Pro/Con:
Researchers as assessors
Use of Test Collection
Difference is not significant (two-tailed, p<0.05)
Measure systems of ranked retrieval
Prec.1.001.00
0.60
0.57
0.50--------------------------------------------Avg. Prec. (AP): 0.73
Rel?
Topic DS1
Score0.950.910.880.820.800.770.630.620.550.51
Rank12345678910
Doc# lists-000-9743321
lists-000-7456300 lists-001-3400432 lists-002-6590811 lists-004-5566320 lists-009-1349620 lists-011-0383209 lists-005-5201023 lists-007-5610095 lists-002-3204102
Topic
AP0.500.380.360.090.100.830.280.200.410.30------0.35
B
A-B+++-++-++-
------N+=7N-=3
TopicDS1DS2DS3DS4DS5DS6DS7DS8DS9DS10
AP0.730.450.560.000.131.000.240.470.530.23
System A
------- ------MAP: 0.43
9
Emerging Topic TypesType/Category: Method, tip, solution
• Example1:
Query: Annotea installation
Narrative: A relevant message will provide at least a tip on Annotea installation.
• Example2:
• Query: file upload http
• Narrative: A relevant message will discuss methods of doing file uploads using http.
10
Topic Type Analysis11
Find categories amenable to pro/con classification
Number of Topics in Categories
0 5 10 15 20 25 30
F: Reasons, design rationales
E: Definitions, functionality
D: Problems, impacts
C: Discuss an issue
B: Methods, tips, solutions
A: Comparions, usefulness, relationships
Category
Measuring Agreement
Chance corrected overlap Judge1
Judge2
R NR
R a b
NR c d
a+b
c+d
a+c b+d a+b+c+d=N
Cohen’s Kappa=
Ndcdb
Ndcca
N
Ndcdb
Ndcca
da
))(())((
))(())(()(
12
lists-000-9874732lists-001-0683001lists-003-0000221lists-004-8436200…lists-002-8833514
lists-000-9874732lists-001-0683001lists-003-0000221lists-004-8436200…lists-002-8833514
…
…
a
b a cOverlap
0 1
Kappa
-1 0 1
PerfectNon PerfectChanceInverse
Assessor Agreement by Category
Correlation b/t Overlap and kappa >0.9, significant at p<0.01
13
Overlap
Kappa
0
0.2
0.4
0.6
0.8
A(26) B(10) C(8) D(4) E(2) F(1) All
Categories (Num of Topics)
Kap
pa pro/con
topical
3 categ
0.0
0.2
0.4
0.6
0.8
1.0
A(26) B(10) C(8) D(4) E(2) F(1) All
Categories (Num of Topics)
Ove
rlap
pro/con
topical
.
Effect of Disagreement on RankingPrimary Judge Secondary Judge
Kendall’s Tau = 1- N
swapsadjpairwise )__min(
W3C Tau
Topical relevance 0.763
(Significant)
Pro/con relevance 0.776
(significant)
Typical text retrieval “Identical” if >= 0.9
Important difference in relevance judgment
14
= 1- 3/5 = 0.4
1
2
3
4
5
1
2
3
4
5
Outline
• Intrinsic evaluation
-- topic type analysis
-- inter-assessor agreement analysis• Extrinsic evaluation:
--Use W3C to evaluate a topic & pro/con system
15
Experiment Design: Round Robin
Pro/Con
Pro/Con
Non-Pro/Con
Non-Pro/Con
Pro/Confeature
Pro/Con
Pro/Con
Non-Pro/Con
Non-Pro/Con
Pro/Confeature
… …
Topic
Topic
48
Training
Topics
Top N terms (N=100)
INQUERY Query1 Evaluation Topic
48-fold
Cross-
Validation
Search
Ranked List
Evaluation Query relevance set (Relevance Judgments)
MAP
16
Compare Two Systems
Topic Retrieval (Baseline):
Query: 100% topic terms: Browser technology support incompatibility
MAP= 0.2743
Topic + Pro/Con Retrieval (Rocchio):
Query: 30% topic terms: Browser technology support incompatibility
70% pro/con terms: advantage, strength, weakness …
MAP= 0.2857
4.3% relative improvement.
Sig. (p<0.05, Wilcoxon signed-rank test)
All Topics
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Topic Type A
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
8 48 42 37 43 2 54 20 41 59 49 15 24
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Topic Type B
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
16 19 21 30 7 47 59 17 10.
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Topic Type C
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
14 27 31 50 26 34 45 25
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Topic Type D
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
.
33 44 56
.
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Topic Type E
Topic + Pro/Con System Better
-0.1
0
0.1
0.2
. 4 3 5 .
Baseline BetterDif
fere
nce
in
Ave
rag
e P
reci
sio
n
Effects of Topic and Topic Types
• Two-way ANOVA• Topic difficulty levels: 27 improved, 16 hurt, 7 unused• Topic types: A, B, C, D, E, F
24
Topic Topic Type
Pro/Con
Relevance
Agreement
Sig.
(p<0.05)
Topical Relevance
Agreement
Topic Topic Type
Pro/Con
Relevance
Agreement
Sig.
(p<0.05)
Topical Relevance
Agreement
Overlap Kappa
Conclusion – Test Collection Evaluation
• Test collection generally useful• Important differences in judgments• Relevance judgments could be improved• Topic type: factor of agreement of pro/con relevance• Categories less of a pro/con nature: -- B (method, tip, solution) : not lead to pro/con -- C (discuss an issue) : vague• Rocchio style system: 4.2% improvement in MAP• Major improvements in A and E• Pro/con relevance judgments useful.
25
Future Work – Better Test Collection Design• Balance topic types: -- half in A. -- F (reason, design rationale): 1 topic.• Study information needs and search process• Improve the process --e.g., better defining topics for pro/con• Use within-category topics for training -- examine the quality of training data by category• Other classification methods: SVM, Naïve Bayes• Separate models for detecting pros and cons. • THANKS!
26
20
Pro/Con Feature Selection
30
8
5 18
15
Topic1 Topic2 Topic48
log(20+1)Topic Weight: log(5+1) log(15+1)
Pro/Con docs
Non Pro/Con
TF=38+1
TF=10+1
“advantage”:
TF=30+1
TF=28+1
TF=40+1
TF=10+1
39/20log21* log------- 11/30
Pro/Con docs
Non Pro/Con
31/8log6* log-------- 29/5
41/15log16*log-------- 11/18
+ +
…
…
…
…
+…
advantagestrength
weaknesshate
opinion…
wow
12345…100
“strength” …
“Microsoft” …
“Html” …
“opinion” …
…
Feature Selection• Pro/con feature vector term weight
)]1,1log[min( iii NnegNposWeightTopic
48
1
)1
1
log(i
neg
neg
pos
pos
i
i
i
i
i
NTF
NTF
WeightTopic
Pos: Pro/Con relevant documents
Neg: Non Pro/Con relevant documents
28
log odds ratio:
Rocchio-style Implementation• Appropriate for topic and pro/con retrieval.• Baseline classifier to test the utility of test collection• Expanded query:
• Q0: initial query; Q1: expanded query.
• Ri: vectors from positive docs
• Si: vectors from negative docs
, : parameters
)(2
1 2
1
1 101
n
i
in
i
i
n
S
n
RQQ
29