Post on 19-Dec-2015
Agenda
• Track goals• Deciding on a document collection• “Beating Boolean”• Handling nasty OCR• Making the best use of the metadata• Ad hoc task design• Interactive task design• Relevance feedback task design• Other issues
Track Goals
• Develop a reusable test collection– Documents, topics, evaluation measures
• Foster formation of a research community
• Establish baseline results
Choosing a Collection
• FERC Enron (w/attachments, full headers)– Somewhat larger than CMU– Email is the real killer app for E-discovery
• IIT CDIP version 1.0 (same as 2006/07)– We have 83 topics. Do we need more?
• State Department Cables– Task model would be FOIA, not E-Discovery
TREC Topic Number: 1
Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.
Query Possibilities: • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)
o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.
• (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases.
• (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)
o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)
Identity Modeling in Enron
m..scott@enron.comsusan m scott
suebobsusan scott
sue
susan
ciao
again
m scott
scott.susan@enron.com
scott susan
susan m scott
susan scott
sscott5@enron.comsusan scott
friday
sscott5
susan
sscott
susan m scott
com members
66,715 models
82,084addr-name
3,151 addr-nickname
19,708 addr-addr
Enron Identity Test Collections
Collection Emails Identities Mention Candidates
Queries Min. Avg. Max.
Sager 1,628 627 51 1 4 11
Shapiro 974 855 49 1 8 21
Enron-subset 54,018 27,340 78 1 152 489
Enron-all 248,451 123,783 78 3 518 1785
Sager
Shapiro
Enron-subsetEnron-all
Test CollectionsTest Collections
Example Document
Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY
Organization Authors: PMUSA, PHILIP MORRIS USA
Person Authors: HALLE, L
Document Date: 19970530
Document Type: MEMO, MEMORANDUM
Bates Number: 2078039376/9377
Page Count: 2
Collection: Philip Morris
Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aaBenffrts Departmext Rieh>pwna, Yfe&iaTa: Dishlbutfon Data aday 90,1997.From: Lisa FisllaSabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsUDuring our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ngartieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was amsiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with mySadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* .I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you onwhetlne you concur with my reeommendatioa…
Scanned OCR Metadata
State Department Cables
0
100,000
200,000
300,000
400,000
1973 1974 1975
Nu
mb
er o
f R
eco
rds
Withdrawn
Metadata
Full Text
791,857 records – 550,983 of which are full text
Handling Nasty OCR
• Index pruning
• Error estimation
• Character n-grams
• Duplicate detection
• Expansion using a cleaner collection
How to “Beat Boolean”
• Work from reference Boolean?– Swap out low-ranked-in for high-ranked-out
• Relax Boolean somehow?– Cover density, proximity perturbation, …
Ad Hoc Task Design
• Evaluation measures– R@B?, P@R?, Index size?– Error bars / Statistical significance testing– Limits on post-hoc use of the collection?– What are “meaningful” differences?
• Topic design– Negotiation transcript?
• Inter-annotator agreement
Interactive Track Design
• Evaluation measure– Precision-oriented?– Recall-oriented?– Effect of assessor disagreement
Some Open Questions• Test collection reusability
– Unbiased estimates? Tight error bars?
• Why can’t we beat Boolean???– Different strategies? Detailed failure analysis?
• Can we improve topic formulation?– Structured relevance relevance feedback?
• Is OCR masking effects we need to see?– Is it time for a new collection?– Must it be de-duped? Is metadata needed?
• Does Δscope invalidate the interactive task?