Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.

Planning for the TREC 2008Legal Track

Douglas Oard

Stephen Tomlinson

Jason Baron

Agenda

• Track goals• Deciding on a document collection• “Beating Boolean”• Handling nasty OCR• Making the best use of the metadata• Ad hoc task design• Interactive task design• Relevance feedback task design• Other issues

Track Goals

• Develop a reusable test collection– Documents, topics, evaluation measures

• Foster formation of a research community

• Establish baseline results

Choosing a Collection

• FERC Enron (w/attachments, full headers)– Somewhat larger than CMU– Email is the real killer app for E-discovery

• IIT CDIP version 1.0 (same as 2006/07)– We have 83 topics. Do we need more?

• State Department Cables– Task model would be FOIA, not E-Discovery

TREC Topic Number: 1

Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.

Query Possibilities: • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)

o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.

• (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases.

• (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)

o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)

Identity Modeling in Enron

[email protected] m scott

suebobsusan scott

sue

susan

ciao

again

m scott

[email protected]

scott susan

susan m scott

susan scott

[email protected] scott

friday

sscott5

susan

sscott

susan m scott

com members

66,715 models

82,084addr-name

3,151 addr-nickname

19,708 addr-addr

Enron Identity Test Collections

Collection Emails Identities Mention Candidates

Queries Min. Avg. Max.

Sager 1,628 627 51 1 4 11

Shapiro 974 855 49 1 8 21

Enron-subset 54,018 27,340 78 1 152 489

Enron-all 248,451 123,783 78 3 518 1785

Sager

Shapiro

Enron-subsetEnron-all

Test CollectionsTest Collections

Example Document

Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY

Organization Authors: PMUSA, PHILIP MORRIS USA

Person Authors: HALLE, L

Document Date: 19970530

Document Type: MEMO, MEMORANDUM

Bates Number: 2078039376/9377

Page Count: 2

Collection: Philip Morris

Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aaBenffrts Departmext Rieh>pwna, Yfe&iaTa: Dishlbutfon Data aday 90,1997.From: Lisa FisllaSabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsUDuring our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ngartieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was amsiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with mySadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* .I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you onwhetlne you concur with my reeommendatioa…

Scanned OCR Metadata

State Department Cables

0

100,000

200,000

300,000

400,000

1973 1974 1975

Nu

mb

er o

f R

eco

rds

Withdrawn

Metadata

Full Text

791,857 records – 550,983 of which are full text

State Department Cables

Handling Nasty OCR

• Index pruning

• Error estimation

• Character n-grams

• Duplicate detection

• Expansion using a cleaner collection

How to “Beat Boolean”

• Work from reference Boolean?– Swap out low-ranked-in for high-ranked-out

• Relax Boolean somehow?– Cover density, proximity perturbation, …

Using Metadata

• Title (term match)

• Author (social network

• Bates number (sequence)

Ad Hoc Task Design

• Evaluation measures– R@B?, P@R?, Index size?– Error bars / Statistical significance testing– Limits on post-hoc use of the collection?– What are “meaningful” differences?

• Topic design– Negotiation transcript?

• Inter-annotator agreement

Interactive Track Design

• Evaluation measure– Precision-oriented?– Recall-oriented?– Effect of assessor disagreement

Relevance Feedback Task

• Evaluation measure– Residual recall at B_Residual?

• Two-stage feedback?

Some Open Questions• Test collection reusability

– Unbiased estimates? Tight error bars?

• Why can’t we beat Boolean???– Different strategies? Detailed failure analysis?

• Can we improve topic formulation?– Structured relevance relevance feedback?

• Is OCR masking effects we need to see?– Is it time for a new collection?– Must it be de-duped? Is metadata needed?

• Does Δscope invalidate the interactive task?

Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.

Documents

Transcript of Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.