SF Women in eDiscovery Sept 2011

12/5/2011 1

Getting to a Manageable Review Set

Intake

Data

100%

Duplicates

25%

Non-

Responsive

20%

Produced

12.25%

These figures vary based upon the data set received

NR/Priv

20%

Responsive

& Priv 15%

Junk/Spam/

Porn

20%

Focus on finding,

reviewing & using the

“right” data,

not just filtering data

12/5/2011 2

Review risks

Failure to collect the right data

Failure to find responsive documents

Failure to recognize responsive documents

Failure to recognize privileged documents

Inconsistent treatment of documents (e.g.,

duplicates)

Failure to complete project in a timely manner

Sophisticated Tools

– Understand What They Do and Don’t Do Well

– Inform Yourself, Speak to References, Consultants

12/5/2011 3

Search Methodologies

specific exact wordsKeyword

Clustering Ontology

relationships among relevant people

similarity of

salient features

generalized

words or phrases

Social Network Analysis

specific exact wordsKeyword specific exact wordsKeyword

Clustering Ontologysimilarity of

salient features

generalized

words or phrases

specific exact wordsKeyword

Clustering Ontology


Relationship

Analysisdocuments with

causal or

sequential relationship


similarity of

salient features

generalized

words or phrases

specific exact words,

proximity searches, stemmingKeyword

Clustering Ontology


Relationship

Analysisdocuments with

causal or

sequential relationship


similarity of

salient features

generalized

words or phrases

Content

Concept

Context

Visualization

Measurement

12/5/2011 4

Myth

Keyword Searching is the Way to Go

If I agree to keyword terms, I am OK

Missing in Action (Under-inclusive)

Unwanted Extras (Over-inclusive)

Multiple subject/persons (Disambiguate)

Reality: Keyword Search is one tool among many!

Keyword culling

"simple keyword searches end up being both over- and under-inclusive."

Judge Paul Grimm, Victor Stanley, Inc. v. Creative Pipe, Inc., No. MJG-06-2662, 2008 U.S. Dist. LEXIS 42025

(D. Md. May 29, 2008).

12/5/2011 6

Keyword Accuracy Example

8,553 responsive documents

missed by keyword search

(Almost 8% of responsive

documents missed by

keyword search - Under-inclusive)

Keyword search reduced the

document set by only 47%

And 88% of the documents

returned by keyword

search were not responsive

(Over-inclusive)

12/5/2011 7

Missing abbreviations / acronyms / clippings:

– incentive stock option but not ISO

Missing inflectional variants:

– grant but not grants, granted, granting

Missing spellings or common misspellings:

– gray but not grey

– privileged but not priviliged, priviledged, privilidged,

priveliged, privelidged, priveledged, …

Missing syntactic variants:

• board of directors meetingbut not meeting of the board of

directors, BOD meeting, board meeting, BOD mtg…

Missing Synonyms/Paraphrases:

• Hire date but not start date

Under Inclusive - Missing in Action

12/5/2011 8

Options

Target: Sheila was granted 100,000 options at $10

Match: What are our options for lunch?

Match in a signature line:

Amanda Wacz

Acme Stock Options Administrator

Destroy

Target:destroyevidence

Match in a disclaimer: The information in this email, and any

attachments, may contain confidential and/or privileged

information and is intended solely for the use of the named

recipient(s). Any disclosure or dissemination in whatever form, by

anyone other than the recipient is strictly prohibited. If you have

received this transmission in error, please contact the sender

and destroy this message and any attachments. Thank you.

Over-Inclusive - Unwanted Extras (a)

12/5/2011 9

Over-Inclusive - Unwanted Extras (b)

alter*

Target: alter, alters, altered, altering

Matches: alternate, alternative, alternation, altercate,

altercation, alterably, …

grant

Target:stock optiongrant

Matches names:Grant Woods, Howard Grant

12/5/2011 10

Example: refund is used to refer to:

– FERC-ordered refunds owed by Enron for

overcharging

– Tax refunds (both corporate and personal)

– Mundane business matters

In a given matter, one might be of interest

while the others are not

Failure to Disambiguate

Words that Relate to Multiple Subjects

12/5/2011 11

Priv by

High-Speed

Manual Review

Source

Data

Eliminate

Duplicates &

System Files

Non-Responsive

Isolation

ontologies

Responsive

by Technology

Enhanced

Review

(removed

another 7%)

NR by

Technology

Enhanced

Review

(removed

another 18%)

30%

30%

15%22%

100%

3%

Technology Enhanced Review:

Speed, Predictable Costs, and Accuracy

Automate any portion of the review

Example from a real case

12/5/2011 12

Example: “priv” ontology

Valuable, re-usable work product

Combines classifiers into concepts,

into bigger concepts

12/5/2011 13

Disclaimer Detection

Disclaimers can throw off attempts to detect privileged communications

Prevalent throughout many companies, even on trivial communications

Detect them automatically, and exclude them from searches

12/5/2011 14

Privileged by Actor and Term

Privileged by Actor Only

Privileged by Term Only

Domain of Disclaimer

Detection

Responsive

Privileged by

Disclaimer Only

12/5/2011 15

Priv Logs

Expensive - But Do NOT Have to Be

In re Vioxx Products Liability Litigation (E.D. La 2007)

Merck’s Priv Log had 30,000 items on it

– How to Make a Judge Angry

– How to Waste Client Money

– How to Attract Sanctions

12/5/2011 16

Transparency of Process

Discussing Review Protocols

– Provide transparent, defensible, sophisticated search

based on document content

– Clustering, Ontologies, Analytics, and yes, sometimes

Keywords too

Develop search methodologies for each case

– Use technology experts in consultation with case / legal

experts

Results verifiable by Quality Control

– Defensible sampling

Sophisticated Tools

– Understand What They Do and Don’t Do Well

– Inform Yourself, Speak to References, Consultants

Blair &Maron:

Keyword search is incomplete

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Predicted Obtained

Resp

on

siv

e d

ocu

men

ts

Blair and Maron, Communications of the ACM, 28, 1985, 289-299

What the lawyers thought

they were finding

What they

actually found

Blair & Maron Study: 20% recallLawyers picked 3 key terms, B & M found 26 more

Defense: “Unfortunate incident” Plaintiff: “Disaster”

Blair and Maron, Communications of the ACM, 28, 1985, 289-299

Blair and Maron“It is impossibly difficult for users to

predict the exact words, word

combinations, and phrases that are

used by all (or most) relevant

documents and only (or primarily) by

those documents.”

Predictive

Coding

Document categorization in Legal Discovery: Computer Classification vs. Manual ReviewHerbert L. Roitblat, Anne Kershaw, & Patrick Oot

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Team A Team B System C System D

Ag

reem

en

t w

ith

ori

gin

al

Roitblat, Kershaw, &Oot, 2010, JASIST

Manual

review

Computer

classification

Gold Standard

Turing test

Alan Turing, 1912-1954

Substantial disagreement between Team A & Team B

629 580 858

0 500 1000 1500 2000

Responsive Documents

A

Both

B

28%

Roitblat, Kershaw, &Oot, 2010, JASIST

Conclusion

The computer systems yielded comparable level of

performance relative to manual review

Fewer people, less time, less cost

Measure performance to evaluate

Will lawyers lose control?

Computer system amplifies the

intelligence of the Expert

Will lawyers

lose their jobs?

Tap into the mind of an expert

12/5/2011 29

Technology-Enhanced or Automated Review

Setup

Sample

Expert judges

sample

Non-

responsiveResponsive

Model learns

Model

predicts

Model categorizes all remaining

documents

Responsive Non-responsive

Repeat as needed

Predictive coding achieves much higher

accuracy (Jaccard)

0.186

0.304

0.688

0.281

0.126

0.415

Responsive Documents

Team A Only Team A and Team B Team B

Humans Humans and Predictive Coding Predictive Coding

Data from Roitblat, et al. and an Internal OrcaTec Case Study

Why doesn’t everyone use it?

• Attorneys don’t understand the

technology

• May not be aware of the accuracy data

• May not understand how to fit into their

work flow

• Not in everyone’s economic interest

• Acceptable to judges?

Defensible?

Measure TREC

2008

Roitblat, e

t al. Team

A

Roitblat

et al.

Team B

Predictiv

e

Coding*

Precision 0.210 0.197 0.183 0.899

Recall 0.555 0.488 0.539 0.873

*OrcaTec internal Result

12/5/2011 34

Thank you!

Sonya Sigler

650-281-8325

[email protected]

Herb Roitblat

770-650-7706x229

[email protected]

SF Women in eDiscovery Sept 2011

Technology

Transcript of SF Women in eDiscovery Sept 2011