SF Women in eDiscovery Sept 2011
-
Upload
sonya-sigler -
Category
Technology
-
view
332 -
download
1
Transcript of SF Women in eDiscovery Sept 2011
12/5/2011 1
Getting to a Manageable Review Set
Intake
Data
100%
Duplicates
25%
Non-
Responsive
20%
Produced
12.25%
These figures vary based upon the data set received
NR/Priv
20%
Responsive
& Priv 15%
Junk/Spam/
Porn
20%
Focus on finding,
reviewing & using the
“right” data,
not just filtering data
12/5/2011 2
Review risks
Failure to collect the right data
Failure to find responsive documents
Failure to recognize responsive documents
Failure to recognize privileged documents
Inconsistent treatment of documents (e.g.,
duplicates)
Failure to complete project in a timely manner
Sophisticated Tools
– Understand What They Do and Don’t Do Well
– Inform Yourself, Speak to References, Consultants
12/5/2011 3
Search Methodologies
specific exact wordsKeyword
Clustering Ontology
relationships among relevant people
similarity of
salient features
generalized
words or phrases
Social Network Analysis
specific exact wordsKeyword specific exact wordsKeyword
Clustering Ontologysimilarity of
salient features
generalized
words or phrases
specific exact wordsKeyword
Clustering Ontology
Social Network Analysis
Relationship
Analysisdocuments with
causal or
sequential relationship
relationships among relevant people
similarity of
salient features
generalized
words or phrases
specific exact words,
proximity searches, stemmingKeyword
Clustering Ontology
Social Network Analysis
Relationship
Analysisdocuments with
causal or
sequential relationship
relationships among relevant people
similarity of
salient features
generalized
words or phrases
Content
Concept
Context
Visualization
Measurement
12/5/2011 4
Myth
Keyword Searching is the Way to Go
If I agree to keyword terms, I am OK
Missing in Action (Under-inclusive)
Unwanted Extras (Over-inclusive)
Multiple subject/persons (Disambiguate)
Reality: Keyword Search is one tool among many!
Keyword culling
"simple keyword searches end up being both over- and under-inclusive."
Judge Paul Grimm, Victor Stanley, Inc. v. Creative Pipe, Inc., No. MJG-06-2662, 2008 U.S. Dist. LEXIS 42025
(D. Md. May 29, 2008).
12/5/2011 6
Keyword Accuracy Example
8,553 responsive documents
missed by keyword search
(Almost 8% of responsive
documents missed by
keyword search - Under-inclusive)
Keyword search reduced the
document set by only 47%
And 88% of the documents
returned by keyword
search were not responsive
(Over-inclusive)
12/5/2011 7
Missing abbreviations / acronyms / clippings:
– incentive stock option but not ISO
Missing inflectional variants:
– grant but not grants, granted, granting
Missing spellings or common misspellings:
– gray but not grey
– privileged but not priviliged, priviledged, privilidged,
priveliged, privelidged, priveledged, …
Missing syntactic variants:
• board of directors meetingbut not meeting of the board of
directors, BOD meeting, board meeting, BOD mtg…
Missing Synonyms/Paraphrases:
• Hire date but not start date
Under Inclusive - Missing in Action
12/5/2011 8
Options
Target: Sheila was granted 100,000 options at $10
Match: What are our options for lunch?
Match in a signature line:
Amanda Wacz
Acme Stock Options Administrator
Destroy
Target:destroyevidence
Match in a disclaimer: The information in this email, and any
attachments, may contain confidential and/or privileged
information and is intended solely for the use of the named
recipient(s). Any disclosure or dissemination in whatever form, by
anyone other than the recipient is strictly prohibited. If you have
received this transmission in error, please contact the sender
and destroy this message and any attachments. Thank you.
Over-Inclusive - Unwanted Extras (a)
12/5/2011 9
Over-Inclusive - Unwanted Extras (b)
alter*
Target: alter, alters, altered, altering
Matches: alternate, alternative, alternation, altercate,
altercation, alterably, …
grant
Target:stock optiongrant
Matches names:Grant Woods, Howard Grant
12/5/2011 10
Example: refund is used to refer to:
– FERC-ordered refunds owed by Enron for
overcharging
– Tax refunds (both corporate and personal)
– Mundane business matters
In a given matter, one might be of interest
while the others are not
Failure to Disambiguate
Words that Relate to Multiple Subjects
12/5/2011 11
Priv by
High-Speed
Manual Review
Source
Data
Eliminate
Duplicates &
System Files
Non-Responsive
Isolation
ontologies
Responsive
by Technology
Enhanced
Review
(removed
another 7%)
NR by
Technology
Enhanced
Review
(removed
another 18%)
30%
30%
15%22%
100%
3%
Technology Enhanced Review:
Speed, Predictable Costs, and Accuracy
Automate any portion of the review
Example from a real case
12/5/2011 12
Example: “priv” ontology
Valuable, re-usable work product
Combines classifiers into concepts,
into bigger concepts
12/5/2011 13
Disclaimer Detection
Disclaimers can throw off attempts to detect privileged communications
Prevalent throughout many companies, even on trivial communications
Detect them automatically, and exclude them from searches
12/5/2011 14
Privileged by Actor and Term
Privileged by Actor Only
Privileged by Term Only
Domain of Disclaimer
Detection
Responsive
Privileged by
Disclaimer Only
12/5/2011 15
Priv Logs
Expensive - But Do NOT Have to Be
In re Vioxx Products Liability Litigation (E.D. La 2007)
Merck’s Priv Log had 30,000 items on it
– How to Make a Judge Angry
– How to Waste Client Money
– How to Attract Sanctions
12/5/2011 16
Transparency of Process
Discussing Review Protocols
– Provide transparent, defensible, sophisticated search
based on document content
– Clustering, Ontologies, Analytics, and yes, sometimes
Keywords too
Develop search methodologies for each case
– Use technology experts in consultation with case / legal
experts
Results verifiable by Quality Control
– Defensible sampling
Sophisticated Tools
– Understand What They Do and Don’t Do Well
– Inform Yourself, Speak to References, Consultants
Blair &Maron:
Keyword search is incomplete
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Predicted Obtained
Resp
on
siv
e d
ocu
men
ts
Blair and Maron, Communications of the ACM, 28, 1985, 289-299
What the lawyers thought
they were finding
What they
actually found
Blair & Maron Study: 20% recallLawyers picked 3 key terms, B & M found 26 more
Defense: “Unfortunate incident” Plaintiff: “Disaster”
Blair and Maron, Communications of the ACM, 28, 1985, 289-299
Blair and Maron“It is impossibly difficult for users to
predict the exact words, word
combinations, and phrases that are
used by all (or most) relevant
documents and only (or primarily) by
those documents.”
Predictive
Coding
Document categorization in Legal Discovery: Computer Classification vs. Manual ReviewHerbert L. Roitblat, Anne Kershaw, & Patrick Oot
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Team A Team B System C System D
Ag
reem
en
t w
ith
ori
gin
al
Roitblat, Kershaw, &Oot, 2010, JASIST
Manual
review
Computer
classification
Gold Standard
Turing test
Alan Turing, 1912-1954
Substantial disagreement between Team A & Team B
629 580 858
0 500 1000 1500 2000
Responsive Documents
A
Both
B
28%
Roitblat, Kershaw, &Oot, 2010, JASIST
Conclusion
The computer systems yielded comparable level of
performance relative to manual review
Fewer people, less time, less cost
Measure performance to evaluate
Will lawyers lose control?
Computer system amplifies the
intelligence of the Expert
Will lawyers
lose their jobs?
Tap into the mind of an expert
12/5/2011 29
Technology-Enhanced or Automated Review
Setup
Sample
Expert judges
sample
Non-
responsiveResponsive
Model learns
Model
predicts
Model categorizes all remaining
documents
Responsive Non-responsive
Repeat as needed
Predictive coding achieves much higher
accuracy (Jaccard)
0.186
0.304
0.688
0.281
0.126
0.415
Responsive Documents
Team A Only Team A and Team B Team B
Humans Humans and Predictive Coding Predictive Coding
Data from Roitblat, et al. and an Internal OrcaTec Case Study
Why doesn’t everyone use it?
• Attorneys don’t understand the
technology
• May not be aware of the accuracy data
• May not understand how to fit into their
work flow
• Not in everyone’s economic interest
• Acceptable to judges?
Defensible?
Measure TREC
2008
Roitblat, e
t al. Team
A
Roitblat
et al.
Team B
Predictiv
e
Coding*
Precision 0.210 0.197 0.183 0.899
Recall 0.555 0.488 0.539 0.873
*OrcaTec internal Result
12/5/2011 34
Thank you!
Sonya Sigler
650-281-8325
Herb Roitblat
770-650-7706x229