Modeling Ceramics – Warts and All Rick Ubic , Boise State University, DMR 1052788
When to stop reviewing documents in eDiscovery...
Transcript of When to stop reviewing documents in eDiscovery...
When to stop reviewing
documents in eDiscovery cases
The Lit i View Quality Monitor and Endpoint Detector
ⒸUBIC, Inc. 2013 All Rights Reserved.
Jakob Halskov, Hideki Takeda UBIC Inc., Technology Dept.
MEDES/ACM 2013
Luxembourg, October 30th 2013
Tokyo| Osaka | Nagoya | Seoul | Taipei| Hong Kong | Silicon Valley| Washington DC | New York | London
Outline of talk
• Introduction: Redefining Big Data!
• The Discovery system
• UBIC’s Legal Cloud & Lit i View SaaS
• Outline of Predictive Coding technology
• Impact of Predictive Coding: case study
• Estimating sample size & HOT ratio
• Demo of UBIC’s Quality Monitor and Endpoint Detector
ⒸUBIC, Inc. 2013 All Rights Reserved. 1
ⒸUBIC, Inc. 2013 All Rights Reserved. 2
Behavior Informatics
We need new approaches for analyzing
Human Thought and Behavior
UBIC redefines Big Data
Big Data is a Universe of
Human Thought and Behavior
ⒸUBIC, Inc. 2013 All Rights Reserved. 3
Informatics
Statistics – Mathematics
Data Mining - Text Mining
Speech Technology
Behavioral Science
Criminology
Sociology
Psychology
Discover Risk
Discover Knowledge
More Effectively and Efficiently
What is Behavior Informatics?
Legal Intelligence (eDiscovery)
Discover Risk for Company Digital
Forensics
Business Intelligence
Discover
Knowledge Medicine
Intelligence(Security Support)
Discover Risk for Community
M&A
ⒸUBIC, Inc. 2013 All Rights Reserved.
Applications of Behavior Informatics
The Discovery system • Data protection and privacy laws in the US are lax
– Categorical document requests (virtually all types of ESI is discoverable)
• Being forced to give information to a competitor/government is RISKY
– Narrow down the amount of information released
– UBIC makes this process as painless as possible while ensuring defensibility
• At the “Meet and Confer” the opposing parties will agree to a Discovery Plan (aka “protocol”)
– Defining the scope of responsive (relevant) data
– Scope of Accessibility & cost shifting (who is paying?)
– Defining privileged data (exempt from production)
– Setting performance goals and deadlines for production
• Recall and defensibility are key under this system
• Famous cases
– ENRON (TREC Legal track)
– Global Aerospace (Dec 2012, Predictive Coding becomes mainstream in eDiscovery, judge sets minimum recall rate at 75%)
ⒸUBIC, Inc. 2013 All Rights Reserved. 5
The nine phases of eDiscovery
ⒸUBIC, Inc. 2013 All Rights Reserved. 6
Review typically costs 70% of the total costs of an eDiscovery case.
UBIC’s Lit i View software
• Lit i View (Cloud-based SaaS)
– Used in more than 275 cross-border litigation cases, including
• Plaintiff = private
– Intellectual Property (patent infringement)
– Product Liability, …
• Plaintiff = government
– Anti-trust regulations (cartels)
– Covers virtually all phases of the EDRM
• Custodian identification and management (“Central Linkage”)
• Legal hold management (“Easy hold”)
• Collection & preservation
• Processing (+CJK support, encoding/segmentation etc.)
• Analysis & Review (Predictive Coding)
ⒸUBIC, Inc. 2013 All Rights Reserved. 7
UBIC Legal Cloud overview
ⒸUBIC, Inc. 2013 All Rights Reserved. 8
Customer benefit: Most data can stay locally (in Asia or US) for the duration of the case
Outline of Predictive Coding
What:
Supervised machine learning algorithm assigning Relevance Scores to documents
Why:
– Improve quality/consistency of Review
– Save time
– Optimize sampling strategy (control costs)
Flexible & iterative design
– Random sample extraction
– Feature selection
• Morphological analysis (+CJK)
• Statistical analysis
– Feature (re)weighting
• Association measures
– Document (re)scoring
ⒸUBIC, Inc. 2013 All Rights Reserved. 9
Impact of Predictive Coding: case study
ⒸUBIC, Inc. 2013 All Rights Reserved. 10
• Japanese maker, US law firm
• Over 1 million documents
• PC carried out twice
• Review costs were reduced by
40% vs. conventional human review
Estimating minimum sample size, ns
The error level, 𝛥𝑝, for the predictor 𝑝 =𝑁𝐻𝑂𝑇
𝑁 is given by:
𝛥𝑝 = 𝛾𝑁 − 𝑛𝑠
𝑁 − 1
𝑝 1 − 𝑝
𝑛𝑠<=>
𝑛𝑠 =𝛾2
𝛥𝑝2
1
𝑁 − 1𝑁
1𝑝(1 − 𝑝)
+𝛾2
𝑁𝛥𝑝2
When N is much greater than ns: 𝑁−𝑛𝑠
𝑁−1→ 1, and thus:
𝑛𝑠 ≈𝛾2
𝛥𝑝2𝑝(1 − 𝑝)
Unfortunately, we do not know p (as NHOT is unknown)
ⒸUBIC, Inc. 2013 All Rights Reserved. 11
Estimating NHOT and minimum ns
N Conf. Level = 95% Conf. Level = 99%
ns ns << N ns ns << N
10,000 4,899
9,604
6,247
16,641 100,000 8,057 14,267
1,000,000 9,513 16,369
5,000,000 9,586 16,586
ⒸUBIC, Inc. 2013 All Rights Reserved. 12
0
0.1
0.2
0.3
0 0.2 0.4 0.6 0.8 1
p(1
-p)
p
f(p)=p(1-p)
With p=0.5 as the worst case
(giving the highest sample size),
we get
𝑛𝑠 ≈1
4
𝛾2
𝛥𝑝2
Using a confidence interval of
95%, the confidence coefficient
(𝛾) is 1.96, and we can now
compute the minimum sample
sizes for various N, for example
setting the error level at 0.01.
𝑁𝐻𝑂𝑇𝑒𝑠𝑡 = 𝑁 𝑝𝑇𝐴𝐺 ± 𝛥𝑝
𝛥𝑝 = 𝛾𝑁 − 𝑛𝑠
𝑁 − 1
𝑝𝑇𝐴𝐺 1 − 𝑝𝑇𝐴𝐺
𝑛𝑠
𝑝𝑇𝐴𝐺 =𝑛𝑇𝐴𝐺
𝑛𝑠
Quality Monitor demo
ⒸUBIC, Inc. 2013 All Rights Reserved. 13
Conclusion & new/future features
• UBIC’s QM and EPD provide a user-friendly UI to secure a high
quality and defensible outcome of the review process
ⒸUBIC, Inc. 2013 All Rights Reserved. 14
• Leveraging
theory from
Social Network
Analysis, UBIC
released “Central
Linkage” on
October 1st 2013
References Diesner, Jana; Terrill L. Frantz; Kathleen M. Carley. 2005. Communication Networks from the Enron Email Corpus. “It’s Always About the People. Enron is no Different”. In Computational & Mathematical Organization Theory. 11 (3), 201-228. Kluwer, MA.
Halskov, Jakob. 2013. Augmenting Predictive Analytics for eDiscovery with Richer Linguistic Features. Poster presentation at Asian Summer School in Information Access, ASSIA (Research Center for Knowledge Communities, Tsukuba University, Japan, June 22-24). http://www.kc.tsukuba.ac.jp/assia2013/poster_presentation
Oard, Douglas W.; Jason R. Baron; Bruce Hedin; David D. Lewis; Stephen Tomlinson. 2010. Evaluation of information retrieval for E-discovery. In Artificial Intelligence and Law, 18 (4), 347-386. Springer, Amsterdam.
Takeda, Hideki. 2013. Trend on Digital Forensic Technologies and Business in Japan. Keynote Speech. In Proceedings of the 5th IEEE International Workshop on Computer Forensics in Software Engineering (Kyoto, Japan, July 22-26). IEEE Computer Society Press, CA.
Webber, William. 2011. Re-examining the Effectiveness of Manual Review. In Proceedings of SIGIR 2011 Information Retrieval for E-Discovery Workshop (Beijing, China, July 28). http://www.umiacs.umd.edu/~oard/sire11/
ⒸUBIC, Inc. 2013 All Rights Reserved. 15