When to stop reviewing documents in eDiscovery...

Outline of talk

• Introduction: Redefining Big Data!

• The Discovery system

• UBIC’s Legal Cloud & Lit i View SaaS

• Outline of Predictive Coding technology

• Impact of Predictive Coding: case study

• Estimating sample size & HOT ratio

• Demo of UBIC’s Quality Monitor and Endpoint Detector

ⒸUBIC, Inc. 2013 All Rights Reserved. 1


Behavior Informatics

We need new approaches for analyzing

Human Thought and Behavior

UBIC redefines Big Data

Big Data is a Universe of

Human Thought and Behavior


Informatics

Statistics – Mathematics

Data Mining - Text Mining

Speech Technology

Behavioral Science

Criminology

Sociology

Psychology

Discover Risk

Discover Knowledge

More Effectively and Efficiently

What is Behavior Informatics?

Legal Intelligence (eDiscovery)

Discover Risk for Company Digital

Forensics

Business Intelligence

Discover

Knowledge Medicine

Intelligence(Security Support)

Discover Risk for Community

M&A

ⒸUBIC, Inc. 2013 All Rights Reserved.

Applications of Behavior Informatics

The Discovery system • Data protection and privacy laws in the US are lax

– Categorical document requests (virtually all types of ESI is discoverable)

• Being forced to give information to a competitor/government is RISKY

– Narrow down the amount of information released

– UBIC makes this process as painless as possible while ensuring defensibility

• At the “Meet and Confer” the opposing parties will agree to a Discovery Plan (aka “protocol”)

– Defining the scope of responsive (relevant) data

– Scope of Accessibility & cost shifting (who is paying?)

– Defining privileged data (exempt from production)

– Setting performance goals and deadlines for production

• Recall and defensibility are key under this system

• Famous cases

– ENRON (TREC Legal track)

– Global Aerospace (Dec 2012, Predictive Coding becomes mainstream in eDiscovery, judge sets minimum recall rate at 75%)


The nine phases of eDiscovery


Review typically costs 70% of the total costs of an eDiscovery case.

UBIC’s Lit i View software

• Lit i View (Cloud-based SaaS)

– Used in more than 275 cross-border litigation cases, including

• Plaintiff = private

– Intellectual Property (patent infringement)

– Product Liability, …

• Plaintiff = government

– Anti-trust regulations (cartels)

– Covers virtually all phases of the EDRM

• Custodian identification and management (“Central Linkage”)

• Legal hold management (“Easy hold”)

• Collection & preservation

• Processing (+CJK support, encoding/segmentation etc.)

• Analysis & Review (Predictive Coding)


UBIC Legal Cloud overview


Customer benefit: Most data can stay locally (in Asia or US) for the duration of the case

Outline of Predictive Coding

What:

Supervised machine learning algorithm assigning Relevance Scores to documents

Why:

– Improve quality/consistency of Review

– Save time

– Optimize sampling strategy (control costs)

Flexible & iterative design

– Random sample extraction

– Feature selection

• Morphological analysis (+CJK)

• Statistical analysis

– Feature (re)weighting

• Association measures

– Document (re)scoring


Impact of Predictive Coding: case study


• Japanese maker, US law firm

• Over 1 million documents

• PC carried out twice

• Review costs were reduced by

40% vs. conventional human review

Estimating minimum sample size, ns

The error level, 𝛥𝑝, for the predictor 𝑝 =𝑁𝐻𝑂𝑇

𝑁 is given by:

𝛥𝑝 = 𝛾𝑁 − 𝑛𝑠

𝑁 − 1

𝑝 1 − 𝑝

𝑛𝑠<=>

𝑛𝑠 =𝛾2

𝛥𝑝2

1

𝑁 − 1𝑁

1𝑝(1 − 𝑝)

+𝛾2

𝑁𝛥𝑝2

When N is much greater than ns: 𝑁−𝑛𝑠

𝑁−1→ 1, and thus:

𝑛𝑠 ≈𝛾2

𝛥𝑝2𝑝(1 − 𝑝)

Unfortunately, we do not know p (as NHOT is unknown)


Estimating NHOT and minimum ns

N Conf. Level = 95% Conf. Level = 99%

ns ns << N ns ns << N

10,000 4,899

9,604

6,247

16,641 100,000 8,057 14,267

1,000,000 9,513 16,369

5,000,000 9,586 16,586


0

0.1

0.2

0.3

0 0.2 0.4 0.6 0.8 1

p(1

-p)

p

f(p)=p(1-p)

With p=0.5 as the worst case

(giving the highest sample size),

we get

𝑛𝑠 ≈1

4

𝛾2

𝛥𝑝2

Using a confidence interval of

95%, the confidence coefficient

(𝛾) is 1.96, and we can now

compute the minimum sample

sizes for various N, for example

setting the error level at 0.01.

𝑁𝐻𝑂𝑇𝑒𝑠𝑡 = 𝑁 𝑝𝑇𝐴𝐺 ± 𝛥𝑝

𝛥𝑝 = 𝛾𝑁 − 𝑛𝑠

𝑁 − 1

𝑝𝑇𝐴𝐺 1 − 𝑝𝑇𝐴𝐺

𝑛𝑠

𝑝𝑇𝐴𝐺 =𝑛𝑇𝐴𝐺

𝑛𝑠

Quality Monitor demo


Conclusion & new/future features

• UBIC’s QM and EPD provide a user-friendly UI to secure a high

quality and defensible outcome of the review process


• Leveraging

theory from

Social Network

Analysis, UBIC

released “Central

Linkage” on

October 1st 2013

References Diesner, Jana; Terrill L. Frantz; Kathleen M. Carley. 2005. Communication Networks from the Enron Email Corpus. “It’s Always About the People. Enron is no Different”. In Computational & Mathematical Organization Theory. 11 (3), 201-228. Kluwer, MA.

Halskov, Jakob. 2013. Augmenting Predictive Analytics for eDiscovery with Richer Linguistic Features. Poster presentation at Asian Summer School in Information Access, ASSIA (Research Center for Knowledge Communities, Tsukuba University, Japan, June 22-24). http://www.kc.tsukuba.ac.jp/assia2013/poster_presentation

Oard, Douglas W.; Jason R. Baron; Bruce Hedin; David D. Lewis; Stephen Tomlinson. 2010. Evaluation of information retrieval for E-discovery. In Artificial Intelligence and Law, 18 (4), 347-386. Springer, Amsterdam.

Takeda, Hideki. 2013. Trend on Digital Forensic Technologies and Business in Japan. Keynote Speech. In Proceedings of the 5th IEEE International Workshop on Computer Forensics in Software Engineering (Kyoto, Japan, July 22-26). IEEE Computer Society Press, CA.

Webber, William. 2011. Re-examining the Effectiveness of Manual Review. In Proceedings of SIGIR 2011 Information Retrieval for E-Discovery Workshop (Beijing, China, July 28). http://www.umiacs.umd.edu/~oard/sire11/


http://www.kc.tsukuba.ac.jp/assia2013/poster_presentation



http://www.umiacs.umd.edu/~oard/sire11/

http://www.umiacs.umd.edu/~oard/sire11/

When to stop reviewing documents in eDiscovery...

Documents

Transcript of When to stop reviewing documents in eDiscovery...