Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

35
[email protected] | (203) 870-3000 Proprietary & Confidential I, Robot, Esquire Information Extraction and Summarization in Legal Documents Jacob Mundt – MLConf ATL

description

I, Robot, Esquire: Information Extraction and Summarization in Legal Documents Pundits constantly predict the demise of many types of knowledge workers at the hands of intelligent machines, and few professionals perform more textual document review than lawyers. In this session, I’ll share work that eBrevia has been doing to apply research from the fields of ML and NLP to summarize and extract information from legal contracts to help accelerate corporate mergers and acquisitions. I will look at the unique characteristics of the legal industry, examine some supervised and semi-supervised training strategies and classification models, and discuss the limitations of these techniques and the essential role lawyers will continue to play.

Transcript of Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

Page 1: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, EsquireInformation Extraction and Summarization in Legal Documents

Jacob Mundt – MLConf ATL

Page 2: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Who we are

One of four national winners in Startup America DEMO Competition

One of CIO.com’s top ten enterprise products at DEMO Fall 2012

Most Promising Software Product of the Year award from Connecticut Technology Council

Completed Connecticut Innovations’ TechStart Fund Program

Commercializing machine learning technology developed at Columbia University to make legal document review more

efficient, accurate and cost effective.

2

Page 3: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Large law firm experience; tech startup experience; sales & business development experience

Harvard Law

Led R&D team at tech company extracting data in medical industry

Columbia Masters; NLP researcher

Founder of Ivy Link (20+ staff); Chief of Staff of 350-person real estate private equity firm

Harvard Law; law firm & in-house experience

Management Team

Ned GannonCEO

Adam Nguyen COO

Jake MundtCTO

Page 4: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

The Future of Law

“In contrast, in looking 25 years ahead from now, I argue that it would be absurd to expect lawyers and courts to carry on operating as they do now.”

—Richard Susskind, Tomorrow’s lawyers

“Well, if droids could think, there'd be none of us here, would there?”

— Obi-Wan Kenobi

Page 5: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, Esquire - Overview

Motivation

Can we use ML and NLP?

eBrevia Solution – Deep Dive

Challenges and Lessons Learned

Future directions

Page 6: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Corporate Mergers and Due Diligence

Business due diligence

Legal due diligence

Closing

Page 7: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Corporate mergers and due diligence

Page 8: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Legal Due Diligence Process

Extract Summarize Analyze Advise

Junior Attorneys Senior Attorneys

Teams of junior attorneys billed out at $300-$500/hour poring over hundreds of contracts in virtual data rooms to summarize their content and identify red flags.

Page 9: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Legal Due Diligence Summary

Here come the spreadsheets – summarize ALL the contracts: – leases – executive

employment agreements

– supplier agreements– Loan/credit

agreements

Extract key data points

Also extract any clauses that discuss particular provisions

Page 10: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

The Stone Age

On site data room with reams of documents, organized by seller

Buyer’s agents travel to evaluate the target, under constant supervision

Page 11: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

State of the Art – Virtual Data Rooms

Digitized, but not machine readable

Some simple OCR and searching capability

Commercial systems like IntraLinks have advanced capabilities, but mostly focused on security and auditability.

Page 12: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

The Future is Here

Misses stems, synonyms, plural forms

False positives—some common words also have special meanings in context.

Impossible to find dates, parties, dollar amounts, or any other generic quantities

We can do better

Page 13: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, Esquire - Overview

Motivation

Can we use ML and NLP?

eBrevia Solution – Deep Dive

Challenges and Lessons Learned

Future directions

Page 14: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Can we use ML and NLP?

Actually many sub-problems:

Classify entire document type—discover contracts amongst heterogeneous corpus

Duplicate detection

Group documents that were based on a common form agreement

Automatically flagging questionable docs for further review

Automatic provision extraction

Page 15: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Why this is Easy

Precise, formal writing

Extremely structured

Lots of clause reuse

Page 16: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Why this is Hard

Precise, formal writing

Extremely structured

Lots of clause reuse

Obfuscation

High demands on recall

Deep chains of defined term references

Page 17: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Detecting “Evil” Clauses?

Lawyers actually prefer to make the calls on exactly what to include, and how to advise the client

Just find the source material, and let the lawyer decide. Determine relevance, don’t make value judgments

“Learning to detect spyware using end user license agreements”, Lavesson, et al. (2009)

Illustration of Saint Wolfgang and the Devil with the Devil's Contract, by Michael Pacher.

Page 18: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, Esquire - Overview

Motivation

Can we use ML and NLP?

eBrevia Solution – Deep Dive

Challenges and Lessons Learned

Future directions

Page 19: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

eBrevia’s Approach

Not all provisions are the same!

• Find sentences discussing “change of control”• Find restrictions concerning confidential information

Topic modeling

• The contract runs from TIMEX to TIMEX.• The monthly rent will start at $X, and increase by no more than Y

% annually.

Information Extraction (IE)

• Find every borrower’s FICO score

Rule based approach

Page 20: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Text analysis pipeline

OCR

Sentence Segmentation

NLP Processing (POS, NER,

Parsing)

Document Structure tagging

General Candidate detection

Rule Based detection

Topic classifier

Candidate detection for IE

Information Extraction and

slot filling

Page 21: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Classifier Features

Basic textual analysis feature– words– n-grams– positional and morphological

features.– Named entities

Syntactic features– Parts of speech– Parse tree and heads

Structural features– First level classifier pass for

determining document structure– Especially important on scanned

documents where these features aren’t readily available

Client shall indemnify N V V

indemnify

indemnify

Section III: Miscellaneous1. Lorem ipsum dolor

a. sit amet, consectetur

The/O buyer/O Acme/ORG Inc./ORG

Page 22: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Hunting for Training Data

All your customer’s data is confidential– Redacted contracts– Mine the SEC

Expense of lawyer-labeled training data– Bootstrapping– Co-training with different

feature sets– Active learning

Page 23: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Hacks and Special Cases

Very useful, but boring

Formatting fixes specific to legal documents– ALL CAPS– Handling of amendments– Handwritten signature blocks

Hand crafted rules very good for high-precision heuristics—customers expect the software not to miss “easy” provisions.

Page 24: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, Esquire - Overview

Motivation

Can we use ML and NLP?

eBrevia Solution – Deep Dive

Challenges and Lessons Learned

Future directions

Page 25: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

The Audacity of Keywords

Seemingly-reliable keywords, aren’t

Phrase Likelihood that candidate phrase is relevant

Likelihood candidate phrase is irrelevant

“Change [of|in] Control” 48.4% 51.6%

“13(d) and 14(d)” 98.7% 1.3%

A simple keyword based search with an obvious keyword wouldn’t even get us to 50% precision! Conversely, a human would have never discovered this reliable trigram heuristic.

Page 26: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

The Tyranny of Paper

Lawyers still have a lot of paper – over 50% of the documents uploaded to our system are scans.

OCR on poor quality scans works poorly for keyword searching but decently with ML, with properly constructed features.

Page 27: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Welcoming our Robot Lawyer Overlords

“[eBrevia’s software] cuts down significantly on time by performing 50-60% of the work up front and then you work from there.”

– NY law firm partner

“Your product is a great fit for our firm’s approach to practicing law.”

– Partner, national law firm

Page 28: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

User Interface Notes

Page 29: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

User Interface Notes

Highlight in original, formatted documentCross-referencing, editing, and corrections

Page 30: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

User Interface Notes

Additional critical features

Quick Correction

Level of confidence indications (similar to Google voice transcription)

Good generic text search features to make human review easy

Page 31: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

I, Robot, Esquire - Overview

Motivation

Can we use ML and NLP?

eBrevia Solution – Deep Dive

Challenges and Lessons Learned

Future directions

Page 32: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Current Research and Future Directions

Coreference resolution: intra- and inter document. Useful for doc references, and entity references.

Machine learning for document cross-referencing and definition resolution.

Automatic summarization of longer provisions to provide quick overviews.

Understanding the lineage of a document – where its various pieces came from, and how they were changed.

Page 33: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Feedback Learning from Lawyers

Some lawyers are just bad

Noise is NOT random– They fall for

the same “trap”

– They’re often bad in the same way

33

So can’t use noise-tolerant learning algorithms to deal with this.

Consensus models, model user reputation/ability

Page 34: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

Current Research and Future Directions

Other upcoming applications for eBrevia’s technology:

Contract managementDocument draftingLease abstractionFinancial/ComplianceConsumer applications

Page 35: Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

[email protected] | (203) 870-3000 Proprietary & Confidential

JACOB MUNDT, CTO

[email protected]

www.ebrevia.com

(203) 870-3000

Thank You – Contact Info

35