[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, EsquireInformation Extraction and Summarization in Legal Documents
Jacob Mundt – MLConf ATL
[email protected] | (203) 870-3000 Proprietary & Confidential
Who we are
One of four national winners in Startup America DEMO Competition
One of CIO.com’s top ten enterprise products at DEMO Fall 2012
Most Promising Software Product of the Year award from Connecticut Technology Council
Completed Connecticut Innovations’ TechStart Fund Program
Commercializing machine learning technology developed at Columbia University to make legal document review more
efficient, accurate and cost effective.
2
[email protected] | (203) 870-3000 Proprietary & Confidential
Large law firm experience; tech startup experience; sales & business development experience
Harvard Law
Led R&D team at tech company extracting data in medical industry
Columbia Masters; NLP researcher
Founder of Ivy Link (20+ staff); Chief of Staff of 350-person real estate private equity firm
Harvard Law; law firm & in-house experience
Management Team
Ned GannonCEO
Adam Nguyen COO
Jake MundtCTO
[email protected] | (203) 870-3000 Proprietary & Confidential
The Future of Law
“In contrast, in looking 25 years ahead from now, I argue that it would be absurd to expect lawyers and courts to carry on operating as they do now.”
—Richard Susskind, Tomorrow’s lawyers
“Well, if droids could think, there'd be none of us here, would there?”
— Obi-Wan Kenobi
[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
[email protected] | (203) 870-3000 Proprietary & Confidential
Corporate Mergers and Due Diligence
Business due diligence
Legal due diligence
Closing
[email protected] | (203) 870-3000 Proprietary & Confidential
Corporate mergers and due diligence
[email protected] | (203) 870-3000 Proprietary & Confidential
Legal Due Diligence Process
Extract Summarize Analyze Advise
Junior Attorneys Senior Attorneys
Teams of junior attorneys billed out at $300-$500/hour poring over hundreds of contracts in virtual data rooms to summarize their content and identify red flags.
[email protected] | (203) 870-3000 Proprietary & Confidential
Legal Due Diligence Summary
Here come the spreadsheets – summarize ALL the contracts: – leases – executive
employment agreements
– supplier agreements– Loan/credit
agreements
Extract key data points
Also extract any clauses that discuss particular provisions
[email protected] | (203) 870-3000 Proprietary & Confidential
The Stone Age
On site data room with reams of documents, organized by seller
Buyer’s agents travel to evaluate the target, under constant supervision
[email protected] | (203) 870-3000 Proprietary & Confidential
State of the Art – Virtual Data Rooms
Digitized, but not machine readable
Some simple OCR and searching capability
Commercial systems like IntraLinks have advanced capabilities, but mostly focused on security and auditability.
[email protected] | (203) 870-3000 Proprietary & Confidential
The Future is Here
Misses stems, synonyms, plural forms
False positives—some common words also have special meanings in context.
Impossible to find dates, parties, dollar amounts, or any other generic quantities
We can do better
[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
[email protected] | (203) 870-3000 Proprietary & Confidential
Can we use ML and NLP?
Actually many sub-problems:
Classify entire document type—discover contracts amongst heterogeneous corpus
Duplicate detection
Group documents that were based on a common form agreement
Automatically flagging questionable docs for further review
Automatic provision extraction
[email protected] | (203) 870-3000 Proprietary & Confidential
Why this is Easy
Precise, formal writing
Extremely structured
Lots of clause reuse
[email protected] | (203) 870-3000 Proprietary & Confidential
Why this is Hard
Precise, formal writing
Extremely structured
Lots of clause reuse
Obfuscation
High demands on recall
Deep chains of defined term references
[email protected] | (203) 870-3000 Proprietary & Confidential
Detecting “Evil” Clauses?
Lawyers actually prefer to make the calls on exactly what to include, and how to advise the client
Just find the source material, and let the lawyer decide. Determine relevance, don’t make value judgments
“Learning to detect spyware using end user license agreements”, Lavesson, et al. (2009)
Illustration of Saint Wolfgang and the Devil with the Devil's Contract, by Michael Pacher.
[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
[email protected] | (203) 870-3000 Proprietary & Confidential
eBrevia’s Approach
Not all provisions are the same!
• Find sentences discussing “change of control”• Find restrictions concerning confidential information
Topic modeling
• The contract runs from TIMEX to TIMEX.• The monthly rent will start at $X, and increase by no more than Y
% annually.
Information Extraction (IE)
• Find every borrower’s FICO score
Rule based approach
[email protected] | (203) 870-3000 Proprietary & Confidential
Text analysis pipeline
OCR
Sentence Segmentation
NLP Processing (POS, NER,
Parsing)
Document Structure tagging
General Candidate detection
Rule Based detection
Topic classifier
Candidate detection for IE
Information Extraction and
slot filling
[email protected] | (203) 870-3000 Proprietary & Confidential
Classifier Features
Basic textual analysis feature– words– n-grams– positional and morphological
features.– Named entities
Syntactic features– Parts of speech– Parse tree and heads
Structural features– First level classifier pass for
determining document structure– Especially important on scanned
documents where these features aren’t readily available
Client shall indemnify N V V
indemnify
indemnify
Section III: Miscellaneous1. Lorem ipsum dolor
a. sit amet, consectetur
The/O buyer/O Acme/ORG Inc./ORG
[email protected] | (203) 870-3000 Proprietary & Confidential
Hunting for Training Data
All your customer’s data is confidential– Redacted contracts– Mine the SEC
Expense of lawyer-labeled training data– Bootstrapping– Co-training with different
feature sets– Active learning
[email protected] | (203) 870-3000 Proprietary & Confidential
Hacks and Special Cases
Very useful, but boring
Formatting fixes specific to legal documents– ALL CAPS– Handling of amendments– Handwritten signature blocks
Hand crafted rules very good for high-precision heuristics—customers expect the software not to miss “easy” provisions.
[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
[email protected] | (203) 870-3000 Proprietary & Confidential
The Audacity of Keywords
Seemingly-reliable keywords, aren’t
Phrase Likelihood that candidate phrase is relevant
Likelihood candidate phrase is irrelevant
“Change [of|in] Control” 48.4% 51.6%
“13(d) and 14(d)” 98.7% 1.3%
A simple keyword based search with an obvious keyword wouldn’t even get us to 50% precision! Conversely, a human would have never discovered this reliable trigram heuristic.
[email protected] | (203) 870-3000 Proprietary & Confidential
The Tyranny of Paper
Lawyers still have a lot of paper – over 50% of the documents uploaded to our system are scans.
OCR on poor quality scans works poorly for keyword searching but decently with ML, with properly constructed features.
[email protected] | (203) 870-3000 Proprietary & Confidential
Welcoming our Robot Lawyer Overlords
“[eBrevia’s software] cuts down significantly on time by performing 50-60% of the work up front and then you work from there.”
– NY law firm partner
“Your product is a great fit for our firm’s approach to practicing law.”
– Partner, national law firm
[email protected] | (203) 870-3000 Proprietary & Confidential
User Interface Notes
[email protected] | (203) 870-3000 Proprietary & Confidential
User Interface Notes
Highlight in original, formatted documentCross-referencing, editing, and corrections
[email protected] | (203) 870-3000 Proprietary & Confidential
User Interface Notes
Additional critical features
Quick Correction
Level of confidence indications (similar to Google voice transcription)
Good generic text search features to make human review easy
[email protected] | (203) 870-3000 Proprietary & Confidential
I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
[email protected] | (203) 870-3000 Proprietary & Confidential
Current Research and Future Directions
Coreference resolution: intra- and inter document. Useful for doc references, and entity references.
Machine learning for document cross-referencing and definition resolution.
Automatic summarization of longer provisions to provide quick overviews.
Understanding the lineage of a document – where its various pieces came from, and how they were changed.
[email protected] | (203) 870-3000 Proprietary & Confidential
Feedback Learning from Lawyers
Some lawyers are just bad
Noise is NOT random– They fall for
the same “trap”
– They’re often bad in the same way
33
So can’t use noise-tolerant learning algorithms to deal with this.
Consensus models, model user reputation/ability
[email protected] | (203) 870-3000 Proprietary & Confidential
Current Research and Future Directions
Other upcoming applications for eBrevia’s technology:
Contract managementDocument draftingLease abstractionFinancial/ComplianceConsumer applications
[email protected] | (203) 870-3000 Proprietary & Confidential
JACOB MUNDT, CTO
www.ebrevia.com
(203) 870-3000
Thank You – Contact Info
35