NLP: A Primer and Research Portfolio

52
Empowering the future of legal decision- making LexPredict 2012-2016 @lexpredict www.lexpredict.com [email protected] NLP: A Primer and Portfolio prepared for: MSU Law Review Symposium prepared on: Mar 2017 Michael J Bommarito II IIT Chicago-Kent / MSU / Michigan / Stanford

Transcript of NLP: A Primer and Research Portfolio

Page 1: NLP: A Primer and Research Portfolio

Empowering the future of legal decision-making

© LexPredict 2012-2016

@lexpredict

[email protected]

NLP: A Primer and Portfolio

prepared for: MSU Law Review Symposium prepared on: Mar 2017

Michael J Bommarito IIIIT Chicago-Kent / MSU / Michigan / Stanford

Page 2: NLP: A Primer and Research Portfolio

2

A look at our presentation agendaPresentation Section

What is NLP?

How does ML fit?

Page 3: NLP: A Primer and Research Portfolio

3

Sources

Example Software

Example Research

Questions

End

Page 4: NLP: A Primer and Research Portfolio

4

What is NLP?A Brief Primer

Page 5: NLP: A Primer and Research Portfolio

5

Let’s start with some text.

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

(Bloomberg article on Sandy)

What is NLP?

Page 6: NLP: A Primer and Research Portfolio

6

Real Data

When we work with real data, we often need to pre-process and clean data before we can segment and tokenize.

Consider, for example: Hand-written documents: OCR Digital formats: PDF, Word, WordPerfect, HTML Typesetting remnants, e.g., page breaks, line break hyphens

Pre-processing is very important! All subsequent work depends on this quality.

What is NLP?

Page 7: NLP: A Primer and Research Portfolio

7

What kind of questions can we ask?

Basic What is the structure of the text?

Paragraphs Sentences Tokens/words

What are the “words” that appear in this text? Nouns

Subjects Direct objects …

Verbs

Advanced What are the concepts that appear in this text? How does this text compare to other text?

What is NLP?

Page 8: NLP: A Primer and Research Portfolio

8

Segmentation and Tokenization

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

• Segments Types• Paragraphs• Sentences• Tokens

What is NLP?

Page 9: NLP: A Primer and Research Portfolio

9

Segmentation and Tokenization

But how does i t work?

Paragraphs Two consecutive line breaks A hard line break followed by an indent

Sentences Period, except abbreviation, ellipsis within quotation, etc.

Tokens and Words Whitespace Punctuation

Remember what real-world text looks l ike – think text and email.

What is NLP?

Page 10: NLP: A Primer and Research Portfolio

10

Segmentation and Tokenization“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

Paragraphs: 2Sentences: 2Words: 561.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

What is NLP?

Page 11: NLP: A Primer and Research Portfolio

11

What kind of questions can we ask?

We now have an ordered list of tokens.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

Does the word phrase “quote stuffing” occur in the text? How many times does “Sandy” occur? How often does “outage” occur after “power?” What percentage of tokens are numbers?

What is NLP?

Page 12: NLP: A Primer and Research Portfolio

12

An Aside on Storage

Data: The word ‘the’ ten times and the word ‘a’ ten times.

Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]

Representation 2 – Term Frequency: [(‘the’, 10), (‘a’, 10)]

What is NLP?What is NLP?

Page 13: NLP: A Primer and Research Portfolio

13

An Aside on Storage

Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]

Representation 2 - Frequency Map: [(‘the’, 10), (‘a’, 10)]

Tradeoffs Total space Ease of answering certain questions Information about context

Not all software make the same choice!

What is NLP?

Page 14: NLP: A Primer and Research Portfolio

14

Stopwording, Stemming, Parsing, and Tagging

Stopwording Removing “filler ” words like prepositions, auxiliary or infinitive verbs, and

conjunctions.

Stemming Matching declined nouns like dog/dogs or child/children. Matching conjugated verbs like run/ran.

Parsing Determining the “structure” of a sentence, typically as represented by a grade

school sentence diagram (requires grammar definition; we’ll skip).

Tagging Identifying the part of speech of each token in a sentence.

What is NLP?

Page 15: NLP: A Primer and Research Portfolio

15

Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain.

System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power mill ions week, according forecasters risk experts.

What is NLP?

Page 16: NLP: A Primer and Research Portfolio

16

Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain.

System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.

What is NLP?

Page 17: NLP: A Primer and Research Portfolio

17

What is ML?And how does it fit with NLP?

Page 18: NLP: A Primer and Research Portfolio

18

Definition: Automated classification and prediction on data.

Examples: Product recommenders, a la Amazon Computer vision – is it a cat? Sentiment analysis Topic classification Document clustering

At least two stages to a classification problem: Training Classification

What is Machine Learning?

Page 19: NLP: A Primer and Research Portfolio

19

Learning

Machine learning requires “learning ” or “training.”

There are two types of training: Supervised Unsupervised

The goal of training is to determine a mapping from input features to a set of target classes.

What is Machine Learning?

Page 20: NLP: A Primer and Research Portfolio

20

Learning

Imagine a student given a small l ist of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from? Supervised : The teacher provides the answers while learning.Unsupervised : The teacher provides nothing while learning.

In our example, the teacher wi l l typical ly provide the “canonical” domains and kingdoms of biology. However, most real-world problems domains are not so wel l-studied.

What is Machine Learning?

Page 21: NLP: A Primer and Research Portfolio

21

Learning

What if the teacher gave the student some of the answers?

This is semi-supervised learning. Supervised : The teacher provides the answers while learning. Semi-supervised : The teacher provides some answers while

learning.. Unsupervised : The teacher provides nothing while learning..

What is Machine Learning?

Page 22: NLP: A Primer and Research Portfolio

22

Classification

The student has now learned to map from an organism’s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism.

What is Machine Learning?

Page 23: NLP: A Primer and Research Portfolio

23

Replace the student with an algorithm and we have machine learning.

Sentiment Analysis Example Organisms : Restaurant reviews Descriptions :

Number of positive phrases Number of negative phrases Number of times visited Number of restaurants reviewed Recency of review

Target: 1-5 stars for restaurant sentiment

What is Machine Learning?

Page 24: NLP: A Primer and Research Portfolio

24

Some Machine Learning Algorithms Supervised

Statistical models Bayesian, e.g., Naïve Bayes Classification Frequentist, e.g., Ordinary Least Squares.

Neural Networks (NN) Support Vector Machines (SVM) Random Forests (RF) Genetic Algorithms (GA)

Semi/unsupervised Neural Networks (NN) Clustering

K-means Hierarchical Radial Basis (RBF) Graph

What is Machine Learning?

Page 25: NLP: A Primer and Research Portfolio

25

Notes on Algorithm Diversity

Not all algorithms return scores/probabilities; some are binary. True, True, False 0.9, 0.7, 0.1

Not all algorithms support more than two classes. Cat, Dog, Mouse Cat, Not Cat

Not all algorithms scale similarly. 1M documents = 1 day 10M documents = {10 days, 100 days, 1000 days}

What is Machine Learning?

Page 26: NLP: A Primer and Research Portfolio

26

eDiscovery – a brief aside

3 English medium

Inputs Parameters Outputs

What is Machine Learning?

Page 27: NLP: A Primer and Research Portfolio

27

?Secret: Most black boxes are

very similar inside.

You just saw all of the building blocks.

eDiscovery – a brief aside

What is Machine Learning?

Page 28: NLP: A Primer and Research Portfolio

28

eDiscovery Terminology Translation:• Predictive Coding = Classification Problem• “Relevant”, “Privileged” – Class or Label• Review: Training a model• Production: Running a model

What is Machine Learning?

Page 29: NLP: A Primer and Research Portfolio

29

SourcesWhere do we see it in the wild?

Page 30: NLP: A Primer and Research Portfolio

30

• Statutory Material• Statues at Large• US Code• Michigan Compiled Law• Regulatory Material• Federal Register• Code of Federal Regulations• SEC Filings• FCC Orders• Judicial Material• Briefs• Opinions• (Evidence: eDiscovery)• Other Examples• Executive dialog (State of the Union, twitter)• Federal Reserve governors

Sources of Natural Language

Page 31: NLP: A Primer and Research Portfolio

31

Enough of government data.

Is there any useful data inside of organizations

like businesses?

What is Machine Learning?

Page 32: NLP: A Primer and Research Portfolio

32

Software ExampleA thinly-veiled product pitch: ContraxSuite for M&A

Page 33: NLP: A Primer and Research Portfolio

33

(the holy document grail)The Hope

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............Financials

Legal

Operations

Marketing & Sales

Supply Chain

Acme, Inc. File Server

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

Real-World Example

Page 34: NLP: A Primer and Research Portfolio

34

(the document swamp)The Reality

Acme, Inc.

Real-World Example

Page 35: NLP: A Primer and Research Portfolio

35

Document Funnelimproving diligence in the real-world

STAGE 5

STAGE 4

STAGE 3

STAGE 2

STAGE 1

SearchSearch complete file stores, document

management systems, and mail servers

OrganizeUtilize both guided and automatic

document organization

IdentifyIdentify important factors in policies,

procedures, plans, and legal documents

TrainTrain new assistants on any document, no

development required.

VisualizeGenerate visualizations for both one-time

and ongoing use

Real-World Example

Page 36: NLP: A Primer and Research Portfolio

36

Unlocking the value in your documentsContraxSuite

Search

Organize

Identify

Page 37: NLP: A Primer and Research Portfolio

37

Visualize

Train

Page 38: NLP: A Primer and Research Portfolio

38

Unlocking the value in your documentsContraxSuite

Search

Organize

Identify

Stepsi. Identify policies, procedures,

and plansii. Identify material pre-sales

and sales discussionsiii. Identify traditional legal

agreements

Pointsi. Find all important written

communication, not just someii. Sales teams and management teams

frequently execute agreements without awareness of legal implications

Page 39: NLP: A Primer and Research Portfolio

39

Search

Organize

Identify

Documents can be organized using guided and automatic methods

Page 40: NLP: A Primer and Research Portfolio

40

Search

Organize

IdentifyFor policies, plans, and legal documentsi. Identify common clausesii. Identify common regulatory

and statutory entitiesiii. Identify common

geopolitical and business entities

iv. Customize on your own.

Page 41: NLP: A Primer and Research Portfolio

41

Visualize

TrainFor any type of document:• Train new clause-tagging

models• Train new clause classifiers

Page 42: NLP: A Primer and Research Portfolio

42

Visualize

Train

• Reports for one-time/ad-hoc analysis

• Dashboards for ongoing usage

Page 43: NLP: A Primer and Research Portfolio

43

Research Examples(all of the examples are mine)

Page 44: NLP: A Primer and Research Portfolio

44Research Examples

Page 45: NLP: A Primer and Research Portfolio

45Research Examples

Page 46: NLP: A Primer and Research Portfolio

46Research Examples

Page 47: NLP: A Primer and Research Portfolio

47Research Examples

Page 48: NLP: A Primer and Research Portfolio

48Research Examples

Page 49: NLP: A Primer and Research Portfolio

49Research Examples

Page 50: NLP: A Primer and Research Portfolio

50Research Examples

Page 51: NLP: A Primer and Research Portfolio

51Research Examples

Page 52: NLP: A Primer and Research Portfolio

NLP: A Primer and Portfolio

https://www.lexpredict.com@lexpredict

Thank you!