Jan25 - Ottawa Machine Learning Meetup

51
CLASSIFYING OMNIBUS BILLS OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH, 2016 SAMUEL WITHERSPOON, MATHEW SONKE

Transcript of Jan25 - Ottawa Machine Learning Meetup

Page 1: Jan25 - Ottawa Machine Learning Meetup

CLASSIFYING OMNIBUS BILLS

OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH, 2016

SAMUEL WITHERSPOON, MATHEW SONKE

Page 2: Jan25 - Ottawa Machine Learning Meetup

DISCLAIMER

THIS IS OUR FIRST ITERATION AND IS A WORK IN PROGRESS.

Page 3: Jan25 - Ottawa Machine Learning Meetup

PURPOSE

WE WANT TO SHOW HOW WE MOVE FROM START TO FIRST SET OF RESULTS IN AN ML PROBLEM

Page 4: Jan25 - Ottawa Machine Learning Meetup

SUMMARY OF EFFORT≈ 50 HOURS SPENT

≈ 120 BILLS MANUALLY CLASSIFIED

SOURCE CODE:https://github.com/switherspoon/MachineLearningMeetup

Page 5: Jan25 - Ottawa Machine Learning Meetup

WHAT IS AN OMNIBUS BILL?

TYPICALLY VERY LONG

TYPICALLY LOTS OF OTHER BILLS MODIFIED

For Example Bill C-51

A BILL THAT HAS A WIDE VARIETY OF TOPICS

Page 6: Jan25 - Ottawa Machine Learning Meetup

THAT DEFINITION INFORMED OUR FEATURES

LENGTH OF BILL

DIVERSITY OF TOPICS IN THE BILL

NUMBER OF OTHER BILLS MODIFIED/REFERENCED

FEATURES:

Page 7: Jan25 - Ottawa Machine Learning Meetup

WHAT DOES AN OMNIBUS LOOK

LIKE?

Page 8: Jan25 - Ottawa Machine Learning Meetup

BILL C-51 - 41st PARLIAMENT 2nd

SESSION

BILL C-54 - 41st PARLIAMENT 2nd

SESSION

4/19 PAGES 1/1 PAGE

Page 9: Jan25 - Ottawa Machine Learning Meetup

GETTING STARTEDWE USED PYTHON3 WITH:

1. NLTK (http://www.nltk.org/) - FOR NLP 2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER 3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL 4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT

ALL INSTALLED WITH PIP3

Page 10: Jan25 - Ottawa Machine Learning Meetup

GETTING STARTED (CONT…)

WE SOURCED OUR DATA FROM:

https://openparliament.ca/

http://parl.gc.ca

Page 11: Jan25 - Ottawa Machine Learning Meetup

DATA ANALYSISMANUALLY SKIMMED AND EXTRACTED FEATURES FROM ≈120 BILLS AND BUILT A SPREADSHEET

link: https://docs.google.com/spreadsheets/d/1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing

Page 12: Jan25 - Ottawa Machine Learning Meetup

MODEL FEATURESLENGTH OF BILL

NUMBER OF BILLS REFERENCED

AVERAGE SEMANTIC DISTANCE OF TOPICS IN EACH BILL

Page 13: Jan25 - Ottawa Machine Learning Meetup

THE MODEL

Page 14: Jan25 - Ottawa Machine Learning Meetup

THE CLASSIFIERNAIVE BAYES

EASY

FAST

UNDERSTANDABLE

WORKS WELL WITH SMALL TRAINING SET (MAYBE NOT THIS SMALL)

Page 15: Jan25 - Ottawa Machine Learning Meetup

LENGTH OF BILLLENGTH OF RAW STRING READ IN FROM FILES

AS EASY AS: len(raw)

Page 16: Jan25 - Ottawa Machine Learning Meetup

NUMBER OF BILLS REFERENCED

(1) DATA RETRIEVAL

(2) PREPROCESSING

(3) NAMED ENTITY RECOGNITION (NER)

Page 17: Jan25 - Ottawa Machine Learning Meetup

(1) DATA RETRIEVAL

2 DATA SETS TO COLLECT

• CONSOLIDATED LIST OF ACTS

• FULL TEXT OF BILLS

Page 18: Jan25 - Ottawa Machine Learning Meetup

DATA RETRIEVAL CONT…

LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA (http://laws-lois.justice.gc.ca/eng/acts/)

WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE • SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT • IMPORT.IO WORKED WELL FOR OUR NEEDS

Page 19: Jan25 - Ottawa Machine Learning Meetup

DATA RETRIEVAL CONT…

Page 20: Jan25 - Ottawa Machine Learning Meetup

DATA RETRIEVAL CONT…

TEXT OF BILLS RETRIEVED FROM OPENPARLIAMENT DATABASE USING SQL

Page 21: Jan25 - Ottawa Machine Learning Meetup

(2) PREPROCESSING

OPENPARLIAMENT DATABASE ISN’T PERFECT • REMOVED DUPLICATES • VERIFIED SESSION NUMBER WAS CORRECT • CONVERTED EVERYTHING TO LOWERCASE

Page 22: Jan25 - Ottawa Machine Learning Meetup

(3) NAMED ENTITY RECOGNITION

MANY APPROACHES TO THIS • HAND-CRAFTED GRAMMAR BASED • STATISTICAL MODELS • MATCHING AGAINST A LIBRARY

Page 23: Jan25 - Ottawa Machine Learning Meetup

NAMED ENTITY RECOGNITION CONT…

WE NOTICED COMMON PHRASES LIKE “AMENDS”, “RELATED AMENDMENTS”, “REPLACED BY” WHEN REFERENCING ACTS

ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY • THIS GAVE US GOOD RESULTS WITH LITTLE CODE • WON’T ALWAYS WORK

Page 24: Jan25 - Ottawa Machine Learning Meetup

SEMANTIC DISTANCE OF TOPICS

HYPOTHESIS:

SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS THE AVERAGE DISTANCE BETWEEN TOPICS IN AN OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS BILL.

Page 25: Jan25 - Ottawa Machine Learning Meetup

SEMANTIC DISTANCE OF TOPICS PROCEDURE

(1) PREPROCESS A BILL

(2) LDA TOPIC MODELLING ON THE BILL

(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)

(4) AVERAGE TOPIC DISTANCE OF THE BILL

Page 26: Jan25 - Ottawa Machine Learning Meetup

(1) PREPROCESSING• READ IN FILES

•TOKENIZE WORDS

• REMOVE STOP WORDS

•IGNORE WORD ORDER (BAG OF WORDS)

Page 27: Jan25 - Ottawa Machine Learning Meetup

(2) LDA TOPIC MODELING•PROBABILISTIC TOPIC MODEL

•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION

•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A HIDDEN STRUCTURE BUILT AROUND TOPICS

•IGNORES WORD ORDER

Page 28: Jan25 - Ottawa Machine Learning Meetup

LDA CONT…•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/ LDA

•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY SCORE OF ‘1’

•THIS IS A REALLY BAD WORKAROUND

•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES •THIS IS AN OPTIMIZATION PROBLEM

MORE READING: https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

Page 29: Jan25 - Ottawa Machine Learning Meetup

(3) SEMANTIC SIMILARITYLIN SIMILARITY

BUT WHAT DOES THIS MEAN???

Page 30: Jan25 - Ottawa Machine Learning Meetup

WORDNETA HIERARCHICAL TREE OF WORDS WITH MORE GENERAL WORDS AT THE ROOT AND MORE SPECIFIC WORDS AT

THE LEAF

Page 31: Jan25 - Ottawa Machine Learning Meetup

SIMILARITY CONT…LIN SIMILARITY

*OVERSIMPLIFICATION* THERE IS A GRAPH/NETWORK OF SYNONYMS - LIN SIMILARITY IS THE SHORTEST DISTANCE TO THE FIRST COMMON ANCESTOR (LOWEST COMMON ANCESTOR)

Page 32: Jan25 - Ottawa Machine Learning Meetup

SIMILARITY CONT…SCORES ARE BETWEEN 0 AND 1

>0.8 MEANS VERY SIMILAR

<0.2 MEANS NOT VERY SIMILAR

ie. CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC) HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC) CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)

Page 33: Jan25 - Ottawa Machine Learning Meetup

(4) AVG. TOPIC DISTANCE IN A BILL

WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH BILL:

SUM OF ALL COMPARED SCORES/TOTAL NUMBER OF COMPARISONS

THERE ARE FLAWS IN THIS APPROACH •NOUN ONLY •NO WEIGHTING

Page 34: Jan25 - Ottawa Machine Learning Meetup

CLASSIFICATION!WE WERE RUNNING OUT OF TIME…..

WE WANTED TO COMPARE: •NAIVE BAYES •RANDOM FOREST DECISION TREE •SVM

WE COMPARED: •NAIVE BAYES!

Page 35: Jan25 - Ottawa Machine Learning Meetup

CLASSIFIER COMPARISON

:(

NAIVE BAYES •GAUSSIAN •MULTINOMIAL

Page 36: Jan25 - Ottawa Machine Learning Meetup

MODEL EVALUATION

WE WONT SHOW YOU ACCURACY BECAUSE…

Page 37: Jan25 - Ottawa Machine Learning Meetup

CLASS IMBALANCE!

•9 OMNIBUS BILLS IN 120 BILLS

•7.5% CHANCE A BILL IS AN OMNIBUS BILL

•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY PICKING ‘NOT OMNIBUS’ EVERY TIME!

Page 38: Jan25 - Ottawa Machine Learning Meetup

PRECISION True Positives / (True Positives + False Positives)

Page 39: Jan25 - Ottawa Machine Learning Meetup

RECALL (True Positives / (True Positives + False Negatives))

Page 40: Jan25 - Ottawa Machine Learning Meetup

BUT WE HAVE A CLASS IMBALANCE PROBLEM

Page 41: Jan25 - Ottawa Machine Learning Meetup

PRETENDING WE DON’T HAVE A PROBLEM

Page 42: Jan25 - Ottawa Machine Learning Meetup

CLASS IMBALANCE SOLUTION

REMOVE THE IMBALANCE!!!!

WE WENT FROM 65 TRAINING EXAMPLES TO 25 TO 11 BY REMOVING NEGATIVE EXAMPLES

Page 43: Jan25 - Ottawa Machine Learning Meetup

RESULTSTRUE CLASS IMBALANCE

(5:60)

NEW (5:20)

RATIOS ARE (#OMNIBUS:#NOTOMNIBUS)

Page 44: Jan25 - Ottawa Machine Learning Meetup

REMOVING EVEN MORENEW (5:20)

NEWEST (5:6)

Page 45: Jan25 - Ottawa Machine Learning Meetup

FINAL TRAINING SET

} }

Page 46: Jan25 - Ottawa Machine Learning Meetup

CONCLUSIONS

EITHER NEED: (1)SUBSTANTIALLY MORE DATA OR; (2)BETTER ACCURACY ON TOPIC EXTRACTION AND

NAMED ENTITY RECOGNITION

LOTS OF ROOM FOR IMPROVEMENT

WE STILL THINK THREE FEATURES IS ENOUGH

NEED TO DO MORE WORK CLEANING/VALIDATING OUR INPUT DATA

Page 47: Jan25 - Ottawa Machine Learning Meetup

CONCLUSIONS CONT…

WE ARE PERFORMING BETTER THAN RANDOM GUESSING!

WE WOULD LOVE HELP IMPROVING OUR APPROACH

Page 48: Jan25 - Ottawa Machine Learning Meetup

WAYS TO IMPROVEUSE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY

LINKED TOPIC MODELLING

IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS

EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS

USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)

TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A DOCUMENT

EXPERIMENT WITH OTHER CLASSIFIERS

EXPERIMENT WITH MORE FEATURES

Page 49: Jan25 - Ottawa Machine Learning Meetup

QUESTIONS?

Page 50: Jan25 - Ottawa Machine Learning Meetup

Machine learning is no cakewalk.

Can we form a group to help Ottawa companies achieve greater success with ML?

What would this group do? Who would be in it?

How would it be funded? Do we have the local talent? What about protecting IP?

Who would make the decisions? Why bother?

We want your feedback! If you'd like to participate in ongoing discussions, please leave

us your contact info.

Page 51: Jan25 - Ottawa Machine Learning Meetup

RELATIVE OPERATING CHARACTERISTICS (ROC)

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

Random GuessGaussianMultinomial

FALSE POSITIVE RATE

TRU

E PO

SITI

VE

RATE