Using Machine Learning to aid Journalism at the New York Times

Aiding journalism with machine learning @ NYT

Dae Il Kim - [email protected]

Overview● The Story of Faulty Takata Airbags

○ Using Logistic Regression to predict suspicious comments

● Dealing with large document corpuses: The FOIA problem○ What are Topic Models?

■ What are topics and why are they useful?■ Latent Dirichlet Allocation - A Graphical Model Perspective■ Scalable Topic Models

○ Refinery: A Locally Deployable Web Platform for Large Document Analysis■ The Technology Stack for Refinery■ How does Refinery work?

● Future Directions

The Story of Faulty Takata Airbags

Complaints data from NHTSA complaints

The DataData contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi).

A Machine Learning ApproachDevelop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.

The Machine Learning ApproachA sample comment. We will preprocess this data for the algorithm

- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)

TOKENIZE

FILTER

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments

DATA IS READY FOR TRAINING!

The data now consists of 33,204 examples with 56,191 features

Cross-ValidationCo

mm

ent I

D

Features (i.e word frequency)

0 0 0 3 1 0 2 0...

1 0 0 0 2 0 1 1...

...

1 1 5 1 2 0 0 1...

This is our training set. Take a subset of the data for training

S

NS

S

S

NS

NS

NS

NS

NS

Labels (S = Suspicious, NS = Not Suspicious)

This is our test set. After training, test on this dataset to obtain accuracy measures.

How did we do?

Experiment SetupWe hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits.

Performance!We obtain a very high AUC (~.97) on our test sets.

Check what we missedThese comments are potentially worth checking twice.

The most predictive words / features

Predictive of a suspicious comment

Predictive of a normal comment.

After training the model, we then applied this on the full dataset.

We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total).

Result: 7 new cases where a passenger was injured were discovered from those comments she missed.

Dealing with large document corpuses (i.e FOIA dumps)

We’ll use Topic Models for making sense of these large document collections!

What are Topic Models?

There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct…

Decompose documents as a probability distribution over “topic” indices

1

0“Politics”

“Climate Change”

“Genetics”

“Climate Change” “Genetics”“Politics”

Topics in turn represent probability distributions over the unique words in your vocabulary.

Topic Models: A Graphical Model PerspectiveLDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

1

0“Politics”

“Climate Change”

“Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3

Bayes Theorem

Prior belief about the world. In terms of LDA, our modeling assumptions / priors.

Normalization constant makes this problem a lot harder. We need this for valid probabilities.

Likelihood. Given our model, how likely is this data?

Posterior distribution. Probability of our new model given the data.

Posterior Inference in LDA

GOAL: Obtain this posterior

which means that we need to calculate this intractable term:

For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z.

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Scalable Learning & Inference in Topic Models

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Analyze a subset of your total documents before updating.

Update θ, z, and β after analyzing each mini-batch of documents.

Refinery: An open source web-app for large document analyses

Daeil Kim @ New York TimesFounder of [email protected]

Ben Swanson @ MIT Media LabCo-Founder of [email protected]

Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org

http://docrefinery.org

Installing Refinery

1) Command → git clone https://github.com/daeilkim/refinery.git2) Go to the root folder. Command → vagrant up3) Open brower and go to --> 11.11.11.11:8080

3 Simple Steps to get Refinery runningInstall these first!

https://github.com/daeilkim/refinery.git

A Typical Refinery Pipeline

Step 1: Upload documents

Step 2: Extract Topics from a Topic Model

Step 3: Find a subset of documents with topics of interest.

Step 4: Discover Interesting Phrases

A Quick Refinery Demo

Extracting NYT articles from keyword “obama” in 2013.

What themes / topics defined the Obama administration during 2013?

Future Directions: Better tools for Investigative Reporting

Collecting & Scraping

Data

Refinery focuses on extracting insights from relatively clean data

Great tools like DocumentCloud take care of steps 1 & 2

Enterprise stories might be completed in a fraction of the time.

Filtering& Cleaning

Data

Extracting Insights

Interesting Extensions to Topic Models

Combining topic models with recommendation systems.

LDA / Topic Modeling

Matrix Factorization Model

Generative Process

Generative ProcessBenefits

● The model think of users as mixtures of topics. We are what we read and rate.

● The ratings in turn help shape the topics that are also discovered.

● Can do in-matrix and out of matrix predictions.

Using Machine Learning to aid Journalism at the New York Times

Data & Analytics

Transcript of Using Machine Learning to aid Journalism at the New York Times