Using Machine Learning to aid Journalism at the New York Times

20
Aiding journalism with machine learning @ NYT Dae Il Kim - [email protected]

description

This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014. Speaker: Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism. Topic: This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.

Transcript of Using Machine Learning to aid Journalism at the New York Times

Page 1: Using Machine Learning to aid Journalism at the New York Times

Aiding journalism with machine learning @ NYT

Dae Il Kim - [email protected]

Page 2: Using Machine Learning to aid Journalism at the New York Times

Overview● The Story of Faulty Takata Airbags

○ Using Logistic Regression to predict suspicious comments

● Dealing with large document corpuses: The FOIA problem○ What are Topic Models?

■ What are topics and why are they useful?■ Latent Dirichlet Allocation - A Graphical Model Perspective■ Scalable Topic Models

○ Refinery: A Locally Deployable Web Platform for Large Document Analysis■ The Technology Stack for Refinery■ How does Refinery work?

● Future Directions

Page 3: Using Machine Learning to aid Journalism at the New York Times

The Story of Faulty Takata Airbags

Page 4: Using Machine Learning to aid Journalism at the New York Times

Complaints data from NHTSA complaints

The DataData contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi).

A Machine Learning ApproachDevelop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.

Page 5: Using Machine Learning to aid Journalism at the New York Times

The Machine Learning ApproachA sample comment. We will preprocess this data for the algorithm

- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)

TOKENIZE

FILTER

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments

DATA IS READY FOR TRAINING!

The data now consists of 33,204 examples with 56,191 features

Page 6: Using Machine Learning to aid Journalism at the New York Times

Cross-ValidationCo

mm

ent I

D

Features (i.e word frequency)

0 0 0 3 1 0 2 0...

1 0 0 0 2 0 1 1...

...

1 1 5 1 2 0 0 1...

This is our training set. Take a subset of the data for training

S

NS

S

S

NS

NS

NS

NS

NS

Labels (S = Suspicious, NS = Not Suspicious)

This is our test set. After training, test on this dataset to obtain accuracy measures.

Page 7: Using Machine Learning to aid Journalism at the New York Times

How did we do?

Experiment SetupWe hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits.

Performance!We obtain a very high AUC (~.97) on our test sets.

Check what we missedThese comments are potentially worth checking twice.

Page 8: Using Machine Learning to aid Journalism at the New York Times

The most predictive words / features

Predictive of a suspicious comment

Predictive of a normal comment.

After training the model, we then applied this on the full dataset.

We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total).

Result: 7 new cases where a passenger was injured were discovered from those comments she missed.

Page 9: Using Machine Learning to aid Journalism at the New York Times

Dealing with large document corpuses (i.e FOIA dumps)

We’ll use Topic Models for making sense of these large document collections!

Page 10: Using Machine Learning to aid Journalism at the New York Times

What are Topic Models?

There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct…

Decompose documents as a probability distribution over “topic” indices

1

0“Politics”

“Climate Change”

“Genetics”

“Climate Change” “Genetics”“Politics”

Topics in turn represent probability distributions over the unique words in your vocabulary.

Page 11: Using Machine Learning to aid Journalism at the New York Times

Topic Models: A Graphical Model PerspectiveLDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

1

0“Politics”

“Climate Change”

“Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3

Page 12: Using Machine Learning to aid Journalism at the New York Times

Bayes Theorem

Prior belief about the world. In terms of LDA, our modeling assumptions / priors.

Normalization constant makes this problem a lot harder. We need this for valid probabilities.

Likelihood. Given our model, how likely is this data?

Posterior distribution. Probability of our new model given the data.

Page 13: Using Machine Learning to aid Journalism at the New York Times

Posterior Inference in LDA

GOAL: Obtain this posterior

which means that we need to calculate this intractable term:

For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z.

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Page 14: Using Machine Learning to aid Journalism at the New York Times

Scalable Learning & Inference in Topic Models

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Analyze a subset of your total documents before updating.

Update θ, z, and β after analyzing each mini-batch of documents.

Page 15: Using Machine Learning to aid Journalism at the New York Times

Refinery: An open source web-app for large document analyses

Daeil Kim @ New York TimesFounder of [email protected]

Ben Swanson @ MIT Media LabCo-Founder of [email protected]

Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org

Page 16: Using Machine Learning to aid Journalism at the New York Times

Installing Refinery

1) Command → git clone https://github.com/daeilkim/refinery.git2) Go to the root folder. Command → vagrant up3) Open brower and go to --> 11.11.11.11:8080

3 Simple Steps to get Refinery runningInstall these first!

Page 17: Using Machine Learning to aid Journalism at the New York Times

A Typical Refinery Pipeline

Step 1: Upload documents

Step 2: Extract Topics from a Topic Model

Step 3: Find a subset of documents with topics of interest.

Step 4: Discover Interesting Phrases

Page 18: Using Machine Learning to aid Journalism at the New York Times

A Quick Refinery Demo

Extracting NYT articles from keyword “obama” in 2013.

What themes / topics defined the Obama administration during 2013?

Page 19: Using Machine Learning to aid Journalism at the New York Times

Future Directions: Better tools for Investigative Reporting

Collecting & Scraping

Data

Refinery focuses on extracting insights from relatively clean data

Great tools like DocumentCloud take care of steps 1 & 2

Enterprise stories might be completed in a fraction of the time.

Filtering& Cleaning

Data

Extracting Insights

Page 20: Using Machine Learning to aid Journalism at the New York Times

Interesting Extensions to Topic Models

Combining topic models with recommendation systems.

LDA / Topic Modeling

Matrix Factorization Model

Generative Process

Generative ProcessBenefits

● The model think of users as mixtures of topics. We are what we read and rate.

● The ratings in turn help shape the topics that are also discovered.

● Can do in-matrix and out of matrix predictions.