University of California, Los Angeleshelper.ipam.ucla.edu/publications/caws3/caws3_13063.pdf ·...

Post on 25-Aug-2020

2 views 0 download

Transcript of University of California, Los Angeleshelper.ipam.ucla.edu/publications/caws3/caws3_13063.pdf ·...

Matching Methods for High-Dimensional Data withApplications to Text

Molly Roberts, Brandon Stewart and Rich Nielsen

UCSD, Princeton and MIT

May 11, 2016

Roberts (UCSD) Text Matching May 11, 2016 1 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online information

I Censoring the whole internet is hard (# of bloggers � # of censors)

I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

Roberts (UCSD) Text Matching 28 April 2016 2 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:

I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor OR

I Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

How do people react to online repression?

I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing

I Governments might jail some bloggers toscare people

I Then encourage self-censorship by signalingoff-limits topics

I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel

I how censorship works: important tounderstand self-censorship

I BUT self-censorship very hard to measure

Roberts (UCSD) Text Matching 28 April 2016 3 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethicalProblem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethicalProblem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethicalProblem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethicalProblem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethical

Problem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

The perfect experiment

1. Be the Chinese government

2. Randomly assign censorship

3. See what bloggers write after censorship

Problem 1: unethicalProblem 2: we aren’t the Chinese government

Roberts (UCSD) Text Matching 28 April 2016 4 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar posts

with similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar posts

with similar posts

written on the same day

Only one censoredOnly one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censored

Only one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censored

Only one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censored

Only one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censored

Only one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

How Can We Measure Deterrence?

The best approximation:

Find two bloggers

Xsimilar users,

Xsimilar censorship histories,

Xsimilar numbers of posts

Xsimilar previous post sensitivity......

with very similar postswith similar posts

written on the same day

Only one censored

Only one censored

Censorship ‘Mistake’

Does the censored blogger’s behavior change?

Does the censored blogger stay away from the topic?

Does the censored blogger pursue the topic?

Roberts (UCSD) Text Matching 28 April 2016 5 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder

a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problem

I ApplicationsI Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?

I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?

I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?

I Control for letters of recommendation, trade treaties, Congressionalbills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)

I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatment

I But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Text Matching

I Text as pre-treatment confounder a surprisingly frequent problemI Applications

I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional

bills, etc

I BUT existing matching methods impossible to apply tohigh-dimensional data

I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment

Very little work on this.

Roberts (UCSD) Text Matching 28 April 2016 6 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methods

I Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methods

I Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse Regression

I Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method

Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method Topical Inverse Regression Matching (TIRM)

,by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Our Approach to Text Matching

1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching

2. Identify benefits and drawbacks of each

3. Create a new method Topical Inverse Regression Matching (TIRM),by combining the two

Roberts (UCSD) Text Matching 28 April 2016 7 / 32

Introduction

Outline of Talk

I A Quick Review of Matching

I Text Analogs to Current Matching Method

I Topical Inverse Regression Matching

I Applications

Roberts (UCSD) Text Matching 28 April 2016 8 / 32

Introduction

Outline of Talk

I A Quick Review of Matching

I Text Analogs to Current Matching Method

I Topical Inverse Regression Matching

I Applications

Roberts (UCSD) Text Matching 28 April 2016 8 / 32

Introduction

Outline of Talk

I A Quick Review of Matching

I Text Analogs to Current Matching Method

I Topical Inverse Regression Matching

I Applications

Roberts (UCSD) Text Matching 28 April 2016 8 / 32

Introduction

Outline of Talk

I A Quick Review of Matching

I Text Analogs to Current Matching Method

I Topical Inverse Regression Matching

I Applications

Roberts (UCSD) Text Matching 28 April 2016 8 / 32

Introduction

Outline of Talk

I A Quick Review of Matching

I Text Analogs to Current Matching Method

I Topical Inverse Regression Matching

I Applications

Roberts (UCSD) Text Matching 28 April 2016 8 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xi

I Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching

, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching

, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching

, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching

,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores

2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Previous Approaches to Matching

I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic

matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,

mahalanobis matching, exact matching, subclassification matching, nearest neighbor

matching, full matching . . .

I Today two of these strategies:

1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching

I Both strategies scale poorly with high-dimensional covariates.

Roberts (UCSD) Text Matching 28 April 2016 9 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approach

I fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )

I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatment

I pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )

I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confounders

I X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)

I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation

I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!

Roberts (UCSD) Text Matching 28 April 2016 10 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )

I where qi,v ∝ exp(αv + tiφv )

I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )

I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )

I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and word

I projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and word

I projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Propensity Scores: An Analog for Text

I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)

I assume xi ∼Multinomial(~qi ,mi =∑

v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection

I Match on zi or π̂i

Roberts (UCSD) Text Matching 28 April 2016 11 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Problems with MNIR Matching

Posts equally likely to be treated are not always semantically similar:

I wouldn’t be a problem in expectation BUT

I hard to assess balance in the text case

I could be more efficient if matches were more similar

Roberts (UCSD) Text Matching 28 April 2016 12 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approach

I coarsen each variable into natural categoriesi.e. years of education {high school, elementary school, college}

I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}

I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}I exactly match on coarsened variable

I pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounder

I thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Classical approachI coarsen each variable into natural categories

i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable

I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match

Roberts (UCSD) Text Matching 28 April 2016 13 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matching

I innovation: coarsen across variablessimple example: “tax”, “income”, “tariff” “economics”

I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matchingI innovation: coarsen across variables

simple example: “tax”, “income”, “tariff” “economics”

I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matchingI innovation: coarsen across variables

simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of words

I bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matchingI innovation: coarsen across variables

simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matchingI innovation: coarsen across variables

simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Coarsened Exact Matching: An Analog for Text

I Solution: topically coarsened matchingI innovation: coarsen across variables

simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words

I Estimate a topic model

I Match on the topic density rather than raw word counts

Roberts (UCSD) Text Matching 28 April 2016 14 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Problems with Topical CEM

Topics aren’t always the most important predictor of treatment:

Roberts (UCSD) Text Matching 28 April 2016 15 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents

2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic density

I Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding properties

I estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:

I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)

I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Topical Inverse Regression Matching (TIRM)

I We need something that:

1. Bounds imbalance between documents2. Doesn’t leave out important words

I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment

I topical bounding propertiesI estimates which words associated with treatment

I Ingredients:I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate

Roberts (UCSD) Text Matching 28 April 2016 16 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.

Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

Structural Topic Model

I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior

I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)

I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)

I Documents have different expected topic proportions based onobserved covariates.

I Topics are now deviations from a baseline distribution.

P(word |topic , doc) ∝

exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))

κ(c) and κ(int) how words are related to treatment.Roberts (UCSD) Text Matching 28 April 2016 17 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :

I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )

covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :

I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )

covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word x

I (κ(c))′(xi/mi )

covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )

covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)

topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

TIRM

Match on:

1. θ: Estimated topic proportion (K covariates)

2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection

I (κ(c))′(xi/mi ) + 1mi

∑v xi,v

((κ

(int)v

)′θi

)topic-covariate projection

3. Any other covariates you think are important

We generally use CEM to match but other methods could be used.

Limitations of TIRM

I New: relies on a parametric method to reduce dimensions

I Old: requires SUTVA, relevant covariates

Roberts (UCSD) Text Matching 28 April 2016 18 / 32

Matching Methods for Text

Simulations

Set up:

1. Simulate 200 outcome and treatment with confounding topics andwords

2. Estimate STM3. Condition on topics and projection

−2.0 −1.5 −1.0 −0.5 0.0

05

1015

20

Estimated Effect

Den

sity

Naive EstimatorTopics OnlyTopics and ProjectionTrue

Roberts (UCSD) Text Matching 28 April 2016 19 / 32

Matching Methods for Text

Simulations

Set up:

1. Simulate 200 outcome and treatment with confounding topics andwords

2. Estimate STM3. Condition on topics and projection

−2.0 −1.5 −1.0 −0.5 0.0

05

1015

20

Estimated Effect

Den

sity

Naive EstimatorTopics OnlyTopics and ProjectionTrue

Roberts (UCSD) Text Matching 28 April 2016 19 / 32

Matching Methods for Text

Simulations

Set up:

1. Simulate 200 outcome and treatment with confounding topics andwords

2. Estimate STM

3. Condition on topics and projection

−2.0 −1.5 −1.0 −0.5 0.0

05

1015

20

Estimated Effect

Den

sity

Naive EstimatorTopics OnlyTopics and ProjectionTrue

Roberts (UCSD) Text Matching 28 April 2016 19 / 32

Matching Methods for Text

Simulations

Set up:

1. Simulate 200 outcome and treatment with confounding topics andwords

2. Estimate STM3. Condition on topics and projection

−2.0 −1.5 −1.0 −0.5 0.0

05

1015

20

Estimated Effect

Den

sity

Naive EstimatorTopics OnlyTopics and ProjectionTrue

Roberts (UCSD) Text Matching 28 April 2016 19 / 32

Matching Methods for Text

Simulations

Set up:

1. Simulate 200 outcome and treatment with confounding topics andwords

2. Estimate STM3. Condition on topics and projection

−2.0 −1.5 −1.0 −0.5 0.0

05

1015

20

Estimated Effect

Den

sity

Naive EstimatorTopics OnlyTopics and ProjectionTrue

Roberts (UCSD) Text Matching 28 April 2016 19 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?

I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate after

I sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)

I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

Example 1: How do bloggers react to censorship?

I Data: 593 bloggers over 6 months spanning 2011,2012

I 150,000 posts

I Return to blogs to measure censorship

I Find censors’ mistakes: two similar blogs, different censorship

I Also match on date, previous censorship, previous sensitivity.

I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:

I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after

Roberts (UCSD) Text Matching 28 April 2016 20 / 32

Estimating Censorship Effects

TIRM Finds Almost Identical Posts

Original Data

String Kernel Similarity

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

Roberts (UCSD) Text Matching 28 April 2016 21 / 32

Estimating Censorship Effects

TIRM Finds Almost Identical Posts

TIRM

String Kernel Similarity

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

Roberts (UCSD) Text Matching 28 April 2016 21 / 32

Estimating Censorship Effects

TIRM Finds Almost Identical Posts

Topic Match

String Kernel Similarity

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

Roberts (UCSD) Text Matching 28 April 2016 21 / 32

Estimating Censorship Effects

TIRM Finds Almost Identical Posts

MNIR Match

String Kernel Similarity

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

Roberts (UCSD) Text Matching 28 April 2016 21 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:

I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:

I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:

I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:

I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:

I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorship

I No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorship

I (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:

I Treated group: 20% censorship

Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:I Treated group: 20% censorship

Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorship

I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than control

I Treated group talks significantly more about Bo Xilai incident aftercensorship than control

I Treated group talks significantly more about CCP History/Mao aftercensorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than control

I Treated group talks significantly more about CCP History/Mao aftercensorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Censorship Effects

Results

I We find 46 matched blogs (censors’ mistakes)

I Nearly perfect matches

I Most matched posts are about Bo Xilai incident, Maoist protests

I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)

I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after

censorship than controlI Treated group talks significantly more about CCP History/Mao after

censorship than control

Roberts (UCSD) Text Matching 28 April 2016 22 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Example 2: Does gender affect citations in PoliticalScience?

I Maliniak, Powers, Walter (2013): women get cited less than men in IR

I Problem: women write about different topics than men

I Maliniak et al solution: Code articles into (many) categories

I Our solution: Text matching!

I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.

I Code lots of variables, including gender, article age, tenure, etc.

I Treatment: all-female Control: co-ed/all-male

I Our motive: Find similar articles, see how they are cited differently.

Roberts (UCSD) Text Matching 28 April 2016 23 / 32

Estimating Effects of Gender Citations in IR

Words men and women use differently in IR

Original Data:

Roberts (UCSD) Text Matching 28 April 2016 24 / 32

Estimating Effects of Gender Citations in IR

Words men and women use differently in IR

Topic Matching:

Roberts (UCSD) Text Matching 28 April 2016 24 / 32

Estimating Effects of Gender Citations in IR

Words men and women use differently in IR

TIRM:

Roberts (UCSD) Text Matching 28 April 2016 24 / 32

Estimating Effects of Gender Citations in IR

TIRM Reduces Topical Differences

Mean topic difference (Women−Men)

−0.10 −0.05 0.00 0.05 0.10

Topic 2 : model, variabl, data, effect, measur

Topic 9 : nuclear, weapon, arm, forc, defens

Topic 6 : game, will, cooper, can, strategi

Topic 14 : war, conflict, state, disput, democraci

Topic 1 : state, power, intern, system, polit

Topic 7 : polici, foreign, public, polit, decis

Topic 12 : soviet, militari, war, forc, defens

Topic 13 : trade, econom, polici, bank, intern

Topic 8 : polit, parti, polici, govern, vote

Topic 15 : war, israel, peac, conflict, arab

Topic 3 : polit, conflict, group, ethnic, state

Topic 4 : econom, develop, industri, countri, world

Topic 10 : state, china, unit, foreign, polici

Topic 11 : intern, state, organ, institut, law

Topic 5 : polit, social, one, theori, world

Full Data Set (Unmatched)TIRMMNIRTopic MatchingHuman Coding Matched

Roberts (UCSD) Text Matching 28 April 2016 25 / 32

Estimating Effects of Gender Citations in IR

TIRM Reduces Topical Differences

Mean topic difference (Women−Men)

−0.10 −0.05 0.00 0.05 0.10

Topic 2 : model, variabl, data, effect, measur

Topic 9 : nuclear, weapon, arm, forc, defens

Topic 6 : game, will, cooper, can, strategi

Topic 14 : war, conflict, state, disput, democraci

Topic 1 : state, power, intern, system, polit

Topic 7 : polici, foreign, public, polit, decis

Topic 12 : soviet, militari, war, forc, defens

Topic 13 : trade, econom, polici, bank, intern

Topic 8 : polit, parti, polici, govern, vote

Topic 15 : war, israel, peac, conflict, arab

Topic 3 : polit, conflict, group, ethnic, state

Topic 4 : econom, develop, industri, countri, world

Topic 10 : state, china, unit, foreign, polici

Topic 11 : intern, state, organ, institut, law

Topic 5 : polit, social, one, theori, world

Full Data Set (Unmatched)TIRMMNIRTopic MatchingHuman Coding Matched

Roberts (UCSD) Text Matching 28 April 2016 25 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Results

I Maliniak et al: Women receive 80% the citations of men

I In our data: women receive fewer citations robust across matches

I Final match: Women receive 40-60% the citations of men

I Still looking into why we are getting more extreme results

I Could be the difference is in very high citation counts

Roberts (UCSD) Text Matching 28 April 2016 26 / 32

Estimating Effects of Gender Citations in IR

Ex. 3: Did killing Bin Laden make his ideas less popular?

“His death will serve as a global clarion call for another generation ofjihadists.”– Ed Husain (CFR)

“al-Qaida may emerge even more radical, and more closely united underthe banner of an iconic martyr.”– Abdel Bari Atwan (The Guardian)

Roberts (UCSD) Text Matching 28 April 2016 27 / 32

Estimating Effects of Gender Citations in IR

Ex. 3: Did killing Bin Laden make his ideas less popular?

“His death will serve as a global clarion call for another generation ofjihadists.”– Ed Husain (CFR)

“al-Qaida may emerge even more radical, and more closely united underthe banner of an iconic martyr.”– Abdel Bari Atwan (The Guardian)

Roberts (UCSD) Text Matching 28 April 2016 27 / 32

Estimating Effects of Gender Citations in IR

Ex. 3: Did killing Bin Laden make his ideas less popular?

“The idea that Obama made a strategic misstep by killing a manresponsible for the death of thousands of U.S. citizens and committed tokilling thousands more is absurd. Rather than making him a martyr, BinLaden’s killing demonstrated that he was, like the rest of us, mortal.” –Robert Simcox (LA Times)

Roberts (UCSD) Text Matching 28 April 2016 28 / 32

Estimating Effects of Gender Citations in IR

Ex. 3: Did killing Bin Laden make his ideas less popular?

We don’t really know.

Usama Bin Laden Anwar al-Awlaki Abu Yahya al-Libi5/2/2011 9/30/2011 6/5/2012

Roberts (UCSD) Text Matching 28 April 2016 29 / 32

Estimating Effects of Gender Citations in IR

Ex. 3: Did killing Bin Laden make his ideas less popular?

We don’t really know.

Usama Bin Laden Anwar al-Awlaki Abu Yahya al-Libi5/2/2011 9/30/2011 6/5/2012

Roberts (UCSD) Text Matching 28 April 2016 29 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Empirical Strategy

I View-count data from a Jihadist website, scraped over time

I Does targeted killing of Bin Laden increase views of his work?

I TIRM matching + match on pre-treatment page views.

I QOI is ATT: nearest neighbor matching instead of CEM

I Validation: Matches accord with sub-pages on website

Roberts (UCSD) Text Matching 28 April 2016 30 / 32

Estimating Effects of Gender Citations in IR

Martyr Effect: Clear short-term increase in page views

2012 2013 2014

−3000

−2000

−1000

0

1000

2000

3000

Est

imat

ed In

crea

se in

Pag

e V

iew

s pe

r D

ocum

ent

Date

0

250

500

2011

−05−

02

2011

−07−

02

2011

−09−

02

2011

−11−

02

Matching on topics and page views

Figure: Estimated effects of Usama Bin Laden’s death (on May 2, 2011) onsubsequent page views of his documents on a large jihadist web-library.

Roberts (UCSD) Text Matching 28 April 2016 31 / 32

Estimating Effects of Gender Citations in IR

Martyr Effect: Clear short-term increase in page views

2012 2013 2014

−3000

−2000

−1000

0

1000

2000

3000

Est

imat

ed In

crea

se in

Pag

e V

iew

s pe

r D

ocum

ent

Date

0

250

500

2011

−05−

02

2011

−07−

02

2011

−09−

02

2011

−11−

02

Matching on topics and page views

Figure: Estimated effects of Usama Bin Laden’s death (on May 2, 2011) onsubsequent page views of his documents on a large jihadist web-library.

Roberts (UCSD) Text Matching 28 April 2016 31 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate

bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate

bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do this

I We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate

bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate

bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate

bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment

balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment balances on words related totreatment

I Future work:

I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment balances on words related totreatment

I Future work:I Develop theoretical properties of TIRM

I Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment balances on words related totreatment

I Future work:I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than text

I Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32

Estimating Effects of Gender Citations in IR

Conclusion

I Lots of applications measure pre-treatment confounders with text

I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching

I Matching on topical density estimate bounds differences betweentopics

I Match on probability of treatment balances on words related totreatment

I Future work:I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package

Roberts (UCSD) Text Matching 28 April 2016 32 / 32