Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf ·...

142
Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis II: Unsupervised Learning via Cluster Analysis 1 Gary King http://GKing.Harvard.Edu December 23, 2011 1 Copyright 2010 Gary King, All Rights Reserved. Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: December 23, 2011 1 / 23

Transcript of Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf ·...

Page 1: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Advanced Quantitative Research Methodology, LectureNotes: Text Analysis II: Unsupervised Learning via

Cluster Analysis1

Gary Kinghttp://GKing.Harvard.Edu

December 23, 2011

1©Copyright 2010 Gary King, All Rights Reserved.Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis II: Unsupervised Learning via Cluster AnalysisDecember 23, 2011 1 / 23

Page 2: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Reading

Justin Grimmer and Gary King. 2010. “Quantitative Discovery ofQualitative Information: A General Purpose Document ClusteringMethodology”http://gking.harvard.edu/files/abs/discov-abs.shtml.

Gary King (Harvard, IQSS) Quantitative Discovery from Text 2 / 23

Page 3: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Page 4: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Page 5: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Page 6: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Page 7: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Page 8: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 9: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 10: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 11: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 12: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 13: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 14: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 15: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Page 16: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 17: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 18: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 19: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 20: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .

Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 21: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundations

How to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 22: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 23: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!

The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 24: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods apply

Deep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 25: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods applyDeep problem in cluster analysis literature: no way to know whichmethod will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Page 26: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 27: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 28: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 29: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 30: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the best

Too hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 31: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!

An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 32: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possible

E.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 33: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: hard to do it by understanding the model

We do it ex post (by qualitative choice). For example:

Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Page 34: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

Page 35: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

Page 36: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

Page 37: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

Page 38: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 39: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 40: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 41: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 42: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 43: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 44: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 45: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended (takes about 10-15minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

Page 46: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Many Thousands of Clusterings, Sorted & OrganizedYou choose one (or more), based on insight, discovery, useful information,. . .

Space of Cluster Solutions

biclust_spectral

clust_convex

mult_dirproc

dismea

rock

som

spec_cos spec_eucspec_man

spec_mink

spec_max

spec_canb

mspec_cos

mspec_euc

mspec_man

mspec_mink

mspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euc

kmedoids euclidean

kmedoids manhattan

mixvmf

mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattan

kmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearman

kmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean average

hclust euclidean mcquitty

hclust euclidean median

hclust euclidean centroidhclust maximum ward

hclust maximum single

hclust maximum completehclust maximum averagehclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan averagehclust manhattan mcquittyhclust manhattan median

hclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquittyhclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquitty

hclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson complete

hclust pearson averagehclust pearson mcquitty

hclust pearson medianhclust pearson centroid

hclust correlation ward

hclust correlation single

hclust correlation complete

hclust correlation averagehclust correlation mcquitty

hclust correlation medianhclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman average

hclust spearman mcquitty

hclust spearman median

hclust spearman centroid

hclust kendall ward

hclust kendall single

hclust kendall complete

hclust kendall average

hclust kendall mcquitty

hclust kendall median

hclust kendall centroid

Cluster Solution 1

Carter

Clinton

Eisenhower

Ford

Johnson

Kennedy

Nixon

Obama

Roosevelt

Truman

Bush

HWBush

Reagan

``Other Presidents ''

``Reagan Republicans''

Cluster Solution 2

Carter

Eisenhower

Ford

Johnson

Kennedy

Nixon

Roosevelt

Truman

Bush

ClintonHWBush

Obama

Reagan

``RooseveltTo Carter''

`` Reagan To Obama ''

Gary King (Harvard, IQSS) Quantitative Discovery from Text 9 / 23

Page 47: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions

1 Distance between clusterings: a function of the pairwise documentagreements (pairwise agreements ⇒ triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 48: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions

1 Distance between clusterings: a function of the pairwise documentagreements (pairwise agreements ⇒ triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 49: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

agreements (pairwise agreements ⇒ triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 50: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

agreements (pairwise agreements ⇒ triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

fixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 51: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

agreements (pairwise agreements ⇒ triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 52: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

agreements (pairwise agreements ⇒ triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 53: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

agreements (pairwise agreements ⇒ triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation ofinformation”)

Meila (2007): derives same metric using different axioms (latticetheory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

Page 54: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Available March 2009: 304ppPb: 978-0-415-99701-0: $24.95www.routledge.com/politics

“The list of authors in The Future of Political Science is a 'who’s who' of political science. As I was reading it, I came to think of it as a platter of tasty hors d’oeuvres. It hooked me thoroughly.”

—Peter Kingstone, University of Connecticut

“In this one-of-a-kind collection, an eclectic set of contributors offer short but forceful forecasts about the future of the discipline. The resulting assortment is captivating, consistently thought-provoking, often intriguing, and sure to spur discussion and debate.”

—Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

“King, Schlozman, and Nie have created a visionary and stimulating volume. The organization of the essays strikes me as nothing less than brilliant. . . It is truly a joy to read.”

—Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science, University of Florida

The FuTure oF PoliTical Science100 Perspectivesedited by Gary King, harvard university, Kay Lehman Schlozman, Boston college and Norman H. Nie, Stanford university

Gary King (Harvard, IQSS) Quantitative Discovery from Text 11 / 23

Page 55: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 56: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 57: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 58: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 59: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 60: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68

Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 61: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 62: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 63: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

Page 64: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 65: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 66: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualization

Demonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 67: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

Inject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 68: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 69: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 70: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA coders

Informative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 71: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing texts

Discovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 72: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

Page 73: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 74: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 75: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their head

They can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 76: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time

=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 77: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 78: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 79: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clustering

many pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 80: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documents

for coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 81: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely related

Quality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 82: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)

Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 83: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 84: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

Page 85: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Page 86: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . )

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Page 87: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Policy Agendas: 213 quasi-sentences from Bush’s State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Page 88: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “goldstandard” for supervised learning studies

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Page 89: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 90: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 91: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 92: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)

2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 93: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 94: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 95: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 96: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 97: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 98: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 99: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Page 100: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 101: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 102: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising

- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 103: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming

- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 104: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 105: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 106: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

Page 107: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 108: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

affprop cosine

Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)

Close to:Mixture of von Mises-Fisherdistributions (Banerjee et. al.2005)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 109: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

affprop cosine

mixvmf

Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)Close to:Mixture of von Mises-Fisherdistributions (Banerjee et. al.2005)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 110: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:

local cluster ensemble

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 111: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:

local cluster ensemble

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 112: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:local cluster ensemble

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 113: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 114: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Found a region with particularlyinsightful clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 115: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 116: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 117: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 118: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 119: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 120: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 121: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 122: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

MayhewGary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 123: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

Credit Claiming, Pork:“Sens. Frank R. Lautenberg(D-NJ) and Robert Menendez(D-NJ) announced that the U.S.Department of Commerce hasawarded a $100,000 grant to theSouth Jersey EconomicDevelopment District”

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 124: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

Credit Claiming, Legislation:“As the Senate begins its recess,Senator Frank Lautenberg todaypointed to a string of victories inCongress on his legislative agendaduring this work period”

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 125: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

Advertising:“Senate AdoptsLautenberg/Menendez ResolutionHonoring Spelling Bee Championfrom New Jersey”

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 126: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Partisan Taunting:“Republicans Selling Out Nationon Chemical Plant Security”

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 127: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Partisan Taunting:“Senator Lautenberg’samendment would change thename of ...the Republican bill...to‘More Tax Breaks for the Richand More Debt for OurGrandchildren Deficit ExpansionReconciliation Act of 2006’ ”

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 128: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Definition: Explicit, public, andnegative attacks on anotherpolitical party or its members

Taunting ruinsdeliberation

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 129: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Definition: Explicit, public, andnegative attacks on anotherpolitical party or its members

Taunting ruinsdeliberation

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Page 130: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23

Page 131: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23

Page 132: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23

Page 133: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.

- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23

Page 134: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.

- Apply supervised learning method: measure proportion of pressreleases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23

Page 135: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23

Page 136: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard, IQSS) Quantitative Discovery from Text 21 / 23

Page 137: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

On Avg., Senators Taunt in 27 % of Press Releases

Gary King (Harvard, IQSS) Quantitative Discovery from Text 21 / 23

Page 138: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Advancing the Objective of Discovery

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization: aiding discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

Page 139: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Advancing the Objective of Discovery

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization: aiding discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

Page 140: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Advancing the Objective of Discovery

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization: aiding discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

Page 141: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

Advancing the Objective of Discovery

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization: aiding discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discoveryGary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

Page 142: Advanced Quantitative Research Methodology, Lecture …gking.harvard.edu/files/discov.pdf · Advanced Quantitative Research Methodology, Lecture ... statistics, and data analysis

For more information:

http://GKing.Harvard.edu

Gary King (Harvard, IQSS) Quantitative Discovery from Text 23 / 23