Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research...

25
Why LIWC sucks (or: saner options for social media content analysis) Cornelius Puschmann Alexander von Humboldt Institute for Internet and Society (HIIG) 21 October 2015 #FAIL-Workshop @ AoIR 2015 Phoenix, AZ George Miquilena/Flickr

Transcript of Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research...

Page 1: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Why LIWC sucks (or: saner options for social media content analysis)

Cornelius Puschmann Alexander von Humboldt Institute for Internet and Society (HIIG)

21 October 2015 #FAIL-Workshop @ AoIR 2015

Phoenix, AZ

George Miquilena/Flickr

Page 2: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Aims of this talk

• Discuss theory + methodology together

• Present and compare:

✦ (Manual) Content Analysis (CA)

✦ Linguistic Inquiry and Word Count (LIWC)

Page 3: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

The most intuitive procedures are not necessarily the best ones….

word cloud of Barack Obama’s 2009 inaugural address

Page 4: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

…and not all tools are suitable for all contexts

(sentiment140.com)

Page 5: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

sentiment analysis

Twitter …all too often

Bart/Flickr

Page 6: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

"(C)ontent analysis is a research technique for the objective, systematic, and quantitative description of the manifest content of communication" (Berelson, 1952, p. 18)

"Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of their use"

(Krippendorff, 1980/2004, p. 18)

"Content analysis is any technique for making inferences by objectively and systematically identifying specified characteristics of messages" (Holsti, 1969, p. 14)

Content Analysis (CA)

Page 7: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

• Text:

• newspapers (media research)

• official records (history)

• personal letters, diaries (psychology)

• semi-structured interviews (sociology)

• party programs, political speeches (political science)

• Audiovisual:

• violence in films (television studies)

• gender roles in advertising (gender studies)

Applications of CA

it’s about categorization!

Page 8: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

assess literature formulate questions

collect data

propose categories

develop coding instructions pretest & refine

conduct analysisinterpret results

The CA workflow

Page 9: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

The CA workflow

A. Conceptualization ("What is the phenomenon or event to be studied?")

1. Identify the problem

2. Review theory and research

3. Pose specific research questions and hypotheses

B. Design ("What will be needed to answer the specific research question or test the hypothesis?")

1. Define relevant content and study units

2. Define information units (syntactic, referential, thematic/argumentative)

3. Create dummy tables

4. Operationalize (develop coding protocol and sheets)

5. Specify population and sampling plans

6. Pretest and establish reliability procedures

C. Analysis ("What are the results?")

1. Process data (establish reliability and code content)

2. Apply statistical procedures

D. Interpretation ("What does it all mean?") (Riffe et al., 2005, p. 56)

Page 10: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Sample coding categories

(Riffe et al., 2005, p.138)

Page 11: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

CA + computational workflow

(Grimmer & Stewart, 2013, p. 268)

Page 12: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Lessons (per Krippendorff, 1980/2004)

• texts have no qualities that are "reader-independent"

• texts have no single irreducible meaning

• texts have meanings relative to particular contexts, discourses or purposes

• texts demand of analysts to make inferences about particular contexts, discourses or purposes

Page 13: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Observations

• tension between ‚what you see is what you get‘ vs. ‚reading between the lines‘

• "precision at the cost of problem significance" (Krippendorff, 2004)

subjective <————> objective

interesting <————> trivial

• overt categories are easy to code (for a computer, too!)

• latent categories are hard to code reliably because they require:

➡ discourse knowledge (cotext)

➡ situational knowledge (physical context)

➡ world knowledge (social/cultural context)

Page 14: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Linguistic Inquiry and Word Count (LIWC)

• developed by James Pennebaker (U Texas) and colleagues in the context of clinical psychology

• dictionary approach: words represent psych. categories

• precursors: General Inquirer (Stone, Dunphy & Smith, 1966), DICTION (Hart, 1985)

• creators claim that LIWC2015 is "the gold standard in computerized text analysis" (LIWC website)

Page 15: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Applications of LIWC

• authorship attribution (Argamon, Koppel, Pennebaker, & Schler, 2007)

• gender differences in interpersonal communication (Ireland et al., 2010)

• communication by terrorist organizations and authoritarian regimes (Hancock et al., 2010)

• change in expressed emotion over time (Yardi & Boyd, 2010)

Page 16: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of
Page 17: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

LIWC categories

(Pennebaker et al., 2015, p.4)

Page 18: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

ambiguity

the (delicious) layer cake of meaning

irony

explicitness

compositionality

Page 19: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Words as a proxy for…

• individual well-being

• personality traits

• group performance

• collective dynamics

➡ inferencing about social actors/processes

Page 20: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

But is it valid?

• Sandra González-Bailón and Georgios Paltoglou: Signals of public opinion in online communication: a comparison of methods and data sources (AAPSS, 659, 2015)

• comparison of lexical content analysis tools with CA + machine learning

➡ results suggest that lexical approaches vary considerably in their validity and reliability!

Page 21: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Like rolling a dice

(González-Bailón & Paltoglou, 2015, p. 105)

Page 22: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

"Learn how the words we use in everyday language reveal our thoughts, feelings, personality, and motivations." (LIWC website)

"It is easy to construct indices by counting unambiguously recognizable verbal events. It is a different matter to decide whether the indices thus derived represent anything significantly relevant to the subject’s mental states." (Rapoport, 1969, per Riffe et al, 2005)

Progress?

Page 23: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Principles (per Grimmer & Stewart, 2013)

1. All quantitative models of language are wrong — but some are useful

2. Quantitative methods for text amplify resources and augment humans.

3. There is no globally best method for automated text analysis

4. Validate, Validate, Validate.

Page 24: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

Thank you for your interest!

DGustafson/Wikipedia

Page 25: Why LIWC sucks (or: saner options for social media content ... · "Content analysis is a research technique for making replicable and valid inferences from texts to the contexts of

References

Argamon, S., Koppel, M., Pennebaker, J., & Schler, J. (2007). Mining the blogosphere: age, gender, and the varieties of self-expression. First Monday, 12(9). Retrieved from http://eprints.pascal-network.org/archive/00003406/

Berelson, B. (1952). Content analysis in communication research. New York: Free Press.Gonzalez-Bailon, S., & Paltoglou, G. (2015). Signals of Public Opinion in Online Communication: A Comparison of Methods and Data

Sources. Annals of the American Academy of Political and Social Science, 659(1), 95–107. doi:10.1177/0002716215569192Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts.

Political Analysis, 21(3), 267–297. doi:10.1093/pan/mps028Hancock, J. T., Beaver, D. I., Chung, C. K., Frazee, J., Pennebaker, J. W., Graesser, A., & Cai, Z. (2010). Social language processing: A

framework for analyzing the communication of terrorists and authoritarian regimes. Behavioral Sciences of Terrorism and Political Aggression, 2(2), 108–132. doi:10.1080/19434471003597415

Hart, R. P. (1985). Systematic Analysis of Political Discourse: The Development of DICTION. In Political Communication Yearbook: 1984 (pp. 97–134). Carbondale, IL: Southern Illinois University Press.

Holsti, O. (1969). Content Analysis for the Social Sciences and Humanities. Reading, MA: Addison Wesley.Ireland, M. E., Slatcher, R. B., Eastwick, P. W., Scissors, L. E., Finkel, E. J., & Pennebaker, J. W. (2011). Language Style Matching Predicts

Relationship Initiation and Stability. Psychological Science, 22(1), 39–44. doi:10.1177/0956797610392928Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology. Thousand Oaks: Sage.Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. Austin,

TX. Retrieved from http://liwc.wpengine.com/wp-content/uploads/2015/11/LIWC2015_LanguageManual.pdfRiffe, D., Lacy, S., & Fico, F. (2005). Analyzing Media Messages: Using Quantitative Content Analysis in Research. Mahwah, NJ:

Lawrence Erlbaum Associates.Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The General Inquirer: A computer approach to content analysis.

Cambridge, MA: MIT Press.Yardi, S., & Boyd, D. (2010). Dynamic Debates: An Analysis of Group Polarization Over Time on Twitter. Bulletin of Science, Technology

& Society, 30(5), 316–327. doi:10.1177/0270467610380011