Quantitative Individuated Corpus Linguistics

1. Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universitt Osnabrck 5 Juni 2007

2. Preliminaries 3. A totalizing view of language competence performance production investigation but... whosecompetence? whoseperformance? social function learning cognitive basis cultural transmission 4. Variation or overlap? competence performance Speaker A competence performance Speaker B competence performance Speaker C Observations: 1. A contrastive comparison of performance should give us some insight into shared competence. 2. Speaker-level granularity is preferable to higher levels of segmentation (by gender, social class etc). 3. Instead of generalizing from the outset, we can reach general conclusions after observing the degree of variation or overlap in language production. So how do we do this? 5. How corpora treat language data

any sentence is as good as any other sentence (the data is flat)

a corpus should be a well-balanced mix of different genres, modes and sources (representativeness)

textual and compositional coherence cannot be taken into account

contextual information (who said what, when, where, why, how and to whom) is largely unavailable

6. Corpora and traditions of text production

copora largely consist of well-established genres

the material they contain is produced by language professionals (journalists, writers, politicians)

texts are long and stylistically distant from everyday communication in their level of formality, complexity and elaborateness

compositional integrity (text structure) is very important but largely ignored

the text (=collection of words) takes precedent over the speaker

7. A different view of language data

language data and sources of

variation...

... vs. speakers and their natural attributes

8. Blogs as data sources 9. A new kind of resource

estimated 100 million active bloggers in 2007

split evenly among genders

all age groups are represented

many bloggers provide personal information (age, gender, location)

use web feeds (Atom and RSS formats) to syndicate blog entries in XML (ideal for building modern corpora)

clean data with minimal interference

10. Blogs as corpus data: Pros

very large bodies of data can be automatically assembled

data is naturally segmented by

speaker (+gender, +age, +location, ...)

length and time of writing

often include additional meta-data

produced by a large and growing variety of individuals using it for a wide spectrum of purposes

11. Blogs as corpus data: Cons

only one genre (?)

CMC as a singular mode (?)

sampling of speakers not representative (?)

12. Granularity and natural segmentation of data in a blog-based corpus Modes of investigation: 1. Degree of internal variation among all posts by the same blogger 2. Variation between bloggers 3. Variation between groups (gender, age etc) What I had for breakfast this morning xxx xxx xx xxxx, xxx xxx xx xxxxxxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx posted 01/01/2007 by Jane Smith post 1 post 2 post 3 post 4 ... 13. An example for a blog-based corpus

self-built corpus for my research project on corporate blogging

web feeds (RSS and Atom protocols) used to retrieve, store and analyze language data

implemented TreeTagger for automated part-of-speech tagging

156 sources

25,769 posts

6.6 million words

14. Application 15. Individual variation: word class distribution

Heather Hamilton (Microsoft)

16. Individual variation: word class distribution

Irving Wladawsky-Berger (IBM)

17. Individual variation: pronoun use

Heather Hamilton (Microsoft)

1theDT2787

2IPP2723

3toTO2088

4aDT1440

5ofIN1324

6and CC1254

7ItPP1097

8youPP854

9inIN818

10thatIN776

11myPP$757

12isVBZ739

13ForIN580

14n'tRB540

15'sVBZ530

16onIN498

17areVBP475

18mePP450

19with IN431

20thisDT424

Irving Wladawsky-Berger (IBM)

1theDT2788

2andCC1931

3ofIN1571

4toTO1562

5inIN1291

6aDT1047

7isVBZ695

8IPP560

9thatIN439

10ForIN434

11ItPP417

12with IN401

13asIN390

14areVBP380

15wePP359

16onIN331

17ourPP$259

18haveVHP253

19thatWDT248

18. Individual variation: collocates preceding instances ofbelieve 19. Gender, age and variation: Schler et al

articleEffects of Age and Gender on Blogging (AAAI 2006 )

all blogs accessible from blogger.com one day in August 2004

downloaded each blog that included author-provided indication of gender and at least 200 appearances of common English words

the full corpus thus obtained included over 71,000 blogs and over 300 million tokens

used to predict age and gender of bloggers

20. Gender, age and variation: common words males

token male female

linux0.530.04 0.030.01

microsoft0.630.05 0.080.01

gaming0.250.020.040.00

server 0.760.05 0.130.01

software 0.990.050.170.02

gb0.270.02 0.050.01

programming 0.360.02 0.080.01

google 0.900.04 0.190.02

data0.620.03 0.140.01

graphics0.270.02 0.060.01

india0.620.04 0.150.01

nations0.250.01 0.060.01

democracy0.230.01 0.060.01

users0.450.02 0.110.01

economic0.260.01 0.070.01

21. Gender, age and variation: common words females

token male female

shopping 0.660.02 1.480.03

mom 2.070.05 4.690.08

cried 0.310.01 0.720.02

freaked 0.080.01 0.210.01

pink 0.330.02 0.850.03

cute 0.830.03 2.320.04

gosh 0.170.01 0.470.02

kisses 0.080.01 0.280.01

yummy 0.100.01 0.360.01

mommy 0.080.01 0.310.02

boyfriend 0.410.02 1.730.04

skirt 0.060.01 0.260.01

adorable 0.050.00 0.230.01

husband 0.280.01 1.380.04

hubby 0.010.00 0.300.02

22. Gender, age and variation: common words by age

token teens twens thirties

maths 1.050.06 0.030.00 0.020.01

homework 1.370.06 0.180.01 0.150.02

bored 3.840.27 1.110.14 0.470.04

sis 0.740.04 0.260.03 0.100.02

boring 3.690.10 1.020.04 0.630.05

awesome 2.920.08 1.280.04 0.570.04

mum 1.250.06 0.410.04 0.230.04

mad 2.160.07 0.800.03 0.530.04

dumb 0.890.04 0.450.03 0.220.03

semester 0.220.02 0.440.03 0.180.04

apartment 0.180.021.230.05 0.550.05

drunk 0.770.04 0.880.03 0.410.05

beer 0.320.02 1.150.05 0.700.05

student 0.650.04 0.980.05 0.610.06

album 0.640.05 0.840.06 0.560.08

college 1.510.07 1.920.07 1.310.09

someday 0.350.02 0.400.02 0.280.03

dating 0.310.02 0.520.03 0.370.04

23. Gender, age and variation: common words by age (ii)

token teens twens thirties

marriage 0.270.03 0.830.05 1.410.13

development 0.160.02 0.500.03 0.820.10

campaign 0.140.02 0.380.03 0.700.07

tax 0.140.02 0.380.03 0.720.11

local 0.380.02 1.180.04 1.850.10

democratic 0.130.02 0.290.02 0.590.05

son 0.510.03 0.920.05 2.370.16

systems 0.120.01 0.360.03 0.550.06

provide 0.150.01 0.540.03 0.690.05

workers 0.100.01 0.350.02 0.460.04

24. Observations 25. How an individuated approach to corpus linguistics can benefit the field

allow us to take into account individual stylistic preference as a source of variation when making generalizations (syntax, semantics, ...)

allow us to observe specificities of individual production before making blanket label statements about groups (based on gender, social standing etc)

inverts the idea of system and variation (how much overlap is there in language use? vs. how much variation can our theories account for?)

26. Research possibilities?

personal grammar?, personal semantics?

Construction Grammar (to what degree are constructions individual?)

variation over the lifetime

weighing genre, mode and individual variation

practical applications for forensic linguistics / language profiling

27. Thank you for listening! 28. Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universitt Osnabrck 5 Juni 2007

Quantitative Individuated Corpus Linguistics

Education

Transcript of Quantitative Individuated Corpus Linguistics

Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Why Forensic Linguistics Needs Corpus Linguistics

corpus linguistics and lexicography

Overview of Corpus Linguistics

Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Corpus linguistics for indexing...indexing process itself. Keywords Corpus linguistics, corpora, computational methods, computer-assisted linguistics, corpus-assisted indexing 1. Introduction

Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and.

Introducing Corpus Linguistics

Corpus linguistics: a general introduction. What is Corpus Linguistics? Corpus Linguistics is the study of language/linguistic phenomena through the analysis.

Collocation and Corpus Linguistics

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.

Applied Corpus Linguistics

Corpus Linguistics Presentation

Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch ． 2 CF Meyer, English Corpus Linguistics, Ch. 3.

Corpus Linguistics 2 Keyword Analysis - Edge Hill Universityrepository.edgehill.ac.uk/5932/1/NTNU.Dubrovnik.Keyness.pdf · Corpus Linguistics 2 Keyword Analysis ... corpus software

Empirical Linguistics & Language Documentation Corpus ...elldo.amu.edu.pl/.../2017/10/Corpus-Linguistics-ELLDO-01.pdfLanguage documentation Language documentation (documentary linguistics),

Corpus Linguistics: An Introduction - PALA · 1 Corpus Linguistics: An Introduction 1. Introduction. Corpus Linguistics is a hugely popular area of linguistics which, since its beginnings

Clinical Linguistics? Corpus Linguistics in Health - Brown.uk.com

English Corpus Linguistics An Introduction - aceondo.netlibrary.aceondo.net/.../English_Corpus_Linguistics_An_Introduction.… · that corpus linguistics is more a way of doing linguistics,