- 1. Quantitative Individuated Corpus Linguistics: A
Speaker-Centric Approach to Variation Cornelius Puschmann
Universitt Osnabrck 5 Juni 2007
2. Preliminaries 3. A totalizing view of language competence
performance production investigation but... whosecompetence?
whoseperformance? social function learning cognitive basis cultural
transmission 4. Variation or overlap? competence performance
Speaker A competence performance Speaker B competence performance
Speaker C Observations: 1. A contrastive comparison of performance
should give us some insight into shared competence. 2.
Speaker-level granularity is preferable to higher levels of
segmentation (by gender, social class etc). 3. Instead of
generalizing from the outset, we can reach general conclusions
after observing the degree of variation or overlap in language
production. So how do we do this? 5. How corpora treat language
data
- any sentence is as good as any other sentence (the data is
flat)
- a corpus should be a well-balanced mix of different genres,
modes and sources (representativeness)
- textual and compositional coherence cannot be taken into
account
- contextual information (who said what, when, where, why, how
and to whom) is largely unavailable
6. Corpora and traditions of text production
- copora largely consist of well-established genres
- the material they contain is produced by language professionals
(journalists, writers, politicians)
- texts are long and stylistically distant from everyday
communication in their level of formality, complexity and
elaborateness
- compositional integrity (text structure) is very important but
largely ignored
- the text (=collection of words) takes precedent over the
speaker
7. A different view of language data
- language data and sources of
- ... vs. speakers and their natural attributes
8. Blogs as data sources 9. A new kind of resource
- estimated 100 million active bloggers in 2007
- split evenly among genders
- all age groups are represented
- many bloggers provide personal information (age, gender,
location)
- use web feeds (Atom and RSS formats) to syndicate blog entries
in XML (ideal for building modern corpora)
- clean data with minimal interference
10. Blogs as corpus data: Pros
- very large bodies of data can be automatically assembled
- data is naturally segmented by
-
- speaker (+gender, +age, +location, ...)
-
- length and time of writing
- often include additional meta-data
- produced by a large and growing variety of individuals using it
for a wide spectrum of purposes
11. Blogs as corpus data: Cons
- CMC as a singular mode (?)
- sampling of speakers not representative (?)
12. Granularity and natural segmentation of data in a blog-based
corpus Modes of investigation: 1. Degree of internal variation
among all posts by the same blogger 2. Variation between bloggers
3. Variation between groups (gender, age etc) What I had for
breakfast this morning xxx xxx xx xxxx, xxx xxx xx xxxxxxx xxx xx
xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx
xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx
xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx posted 01/01/2007
by Jane Smith post 1 post 2 post 3 post 4 ... 13. An example for a
blog-based corpus
- self-built corpus for my research project on corporate
blogging
- web feeds (RSS and Atom protocols) used to retrieve, store and
analyze language data
- implemented TreeTagger for automated part-of-speech
tagging
14. Application 15. Individual variation: word class
distribution
-
- Heather Hamilton (Microsoft)
16. Individual variation: word class distribution
-
- Irving Wladawsky-Berger (IBM)
17. Individual variation: pronoun use
- Heather Hamilton (Microsoft)
- Irving Wladawsky-Berger (IBM)
18. Individual variation: collocates preceding instances
ofbelieve 19. Gender, age and variation: Schler et al
- articleEffects of Age and Gender on Blogging (AAAI 2006 )
- all blogs accessible from blogger.com one day in August
2004
- downloaded each blog that included author-provided indication
of gender and at least 200 appearances of common English words
- the full corpus thus obtained included over 71,000 blogs and
over 300 million tokens
- used to predict age and gender of bloggers
20. Gender, age and variation: common words males
- microsoft0.630.05 0.080.01
- software 0.990.050.170.02
- programming 0.360.02 0.080.01
- graphics0.270.02 0.060.01
- democracy0.230.01 0.060.01
- economic0.260.01 0.070.01
21. Gender, age and variation: common words females
- shopping 0.660.02 1.480.03
- freaked 0.080.01 0.210.01
- boyfriend 0.410.02 1.730.04
- adorable 0.050.00 0.230.01
- husband 0.280.01 1.380.04
22. Gender, age and variation: common words by age
- token teens twens thirties
- maths 1.050.06 0.030.00 0.020.01
- homework 1.370.06 0.180.01 0.150.02
- bored 3.840.27 1.110.14 0.470.04
- sis 0.740.04 0.260.03 0.100.02
- boring 3.690.10 1.020.04 0.630.05
- awesome 2.920.08 1.280.04 0.570.04
- mum 1.250.06 0.410.04 0.230.04
- mad 2.160.07 0.800.03 0.530.04
- dumb 0.890.04 0.450.03 0.220.03
- semester 0.220.02 0.440.03 0.180.04
- apartment 0.180.021.230.05 0.550.05
- drunk 0.770.04 0.880.03 0.410.05
- beer 0.320.02 1.150.05 0.700.05
- student 0.650.04 0.980.05 0.610.06
- album 0.640.05 0.840.06 0.560.08
- college 1.510.07 1.920.07 1.310.09
- someday 0.350.02 0.400.02 0.280.03
- dating 0.310.02 0.520.03 0.370.04
23. Gender, age and variation: common words by age (ii)
- token teens twens thirties
- marriage 0.270.03 0.830.05 1.410.13
- development 0.160.02 0.500.03 0.820.10
- campaign 0.140.02 0.380.03 0.700.07
- tax 0.140.02 0.380.03 0.720.11
- local 0.380.02 1.180.04 1.850.10
- democratic 0.130.02 0.290.02 0.590.05
- son 0.510.03 0.920.05 2.370.16
- systems 0.120.01 0.360.03 0.550.06
- provide 0.150.01 0.540.03 0.690.05
- workers 0.100.01 0.350.02 0.460.04
24. Observations 25. How an individuated approach to corpus
linguistics can benefit the field
- allow us to take into account individual stylistic preference
as a source of variation when making generalizations (syntax,
semantics, ...)
- allow us to observe specificities of individual production
before making blanket label statements about groups (based on
gender, social standing etc)
- inverts the idea of system and variation (how much overlap is
there in language use? vs. how much variation can our theories
account for?)
26. Research possibilities?
- personal grammar?, personal semantics?
- Construction Grammar (to what degree are constructions
individual?)
- variation over the lifetime
- weighing genre, mode and individual variation
- practical applications for forensic linguistics / language
profiling
27. Thank you for listening! 28. Quantitative Individuated
Corpus Linguistics: A Speaker-Centric Approach to Variation
Cornelius Puschmann Universitt Osnabrck 5 Juni 2007