Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf ·...

24
Periodization of constructional productivity in diachronic corpora Florent Perek University of Birmingham

Transcript of Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf ·...

Page 1: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Periodization of constructional productivity in diachronic corpora Florent Perek University of Birmingham

Page 2: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Overview

o  New method for diachronic studies

o  Aim: identify stages of language change in the productivity of grammatical constructions

o  Two case studies

Page 3: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Corpus-based studies of language change o  Typical corpus-based studies of language change

–  Extract tokens from a diachronic corpus

–  Classify these tokens according to some criterion

–  Compare the state of the language at different points in time

o  Assess stages of language change –  When was it relatively stable, and for how long?

–  When did it change (and how)?

Page 4: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Manual periodization

o  Normalised frequency of the hell-construction in the COHA “Verb the hell out of”, e.g., You scared the hell out of me!

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Decades

Nor

mal

ised

freq

uenc

y (p

er M

W)

1930 1940 1950 1960 1970 1980 1990 2000

Page 5: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Problems with manual periodization

o  Stages are not always clear to discern

o  Potentially subjective: what are the criteria for splitting periods? –  Different possible groupings for the same data

–  Comparison between studies

o  More complex when multiple variables are considered e.g., token frequency + type frequency

Page 6: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Periodization

o  This problem was first exposed by Gries & Hilpert (2008)

o  They introduce “variability-based neighbour clustering” (VNC) as a method for automatic periodization

o  Variant of agglomerative clustering algorithm –  Periods are grouped according to their similarity, following

some pre-defined criteria

–  Only time-adjacent periods can be merged

Gries, S., & Hilpert, M. (2008). The Identification of Stages in Diachronic Data: Variability-based Neighbor Clustering. Corpora, 3, 59–81.

Page 7: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

The VNC algorithm

o  Starting point: data partitioned into “natural” time periods (years, decades, etc.)

1.  Look at all pairs of adjacent periods (e.g., 1930s-1940s, 1940s-1950s, etc.). Measure their similarity according to some quantifiable property/ies.

2.  Merge the two periods that are the most similar.

3.  Calculate the properties of the merger as the mean values of its constituent periods.

o  Repeat until all periods have been merged.

Page 8: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

VNC: an example

o  VNC with one variable: frequency of the hell-construction

Token frequency (per million words)

Decades

Sum

med

dis

tanc

e (S

D)

1930 1940 1950 1960 1970 1980 1990 2000

0.0

1.0

2.0

0.0

1.0

2.0

3.0

Page 9: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

VNC

o  Two kinds of uses of VNC in the literature –  To partition data in a principled way for further analysis

–  To uncover patterns of change and/or compare changes

o  So far mostly based on quantitative variables –  Frequencies: tokens, types, hapax legomena, etc.

–  Frequency distributions of lexical items, collexeme analysis

o  Lines up with usage-based linguistics: grammatical representations are shaped by frequency

o  Frequency = good starting point for looking at the history of constructions, but do not tell the whole story

Page 10: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Productivity

o  Especially true for the study of productivity –  The property of a construction to attract new lexical fillers

–  E.g., verbs in the way-construction (Israel 1996)

They hacked their way through the jungle. (16th century)

She talked her way into the club. (19th century)

o  Type frequency often taken as an indicator of productivity –  Number of different items, but not how different they are

–  Need to consider the semantic diversity of the distribution

Israel, M. (1996). The way constructions grow. In A. Goldberg (ed.), Conceptual structure, discourse and language. Stanford, CA: CSLI Publications, 217-230.

Page 11: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Operationalizing word meaning

o  Distributional semantics (Lenci 2008) –  “You shall know a word by the company it keeps.”

(Firth 1957: 11)

–  Words that occur in similar contexts tend to have related meanings (Miller & Charles 1991)

o  Captures the meaning of words through their distribution in a large corpus

o  Proposal: use distributional semantics to build representations of the semantic range of a construction

Firth, J.R. (1957). A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society.

Lenci, A. (2008). Distributional semantics in linguistic and cognitive research. Rivista di Linguistica, 20(1), 1–31. Miller, G. & W. Charles (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28.

Page 12: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

“Bag of words” approach

o  Distributional data extracted from COHA (Davies 2010); 400 MW from 1810 to 2009

o  Collocates of all verbs in a 2-word window

o  Restricted to the 10,000 most frequent nouns, verbs, adjectives and adverbs

the upper crust; cut a lip in it ; and ornament growing season. “I spend a lot of my garden time and disdainful port; looked intrepidly and indignantly mocking me? What! I marry a woman sixty-four years old that they no longer fight against it ; it is embalmed

Davies, M. (2010). The Corpus of Historical American English: 400 million words, 1810-2009. Available online at http://corpus.byu.edu/coha/

Page 13: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Distributional semantic model

o  Co-occurrence frequencies turned into PPMI scores

o  10,000 columns of the co-occurrence matrix reduced to 300 distributional-semantic features with SVD

o  In the distributional semantic model, each verb corresponds to an array of 300 values, i.e., a vector

o  Semantically similar words tend to have similar values in the same features

(column1) (column2) (column3) (column300)

find 15.59443 -2.022215 0.561186 ... -0.5778517 carry 21.82777 4.714768 -11.974389 ... -0.5226300 answer 11.66246 2.008967 8.810539 ... -0.2389049 push 22.09577 13.130336 -6.027978 ... 0.8539545 ... ... ... ... ... ...

Page 14: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Period vectors

o  For each period, extract the semantic vector of each verb in the distribution of the construction

o  Add all vectors and divide by the number of verbs: this is the period vector

o  “Semantic average” of the distribution; reflects semantic properties of the verbs attested in the period

(column1) (column2) (column3) (column300)

make 14.09814 -4.231832 -1.844898 ... 0.06963598 find 15.59443 -2.022215 0.561186 ... -0.5778517 push 22.09577 13.130336 -6.027978 ... 0.8539545 Sum 51.78834 6.876289 -7.311691 ... 0.3457388 /3 17.26278 2.292096 -2.43723 ... 0.1152463 period vector

Page 15: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Distributional period clustering

o  The VNC algorithm is run on the period vectors

o  Similarity is measured by cosines between vectors

o  The output dendrogram shows the semantic history of the construction: –  Early mergers correspond to periods of semantic stability.

–  Late mergers of large clusters indicate semantic shifts.

Page 16: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Two case studies

o  Both using COHA, focusing on verbs in two constructions

o  The hell-construction V the hell out of NP You scared the hell out of me!

I enjoyed the hell out of that show.

They beat the hell out of him.

o  The way-construction V one’s way PP They hacked their way through the jungle.

She talked her way into the club.

Restricted to the “path-creation” interpretation: the verb describes an action that enables motion

(vs. manner: They trudged their way through the snow)

Page 17: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

The hell-construction

Gradual expansion and “plateauing” rather than brutal shifts

VNC dendrogram

Decades

Sum

med

cos

ine

dist

ance

1930

1940

1950

1960

1970

1980

1990

2000

0.0

0.4

0.8

1.2

Scree plot

ClustersC

osin

e di

stan

ce

1 2 3 4 5 6 7

0.10

0.15

0.20

0.25

0.23

0.2

0.150.13

0.160.14 0.14

Token frequency (per million words)

Decades

Sum

med

dis

tanc

e (S

D)

1930 1950 1970 1990

0.0

1.0

2.0

● ●

●●

0.0

1.0

2.0

3.0

Scree plot

Clusters

Dis

tanc

e (S

D)

1 2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

0.9

0.40.4 0.4

0.2

0.1 0.1

Type frequency

Decades

Sum

med

dis

tanc

e (S

D)

1930 1950 1970 1990

05

1020

30

●●

010

2030

40

Scree plot

Clusters

Dis

tanc

e (S

D)

1 2 3 4 5 6 7 8

02

46

810

10.6

64.9

4.2

2.6

1 0.7

Hapax legomena

Decades

Sum

med

dis

tanc

e (S

D)

1930 1950 1970 1990

05

1015

20●

● ●

05

1015

2025

30

Scree plot

Clusters

Dis

tanc

e (S

D)

1 2 3 4 5 6 7 8

02

46

7.5

4

2.92.1

1.7 1.7

0

Page 18: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

The hell-construction

o  The shape of the dendrogram reflects gradual expansion rather than brutal shifts (cf. Perek 2014, 2016)

o  Construction centered on the same semantic classes, with new members joining the periphery

o  Vs. two-way split obtained with quantitative measures

o  Questions the practice of using quantitative data for the initial partitioning

Perek, F. (2014). Vector spaces for historical linguistics: Using distributional semantics to study syntactic productivity in diachrony. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland USA, June 23-25 2014 (pp. 309-314).

Perek, F. (2016). Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics, 54(1), 149–188.

Page 19: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

The way-construction VNC dendrogram

Decades

Sum

med

cos

ine

dist

ance

1830

1840

1850

1860

1870

1880

1890

1900

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000

0.0

0.5

1.0

1.5

Scree plot

Clusters

Cos

ine

dist

ance

1 3 5 7 9 11 13 15 17

0.06

0.07

0.08

0.09

0.10

0.11

0.094

0.104

0.0980.1 0.1

0.102

0.089

0.0830.083

0.072

0.0780.076

0.071

0.0790.0770.074

0.065

1830s – 1870s Concrete, physical actions, literal creation of a path: hew, shape, explore, carve, track, enforce, shoulder, etc.

1890s – 2000s More abstract: communication, social interaction, etc.: joke, bellow, chatter, snarl, spit, laugh, talk, bully, etc.

1880s: transition period More abstract verbs than the previous period: buy, smell, stammer, beg, think, pay, etc. More concrete verbs than the next period: bore, pierce, feel, wear, melt, trace, burn, etc.

Page 20: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

The way-construction

o  Change from mostly concrete to more abstract verbs (in line with Israel 1996, Perek aop)

o  How does distributional semantics compare to collostructional analysis for periodization? –  Which verbs occur more distinctively frequently in each

decade than in the others? (Hilpert 2006)

–  Each verb receives an association score in each decade

–  The distribution of collexemes can be used as input for VNC (Hilpert 2012): change in lexico-grammatical associations

Hilpert, M. 2006. Distinctive collexeme analysis and diachrony. Corpus Linguistics and Linguistic Theory 2(2). 243–57. Hilpert, M. 2012. Diachronic collostructional analysis. How to use it, and how to deal with confounding factors. In K. Allan &

J. Robynson (eds.), Current Methods in Historical Semantics, 133–160. Berlin: Mouton de Gruyter. Perek, F. (ahead-of-print). Recent change in the productivity and schematicity of the way-construction: a distributional

semantic analysis. Corpus Linguistics and Linguistic Theory.

Page 21: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

VNC with collostructional analysis VNC dendrogram

Decades

Cumu

lated

cosin

e dist

ance

s

1830

1840

1850

1860

1870

1880

1890

1900

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000

02

46

8

Scree plot

Clusters

Cosin

e dist

ance

1 2 3 4 5 6 7 8 9 11 13 15 17

0.40.6

0.8

0.966

0.6710.6210.612

0.5860.533

0.470.497

0.470.4610.4670.4650.4180.413

0.356

0.2410.241

Physical change of state: cut, hew, tear, cleave, break, pierce, burst, etc. Semantically neutral verbs: take, find, win, make

talk, buy, negotiate, lie (1930s-2000s)

Haphazard list of more abstract verbs:

earn, sing, advertise, brew, declaim, experiment (1910s-1920s)

work, pick (1930s-2000s)

Page 22: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

VNC with collostructional analysis

o  Some evidence of a shift from concrete to abstract verbs

o  But it is attested later than in the distributional VNC

o  Semantic classes are less clearly identifiable

o  With collostructional analysis, the detection of changes is highly dependent on token frequency –  Frequency associations are not always semantically relevant

–  “Real” change is only exemplified by high-frequency types

–  The timing of these changes is delayed, until sufficient frequency is reached

Page 23: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Conclusion

o  Distributional period clustering captures semantic changes in the productivity of constructions

o  Represents a step forward from regular VNC

o  Results confirm previous studies

o  Two advantages –  Semantic changes are inferred mathematically rather than

assessed impressionistically

–  Changes can be dated more precisely

… paper (with Martin Hilpert) under review, downloadable at www.fperek.net

Page 24: Periodization of constructional productivity in diachronic ...fperek.net/pdfs/slides-icame38.pdf · Manual periodization o Normalised frequency of the hell-construction in the COHA

Thanks for your attention! [email protected] www.fperek.net