Collostructional Analysis - Pure

1
Collostructional Analysis Casper Woldersgaard Aarhus University au AARHUS UNIVERSITET Is translationa special variant of language? Xiao and Dai state that “[t]ranslational language as a type of mediated discourse has distinctive features that make it perceptibly dierent from comparable target language.” (Xiao and Dai, 2014, p. 12) Frawley argues (1984) that “translation is essentially a third code which arises out of the bilateral consideration of the matrix and target codes: it is, in a sense, a subcode of each of the codes involved.” How are translations dierent? In word length? In sentence length? In lexicon? Or maybe syntactic constructions? Basic idea i My idea is to isolate and investigate certain linguistic phenomena in a general monolingual corpus and in one or more parallel corpora with translations. My main question is simple: Are the investigated linguistic phenomena stable across the dierent corpora? Or are the parallel corpora, with Xiao’s words, perceptibly dierent? Here, I am interested in describing the Danish verb-particle construction; specifically, the particle ned/down that typically denotes the semantic component of Path. I wish to identify those (Motion) verb lexemes that the particle attracts or repels, respectively. Basic idea ii Danish, being a satellite language -cf. Talmy-, should show a preference for verb lexemes that describe the Manner of Motion, e.g. verbs such as gå, løbe, and hoppe. This is in contrast with a verb-framed language such as Spanish that should be lexically biased towards verbs expressing the Path of the Motion, i.e. verbs such as bajar, subir, and atravesar, sometimes combined with an optional adjunct that describes the Manner of the Motion. Collostructional Analysis as a method Collostructional analysis, and here collexeme analysis more precisely, is an extension of collocational analysis that takes into account grammatical structure. The method is specifically geared to investigating the interaction of lexemes and grammatical constructions (cf. Stefanowitsch and Gries, 2003). Collostructional analysis typically employs the -log 10 (P FYE )-value of the Fisher-Yates exact test to measure the degree of a lexeme’s attraction to or repulsion from a slot in a given construction. Verb: bøje ¬bøje Totals Construction: ned 327 55871 56198 Construction: ¬ned 3509 107125505 107129014 Totals 3836 107181376 107185212 Data KorpusDK N > 107 million tokens, collected in the period 1990-2010. Script Summary 00 load corpus; 01 retrieve examples (using grep with regular expressions); 02 tabulate frequency (c = count): c( ned c verb i ), c( ned total c verbs total ) and sort examples according to verb lexeme; 03 obtain remaining frequencies to be used for calculations: c(verb i total ) and c(N ), and merge all frequencies in one file. 04 calculate statistics by using Stefan Gries’ script coll.analysis.mpfr.r. Results Which Motion verbs, or collexemes, are most strongly attracted to the V-slot in the Danish verb-particle ned construction? Collexeme Col. strength Exp. Frequency Obs. Frequency bøje bend >300 2.011 327 dumpe drop/fall >300 0.572 194 dykke bend >300 0.675 430 falde fall >300 21.668 1678 glide slide/slip >300 3.238 940 walk >300 117.928 4318 hoppe jump >300 1.77 3376 komme come >300 150.678 1801 kravle crawl >300 1.21 198 køre drive >300 21.358 454 lægge put/lay >300 30.51 776 løbe run >300 10.875 783 The motion verb lexemes most strongly attracted to the ned-construction all express Manner, except the deictic verb lexeme komme. Verbs such as falde fuse Path and Manner into one lexeme. Challenges and Improvements Noise: Mixed distributions of syntactic and semantic constructions Developing a context-free grammar able to identify dierent syntactic structures. The current POS-tagging in the Danish corpus has many error tags, especially in terms of verbs, which is a problem for the test statistic. At the moment, no availability of parallel corpora that are comparable to the general Danish corpus. 8 th EST Congress 2016 at Aarhus University [email protected]

Transcript of Collostructional Analysis - Pure

Page 1: Collostructional Analysis - Pure

Collostructional AnalysisCasper Woldersgaard

Aarhus UniversityauAARHUS UNIVERSITET

Is ’translation’ a special variant oflanguage?•Xiao and Dai state that “[t]ranslational language as a type of

mediated discourse has distinctive features that make it perceptiblydifferent from comparable target language.” (Xiao and Dai, 2014, p.12)

•Frawley argues (1984) that “translation is essentially a third codewhich arises out of the bilateral consideration of the matrix andtarget codes: it is, in a sense, a subcode of each of the codesinvolved.”

•How are translations different? In word length? In sentence length? Inlexicon? Or maybe syntactic constructions?

Basic idea i•My idea is to isolate and investigate certain linguistic phenomena in

a general monolingual corpus and in one or more parallel corporawith translations.

•My main question is simple: Are the investigated linguisticphenomena stable across the different corpora? Or are the parallelcorpora, with Xiao’s words, perceptibly different?

•Here, I am interested in describing the Danish verb-particleconstruction; specifically, the particle ned/down that typicallydenotes the semantic component of Path. I wish to identify those(Motion) verb lexemes that the particle attracts or repels,respectively.

Basic idea ii

•Danish, being a satellite language -cf. Talmy-, should show apreference for verb lexemes that describe the Manner of Motion,e.g. verbs such as gå, løbe, and hoppe.

•This is in contrast with a verb-framed language such as Spanishthat should be lexically biased towards verbs expressing the Path ofthe Motion, i.e. verbs such as bajar, subir, and atravesar, sometimescombined with an optional adjunct that describes the Manner of theMotion.

Collostructional Analysis as a method•Collostructional analysis, and here collexeme analysis more precisely,

is an extension of collocational analysis that takes into accountgrammatical structure. The method is specifically geared toinvestigating the interaction of lexemes and grammaticalconstructions (cf. Stefanowitsch and Gries, 2003).

•Collostructional analysis typically employs the −log10(PFYE)-valueof the Fisher-Yates exact test to measure the degree of a lexeme’sattraction to or repulsion from a slot in a given construction.

Verb: bøje ¬bøje TotalsConstruction: ned 327 55871 56198Construction: ¬ned 3509 107125505 107129014Totals 3836 107181376 107185212

Data•KorpusDK N > 107 million tokens, collected in the period

1990-2010.

Script Summary• 00 load corpus;• 01 retrieve examples (using grep with regular expressions);• 02 tabulate frequency (c = count): c(ned ∩ cverbi), c(nedtotal ∩ cverbstotal) and

sort examples according to verb lexeme;• 03 obtain remaining frequencies to be used for calculations:

c(verbitotal) and c(N), and merge all frequencies in one file.• 04 calculate statistics by using Stefan Gries’ script coll.analysis.mpfr.r.

Results•Which Motion verbs, or collexemes, are most strongly attracted to

the V-slot in the Danish verb-particle ned construction?

Collexeme Col. strength Exp. Frequency Obs. Frequencybøje bend >300 2.011 327dumpe drop/fall >300 0.572 194dykke bend >300 0.675 430falde fall >300 21.668 1678glide slide/slip >300 3.238 940gå walk >300 117.928 4318hoppe jump >300 1.77 3376komme come >300 150.678 1801kravle crawl >300 1.21 198køre drive >300 21.358 454lægge put/lay >300 30.51 776løbe run >300 10.875 783

•The motion verb lexemes most strongly attracted to thened-construction all express Manner, except the deictic verb lexemekomme. Verbs such as falde fuse Path and Manner into one lexeme.

Challenges and Improvements•Noise: Mixed distributions of syntactic and semantic constructions→ Developing a context-free grammar able to identify differentsyntactic structures.•The current POS-tagging in the Danish corpus has many error tags,

especially in terms of verbs, which is a problem for the test statistic.•At the moment, no availability of parallel corpora that are

comparable to the general Danish corpus.

8th EST Congress 2016 at Aarhus University [email protected]