The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski...

The stumbling blocks in corpus-based research of interlanguage phraseology

Przemysław Kaszubski

School of English

Adam Mickiewicz University

Poznań, Poland

PLM 2001 Bukowy Dworek27 April 2001

PLM 2001, Bukowy Dworek

Corpus linguistics: central problems

Representativeness (corpus design, compilation criteria, etc.)

Annotation (& disambiguation) of dataSome basic questions:

How much to annotate? (whole corpus? 1 part of speech, 1 lemma etc.)

How deep an analysis? How large is the corpus? What and whom are the results for? Corpus-based or corpus-driven procedures?

Methodological premises of my research (1)

EFL learners’ overuse of high-frequency words: what does it mean? Intensive collocability of core lexical items Multi-word extensions (compounds, coinages, idioms,

expressions, phrasals)

Confrontation Available corpus-driven extraction methods

vs. pedagogical usefulness: L1-perspective (the role of

transfer)

Methodological premises (2)

multi-corpus scheme with Polish advanced EFL learner data as hub data

variables: a) genre / text-type; b) L1; c) proficiency level d) age / maturity level

Lemma-based approach (as opposed to wordform- or family-oriented approaches)

ENGLISH CORPORA

non-native English native English

‘apprentice’ corpora ‘expert’ corpora

1. Intermediate 2. Upper-intermediate

3. Advanced 4. College 5. Professional

Polishintermediate EFL

Spanish(upper-)

intermediate EFL

Belgian-French

advancedEFL

Polishadvanced

British and Americancollege learner

English

Britishacademic

writing

British andAmerican quality

PLLC SPAN FREN IFA-P(ICLE) LOCN(ARG) MCONC LOB&BROWN

92,712tokens

94,965tokens

101,442tokens

107,990tokens

106,255tokens

97,914tokens

94,421tokens

POLISH CORPORA

POL-STUD ‘apprentice’corpus

4. College level Polish college compositions 103,382tokens

POL-EXP ‘expert’ corpus 5. Professional level Polish academic papers + quality-press articles

101,348tokens

The corpus base: full specification

The ‘extended’ tripartite idiomaticity model: the criteria

lexical fixednesssyntactic fixedness and / or anomalysemantic opacity lexicalisation / institutionalisation / specialisation

/ conventionality = frequency + distribution implementation of fourth criterion via external

sources BBI2 & LDOCE3

The ‘extended’ tripartite idiomaticity model: the levels (1)

frozen expressions:phrasals: ‘TAKE after sb’; ‘TAKE to (doing) sth; ‘be taken

aback’; ‘GIVE (sth) up’; ‘GIVE sb/o.s. away’MWUs: ‘GIVE rise to’; ‘GIVE way to sb/sth’; ‘GIVE sb a

hand’; ‘TAKE care’; ‘TAKE place’; ‘TAKE for granted’; ‘TAKE advantage’; ‘TAKE root’; ‘TAKE effect’;

lexicalised compounds: ‘God-given’; ‘risk-taking’; ‘leave-taking’

restricted uses (1):restricted collocations & delexical uses: ‘TAKE drugs/ steps/

the form of/ advice/ decision/ initiative/ a bath/ a breath/ sleep’; ‘GIVE an account/ a lesson/ explanation/ sb/sth a name/ a concert/ permission/ a speech/ sb a warm welcome’

The ‘extended’ tripartite idiomaticity model: the levels (2)

restricted uses (2):special senses or uses: ‘GIVE results/ details/ data’; ‘TAKE

discourse formulae ‘let's take X/an example of X/ X as an example etc.’

free combinations:regular (incl. transparent phrasals): ‘TAKE <sb> away/ to <a

place> etc.’; ‘GIVE <sth> back’; ‘GIVE money’curious interlanguage usage: ‘?GIVE generalisation/

stabilisation to <sth>’; ‘?TAKE help/ behaviour’

The research hypotheses

negative correlation between proficiency level and frequencies of non-idiomatic uses

positive correlation between proficiency level and frequencies of idiomatic expressions except EFL learners’ ‘favourite expressions’

traceability of (at least) some ‘favourite expressions’ to L1

Automatic extraction precision & recall problems

POS (part-of-speech) taggers’ error marginWord-sense disambiguation and / or syntactic

parsingCollocation statisticsNature of learner language Inter-corpus comparability

Problem 1: error margin of POS taggers

Standard error margin: 5%Affected: extraction of lemmas meeting POS criteriaPrecision (noise in data): non-verbs tagged as verbs

• Not-telling VB(lex,montr,ingp) ?not-tel? ...(7)

• agressive VB(lex,intr,infin) ?agressive? ...(3)

• well-behaved VB(lex,montr,edp) ?well-behave? ...(2)

Recall (data ignored): verbs tagged as non-verbs / lexical verbs as auxiliaries:

• ... who in sharing their lives with a retarded sibiling [sic!] and taking <ADJ(ge,pos,ingp)> {taking} part in every-day care problems, may decide never to have ...

Tracking & rectifying the POS errors

tagger built-in tag editor (TOSCA-ICLE): on-line targeting of precision & recall errors (UNTAGged and doubtful cases) Problem: insufficient query language: word OR lemma OR

tag pattern

no tagger built-in editor: concordancer or editor needed to test for precision and recall Problem: either comprehensive or intuitive check

remaining difficulty: tagsets vs. research assumptions (gerunds & participles tagged as non-verbs)

Problem 2: semantic disambiguation and associations

sometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4)

automatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project

University of Lancaster disambiguation toolTools unavailable or not at implementable stage

Problem 3: corpus-driven collocation extraction (1)

lemmas or wordforms?collocation vs co-occurrence (vs adjacency)

word clustersprecision: many identified clusters have little linguistic

significance (‘is the’; ‘of the’; ‘it BE a’)recall: Many genuine collocations and MWUs are not

contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’)

stop-listing not quite possible with high-frequency items (BUT: Ted Pedersen’s ‘Bigram Statistics Package’: http://www.d.umn.edu/~tpederse/code.html)

co-occurrence statistics (WordSmith)precision: not all co-occurrence patterns testify to meaningful

collocationsrecall: collocations may extend beyond typical 4:4 word spansMI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998;

90):GIVE 172 2458 birth 4.65 vote 4.24 opening

4.24 antibiotic 4.01 vaccination 3.91 ingenuity3.91 isolate 3.43 habit 3.43 happiness

3.24 away 2.91

WordSmith: only 10 collocate outputOliver Mason’s QWICK: MI with weighting factors for frequent

words and unlimited display of collocates

co-occurrence statistics (z-score: TACT)z-score & t-score - better suited for frequent collocates but also mutual

and imprecise on their own: z-score ordered collocate list for BE:

• there; it; that; able; not; which; should; considered; by; likely; to; said; very; enough; why; important; concerned; what; always; worth; if; proved; afraid; used;

Mason’s QWICK: multi-test package: incl. also log-likelihood; modified log likelihood; expected/observed ratio

Remaining problemsstop-listing not quite possible with high-frequency item testscollocations outside a heuristic window lexical associations between collocates (synsets)semi-manual grouping of data essential (limitations)

Problem 4: the nature of learner data

Difference in proficiency levels essential in cross-corpus comparisons Recall: misspelled words may get mistagged by

taggers and overlooked by concordancers, unless edited beforehand

Wrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’

Unrecognised words vs. tagger default option tag

Problem 5: cross-corpus comparability

genre homogeneity topic-skewed distribution: heuristic method of

isolation: sort by standard deviation

TAKE<sb/sth>

LOB&BR

MCONC LOCN IFA-P FREN SPAN PLLC SD

drugs* 0 0 2 43 6 0 0 15,90

steps 4 4 1 13 2 7 0 4,43

overdose 0 6 0 0 0 0 0 2,27

exercise 0 6 0 1 0 0 1 2,19

life** 0 0 6 1 1 0 0 2,19* incl. marijuana, opium, chemical substances** = kill

Summary

Difficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity

With small corpora, mere automated methods of processing and analysis display insufficient precision and recall

Loss of data may be prove too costly when pedagogical conclusions are sought

Instead of automatisation: increase the pace of assisted pre-processing and semi-manual analysis (disambiguation)

Dedicated new type of hybrid concordancer-editor needed

SOLUTION: dedicated concordancer-annotator

Feature 1: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors

Feature 2: allow adding custom information to concordance lines (specialised annotation / grouping of data)

Feature 3: allow saving concordances as text BACK into the corpus (pasting)

Feature 4: collocation annotation / statistics enhanced by links with phraseological dictionary

Feature 5: ???

This show shortly available from:

http://main.amu.edu.pl/~przemka/rsearch.html

The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski...

Documents

Transcript of The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski...

NanoBioMedical Centre Adam Mickiewicz …NanoBioMedical Centre Adam Mickiewicz University, Poznań A multidisciplinary unit focused on the high quality research and education on doctoral

Migrating University of Mickiewicz (English Catalogue)

ADAM MICKIEWICZ UNIVERSITY Faculty of Law and ... - unitus.it · ADAM MICKIEWICZ UNIVERSITY Faculty of Law and Administration ) Poznan (Poland) ORGANIC PRODUCTS. LEGAL ASPECTS. I.

Class 5 b Adam Mickiewicz Primary School in Lubsko

the syllabiu structures of english and polish - Adam Mickiewicz

Universidad Adam Mickiewicz el Nuevo Mundo al receptor ...

IEEE DySPAN 2010 Demonstrations Cavin Wang, IDA, Singapore Ser Wah Oh, I2R, Singapore Przemysław Pawełczak, UCLA, USA.

Investigating the dual function of Adam Mickiewicz ... · Adam Mickiewicz University Poznan, Poland Introduction GESTURE IN BLINDNESS ... Psychologia niewidomych i niedowidzących.

PLNOG14: Service orchestration in provider network, Tail-f - Przemysław Borek

Geographic Information Systems External Evaluation – 2012 Annual Call Jan De Belder & Przemysław Sowinski.

The Influence of Adam Mickiewicz on the Ballades of Chopin

Adam Mickiewicz University in Poznań - Study in … · UAM 3 B.Sc. in Computer Science Adam Mickiewicz University in Poznań address: ul. Wieniawskiego 1, 61-712 Poznań, Poland

FreeBSD kernel level vulnerabilities - Przemysław Frasunek · FreeBSD kernel level vulnerabilities Przemysław Frasunek Warsaw, 20th November 2009 CONFidence 2009 II

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Company size and spatial localization – game theory and experimental approach Przemysław Kusztelak (spatial economist) Macjej Pogorzelski (mathematician)

Sonnets from the Crimea - El Kirim sonetleri by Adam Mickiewicz

Four Years of the Adam Mickiewicz University Repository – AMUR: Some Lessons and Reflections, Małgorzata Rychlik, Adam Mickiewicz University

ADAM MICKIEWICZ UNIVERSITY IN POZNA Faculty of Englishvirtual-interpreting.net/wp-content/uploads/2015/08/EVIVASeminar... · wa.amu.edu.pl ADAM MICKIEWICZ UNIVERSITY IN POZNAŃ Faculty

(GEOSYSTEMS Przemyslaw Turos [tryb zgodności]) · Przemysław Turos, GEOSYSTEMS Polska Sp. z o.o. (SME), Warsaw, Poland przemyslaw.turos@geosystems.pl

Przemysław Kaszubski Faculty of English From PICLE search to IFAConc Corpora in Poland PLM Session Poznań 8 Sept. 2012.