Post on 22-Dec-2015
The stumbling blocks in corpus-based research of interlanguage phraseology
Przemysław Kaszubski
School of English
Adam Mickiewicz University
Poznań, Poland
PLM 2001 Bukowy Dworek27 April 2001
PLM 2001, Bukowy Dworek
Corpus linguistics: central problems
Representativeness (corpus design, compilation criteria, etc.)
Annotation (& disambiguation) of dataSome basic questions:
How much to annotate? (whole corpus? 1 part of speech, 1 lemma etc.)
How deep an analysis? How large is the corpus? What and whom are the results for? Corpus-based or corpus-driven procedures?
PLM 2001, Bukowy Dworek
Methodological premises of my research (1)
EFL learners’ overuse of high-frequency words: what does it mean? Intensive collocability of core lexical items Multi-word extensions (compounds, coinages, idioms,
expressions, phrasals)
Confrontation Available corpus-driven extraction methods
vs. pedagogical usefulness: L1-perspective (the role of
transfer)
PLM 2001, Bukowy Dworek
Methodological premises (2)
multi-corpus scheme with Polish advanced EFL learner data as hub data
variables: a) genre / text-type; b) L1; c) proficiency level d) age / maturity level
Lemma-based approach (as opposed to wordform- or family-oriented approaches)
PLM 2001, Bukowy Dworek
ENGLISH CORPORA
non-native English native English
‘apprentice’ corpora ‘expert’ corpora
1. Intermediate 2. Upper-intermediate
3. Advanced 4. College 5. Professional
Polishintermediate EFL
Spanish(upper-)
intermediate EFL
Belgian-French
advancedEFL
Polishadvanced
EFL
British and Americancollege learner
English
Britishacademic
writing
British andAmerican quality
press
PLLC SPAN FREN IFA-P(ICLE) LOCN(ARG) MCONC LOB&BROWN
92,712tokens
94,965tokens
101,442tokens
107,990tokens
106,255tokens
97,914tokens
94,421tokens
POLISH CORPORA
POL-STUD ‘apprentice’corpus
4. College level Polish college compositions 103,382tokens
POL-EXP ‘expert’ corpus 5. Professional level Polish academic papers + quality-press articles
101,348tokens
The corpus base: full specification
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite idiomaticity model: the criteria
lexical fixednesssyntactic fixedness and / or anomalysemantic opacity lexicalisation / institutionalisation / specialisation
/ conventionality = frequency + distribution implementation of fourth criterion via external
sources BBI2 & LDOCE3
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite idiomaticity model: the levels (1)
frozen expressions:phrasals: ‘TAKE after sb’; ‘TAKE to (doing) sth; ‘be taken
aback’; ‘GIVE (sth) up’; ‘GIVE sb/o.s. away’MWUs: ‘GIVE rise to’; ‘GIVE way to sb/sth’; ‘GIVE sb a
hand’; ‘TAKE care’; ‘TAKE place’; ‘TAKE for granted’; ‘TAKE advantage’; ‘TAKE root’; ‘TAKE effect’;
lexicalised compounds: ‘God-given’; ‘risk-taking’; ‘leave-taking’
restricted uses (1):restricted collocations & delexical uses: ‘TAKE drugs/ steps/
the form of/ advice/ decision/ initiative/ a bath/ a breath/ sleep’; ‘GIVE an account/ a lesson/ explanation/ sb/sth a name/ a concert/ permission/ a speech/ sb a warm welcome’
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite idiomaticity model: the levels (2)
restricted uses (2):special senses or uses: ‘GIVE results/ details/ data’; ‘TAKE
<X minutes, year(s), months, hours, generations, life, etc.>’, TAKE <sth> to mean <sth>
discourse formulae ‘let's take X/an example of X/ X as an example etc.’
free combinations:regular (incl. transparent phrasals): ‘TAKE <sb> away/ to <a
place> etc.’; ‘GIVE <sth> back’; ‘GIVE money’curious interlanguage usage: ‘?GIVE generalisation/
stabilisation to <sth>’; ‘?TAKE help/ behaviour’
PLM 2001, Bukowy Dworek
The research hypotheses
negative correlation between proficiency level and frequencies of non-idiomatic uses
positive correlation between proficiency level and frequencies of idiomatic expressions except EFL learners’ ‘favourite expressions’
traceability of (at least) some ‘favourite expressions’ to L1
PLM 2001, Bukowy Dworek
Automatic extraction precision & recall problems
POS (part-of-speech) taggers’ error marginWord-sense disambiguation and / or syntactic
parsingCollocation statisticsNature of learner language Inter-corpus comparability
PLM 2001, Bukowy Dworek
Problem 1: error margin of POS taggers
Standard error margin: 5%Affected: extraction of lemmas meeting POS criteriaPrecision (noise in data): non-verbs tagged as verbs
• Not-telling VB(lex,montr,ingp) ?not-tel? ...(7)
• agressive VB(lex,intr,infin) ?agressive? ...(3)
• well-behaved VB(lex,montr,edp) ?well-behave? ...(2)
Recall (data ignored): verbs tagged as non-verbs / lexical verbs as auxiliaries:
• ... who in sharing their lives with a retarded sibiling [sic!] and taking <ADJ(ge,pos,ingp)> {taking} part in every-day care problems, may decide never to have ...
PLM 2001, Bukowy Dworek
Tracking & rectifying the POS errors
tagger built-in tag editor (TOSCA-ICLE): on-line targeting of precision & recall errors (UNTAGged and doubtful cases) Problem: insufficient query language: word OR lemma OR
tag pattern
no tagger built-in editor: concordancer or editor needed to test for precision and recall Problem: either comprehensive or intuitive check
remaining difficulty: tagsets vs. research assumptions (gerunds & participles tagged as non-verbs)
PLM 2001, Bukowy Dworek
Problem 2: semantic disambiguation and associations
sometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4)
automatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project
University of Lancaster disambiguation toolTools unavailable or not at implementable stage
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven collocation extraction (1)
lemmas or wordforms?collocation vs co-occurrence (vs adjacency)
word clustersprecision: many identified clusters have little linguistic
significance (‘is the’; ‘of the’; ‘it BE a’)recall: Many genuine collocations and MWUs are not
contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’)
stop-listing not quite possible with high-frequency items (BUT: Ted Pedersen’s ‘Bigram Statistics Package’: http://www.d.umn.edu/~tpederse/code.html)
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven collocation extraction (2)
co-occurrence statistics (WordSmith)precision: not all co-occurrence patterns testify to meaningful
collocationsrecall: collocations may extend beyond typical 4:4 word spansMI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998;
90):GIVE 172 2458 birth 4.65 vote 4.24 opening
4.24 antibiotic 4.01 vaccination 3.91 ingenuity3.91 isolate 3.43 habit 3.43 happiness
3.24 away 2.91
WordSmith: only 10 collocate outputOliver Mason’s QWICK: MI with weighting factors for frequent
words and unlimited display of collocates
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven collocation extraction (3)
co-occurrence statistics (z-score: TACT)z-score & t-score - better suited for frequent collocates but also mutual
and imprecise on their own: z-score ordered collocate list for BE:
• there; it; that; able; not; which; should; considered; by; likely; to; said; very; enough; why; important; concerned; what; always; worth; if; proved; afraid; used;
Mason’s QWICK: multi-test package: incl. also log-likelihood; modified log likelihood; expected/observed ratio
Remaining problemsstop-listing not quite possible with high-frequency item testscollocations outside a heuristic window lexical associations between collocates (synsets)semi-manual grouping of data essential (limitations)
PLM 2001, Bukowy Dworek
Problem 4: the nature of learner data
Difference in proficiency levels essential in cross-corpus comparisons Recall: misspelled words may get mistagged by
taggers and overlooked by concordancers, unless edited beforehand
Wrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’
Unrecognised words vs. tagger default option tag
PLM 2001, Bukowy Dworek
Problem 5: cross-corpus comparability
genre homogeneity topic-skewed distribution: heuristic method of
isolation: sort by standard deviation
TAKE<sb/sth>
LOB&BR
MCONC LOCN IFA-P FREN SPAN PLLC SD
drugs* 0 0 2 43 6 0 0 15,90
steps 4 4 1 13 2 7 0 4,43
overdose 0 6 0 0 0 0 0 2,27
exercise 0 6 0 1 0 0 1 2,19
life** 0 0 6 1 1 0 0 2,19* incl. marijuana, opium, chemical substances** = kill
PLM 2001, Bukowy Dworek
Summary
Difficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity
With small corpora, mere automated methods of processing and analysis display insufficient precision and recall
Loss of data may be prove too costly when pedagogical conclusions are sought
Instead of automatisation: increase the pace of assisted pre-processing and semi-manual analysis (disambiguation)
Dedicated new type of hybrid concordancer-editor needed
PLM 2001, Bukowy Dworek
SOLUTION: dedicated concordancer-annotator
Feature 1: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors
Feature 2: allow adding custom information to concordance lines (specialised annotation / grouping of data)
Feature 3: allow saving concordances as text BACK into the corpus (pasting)
Feature 4: collocation annotation / statistics enhanced by links with phraseological dictionary
Feature 5: ???
PLM 2001, Bukowy Dworek
This show shortly available from:
http://main.amu.edu.pl/~przemka/rsearch.html