Sampson & Babarczy
Transcript of Sampson & Babarczy
Definitional and human constraints on parsing performance
Geoffrey Sampson, Sussex University
Anna Babarczy, Budapest University of Technology and Economics
A number of authors (Voutilainen 1999; Brants 2000) have explored the ceiling on consistency of human
grammatical annotation of natural-language samples. It is not always appreciated that this issue covers two
rather separate sub-issues:
(i) how refined can a well-defined scheme of annotation be?
(ii) how accurately can human analysts learn to apply a well-defined but highly-refined scheme?
The first issue relates to the inherent nature of a language, or of whichever aspect of its structure an annotation
scheme represents. The second relates to the human ability to be explicit about the properties of a language.
To give an analogy: if we aimed to measure the size (volume) of individual clouds in the sky, one limitation
we would face is that the fuzziness of a cloud makes its size ill-defined beyond some threshold of precision;
another limit is that our technology may not enable us to surpass some other, perhaps far lower threshold of
measurement precision.
The analogy is not perfect. Clouds exist independently of human beings, whereas the properties of a language
sample are aspects of the behaviour of people, including linguistic analysts. Nevertheless, the two issues are
logically distinct, though the distinction between them has not always been drawn in past discussions. (The
distinction we are drawing is not the same, for instance, as Dickinson and Meurers' (2003) distinction between
“ambiguity” and “error”: by “ambiguity” Dickinson and Meurers are referring to cases where a linguistic
form (their example is English can), taken out of context, is compatible with alternative annotations but the
correct choice is determined once the context is given. We are interested in cases where full information
about linguistic context and annotation scheme may not uniquely determine the annotation of a given form.
On the other hand, Blaheta's (2002) distinction between “Type A” and “Type B” errors, on one side, and
“Type C” errors, on the other side, does seem to match our distinction.)
In earlier work (Babarczy, Carroll, and Sampson 2006) we began to explore the quantitative and qualitative
differences between these two limits on annotation consistency experimentally, by examining the specific
domain of wordtagging. We found that, even for analysts who are very well-versed in a part-of-speech
tagging scheme, human ability to conform to the scheme is a more serious constraint than precision of scheme
definition on the degree of annotation consistency achievable.
The present paper will report results of an experiment which extends the enquiry to the domain of higher-level
(phrase and clause) annotation.
Note that neither in our earlier investigation nor in that to be reported here are we concerned with the separate
question of what levels of accuracy are achievable by automatic annotation systems (wordtaggers or parsers)
— an issue which has frequently been examined by others. But our work is highly relevant to that issue,
because it implies the existence of upper bounds, lower than 100%, to the degree of accuracy theoretically
achievable by automatic systems. In the physical world it makes straightforwardly good sense to say that
some instrument can measure the size, or the mass, of objects more accurately than a human being can
estimate these properties unaided. In the domain of language, since this is an aspect of human behaviour, it
sounds contradictory or meaningless to suggest that a machine might be able to annotate grammatical
structure more accurately than the best-trained human expert: human performance appears to define the
standard. Nevertheless, the findings already referred to imply that it is logically possible for an automatic
wordtagger to outperform a human expert, though no standard of perfect accuracy exists. Clouds are
inherently somewhat fuzzy but not as fuzzy as people's ability to measure them. The present paper aims to
examine whether the same holds true for structure above the word level. These are considerations which
developers of automatic language-analysis systems need to be aware of.
Our experimental data consist of independent annotations by two suitable human analysts of ten extracts from
diverse files of the written-language section of the British National Corpus, each extract containing 2000+
words beginning and ending at a natural break (or about 2300 parsable items, including e.g. punctuation
marks, parts of hyphenated words, etc.). (Although, ideally, it would certainly be better to use more analysts
for the investigation, the realities of academic research and the need for the analysts to be extremely well-
trained on the annotation scheme mean that in practice it is reasonable to settle for two.) The annotation
scheme applied was the SUSANNE scheme (Sampson 1995), the development of which was guided by the
aim of producing a maximally refined and rigorously-defined set of categories and guidelines for their use
(rather than the aim of generating large quantities of analysed language samples). To quote an independent
observer, “Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus
puts more emphasis on precision and consistency” (Lin 2003: 321).
These data are currently being analysed in two ways: (i) the leaf-ancestor metric (Sampson 2000) is being
applied to measure the degree of discrepancy between the independent parses of the same passages, and to
ascertain what proportions of the overall discrepancy level are attributable to particular aspects of language
structure (e.g. how far discrepancy arises from formtagging as opposed to functiontagging, from phrase
classification as opposed to clause classification, etc.); and (ii) for a sample of specific discrepancies in each
category, the alternative analysts' annotations are compared with the published annotation guidelines to
discover what proportion arise from previously-unnoticed vagueness in the guidelines as opposed to human
error on the analysts' part.
The leaf-ancestor metric is used for this purpose both because it is the best operationalization known to us of
linguists' intuitive concept of relative parse accuracy (Sampson and Babarczy 2003 give experimental
evidence that it is considerably superior in this respect to the best-known alternative metric, the GEIG system
used in the PARSEVAL programme), and because its overall assessment of a pair of labelled trees over a
string is derived from individual scores for the successive elements of the string, making it easy to locate
specific discrepancies and identify structures repeatedly associated with discrepancy.
At the time of writing this abstract, although the annotations have been produced and the software to apply the
leaf-ancestor metric to them has been written, the process of using the software to extract quantitative results
has only just begun. We find that the overall correspondence between the two analysts' annotations of the
20,000-word sample is about 0.94, but this figure in isolation is not very meaningful. More interesting will be
data on how the approx. 0.06 incidence of discrepancy divides between different aspects of language
structure, and between scheme vagueness and human error. Detailed findings on those issues will be
presented at the Osnabrück conference.
References
Babarczy, Anna, J. Carroll, and G.R. Sampson (2006) “Definitional, personal, and mechanical constraints
on part of speech annotation performance”, J. of Natural Language Engineering 11.1-14.
Blaheta, D. (2002) “Handling noisy training and testing data”, Proc. 7th EMNLP, Philadelphia.
Brants, T. (2000) “Inter-annotator agreement for a German newspaper corpus”. Proc. LREC-2000, Athens.
Dickinson, M. and W.D. Meurers (2003) “Detecting errors in part-of-speech annotation”. Proc. 11th
EACL, Budapest.
Lin, D. (2003) “Dependency-based evaluation of Minipar”. In A. Abeillé, ed., Treebanks, Kluwer, pp.
317-29.
Sampson, G.R. (1995) English for the Computer. Clarendon Press (Oxford University Press).
Sampson, G.R. (2000) “A proposal for improving the measurement of parse accuracy”, International J. of
Corpus Linguistics 5.53-68.
Sampson, G.R. and Anna Babarczy (2003) “A test of the leaf-ancestor metric for parse accuracy”, J. of
Natural Language Engineering 9.365-80.
Voutilainen, A. (1999) “An experiment on the upper bound of interjudge agreement: the case of tagging”.
Proc. 9th Conference of EACL, Bergen, pp. 204-8.