Sampson & Babarczy

Definitional and human constraints on parsing performance

Geoffrey Sampson, Sussex University

Anna Babarczy, Budapest University of Technology and Economics

A number of authors (Voutilainen 1999; Brants 2000) have explored the ceiling on consistency of human

grammatical annotation of natural-language samples. It is not always appreciated that this issue covers two

rather separate sub-issues:

(i) how refined can a well-defined scheme of annotation be?

(ii) how accurately can human analysts learn to apply a well-defined but highly-refined scheme?

The first issue relates to the inherent nature of a language, or of whichever aspect of its structure an annotation

scheme represents. The second relates to the human ability to be explicit about the properties of a language.

To give an analogy: if we aimed to measure the size (volume) of individual clouds in the sky, one limitation

we would face is that the fuzziness of a cloud makes its size ill-defined beyond some threshold of precision;

another limit is that our technology may not enable us to surpass some other, perhaps far lower threshold of

measurement precision.

The analogy is not perfect. Clouds exist independently of human beings, whereas the properties of a language

sample are aspects of the behaviour of people, including linguistic analysts. Nevertheless, the two issues are

logically distinct, though the distinction between them has not always been drawn in past discussions. (The

distinction we are drawing is not the same, for instance, as Dickinson and Meurers' (2003) distinction between

“ambiguity” and “error”: by “ambiguity” Dickinson and Meurers are referring to cases where a linguistic

form (their example is English can), taken out of context, is compatible with alternative annotations but the

correct choice is determined once the context is given. We are interested in cases where full information

about linguistic context and annotation scheme may not uniquely determine the annotation of a given form.

On the other hand, Blaheta's (2002) distinction between “Type A” and “Type B” errors, on one side, and

“Type C” errors, on the other side, does seem to match our distinction.)

In earlier work (Babarczy, Carroll, and Sampson 2006) we began to explore the quantitative and qualitative

differences between these two limits on annotation consistency experimentally, by examining the specific

domain of wordtagging. We found that, even for analysts who are very well-versed in a part-of-speech

tagging scheme, human ability to conform to the scheme is a more serious constraint than precision of scheme

definition on the degree of annotation consistency achievable.

The present paper will report results of an experiment which extends the enquiry to the domain of higher-level

(phrase and clause) annotation.

Note that neither in our earlier investigation nor in that to be reported here are we concerned with the separate

question of what levels of accuracy are achievable by automatic annotation systems (wordtaggers or parsers)

— an issue which has frequently been examined by others. But our work is highly relevant to that issue,

because it implies the existence of upper bounds, lower than 100%, to the degree of accuracy theoretically

achievable by automatic systems. In the physical world it makes straightforwardly good sense to say that

some instrument can measure the size, or the mass, of objects more accurately than a human being can

estimate these properties unaided. In the domain of language, since this is an aspect of human behaviour, it

sounds contradictory or meaningless to suggest that a machine might be able to annotate grammatical

structure more accurately than the best-trained human expert: human performance appears to define the

standard. Nevertheless, the findings already referred to imply that it is logically possible for an automatic

wordtagger to outperform a human expert, though no standard of perfect accuracy exists. Clouds are

inherently somewhat fuzzy but not as fuzzy as people's ability to measure them. The present paper aims to

examine whether the same holds true for structure above the word level. These are considerations which

developers of automatic language-analysis systems need to be aware of.

Our experimental data consist of independent annotations by two suitable human analysts of ten extracts from

diverse files of the written-language section of the British National Corpus, each extract containing 2000+

words beginning and ending at a natural break (or about 2300 parsable items, including e.g. punctuation

marks, parts of hyphenated words, etc.). (Although, ideally, it would certainly be better to use more analysts

for the investigation, the realities of academic research and the need for the analysts to be extremely well-

trained on the annotation scheme mean that in practice it is reasonable to settle for two.) The annotation

scheme applied was the SUSANNE scheme (Sampson 1995), the development of which was guided by the

aim of producing a maximally refined and rigorously-defined set of categories and guidelines for their use

(rather than the aim of generating large quantities of analysed language samples). To quote an independent

observer, “Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus

puts more emphasis on precision and consistency” (Lin 2003: 321).

These data are currently being analysed in two ways: (i) the leaf-ancestor metric (Sampson 2000) is being

applied to measure the degree of discrepancy between the independent parses of the same passages, and to

ascertain what proportions of the overall discrepancy level are attributable to particular aspects of language

structure (e.g. how far discrepancy arises from formtagging as opposed to functiontagging, from phrase

classification as opposed to clause classification, etc.); and (ii) for a sample of specific discrepancies in each

category, the alternative analysts' annotations are compared with the published annotation guidelines to

discover what proportion arise from previously-unnoticed vagueness in the guidelines as opposed to human

error on the analysts' part.

The leaf-ancestor metric is used for this purpose both because it is the best operationalization known to us of

linguists' intuitive concept of relative parse accuracy (Sampson and Babarczy 2003 give experimental

evidence that it is considerably superior in this respect to the best-known alternative metric, the GEIG system

used in the PARSEVAL programme), and because its overall assessment of a pair of labelled trees over a

string is derived from individual scores for the successive elements of the string, making it easy to locate

specific discrepancies and identify structures repeatedly associated with discrepancy.

At the time of writing this abstract, although the annotations have been produced and the software to apply the

leaf-ancestor metric to them has been written, the process of using the software to extract quantitative results

has only just begun. We find that the overall correspondence between the two analysts' annotations of the

20,000-word sample is about 0.94, but this figure in isolation is not very meaningful. More interesting will be

data on how the approx. 0.06 incidence of discrepancy divides between different aspects of language

structure, and between scheme vagueness and human error. Detailed findings on those issues will be

presented at the Osnabrück conference.

References

Babarczy, Anna, J. Carroll, and G.R. Sampson (2006) “Definitional, personal, and mechanical constraints

on part of speech annotation performance”, J. of Natural Language Engineering 11.1-14.

Blaheta, D. (2002) “Handling noisy training and testing data”, Proc. 7th EMNLP, Philadelphia.

Brants, T. (2000) “Inter-annotator agreement for a German newspaper corpus”. Proc. LREC-2000, Athens.

Dickinson, M. and W.D. Meurers (2003) “Detecting errors in part-of-speech annotation”. Proc. 11th

EACL, Budapest.

Lin, D. (2003) “Dependency-based evaluation of Minipar”. In A. Abeillé, ed., Treebanks, Kluwer, pp.

317-29.

Sampson, G.R. (1995) English for the Computer. Clarendon Press (Oxford University Press).

Sampson, G.R. (2000) “A proposal for improving the measurement of parse accuracy”, International J. of

Corpus Linguistics 5.53-68.

Sampson, G.R. and Anna Babarczy (2003) “A test of the leaf-ancestor metric for parse accuracy”, J. of

Natural Language Engineering 9.365-80.

Voutilainen, A. (1999) “An experiment on the upper bound of interjudge agreement: the case of tagging”.

Proc. 9th Conference of EACL, Bergen, pp. 204-8.

Sampson & Babarczy

Documents

Transcript of Sampson & Babarczy