Post on 03-Jan-2016
description
Annotation of corpora
• A. Part-of-speech tagging
• B. Syntactic annotation
• C. Semantic annotation
• D. Discourse annotation
• E. Pragmatic annotation
Annotation of corpora
• perfectly plain: produced by scanning; no information about text (usually, not even edition)
• marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc.
• annotated with identifying information, e.g. edition date, author, genre, register, etc.
• annotated for part of speech, syntactic structure, discourse information, etc.
A. Part-of-speech tagging
LOB sample with POS tagging
A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.
A01 3 ^ by_IN Trevor_NP Williams_NP ._.
A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN
A01 4 nominating_VBG any_DTI more_AP labour_NN
A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN
A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.
A. Part-of-speech tagging
• Main steps:– Divide the text into word tokens (tokenization)– Select a set of tags– Apply tag set to tokens
• Tokenization: – orthographic word - morpho-syntactic unit?– multiwords, e.g., in spite of label as
in_PREP31 spite_PREP32 of_PREP33– mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo
label as vendete_VERBlo_PRON– compounds, e.g., tag set label as
tagset_NOUN or tag_NOUN set_NOUN?
A. Part-of-speech tagging
• Choice of tag set
• sophisticated, linguistically well grounded set of tags…
• BUT: not automatically applicable without loss of accuracy
• example: come - present plural indicative, imperative, subjunctive; Lancaster corpus: distinguish from to-infinitive, LOB, Brown corpus: don’t distinguish
A. Part-of-speech tagging
• tag = word class
• label = alphanumeric characters• examples:
preposition prepositionprep
INsingular proper noun
NOUN:prop:singN-p-sg
NP1
• logically organized (taxonomy), e.g., in Lancaster, BNC, C7
• presentation: horizontal or vertical
A. Part-of-speech tagging
• encoding of tags
• TEI (SGML), e.g., BNC<w AV0>Even <w AT0>the <w AJ0>
old <w NN2>women <w VVB>manage <c PUN>, <w AVO>just <w CJS>as <w PNP>they <w VVB>’re <w VVG>passing <wPNP>you <c PUN>.</PUN> (Garside et al., 1997)
A. Part-of-speech tagging
• Applying tags to words
• tagging scheme should include a procedure of how to assign tags to words (both for humans and machines)
• need a lexicon: it will say which tags are assignable to which words
• again: ambuguity is a problem
B. Syntactic annotation
• syntactic annotation = parsed corpora• purposes:
– training automatic parsers (computational linguistics, e.g. probabilistic parsers - inductive training through extraction of frequency counts)
– extracting information (linguistics, e.g., building a lexicon, investigating subcategorization frames, collocations or other linguistic things, describing sublanguages)
B. Syntactic annotation
• a parsing scheme needs (cf. POS tagging):
– a list of symbols
– definitions of symbols
– description of how to apply symbols to text
• syntactically annotated corpora: tree banks
• examples of tree banks: Penn Treebank, Nijmegen Treebank, Susanne Corpus , Helsinki Constraint Grammar (ENGCG), Lancaster/IBM SEC treebank
B. Syntactic annotation
• Parsing
• the (automatic) analysis of texts (sentences) in terms of syntactic categories
S
NP VP
NP ADJP
NP
Pierre 61 old will join the as an executive Nov 29Vinken years board director
NP PP NP
B. Syntactic annotation
• Penn Treebank
• skeleton parsing: partial parse, leaving out the “hard” things (such as PP-attachment)
• phrase structure model (Garside et al., 1997, p.42)
((S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join(NP the board)(PP as (NP a nonexecutive director))(NP Nov 29))).)
B. Syntactic annotation
• Penn Treebank
• available through LDC
• size: 3,300,000 words (Feb 97)
• Brown corpus, Wall Street Journal
• in the current phase:– add function labels (Subj, Obj etc.)
– add null constituents or traces (e.g., It’s easy [t] to eat)
– add indices for coreferences (e.g., Mary[i] saw herself[i] in the mirror)
– discontinuous constituents
– add semantic roles (Agent, Goal etc)
• may get too complex for large-scale reliable analysis…
B. Syntactic annotation
• Susanne Corpus• part of the Brown corpus, 128,000 words• result of manual analysis• parsing scheme specified in great detail• available from Oxford Text Archive:
– sable.ox.ac.uk/ota (http)
– ota.ox.ac.uk/pub/ota/public (ftp)
A./B. Demo
• TIGER
• NEGRA
C. Semantic annotation
• problem (1): more than one way of referring to a concept, e.g.,– text analysis: choice of expression may reflect
ideologies in the text or relationships between participants in conversation, for example, in doctor-patient interaction
abdomen --- tummy– information retrieval: historian in fashion seeks
information about trouserstrousers --- slacks, shorts, leggings, breeches
--> cf. RECALL in IR
C. Semantic annotation
• problem (2): one single word can refer to different concepts, e.g.,– information retrieval: historian in fashion wants to
know about bootsboot --- may refer to shoe, computer, kick, car
--> cf. PRECISION in IR
• so: – need to identify related words
(problem 1)– need to identify the different senses of a word
(problem 2)
C. Semantic annotation
• labeling words according to semantic field (word senses) so that you can
• … extract all the related words by querying on the semantic field
• … extract only those instances of ambiguous words with the specific senses you want by querying on the combination of word and semantic field
C. Semantic annotation• semantic fields: sense relations and other kinds of relations
(e.g., part-of, related-to etc.)• annotation (cf. PoS tagging):
– definition of the tagging scheme (labels and their meanings)– guidelines for applying the tagging scheme– in semantics: this is not as easy and straightforward as for PoS
tagging!– requirements:
• should make linguistic/psycholinguistic sense• should be able to account for the vocabulary in the corpus
exhaustively• should be suitable for texts from different periods and register
(comprehensiveness)• should preferably have a hierarchical structure
C. Semantic annotation
• multiple membership, e.g.,deepened: color and change/remain
• multiword units, e.g.,stubbed out: encoded as two separate words, but belonging together
• one recent ambitious attempt at a taxonomy of such semantic relations (sense relations, thesaurus-type relations, semantic fields etc.): WORDNET at www.cogsci.princeton.edu/~wn/
• you can try it online: www.cogsci.princeton.edu/~wn/online/
C. Semantic annotation
• How to do it?– manually
– computer-assisted (need at least a computer-readable lexicon and a disambiguation process - similar to PoS tagging)
– fully automatic (not really feasible):• semantic analysis is even harder than syntactic parsing
• no integrated ‘parse’ of meaning possible at the present time
D. Discourse annotation
• discourse features: what are they?• Typically: cohesion and coherence• coherence: what makes a text hang together
in terms of content• cohesion: the means of making a text hang
together• reference, substitution, ellipsis, conjunctive
relations (cause, result, effect etc.), thematic development
• Halliday & Hasan, 76
D. Discourse annotation
• example: anaphoric relations in the IBM/Lancaster corpus (UCREL)
• try to build up sth. like an ‘anaphoric treebank’
• what are anaphoric relations?– links between a proform and an antecedent
– example: The married couple said that they were
happy with their lot.The married couple said that they
were happy with their lot.
D. Discourse annotation
• anaphoric annotation in UCREL: categories used are based n Halliday & Hasan, 76
• example of annotation: (1 Feodor Baumenk 1), a former Nazi death camp guard, has asked the U.S Supreme Court to allow <REF=1 him to retain <REF=1 his American citizenship. (2 The Hartford Courant 2) said…
• symbols: (1), (2)… = antecedent < = anaphoric (> =
cataphoric) REF = central pronoun
D. Discourse annotation
• few corpora annotated for discourse features…
• how to do it?– manually
– computer-assisted: either interactive hand annotation, using some kind of specialized editor or automatic annotation with the possibility of hand correction or disambiguation
– a tool supporting annotation of anaphora: XANADU in Lancaster
E. Pragmatic annotation• anything beyond sentences and discourse: contexts of
situation and culture• examples of things people look at in pragmatics
– carry-on signals in conversation (e.g., Stenstroem 87): which functions do carry-on signals such as “well”, “you know” etc. have in conversation?
– speech acts (e.g., Stiles 92): speech act types in conversation, e.g., in doctor-patient interactions
PATIENT: I have the headaches to the point that I have to vomit (D) DOCTOR: Mm -hm (K) PATIENT: Then I have to go to bed and I sleep for a while (E) D = Disclosure
K = Acknowledgment E = Edification
E. Pragmatic annotation
• how to do it?– manually
– computer-assisted: ?
– fully-automatic: -
• You have to use your imagination!
• Stenstroem example: Can be done with a concordance program because it’s essentially word-based
• Stiles example: would probably have to be done manually (then use a concordance program on the annotated texts?)
Higher-level annotation: tools
• Tools that support specialized analysis, such as– specialized editors, e.g., Xanadu for anaphoric relations
– specialized in terms of linguistics models, • e.g., Sys-Tools for Systemic Functional Grammar
(minerva.ling.mq.edu.au/)(http://cirrus.dai.ed.ac.uk:8000/Coder/index.html)
• e.g., RSTTools for rhetorical relations analysis (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.html)
• Tools that support various kinds of analysis (but not quite everything you might want to do):– TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)
References• Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus
Annotation. Linguistic Information from Computer Text Corpora. Longman: London
• Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical Database. MIT Press.
• Garside et al., 1997. Corpus annotation. London, Longman.• Halliday M.A.K. & R. Hasan, 1976. Cohesion in English.
Longman, London.• Mindt, 1991. Syntactic evidence for semantic distinctions in
English. In Aijmer & Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik, London, Longman.
• Stenstroem, 1987. Carry-on signals in English conversation. In Meijs (ed), Corpus Linguistics and Beyond. Amsterdam, Rodopi.
• Stiles, 1992. Describing talk: a taxonomy of verbal response models. Beverly Hills, Sage.