Annotation of corpora

• A. Part-of-speech tagging

• B. Syntactic annotation

• C. Semantic annotation

• D. Discourse annotation

• E. Pragmatic annotation

Annotation of corpora

• perfectly plain: produced by scanning; no information about text (usually, not even edition)

• marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc.

• annotated with identifying information, e.g. edition date, author, genre, register, etc.

• annotated for part of speech, syntactic structure, discourse information, etc.

A. Part-of-speech tagging

LOB sample with POS tagging

A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.

A01 3 ^ by_IN Trevor_NP Williams_NP ._.

A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN

A01 4 nominating_VBG any_DTI more_AP labour_NN

A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN

A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

• Main steps:– Divide the text into word tokens (tokenization)– Select a set of tags– Apply tag set to tokens

• Tokenization: – orthographic word - morpho-syntactic unit?– multiwords, e.g., in spite of label as

in_PREP31 spite_PREP32 of_PREP33– mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo

label as vendete_VERBlo_PRON– compounds, e.g., tag set label as

tagset_NOUN or tag_NOUN set_NOUN?

• Choice of tag set

• sophisticated, linguistically well grounded set of tags…

• BUT: not automatically applicable without loss of accuracy

• example: come - present plural indicative, imperative, subjunctive; Lancaster corpus: distinguish from to-infinitive, LOB, Brown corpus: don’t distinguish

• tag = word class

• label = alphanumeric characters• examples:

preposition prepositionprep

INsingular proper noun

NOUN:prop:singN-p-sg

• logically organized (taxonomy), e.g., in Lancaster, BNC, C7

• presentation: horizontal or vertical

• encoding of tags

• TEI (SGML), e.g., BNC<w AV0>Even <w AT0>the <w AJ0>

old <w NN2>women <w VVB>manage <c PUN>, <w AVO>just <w CJS>as <w PNP>they <w VVB>’re <w VVG>passing <wPNP>you <c PUN>.</PUN> (Garside et al., 1997)

• Applying tags to words

• tagging scheme should include a procedure of how to assign tags to words (both for humans and machines)

• need a lexicon: it will say which tags are assignable to which words

• again: ambuguity is a problem

B. Syntactic annotation

• syntactic annotation = parsed corpora• purposes:

– training automatic parsers (computational linguistics, e.g. probabilistic parsers - inductive training through extraction of frequency counts)

– extracting information (linguistics, e.g., building a lexicon, investigating subcategorization frames, collocations or other linguistic things, describing sublanguages)

• a parsing scheme needs (cf. POS tagging):

– a list of symbols

– definitions of symbols

– description of how to apply symbols to text

• syntactically annotated corpora: tree banks

• examples of tree banks: Penn Treebank, Nijmegen Treebank, Susanne Corpus , Helsinki Constraint Grammar (ENGCG), Lancaster/IBM SEC treebank

• Parsing

• the (automatic) analysis of texts (sentences) in terms of syntactic categories

NP ADJP

Pierre 61 old will join the as an executive Nov 29Vinken years board director

NP PP NP

• Penn Treebank

• skeleton parsing: partial parse, leaving out the “hard” things (such as PP-attachment)

• phrase structure model (Garside et al., 1997, p.42)

((S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join(NP the board)(PP as (NP a nonexecutive director))(NP Nov 29))).)

• Penn Treebank

• available through LDC

• size: 3,300,000 words (Feb 97)

• Brown corpus, Wall Street Journal

• in the current phase:– add function labels (Subj, Obj etc.)

– add null constituents or traces (e.g., It’s easy [t] to eat)

– add indices for coreferences (e.g., Mary[i] saw herself[i] in the mirror)

– discontinuous constituents

– add semantic roles (Agent, Goal etc)

• may get too complex for large-scale reliable analysis…

• Susanne Corpus• part of the Brown corpus, 128,000 words• result of manual analysis• parsing scheme specified in great detail• available from Oxford Text Archive:

– sable.ox.ac.uk/ota (http)

– ota.ox.ac.uk/pub/ota/public (ftp)

A./B. Demo

• TIGER

• NEGRA

C. Semantic annotation

• problem (1): more than one way of referring to a concept, e.g.,– text analysis: choice of expression may reflect

ideologies in the text or relationships between participants in conversation, for example, in doctor-patient interaction

abdomen --- tummy– information retrieval: historian in fashion seeks

information about trouserstrousers --- slacks, shorts, leggings, breeches

--> cf. RECALL in IR

• problem (2): one single word can refer to different concepts, e.g.,– information retrieval: historian in fashion wants to

know about bootsboot --- may refer to shoe, computer, kick, car

--> cf. PRECISION in IR

• so: – need to identify related words

(problem 1)– need to identify the different senses of a word

(problem 2)

• labeling words according to semantic field (word senses) so that you can

• … extract all the related words by querying on the semantic field

• … extract only those instances of ambiguous words with the specific senses you want by querying on the combination of word and semantic field

C. Semantic annotation• semantic fields: sense relations and other kinds of relations

(e.g., part-of, related-to etc.)• annotation (cf. PoS tagging):

– definition of the tagging scheme (labels and their meanings)– guidelines for applying the tagging scheme– in semantics: this is not as easy and straightforward as for PoS

tagging!– requirements:

• should make linguistic/psycholinguistic sense• should be able to account for the vocabulary in the corpus

exhaustively• should be suitable for texts from different periods and register

(comprehensiveness)• should preferably have a hierarchical structure

• multiple membership, e.g.,deepened: color and change/remain

• multiword units, e.g.,stubbed out: encoded as two separate words, but belonging together

• one recent ambitious attempt at a taxonomy of such semantic relations (sense relations, thesaurus-type relations, semantic fields etc.): WORDNET at www.cogsci.princeton.edu/~wn/

• you can try it online: www.cogsci.princeton.edu/~wn/online/

• How to do it?– manually

– computer-assisted (need at least a computer-readable lexicon and a disambiguation process - similar to PoS tagging)

– fully automatic (not really feasible):• semantic analysis is even harder than syntactic parsing

• no integrated ‘parse’ of meaning possible at the present time

D. Discourse annotation

• discourse features: what are they?• Typically: cohesion and coherence• coherence: what makes a text hang together

in terms of content• cohesion: the means of making a text hang

together• reference, substitution, ellipsis, conjunctive

relations (cause, result, effect etc.), thematic development

• Halliday & Hasan, 76

• example: anaphoric relations in the IBM/Lancaster corpus (UCREL)

• try to build up sth. like an ‘anaphoric treebank’

• what are anaphoric relations?– links between a proform and an antecedent

– example: The married couple said that they were

happy with their lot.The married couple said that they

were happy with their lot.

• anaphoric annotation in UCREL: categories used are based n Halliday & Hasan, 76

• example of annotation: (1 Feodor Baumenk 1), a former Nazi death camp guard, has asked the U.S Supreme Court to allow <REF=1 him to retain <REF=1 his American citizenship. (2 The Hartford Courant 2) said…

• symbols: (1), (2)… = antecedent < = anaphoric (> =

cataphoric) REF = central pronoun

• few corpora annotated for discourse features…

• how to do it?– manually

– computer-assisted: either interactive hand annotation, using some kind of specialized editor or automatic annotation with the possibility of hand correction or disambiguation

– a tool supporting annotation of anaphora: XANADU in Lancaster

E. Pragmatic annotation• anything beyond sentences and discourse: contexts of

situation and culture• examples of things people look at in pragmatics

– carry-on signals in conversation (e.g., Stenstroem 87): which functions do carry-on signals such as “well”, “you know” etc. have in conversation?

– speech acts (e.g., Stiles 92): speech act types in conversation, e.g., in doctor-patient interactions

PATIENT: I have the headaches to the point that I have to vomit (D) DOCTOR: Mm -hm (K) PATIENT: Then I have to go to bed and I sleep for a while (E) D = Disclosure

K = Acknowledgment E = Edification

E. Pragmatic annotation

• how to do it?– manually

– computer-assisted: ?

– fully-automatic: -

• You have to use your imagination!

• Stenstroem example: Can be done with a concordance program because it’s essentially word-based

• Stiles example: would probably have to be done manually (then use a concordance program on the annotated texts?)

Higher-level annotation: tools

• Tools that support specialized analysis, such as– specialized editors, e.g., Xanadu for anaphoric relations

– specialized in terms of linguistics models, • e.g., Sys-Tools for Systemic Functional Grammar

(minerva.ling.mq.edu.au/)(http://cirrus.dai.ed.ac.uk:8000/Coder/index.html)

• e.g., RSTTools for rhetorical relations analysis (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.html)

• Tools that support various kinds of analysis (but not quite everything you might want to do):– TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)

References• Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus

Annotation. Linguistic Information from Computer Text Corpora. Longman: London

• Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical Database. MIT Press.

• Garside et al., 1997. Corpus annotation. London, Longman.• Halliday M.A.K. & R. Hasan, 1976. Cohesion in English.

Longman, London.• Mindt, 1991. Syntactic evidence for semantic distinctions in

English. In Aijmer & Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik, London, Longman.

• Stenstroem, 1987. Carry-on signals in English conversation. In Meijs (ed), Corpus Linguistics and Beyond. Amsterdam, Rodopi.

• Stiles, 1992. Describing talk: a taxonomy of verbal response models. Beverly Hills, Sage.

Annotation of corpora

Documents

Transcript of Annotation of corpora

Quick pad tagger an efficient graphical user interface for building annotated corpora with multiple annotation layers

CORPORA & CORPUS ANNOTATION

Multimedia Corpora (Media encoding and annotation) · 1.3 Data models/file formats vs. Transcription systems/conventions Annotating an audio or video file means systematically reducing

Multi-Layer Resources in ANNIS: Historical Corpora and ... · Historical Corpora and Information Structural Annotation. Amir Zeldes, Humboldt-Universität zu Berlin. SFB 632 Information

An Historical Perspective on Language Resources and ... · •Dictionaries (bilingual), lexica, corpora, ... –Collaborative Multilingual / Multimedia Annotation . 6-7 September

Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany.

Linguistic annotation of learner corpora

CORPORA & CORPUS ANNOTATION Massimo Poesio Universita’ di Venezia 29 Settembre Massimo Poesio: Good morning. In this presentation I am going to summarize.

The Grammatical Annotation of Speech Corpora - USClingesp.usc.es/sites/default/files/e_bick.pdf · The Grammatical Annotation of Speech Corpora: Techniques and Perspectives Eckhard

University of Cambridge - ERROR ANNOTATION IN LEARNER … · 2018-07-21 · ERROR ANNOTATION IN LEARNER CORPORA: TOOLS AND APPLICATIONS IN ENGLISH AND ITALIAN OLGA VINOGRADOVA, NIKITA

Of corpora and brains

Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… · · 2012-09-10Problems in the annotation of spoken language corpora

Interlingual Annotation of Parallel Text Corpora: A New ...Natural Language Engineering 1 (1): 1{15. Printed in the United Kingdom c 2004 Cambridge University Press 1 Interlingual

Alignment of Parallel Corpora

Modest XPath and XQuery for corpora: Exploiting deep XML annotation … · 2017. 2. 22. · the form of an abbreviated version of the IRC to allow manual checks of query results)

1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

Semantic Annotation of Corpora Christiane Fellbaum Princeton University.

Automatic annotation of context and speech acts for dialogue … · 2010-02-02 · and Lemon 2004). In particular, large dialogue corpora annotated with contextual information and

Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon.

Module 1: Introduction to GATE Developer · University of Sheffield NLP Module 1 Outline 11.4512.15 Introduction to GATE GUI Documents and corpora Annotations and annotation sets