Post on 26-Dec-2015
LIN 3098Corpus Linguistics – Lecture 4
Albert Gatt
LIN 3098 -- Corpus Linguistics
In this lecture
Levels of annotation Corpus typology
classification based on type and levels of annotation
multilingual corpora
Part 1
Levels of corpus annotation (cont/d)
LIN 3098 -- Corpus Linguistics
Levels of linguistic annotation part-of-speech (word-level) lemmatisation (word-level) parsing (phrase & sentence-level) semantics (multi-level)
semantic relationships between words and phrases
semantic features of words discourse features (supra-sentence level) phonetic transcription prosody
LIN 3098 -- Corpus Linguistics
Lemmatisation
Groups morphological variants of a word under the head word: mexa’ (walk)
imxejt (I walked) imxejna (we walked) nimxu (we walk) ...
Increasingly common these days.
Together , these forma lemma
LIN 3098 -- Corpus Linguistics
Lemmatisation example: the SUSANNE corpus Format: word + tag + lemma
A05:0030.33 - VVDv said say
Every word in the corpus is on separate line. Extremely useful for lexicography
Corpus file:sentence.word POS tag actualword
headword(lemma)
LIN 3098 -- Corpus Linguistics
Automatic morphological analysis
For some languages, there are reasonably good lemmatisers/ morphological analysers:
Examples for English: morpha: built at the University of Sussex EngTwol: commercial, by LingSoft.
LIN 3098 -- Corpus Linguistics
Engtwol output
undeniable: "undeniable" <DER:ble> A ABS
(derived with –ble suffix) adjective (A) absolute (ABS) form
This is a rule-based analyser. There are others which use corpus-derived statistical patterns.
LIN 3098 -- Corpus Linguistics
Semantic annotation I: Two types
markup of semantic relations (e.g. predicate-argument structure) currently used in parsed corpora, to mark up
function-argument structures etc.
markup of features of word meaning (mainly, word senses) has origins in content analysis to arrive at
conclusions about how prominent particular concepts are
Now used in a lot of work on word sense disambiguation
LIN 3098 -- Corpus Linguistics
Example of type 1 semantic markup (Penn Treebank)
(S (NPSBJ1 Chris) (VP wants
(S (NPSBJ *1) (VP to
(VP throw (NP the ball))))))
Predicate Argument Structure: wants(Chris, throw(Chris, ball))
Empty embedded subjectlinked to NP subject no. 1
LIN 3098 -- Corpus Linguistics
Semantic markup type 2: lexical features Most common type:
word-sense tagged corpora Main idea:
disambiguate a word in context by tagging its sense Often uses WordNet (Miller et al 1993)
WordNet is a lexical taxonomy which represents lexical relations within a large number of words. including hyponymy (IS-A) relations etc For each entry, all the (supposed) senses of the word
are given. Main use: identify senses of words in context,
mark them up with a pointer to a wordnet sense.
LIN 3098 -- Corpus Linguistics
WordNet senses: Move (noun)
(377) move -- (the act of deciding to do something; "he didn't make a
move to help"; "his first move was to hire a lawyer")
(70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")
(57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")
(30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")
(5) move -- ((game) a player's turn to take some action permitted by the rules of the game)
LIN 3098 -- Corpus Linguistics
(130) travel, go, move, locomote -- (change location; move,
travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")
(60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")
(52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")
(20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")
WordNet senses: Move (verb)
LIN 3098 -- Corpus Linguistics
Check it out!
Wordnet is freely available for download:
http://wordnet.princeton.edu/
LIN 3098 -- Corpus Linguistics
Word sense annotation: other uses tagging words with their semantic field (Wilson 1996)
plant life men’s clothing …
tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: social processes negative emotions
This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, analyses a text and comes up with a profile of its
personal/emotional content relates this to some features of its author (gender, age…)
LIN 3098 -- Corpus Linguistics
Discourse annotation Most common:
text-level things such as paragraphs
Less common: anaphoric NPs and reference (cf. example from
lecture 3)
Even less common: annotation of words which function as discourse
cues (Stenstrom 1984): apology (sorry), hedges (sort of), etc
annotation of rhetorical structure
LIN 3098 -- Corpus Linguistics
Discourse: Annotating rhetorical structure (I) Rhetorical Structure Theory (Mann and Thompson
1988): views text as made up of “discourse units” units stand in various rhetorical relations, which
reflect their role in constructing an argument, a narrative, etc
CONCESSION/CONTRAST relation: [Although Mr. Freeman is retiring,] [he will continue to
work as a consultant for American Express on a project basis].
Second unit is the main one (nucleus) First unit (satellite) “concedes” that what the main unit
is saying is contradicted by another fact. Recent corpus (Marcu et al 2003) is annotated with
this information.
LIN 3098 -- Corpus Linguistics
Phonetic transcription
Not many phonetically transcribed corpora. MARSEC corpus is one of the best known.
This is a version of the Lancaster/IBM Spoken English Corpus.
Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis).
LIN 3098 -- Corpus Linguistics
Annotating suprasegmentals Aims: capture suprasegmental
features such as stress, intonation and pauses in spoken speech.
Some transcription systems exist TOBI (American) Tonic Stress Marker (TSM; British) define ways of annotating
suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc…
LIN 3098 -- Corpus Linguistics
Problem-oriented tagging
If you’re interested in a particular problem, and no corpus exists, build your own!
Many corpora define problem-specific annotation schemes.
LIN 3098 -- Corpus Linguistics
Example: the TUNA Corpus Problem: How do people refer to objects
using definite NPs? Main interest: visual properties (colour, size etc) Focus: semantics of definite NPs, i.e. what
people choose to include in their description.
Method: experiment to get people to describe objects,
distinguishing them from other objects in the same visual “scene”
annotation of descriptions based on semantics
LIN 3098 -- Corpus Linguistics
TUNA Corpus: description<DESCRIPTION NUM="SINGULAR">
<ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE><ATTRIBUTE NAME="type" VALUE="sofa"> sofa </ATTRIBUTE><ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE>
</DESCRIPTION>
Red sofa, bigger version.
Features of the corpus:
1. represents the “target” referent
2. also represents the “distractors” (from which the target must be distinguished)
3. semantically transparent: annotation goes beyond language
Part 2
Multilingual corpora
LIN 3098 -- Corpus Linguistics
Why multilingual corpora? comparative studies
syntax morphology …
the cornerstone of most research in automatic machine translation nowadays most MT systems are statistical, trained on large
repositories of parallel (e.g. English-Chinese) text.
LIN 3098 -- Corpus Linguistics
Parallel corpora Represents a text in its original language
(L1), with a translation in another language (L2) long history: Medieval polyglot bibles were
among the first “parallel” corpora
Alignment: Many parallel corpora align L1 and L2 at
sentence level, sometimes also at word level… Sentence-level alignment can be achieved
automatically with very high accuracy!
LIN 3098 -- Corpus Linguistics
Example: SMULTRON corpus Developed and released in 2007-8
Relatively small
Aligned texts in English, Swedish and German E.g. Sophie’s World is one of the texts
Annotated with syntax, POS, morphology
Comes with a tool to view parallel syntactic trees.
LIN 3098 -- Corpus Linguistics
SMULTRON example: English (Sophie’s World)<s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/>
<t id="s3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s3_3" word="was" pos="VBD" morph="--"/> <t id="s3_4" word="on" pos="IN" morph="--"/>
<t id="s3_5" word="her" pos="PRP$" morph="--"/><t id="s3_6" word="way" pos="NN" morph="--"/><t id="s3_7" word="home" pos="RB" morph="--"/><t id="s3_8" word="from" pos="IN" morph="--"/><t id="s3_9" word="school" pos="NN" morph="--"/><t id="s3_10" word="." pos="." morph="--"/>
</terminals></s>
This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc)
LIN 3098 -- Corpus Linguistics
SMULTRON: Same sentence in German
<s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--"
lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK"
lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals></s>
Note: richer morphology, representation of lemmas, …
LIN 3098 -- Corpus Linguistics
Translation corpora
Not parallel. Have different texts in two or more
different languages, of the same genre.
Examples: PAROLE corpus is a translation corpus for
EU languages
LIN 3098 -- Corpus Linguistics
Why translation corpora? Parallel corpora, by definition, contain
translation (L2) can give rise to errors artificiality and translation quality can be an
issue e.g. McEnery & Wilson report a study on an
English-Polish corpus. The Polish text reads “like a translation”
Problem can be overcome if the texts used are professionally translated.
Translation corpora have texts in two or more languages, “in the original”. Data is more natural.
LIN 3098 -- Corpus Linguistics
Summary
We have now concluded our initial incursion into: corpus construction corpus annotation corpus typology
Next up: using corpora for linguistic research