LIN 3098 Corpus Linguistics

LIN 3098Corpus Linguistics – Lecture 4

Albert Gatt

LIN 3098 -- Corpus Linguistics

In this lecture

Levels of annotation Corpus typology

classification based on type and levels of annotation

multilingual corpora

Part 1

Levels of corpus annotation (cont/d)

Levels of linguistic annotation part-of-speech (word-level) lemmatisation (word-level) parsing (phrase & sentence-level) semantics (multi-level)

semantic relationships between words and phrases

semantic features of words discourse features (supra-sentence level) phonetic transcription prosody

Lemmatisation

Groups morphological variants of a word under the head word: mexa’ (walk)

imxejt (I walked) imxejna (we walked) nimxu (we walk) ...

Increasingly common these days.

Together , these forma lemma

Lemmatisation example: the SUSANNE corpus Format: word + tag + lemma

A05:0030.33 - VVDv said say

Every word in the corpus is on separate line. Extremely useful for lexicography

Corpus file:sentence.word POS tag actualword

headword(lemma)

Automatic morphological analysis

For some languages, there are reasonably good lemmatisers/ morphological analysers:

Examples for English: morpha: built at the University of Sussex EngTwol: commercial, by LingSoft.

Engtwol output

undeniable: "undeniable" <DER:ble> A ABS

(derived with –ble suffix) adjective (A) absolute (ABS) form

This is a rule-based analyser. There are others which use corpus-derived statistical patterns.

Semantic annotation I: Two types

markup of semantic relations (e.g. predicate-argument structure) currently used in parsed corpora, to mark up

function-argument structures etc.

markup of features of word meaning (mainly, word senses) has origins in content analysis to arrive at

conclusions about how prominent particular concepts are

Now used in a lot of work on word sense disambiguation

Example of type 1 semantic markup (Penn Treebank)

(S (NPSBJ1 Chris) (VP wants

(S (NPSBJ *1) (VP to

(VP throw (NP the ball))))))

Predicate Argument Structure: wants(Chris, throw(Chris, ball))

Empty embedded subjectlinked to NP subject no. 1

Semantic markup type 2: lexical features Most common type:

word-sense tagged corpora Main idea:

disambiguate a word in context by tagging its sense Often uses WordNet (Miller et al 1993)

WordNet is a lexical taxonomy which represents lexical relations within a large number of words. including hyponymy (IS-A) relations etc For each entry, all the (supposed) senses of the word

are given. Main use: identify senses of words in context,

mark them up with a pointer to a wordnet sense.

WordNet senses: Move (noun)

(377) move -- (the act of deciding to do something; "he didn't make a

move to help"; "his first move was to hire a lawyer")

(70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")

(57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")

(30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")

(5) move -- ((game) a player's turn to take some action permitted by the rules of the game)

(130) travel, go, move, locomote -- (change location; move,

travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")

(60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")

(52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")

(20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")

WordNet senses: Move (verb)

Check it out!

Wordnet is freely available for download:

http://wordnet.princeton.edu/

Word sense annotation: other uses tagging words with their semantic field (Wilson 1996)

plant life men’s clothing …

tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: social processes negative emotions

This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, analyses a text and comes up with a profile of its

personal/emotional content relates this to some features of its author (gender, age…)

Discourse annotation Most common:

text-level things such as paragraphs

Less common: anaphoric NPs and reference (cf. example from

lecture 3)

Even less common: annotation of words which function as discourse

cues (Stenstrom 1984): apology (sorry), hedges (sort of), etc

annotation of rhetorical structure

Discourse: Annotating rhetorical structure (I) Rhetorical Structure Theory (Mann and Thompson

1988): views text as made up of “discourse units” units stand in various rhetorical relations, which

reflect their role in constructing an argument, a narrative, etc

CONCESSION/CONTRAST relation: [Although Mr. Freeman is retiring,] [he will continue to

work as a consultant for American Express on a project basis].

Second unit is the main one (nucleus) First unit (satellite) “concedes” that what the main unit

is saying is contradicted by another fact. Recent corpus (Marcu et al 2003) is annotated with

this information.

Phonetic transcription

Not many phonetically transcribed corpora. MARSEC corpus is one of the best known.

This is a version of the Lancaster/IBM Spoken English Corpus.

Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis).

Annotating suprasegmentals Aims: capture suprasegmental

features such as stress, intonation and pauses in spoken speech.

Some transcription systems exist TOBI (American) Tonic Stress Marker (TSM; British) define ways of annotating

suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc…

Problem-oriented tagging

If you’re interested in a particular problem, and no corpus exists, build your own!

Many corpora define problem-specific annotation schemes.

Example: the TUNA Corpus Problem: How do people refer to objects

using definite NPs? Main interest: visual properties (colour, size etc) Focus: semantics of definite NPs, i.e. what

people choose to include in their description.

Method: experiment to get people to describe objects,

distinguishing them from other objects in the same visual “scene”

annotation of descriptions based on semantics

TUNA Corpus: description<DESCRIPTION NUM="SINGULAR">

<ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE><ATTRIBUTE NAME="type" VALUE="sofa"> sofa </ATTRIBUTE><ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE>

</DESCRIPTION>

Red sofa, bigger version.

Features of the corpus:

1. represents the “target” referent

2. also represents the “distractors” (from which the target must be distinguished)

3. semantically transparent: annotation goes beyond language

Part 2

Multilingual corpora

Why multilingual corpora? comparative studies

syntax morphology …

the cornerstone of most research in automatic machine translation nowadays most MT systems are statistical, trained on large

repositories of parallel (e.g. English-Chinese) text.

Parallel corpora Represents a text in its original language

(L1), with a translation in another language (L2) long history: Medieval polyglot bibles were

among the first “parallel” corpora

Alignment: Many parallel corpora align L1 and L2 at

sentence level, sometimes also at word level… Sentence-level alignment can be achieved

automatically with very high accuracy!

Example: SMULTRON corpus Developed and released in 2007-8

Relatively small

Aligned texts in English, Swedish and German E.g. Sophie’s World is one of the texts

Annotated with syntax, POS, morphology

Comes with a tool to view parallel syntactic trees.

SMULTRON example: English (Sophie’s World)<s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/>

</terminals></s>

This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc)

SMULTRON: Same sentence in German

<s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--"

lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK"

lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals></s>

Note: richer morphology, representation of lemmas, …

Translation corpora

Not parallel. Have different texts in two or more

different languages, of the same genre.

Examples: PAROLE corpus is a translation corpus for

EU languages

Why translation corpora? Parallel corpora, by definition, contain

translation (L2) can give rise to errors artificiality and translation quality can be an

issue e.g. McEnery & Wilson report a study on an

English-Polish corpus. The Polish text reads “like a translation”

Problem can be overcome if the texts used are professionally translated.

Translation corpora have texts in two or more languages, “in the original”. Data is more natural.

Summary

We have now concluded our initial incursion into: corpus construction corpus annotation corpus typology

Next up: using corpora for linguistic research

LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

Documents

Transcript of LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

GATT si OMC

Claudia Borg, Institute of Linguistics Ray Fabri, Institute of Linguistics Albert Gatt, Institute of Linguistics Mike Rosner, Department of Intelligent.

2601 Carson Road Birmingham, Alabama 35215 -3098

Stephen gatt

Micro Motion 3098 Gas Specific Gravity Meter

Corpus Linguistics Lecture 1 Albert Gatt. Contact details My email: albert.gatt@um.edu.mtalbert.gatt@um.edu.mt Drop me a line with queries etc, and.

LIN 3098 Corpus Linguistics Lecture 7 Albert Gatt.

Natural Language Generation and Data-To-Text · Natural Language Generation and Data-To-Text Albert Gatt Institute of Linguistics, University of Malta Tilburg center for Cognition

Gatt Conferences

13. GATT&WTO 13.1 GATT GATT (General Agreement on Trade and Tariffs) GATT (General Agreement on Trade and Tariffs) 1947, Marrakech, Morocco 1947, Marrakech,

Module -4 GATT

LIN 3098 – Corpus Linguistics Albert Gatt. In this lecture Corpora for the study of genre/register variation revisit the concept of representativeness.

WTO and GATT

Persentation on GATT

Report Gatt Wto

· 2018-12-04 · 2 jg/t 533— 2018 gb/t 699 gb/t gb/t 1591 gb/t 1740 gb/t 3077 gb/t 3098 .1 2 5 6 11 15 21 1 t t 3098. t 3098. t 3098. t 3098. t 3098. t 3880. gb 6566 t 6725 t

BSI Standards PublicationBS EN ISO 3098-1:2015 EN ISO 3098-1:2015 (E) 3 Foreword This document (EN ISO 3098-1:2015) has been prepared by Technical Committee ISO/TC 10 “Technical

Micro Motion 3098 Gas Specific Gravity Meter...Micro Motion® 3098 Gas Specific Gravity Meter 3 Principle of operation To measure the specific gravity of a gas, the 3098 utilizes a

Introduction to GATT - wtocentre.iift.ac.inwtocentre.iift.ac.in/CBP/GATT obligations.pdf · GATT Obligations: Article I (MFN), II (Bound Rates), III (National Treatment), XI (QRs),