Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for...
-
Upload
darrell-kimsey -
Category
Documents
-
view
217 -
download
0
Transcript of Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for...
![Page 1: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/1.jpg)
Syntactic annotation in CGN: Syntactic annotation in CGN: lessons learned lessons learned and to be learnedand to be learned
Ineke Schuurman
Centre for Computational Linguistics
Katholieke Universiteit Leuven
![Page 2: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/2.jpg)
15-11-2011 Paris 2
This talk ...
• Why CGN: Spoken Dutch Corpus?• At that time …• Other layers
– Orthographic transcription– PoS tagging
• Syntactic annotation– Dependencies and categories
• Spoken language– “standard” language, disfluencies
• LASSY/SoNaR: Written Dutch Corpus • What to take into account when planning a ‘spoken
treebank’
![Page 3: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/3.jpg)
15-11-2011 Paris 3
Why CGN?
Dutch Language Union• Dutch/Flemish organization taking care of common
language• 1997-8: report state of the art wrt Language & Speech
Technology
• 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced
1000 hours, +/- 10M words1 M Syntactic Annotation
• Both research purposes and services (EU) / industry
![Page 4: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/4.jpg)
15-11-2011 Paris 4
At that time
This talk: focus on textual aspects!
--------------------------------------------------------
• No taggers, parsers that could be reused• Existing grammars cover(ed) the northern variant of
Dutch• No ‘formal’ grammar
►start from scratch
![Page 5: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/5.jpg)
15-11-2011 Paris 5
Other layers
• Relevant for syntax:– Orthographic transcription– PoS tagging
• All layers in parallel, butper fragment: layer A finished before start layer B(except for errors)
• Reason: time• But: gave us opportunity to express wishes/needs wrt
other layers• Example: handling of specific types of words.
![Page 6: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/6.jpg)
15-11-2011 Paris 6
Transcription and PoS
An example:
![Page 7: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/7.jpg)
15-11-2011 Paris 7
Specific types of words
*v words in another language (not 'adopted' in Dutch)*a not fully realized words (gaan probe instead of gaan
proberen)*x words that could not be (fully) understood (also xxx,
ggg)
*u mispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat)
*d dialectal words
One or more words?zo’n vs zo ‘n (such a): one token!But hebde*d (litt. have you) realized as hebt*d de*d :
two tokens
![Page 8: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/8.jpg)
15-11-2011 Paris 8
Syntactic analysis: goal CGN
• Annotation in theory-neutral format in order to be useful for as many people as possible
• Categories: NP, PP, …• Functions/dependencies: subject, object1, …
• As automatic as possible:– Tool from NEGRA-corpus: Annotate
– for German– same desiderata as CGN (contrary to Dutch AMAZON-parser)
.
![Page 9: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/9.jpg)
15-11-2011 Paris
Annotate
• Developed for NEGRA-project (Saarbrücken)– Oliver Plaehn, Thorsten Brants
• Semi-automatic annotation– Works with tagger and parser – Suggests structures
• Combined with Cascaded Markov Models (Brants)– Bootstrapping approach possible
![Page 10: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/10.jpg)
15-11-2011 Paris
Annotate screen
.
![Page 11: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/11.jpg)
15-11-2011 Paris 11
Annotate ‘correction’ format
![Page 12: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/12.jpg)
15-11-2011 Paris 12
Annotate export format
.
![Page 13: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/13.jpg)
15-11-2011 Paris 13
Principles of syntactic annotation
• Structures as flat as possible• Only new level when there is a new head• No branching when just one node is involved• No duplication of functions (1 SU, 1 OBJ1, …)• In principle just non-branching heads• Allowed:
– multiple branching– crossing dependencies
• Input: simplified PoS.
![Page 14: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/14.jpg)
15-11-2011 Paris 14
Less PoS-tags
Simplified PoS
• PoS: over 300 tags– Over 100 for pronouns
– Not problematic at all, often unique token/tag combinations
• Not all details necessary for SA
• Example full tagset– T501a VNW(pers,pron,nomin,vol,1,ev) ik (I)
– T501o VNW(pers,pron,nomin,vol,3,ev,masc) hij (he)
• Example simplified tagset– VNW1 VNW(pers,pron) personal pronoun
– In graph: both T501a and VNW1
.
![Page 15: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/15.jpg)
15-11-2011 Paris 15
Syntactic simplifications
Other simplifications
• Obj2 – indirect object (dative)meewerkend voorwerp
• Ik geef hem een boek / een boek aan hem(I give him a book)
belanghebbend voorwerp• Ik koop hem een boek / een boek voor hem
(I buy him a book)
• Bepaling van gesteldheid (~predicative complement)• hij verft de deur blauw (he paints the door blue)• Hij vindt het boek leuk (he does like the book)• Hij nam het boek lachend aan (laughing he accepted the book)
.
![Page 16: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/16.jpg)
15-11-2011 Paris 16
Results
Even then:
• Annotate did most NPs and PPs very well, but often failed for the more complex parts
• In some sense surprising as the results for German were much better.
However:• In that case written language was involved.
Training for spoken language is much harder!.
![Page 17: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/17.jpg)
15-11-2011 Paris 17
Details CGN corpus
Balanced corpus: • types of documents (next slide)• Speaker characteristics
• Sex• Age• Geographic region• Socio-economic class• Level of education
• 2/3 Netherlands, 1/3 Belgium (Flanders)• Participants were asked to speak standard language (in
case they agreed beforehand to participate in CGN) .
![Page 18: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/18.jpg)
15-11-2011 Paris 18
Details CGN corpus
►many types of documents• Read-aloud written: Literature read aloud (library for the
blind)• Written to be spoken:
• News broadcasts• Lectures
• Spoken (spontaneous)• Interviews• Phone calls• Debates• Spontaneous conversations with x people (over lunch etc).
![Page 19: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/19.jpg)
15-11-2011 Paris 19
Variation
To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech
• Separable verbs• NL dat ze hem op wilde bellen (that she wanted to call him)• VL dat ze hem wilde opbellen
• Other choice of auxiliaries• NL Ze is het komen brengen (she came and brought it)• VL Ze heeft het komen brengen
• Other words for same concept, same words for different concepts
• Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon)
Gramm/dictionaries: mostly northern written variant
.
![Page 20: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/20.jpg)
15-11-2011 Paris
Disfluencies
Partially realized words
hilari*a instead of hilarisch (EN hilarious)
Analyzed as if realized
***
Ik doe West- en Oost-Vlaanderen
I’ll take care of West- and Oost-Vlaanderen
Short for: West-Vlaanderen en Oost-Vlaanderen
Completely regularly analyzed as conjunction (CONJ)
.
![Page 21: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/21.jpg)
15-11-2011 Paris
Disfluencies
When too little of a token is realized, such a token is ignored
awel genen TV meer en genen boe*a gene voetbal meer .
EN: So no more tv and no more football
.
![Page 22: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/22.jpg)
15-11-2011 Paris
Ex of disfluency (repetition)
![Page 23: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/23.jpg)
15-11-2011 Paris
Disfluencies
Mixed repetition/correction
Ze was bijna hileri*a hilari*a
She was almost hilarious
hileri*a is corrected as hilari*a, only the corrected form is included in the analysis
Die verd*a die vervl*a die krankzinnige hond
That damn*, that cursed*, that crazy dog
Only last 3 words (that crazy dog) included in graph
.
![Page 24: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/24.jpg)
15-11-2011 Paris 24
Disfluencies
Wrong pronunciation
Dat is een serieus plobleem*u
Dat is een serieus probleem
That’s a serious problem
Analysed as if the ‘correct’ word was involved
***
![Page 25: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/25.jpg)
15-11-2011 Paris 25
Words in foreign language
In spoken and written language:
Words in another language, and not found in a Dutch dictionary:
umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries)
• Single words: just like their Dutch counterpart• Strings: only ‘top’ label presented• Sentences: not analyzed.
![Page 26: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/26.jpg)
15-11-2011 Paris 26
Pro and con markings
Markings (*a, etc) have proven to be useful for PoS and SA.
But:
should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography
Problem: other groups wanted them at orthographic level for speech recognition purposes
Solution: add a field without markings
.
![Page 27: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/27.jpg)
15-11-2011 Paris 27
Syntactic annotation
Lacking and superfluous words
There are no ‘ungrammatical’ sentences, all sentences are to be analyzed!
• Lacking elements: just accept it• Superfluous elements: just accept it
BUT there are some exceptions:
repetition
‘accidental’ sentences
.
![Page 28: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/28.jpg)
15-11-2011 Paris 28
Not analyzed parts
Sometimes parts of a ‘sentence’ are ‘ignored’:
• ReparationsIk zie hem morg*a overmorgenI’ll see him the day after tomorrow
• RepetitionsHij is in in vergaderingHe has a meeting
Or not connected:
• ‘accidental’ sentences/unitsIk heb nooit ik ben leraresI have never I am a teacher
• Uh-insertion (hesitation marker)Ze heeft uh zeven dochtersShe has seven daughters.
![Page 29: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/29.jpg)
15-11-2011 Paris 29
Examples
More of the same
![Page 30: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/30.jpg)
15-11-2011 Paris 30
Asyndetic conjunction
![Page 31: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/31.jpg)
15-11-2011 Paris 31
Discourse phenomena
Some examples of ‘discourse’ within a sentence
![Page 32: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/32.jpg)
15-11-2011 Paris 32
Accidental unit
‘Accidental’ unit, discourse
parts not connected
![Page 33: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/33.jpg)
15-11-2011 Paris 33
Syntactic annotation
sentence
vs
discourse
![Page 34: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/34.jpg)
15-11-2011 Paris 34
Atypical ‘sentences’
Often: discourse
![Page 35: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/35.jpg)
15-11-2011 Paris 35
Complicating factors
No punctuation apart from full stop, question mark, elipsis• ‘wrong order’ of sentences when more people are talking at
the same time!
►Tricky wrt coreference, temporal reasoning etc
Spelling: incorrect (but correct with other meaning)• U zij de glorie (Thine be the glory) • U zei de glorie (‘zei’ meaning ‘said’)• Ik zal haar eraan houden (houden aan: to keep a promise)• Ik zal haar er aanhouden (aanhouden: to arrest)
►context, recordings
.
![Page 36: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/36.jpg)
15-11-2011 Paris 36
Written corpus: Lassy/SoNaR
STEVIN programme (Flemish/Dutch - 2004-2011)
D-Coi / LASSY / (SoNaR)
1M SA written text, manually corrected, plus
1.500M SA automatically
ALPINO parser (Groningen)
Largely inspired by CGN, based on HPSG
Some differences• Mentioning of ‘hidden’ subjects, objects
– Hij heeft een boek gekocht
.
![Page 37: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/37.jpg)
15-11-2011 Paris 37
Alpino
• Alpino grammar: HPSG-based• ‘Constructional’ approach:
– rich lexical representations– many detailed, construction specific lexical rules (+/- 600)
• Grammar based parsing very efficient, esp when combined with specific rules
• Large lexicon (100.000+ entries, 200.000+ NEs)– Stored as perfect hash finite automaton (Daciuk)
• Crucial: Integrated tagger (=/= CGN tagger!)• Left corner parser
![Page 38: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/38.jpg)
15-11-2011 Paris
Alpino (as is) and CGN
Parsing the CGN-corpus with Alpino• very bad results• reason might be: it uses a ‘wrong’ grammar, inadequate
lexicon etc etc
As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy-format. There are, however, still differences in the way a few phenomena are handled.
.
![Page 39: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/39.jpg)
15-11-2011 Paris
Lassy vs CGN
• Subject/direct objects wrt infinitives and participle• Partitives (one of them said …): in CGN separate label
PART, in Lassy combination of HD and MOD• LASSY: head always lexically anchored• In LASSY SBAR-complement always VC-label, in CGN
either OBJ1 or VC• …
Analyses not fully identical, but 99% is!
![Page 40: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/40.jpg)
15-11-2011 Paris 40
Syntactic annotation: Lassy
.
![Page 41: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/41.jpg)
15-11-2011 Paris 41
Syntactic annotation: CGN
.
![Page 42: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.](https://reader035.fdocuments.in/reader035/viewer/2022062404/551a6f495503463e778b5f55/html5/thumbnails/42.jpg)
15-11-2011 Paris 42
To be taken into account
In general:
• Take care of IPR• Be prepared to consult other layers• Use a flexible bug reporting system• “Spoken language”: grammar/system should be very flexible• Alignment may be very time consuming
Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others)
GOOD LUCK !.