Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in...

9
1/9 Annotation of the video data in the Corpus NGT Onno Crasborn & Inge Zwitserlood, November 2008 Department of Linguistics & Centre for Language Studies Radboud University Nijmegen [email protected] http://www.corpusngt.nl 1. Introduction All of the video files in the Corpus NGT have been provided with ELAN annotation files in the first release of the corpus in December 2008. Only a small number of these files (160 out of 2375) actually contain annotations. The other ELAN files are empty, yet do contain the same tiers and linguistic type definitions as the annotated documents. Thus, they will facilitate use of the corpus in a later phase when annotations are added. The conventions used in creating the files as well as the glossing conventions are documented here. In addition, we describe some Perl scripts that have been created to complement the functionality in version 3.6 of ELAN. All of the annotation files (as well as the video and audio files) are subject to the Creative Commons License ‘BY-NC-SA’. For more information, see the corpus website . 2. Specifications for Linguistic Types and Tiers The EAF files in the Corpus NGT contain tiers that are relevant for a large user group. Only the gloss tiers contain annotations, and only for a restricted set of movies. The specifications of the files are available as an ELAN template (CorpusNGT.etf), which can be used for creating comparable EAF files for new media files. There are several scripts that can also be used for generating large numbers of annotation files; this script is described in the section ‘Scripts’ below. An advantage of using the scripts is that it is possible to specify the labels for ‘annotator’ and ‘participant’ for the various tiers in a large number of files, by specifying this in a source file. The scripts were tailor-made for the Corpus NGT and may need to be adapted for your own purpose. The following specifications for Linguistic Types are the basis for the tiers in the files and template: gloss

Transcript of Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in...

Page 1: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

1/9

Annotation of the video data in the Corpus NGT

Onno Crasborn & Inge Zwitserlood, November 2008

Department of Linguistics & Centre for Language Studies Radboud University Nijmegen

[email protected] http://www.corpusngt.nl

1. Introduction All of the video files in the Corpus NGT have been provided with ELAN annotation files in the first release of the corpus in December 2008. Only a small number of these files (160 out of 2375) actually contain annotations. The other ELAN files are empty, yet do contain the same tiers and linguistic type definitions as the annotated documents. Thus, they will facilitate use of the corpus in a later phase when annotations are added. The conventions used in creating the files as well as the glossing conventions are documented here. In addition, we describe some Perl scripts that have been created to complement the functionality in version 3.6 of ELAN. All of the annotation files (as well as the video and audio files) are subject to the Creative Commons License ‘BY-NC-SA’. For more information, see the corpus website. 2. Specifications for Linguistic Types and Tiers The EAF files in the Corpus NGT contain tiers that are relevant for a large user group. Only the gloss tiers contain annotations, and only for a restricted set of movies. The specifications of the files are available as an ELAN template (CorpusNGT.etf), which can be used for creating comparable EAF files for new media files. There are several scripts that can also be used for generating large numbers of annotation files; this script is described in the section ‘Scripts’ below. An advantage of using the scripts is that it is possible to specify the labels for ‘annotator’ and ‘participant’ for the various tiers in a large number of files, by specifying this in a source file. The scripts were tailor-made for the Corpus NGT and may need to be adapted for your own purpose. The following specifications for Linguistic Types are the basis for the tiers in the files and template: gloss

Page 2: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

2/9 Annotation of the video data in the Corpus NGT

translation remarks research These Types have the same specifications; the difference in the names facilitates searching in a specific group of tiers (e.g. only gloss tiers). The specifications for all of these Linguistic Types are as follows: Stereotype: none Use controlled vocabulary: none ISO data category: not used Reference to graphics allowed: no Tiers with these Linguistic Types have been created for every video file in the Corpus NGT. The labels “S1” en “S2” have been systematically used for the signer on the left and on the right, respectively. Although the user can change the order of the signers in the ELAN file or choose to show only one signer, in the Corpus NGT special care was taken to set the order in such a way that the upper body view of the two signers in such a way that they appear to be turned towards each other. In reality the two signers were always opposite of each other, but the camera was always oriented at a small angle from the signer. This order is achieved by ordering the linked media files in the EAF file. In addition, the tier characteristics contain reference to the code of the signer in the Corpus NGT (e.g. S031). The tier characteristics contain the following categories: Participant: the corpus code for a particular signer (S001, S002, etc.); Annotator: the initials of the annotator; this category is as yet empty for other tiers than

the gloss tiers; Tier label colour: two values have been used, in order to visually distinguish the tiers for

the signer to the left and the signer to the right in the timeline viewer: Left: RGB 0,0,51 (dark blue) Right: RGB 0,153,0 (bright green) Most properties of each annotation file are stored in the EAF file; some visual properties such as the tier order in the timeline viewer and the colour of tier labels in the timeline viewer are stored in the preferences file (extension .pfsx). Every EAF file contains the following tiers: GlosL S1 GlosR S1 GlosL S2 GlosR S2 These 4 tiers contain the glosses for the activities of the left hand (GlosL) and the

right hand (GlosR) respectively, of the signer to the left (S1) and the signer to the right (S2). The conventions used in these glosses are below in the section ‘Gloss conventions’.

Tolk S1

Page 3: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

3/9 Annotation of the video data in the Corpus NGT

Tolk S2 Interpreter S1 Interpreter S2 These four tiers should contain a transcript of an interpreter (Dutch) voice-over in

Dutch (Tolk) and in English (Interpreter). Currently, only the Dutch voice-over is available, for a limited number of movie files.

Note: in case a translation is made of the video materials, preferably new tiers ‘Vertaling’ and ‘Translation’ should be used, to highlight the fact that the voice-over by the interpreter(s) was done simultaneous with the signing during the movie, and is not a sentence-by-sentence translation.

Opmerkingen S1 Opmerkingen S2 These two tiers are/can be used for remarks on the (signs of) the signers to the

left and right respectively. 3. Perl scripts A number of Perl scripts have been created for changing information in all ELAN and/or PFSX files in one folder. They are available at the corpus web site. Be careful using these scripts: they are not tested on other material than that available for the Corpus NGT. Please read the manual carefully, and before application of the script be sure to make a backup of all the ELAN/PFSX files in the folder. The scripts have been tested on the whole set of files in the Corpus NGT that existed in 2008. It is impossible to guarantee successful application for all future ELAN files, or that they can be successfully applied in future versions of the ELAN/PFSX specifications. They presuppose the file naming conventions in the Corpus NGT project; see the manuals for details. Be very careful using the scripts, and make sure to test the resulting files. Knowledge of XML is a prerequisite, so that the input and output of the scripts can be inspected in relation to the script. The following scripts are available: • AddLinguisticType.pl Adds a Linguistic Type in all files in a folder. • AddTier.pl Adds a tier in all files in a folder. • EafCopy.pl Creates EAF files for all media files in a folder. • EafCopy2.pl Creates EAF files based on a dummy file, each linked to two media files (body view of

each signer). • EafCopy3.pl

Page 4: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

4/9 Annotation of the video data in the Corpus NGT

Creates EAF files based on a dummy file, each linked to four media files (body and face fiews of each signer).

• PfsxCopy.pl Creates PFSX files based on a dummy file for all media files in a folder. • CorrectAnnotation.pl Changes annotation values in all EAF files in a folder on the basis of a text file, in which

the first column contains the existing value of the annotation and the second column the new value. Attention: the script applies to all tiers in a file.

Possibly, these and comparable functionalities for managing larger numbers of EAF files will be built into ELAN in the future. 4. Annotation conventions 4.1 Introduction The glosses in the annotation files in the Corpus NGT are intended to indicate the exact start and end time of the signs, as well as to refer to a lexicon. Thus, the glosses are not actual translations; in the ideal case they are pointers to lemmas in a lexicon. Because of the fact that there is no common orthography for sign language nor a practical, much used phonetic notation system, Dutch words have been used as a reference. They approach (one of) the meaning(s) of the signs; however, the real meanings of the sign forms are described in the lexicon, not by the gloss. Exceptions to this rule are non-lexicalised forms that, in the gloss, are preceded by a @-character (see under #4 below). Although it was our intention to use glosses referring to a lexicon, for reasons of efficiency it was not possible to always consult the lexicons of the Dutch Sign Centre (NGc) on DVD or on the internet. Because of this, the glosses in the first release will contain many inconsistencies; the user of the annotations be aware of this. Typos and spelling mistakes have been corrected as much as possible in all of the files. However, only part of the material has been checked by a second annotator. It is, therefore, expected that many files contain a number of inconsistencies as well as interpretation differences and mistakes. When the files are used, these can be repaired as much as possible, although no standard procedure for this is as yet available. This will need to be developed in the near future. For further information, users can contact the corpus managers ([email protected]). The glosses are only related to manual activity, not to body or facial activity, even though body and face often express meaning). E.g. when the signer makes a manual sign accompanied by a head shake, only the manual sign has been referred to in the gloss, not the negation. 4.2 Gloss conventions

Page 5: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

5/9 Annotation of the video data in the Corpus NGT

1. All signs are provided with glosses. A gloss is usually one Dutch word. Glosses are in capitals.

o E.g.: HOND (DOG)

2. There is a separate tier for each hand. If a sign was made with the left hand, this is in the GlosL tier; if a sign is made with the right hand, it is annotated in the GlosR tier. If a sign is made with both hands, this is both in the GlosL tier and the GlosR tier.

o E.g.:

left hand right hand

The latter holds irrespectively of whether only one or both hands move or only one hand

moves (so also for the NGT sign for KOFFIE (COFFEE).

3. Some signs have a fixed form and meaning but cannot be labelled with one Dutch word. In those cases, the signs are annotated by a (fixed) combination of Dutch words (where possible, the descriptions of the Dutch Sign Center were used). These words are linked by underscores.

o E.g.: FLUITJE_VAN_EEN_CENT (PIECE_OF_CAKE) NOG_NIET (NOT_YET) HET_EENS_ZIJN_MET (AGREE_WITH)

4. For some signs it appeared to be very difficult to find a good equivalent Dutch word or

fixed word combination, mainly because these signs combine many meanings simultaneously. In these cases a description of the meaning of the sign was given, in small letters and preceded by the “@” character. In general, this concerns less lexicalized or morphologically complex productive forms.

o E.g.: @schapen lopen de heuvel op. (@sheep go up the hill.) 5. Glosses have been assigned as consequently as possible (same sign, same gloss).

However, some signs differ only because of a different mouthing. In those cases different glosses were used, especially when these signs were separate items in the lexicons.

o E.g.: “BROER” vs. “ZUS” (“BROTHER” vs. “SISTER”): different glosses because of the different mouthings.

6. The start and end of each sign have been indicated carefully. The following criteria have

been used to determine sign boundaries: A sign starts:

• at the first frame in which the hand starts to move away from the initial location of the sign to the final location of the sign; or

• (in case the hand does not move through space): at the first frame in which the handshape starts to change, e.g. closing the hand in the sign for “MAN”; or

Page 6: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

6/9 Annotation of the video data in the Corpus NGT

• (in case the hand does not move through space and the handshape does not change): at the first frame in which the orientation of the hand starts to change, e.g. turning the hand in the sign for “OVERLEDEN” (“PASS_AWAY”);

A sign ends: • at the first frame in which the handshape starts to change after the sign was

finished; or • at the first frame in which the hand starts to move away from the final location of

the sign. 7. In two-handed signs the hands do not always move in exactly the same way. Often one

hand stays in a particular position after the sign has ended, while the other hand goes on signing the next sign. Or one hand starts to move or change slightly before the other hand does. The exact duration of the sign is indicated for each hand on the GlosL- and GlosR-tiers, independent of the duration of the other hand.

8. Compound signs have been glossed “literally”, with a “^” character between the parts:

o E.g.: “ONDERWIJS^PERSOON” instead of “ONDERWIJZER” (“TEACH^PERSON” instead of “TEACHER”)

9. If a sign is glossed as a verb, it is always done in the infinite form:

o E.g.: “LOPEN” instead of “LOOPT” or “LIEP” (“WALK” instead of “WALKS” or “WALKED”)

10. There is a sign in NGT that is used to draw the addressee’s attention. That sign is

glossed as “HEE” (“HEY”). 11. There is a sign that is difficult to translate or describe because it has many

meanings/functions. It has the following form: the hand palm(s) are oriented upwards; sometimes there is also a (small) movement upwards or downwards. This sign is glossed as “PO” (Palm van de hand Omhoog) (Palm of hand Up).

Page 7: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

7/9 Annotation of the video data in the Corpus NGT

o E.g.:

12. Pointing signs carry the gloss “INDEX”. If the signer points to him/herself, the gloss is: “INDEX-1”. If a signer consecutively points to different locations, separate glosses “INDEX” are used. However, if a signer uses an arc movement to point to several signs, there is one gloss “INDEX”; this gloss spans the duration of the arc movement in the sign. If the signer consecutively points several times to the same location, this is annotated with one gloss “INDEX”, that has the duration of the sequence of pointing signs to that same location.

13. There are signs made with index and middle finger, with a meaning of “togetherness”.

These have been glossed as follows: o WIJ_TWEEEN (THE_TWO_OF_US) o JULLIE_TWEEEN (THE_TWO_OF_YOU) o ZIJ_TWEEEN (THE_TWO_OF_THEM)

14. Signs for numbers have been glossed in digits. E.g.:

o 312, not: DRIEHONDERD_TWAALF (THREE_HUNDERD_TWELVE) o 2e, not: TWEEDE (SECOND)

15. Counting while using a number on the non-dominant hand has been glossed as

TEN_EERSTE, TEN_TWEEDE, etc. (FIRST, SECOND), not ‘TEN_2e’ (SECOND) or ‘=ten tweede’ (=second) or INDEX. These glosses have been used on both the left hand and right hand tiers.

16. If a signer uses fingerspelling (e.g. a name) all spelled letters are glossed, preceded by a

“#” character:

Page 8: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

8/9 Annotation of the video data in the Corpus NGT

o E.g.: “#INGE” 17. If a signer fingerspells only one letter but simultaneously mouths the word (e.g. a name),

only the spelled letter has been glossed, preceded by the “#” character. o E.g.:

18. In some cases a signer expresses two things simultaneously in one manual action (e.g. a

combination of two signs). In those cases the glosses of both signs have been annotated, separated by a “+” character. o E.g.:

“Inge”

Page 9: Annotation of the video data in the Corpus NGT · 2008. 11. 20. · The conventions used in creating the files as well as the glossing conventions are documented here. ... was done

9/9 Annotation of the video data in the Corpus NGT

19. If a gloss is preceded by a question mark, this indicates that the annotator was not sure in his/her interpretation of the manual activity and that a second opinion is necessary. o E.g.: ?BOEK (?BOOK)

20. An annotation containing only two question marks indicates that the annotator has recognized manual activity as a sign, but does not know the sign and has not been able to find it in any lexicon. o E.g.: ??

21. If a gloss is preceded by a “~” character, this indicates that the annotator has clearly

recognized the sign, but that the sign is not well-formed (e.g. signed sloppily). This character can even have been used in case the sign is actually wrong (or even a different sign). o E.g.: ~AMSTERDAM (while the signer actually pronounces the sign as “MAAL” or

“KEER” (“TIMES”), i.e. without repetition of the contacting movement while moving from high to low in space).

22. For a few frequent signs for well-known terms with long glosses abbreviations are used:

• CI (Cochleair Implantant) • NmG (Nederlands ondersteund met gebaren; Sign Supported Dutch)

Contact For more information, contact the corpus managers at [email protected]. Also see the corpus website for new information: http://www.corpusngt.nl.