Czech Malach Cross-lingual Speech Retrieval Test Collection

16
Czech Malach Cross-lingual Speech Retrieval Test Collection Petra Galuščáková [email protected]ff.cuni.cz Institute of Formal and Applied Linguistics Charles University in Prague 5. 3. 2016

Transcript of Czech Malach Cross-lingual Speech Retrieval Test Collection

Czech Malach Cross-lingualSpeech Retrieval Test Collection

Petra Galuščáková[email protected]

Institute of Formal and Applied LinguisticsCharles University in Prague

5. 3. 2016

2

USC Shoah Foundation's Visual History Archive

● Established to collect and preserve the testimonies of survivors and other witnesses of the Holocaust

● Founded in 1994 by Steven Spielberg● Interviews with the Jewish survivors, Roma and Sinti survivors,

liberators, survivors of the eugenics policies, political prisoners, aid providers, homosexual survivors, war crimes trials participants, ...

● Almost 52 000 videotaped testimonies in 56 countries and 32 languages collected between 1994 and 2000● One of the largest available audio-visual archives

● http://sfi.usc.edu/

3

Malach Centre for Visual History

● Provides local access to the digital archives of the USC Shoah Foundation

● Need to retrieve relevant segments of interviews● Provide a test collection for the retrieval system

created in the Malach project● http://ufal.mff.cuni.cz/cvhm

4

Czech Malach Cross-lingualSpeech Retrieval Test Collection

● 353 audio recordings (592 hours of audio) randomly selected from the set of Czech interviews

● Four automatic transcripts by different provides● Manual topical annotations● Manually entered metadata (PIQ, Thesaurus)

● Planned to be published in April 2016● http://ufal.mff.cuni.cz/malach-test-collection

5

Audience

● Historians, teachers, students● Information Retrieval (IR)

● Cross-lingual IR● CLEF 2006, 2007 Cross-Language Speech Retrieval

Track● Speech processing● Sentiment analysis● Machine translation● Social studies

...

6

Collection

● Form of interviews● Average length: 1 hour

and 41 minutes● Recorded on tapes

(~ 30 minutes long),which were digitalized

7

Transcripts

● Provided by IBM (2003), The Johns Hopkins University (2004, 2006) and University of West Bohemia (2013)

● In 1-best, MLF and XML format● Lattices available for 2013 transcripts ● XML transcripts are morphologically tagged

8

Topics● Annotators manually marked topically coherent segments

and assigned a single topic to each detected segment.● The set of topics created for the annotation of the VHA.

● Topics for Czech collection were selected.● Some of the topics were adapted to better react the Czech

realities.● 5,375 annotations for 118 topics by 6 annotators (librarians

and historians)● Divided into training, test and excluded sets● All topics are in Czech and English

● Some topics are also in French, German and Spanish

9

Topic Examples INumber Name Description Narrative1173 Children's

art in Terezin

We are looking for the description of the art-related activities of children in Terezin such as music, plays, paintings, writings and poetry

The relevant material should include discussions of such activities and how they influenced the survival and following life of the children. Any episodes where the interviewee demonstrates examples of such an art are highly relevant.

1286 Music in the Holocaust

Tell us if music helped (spiritually or otherwise) or hindered the prisoners interned in concentration camps

Descriptions of what role music played in the life of the prisoners.

10

Topic Examples II

● Daily life in Terezin● Jewish children in schools● The liberation of Buchenwald and Dachau● Jewish partisans in Italy● Strengthening faith● Hidden children and rescuers● Bombing of Birkenau and Buchenwald● Minsk ghetto underground

...

11

Annotations I

● Several topics annotated dually● 2 topics annotated by all annotators

● Search Guided Relevance Assessments● Set of possible relevant segments was automatically

restricted by an IR system, Thesaurus keywords, and PIQ ● Annotators entered queries and watched the retrieved

parts of recordings ● Each topic was processed in approximately 20 hours

● Highly-ranked Assessments● Annotators manually evaluated runs submitted to the CLEF

campaign.

12

Annotations II

● Average segment length is 167 second● For each topic 44 relevant segments were found

in average.

13

Thesaurus

● English Thesaurus with 60,000 keywords● Terms are hierarchically organized● Label, definition and scope● Alternative labels (synonyms)

● Czech Thesaurus● Labels were translated manually● Part of the definitions (e.g. complete categories Culture,

Daily Life, Discrimination, Liberation) and scope translated manually

● The rest of the Thesaurus was translated automatically

14

Conclusion

15

Conclusion

● Czech Malach Collection● Cleared manual annotations of topics of segments

in recordings● Translations of topics● Partially manually translated Thesaurus● Cross-Language Speech Retrieval

16

Thank you

http://ufal.mff.cuni.cz/malach-test-collection