IMACT Final Conference - Language Parallel Sessions - Erjavec

12
Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Resources for historical Slovene IMPACT Conference 2011 October 24-25, 2011, London

Transcript of IMACT Final Conference - Language Parallel Sessions - Erjavec

Page 1: IMACT Final Conference - Language Parallel Sessions - Erjavec

Tomaž ErjavecDepartment of Knowledge Technologies

Jožef Stefan Institute

Ljubljana

Resources for historical Slovene

IMPACT Conference 2011

October 24-25, 2011, London

Page 2: IMACT Final Conference - Language Parallel Sessions - Erjavec

2

• Pre-story: AHLib (2004–08)(Deutsch-slowenische/kroatische Übersetzung 1848–1918)• Corpus / DL of ger→slv books• AAS: transcription correction and markup (TEI P4)• JSI: automatic annotation and editing environment

• Story: EU IP IMPACT (ext. 2010–2011)• Better OCR for historical texts• NUK: GTD transcriptions (PAGE/Aletheia)• JSI: (semi)manual lexicon construction

• Co-story: Google award (2011)• Developing language models for historical Slovene• ZRC SAZU: transcriptions of old texts (TEI P5)• JSI: annotating a corpus of old Slovene

Background

Tomaž Erjavec: Slovene language resources

Page 3: IMACT Final Conference - Language Parallel Sessions - Erjavec

3

Methodology• Develop 3 resources:

• transcribed texts• hand-annotated corpus• lexicon of historical words

• Develop annotation tool, ToTrTaLe• How to tag and lemmatise historical Slovene?

Little chance of developing training data comparable to that for contemporary Slovene

• Basic idea: • modernise words then use models for modern Slovene• transcription is via fixed lexicon + transcription patterns• patterns implemented via LMU Vaam• mostly OK for XIX and XVIII century language

Tomaž Erjavec: Slovene language resources

Corpus

Annotators

ToTrTaLe

HistoricallexiconTexts

Contemporarymodels

Page 4: IMACT Final Conference - Language Parallel Sessions - Erjavec

4

Issues• Tokenisation - words were split differently in historical

language :• žnjo → z njo• po noči → ponoči

• Variability:• archaic forms:

ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin

• inflection:ljubezen ← ljubezni, ljubeznijo

• both:ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezin, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin

• Extinct words:• zajhen / cajhen / znamenje

Tomaž Erjavec: Slovene language resources

Page 5: IMACT Final Conference - Language Parallel Sessions - Erjavec

5

Transcribed historical texts• AHLib corpus/DL:

90 books, 10,000 pages, 2M words (> 1850)• NUK GTD:

5,000 pages, 1M words • Google Books:

30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni):

200 books, 5M words (in progress)

~ 10M words

• most texts have associated facsimiles• can be made freely available

Tomaž Erjavec: Slovene language resources

Page 6: IMACT Final Conference - Language Parallel Sessions - Erjavec

6

Initial Lexicon• Development of initial lexicon (2010), using the data and tools at

hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)

Tomaž Erjavec: Slovene language resources

Page 7: IMACT Final Conference - Language Parallel Sessions - Erjavec

7

Reference corpusgoo300k• Page sampled• Each word annotated with:

• Contemporary equivalent• Modern lemma• Part-of-speech tag

• First with ToTrTaLe• Then manually correct

• INL Cobalt Lexicon Tool• A team of annotators• Also correcting errors in transcription• Manual, cookbook, FAQ, mailing list, meetings…

• TEI P5 – bibliography, links to facsimiles & DL

Tomaž Erjavec: Slovene language resources

Period Units Pages Tokens

1584 1 8 60001695 1 27 10000

1751-1800 8 155 27000

1801-1850 12 206 740001851-1875 36 380 1260001876-1900 23 224 51000

∑ 81 1000 296000

Page 8: IMACT Final Conference - Language Parallel Sessions - Erjavec

8

INL Cobalt lexicon building tool

Tomaž Erjavec: Slovene language resources

Page 9: IMACT Final Conference - Language Parallel Sessions - Erjavec

9

TEI corpusdump

Tomaž Erjavec: Slovene language resources

Page 10: IMACT Final Conference - Language Parallel Sessions - Erjavec

10

Final lexicon

Composition:• Initial LeXtractor lexicon (3k entries)• Lexicon dump from goo300k• Additional lexicon from full

text collection

Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus)

examples

Tomaž Erjavec: Slovene language resources

goo300k All Historical

Lex. entries 56346 22849

Word-forms 53853 19627

Normalised 46996 15402

Modernised 37334 11396

Lemmas 19569 8605

Page 11: IMACT Final Conference - Language Parallel Sessions - Erjavec

11

Results• Language resources for historical Slovene:

• Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation)

• Annotated Corpus goo300k: • page-sampled , hand-annotated

• Structured Lexicon imp20k: • grammar + glosses + forms + attestations

• TEI P5, CC BY

• ToTrTaLe + resources for HS: • tokenisation & transcription patterns

• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012

Tomaž Erjavec: Slovene language resources

Page 12: IMACT Final Conference - Language Parallel Sessions - Erjavec

12

Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian

Tomaž Erjavec: Slovene language resources