New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek...

14
New Slovene corpora within the »Communication in Slovene« project Nataša Logar Berginc Simon Krek University of Ljubljana Amebis, Kamnik Faculty of Social Sciences Jozef Stefan Institut [email protected] [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek...

Page 1: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

New Slovene corpora within the »Communication in

Slovene« project

Nataša Logar Berginc Simon KrekUniversity of Ljubljana Amebis, Kamnik Faculty of Social Sciences Jozef Stefan Institut

[email protected] [email protected]

Page 2: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

“Communication in Slovene”

• Web site: http://www.slovenscina.eu• Leading partner: Amebis, d. o. o., Kamnik• Duration: June 2008 - December 2013• Total value: 3,2 million Euro• Project consortium:

• Amebis, d. o. o., Kamnik• Jozef Stefan Institute• University of Ljubljana• Scientific Research Centre of the Slovenian Academy of

Sciences and Arts• Trojina, Institute for Applied Slovene Studies

Page 3: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Language data

• Three corpora of Slovene:

a billion word written corpus GigaFIDA

100 million word balanced subcorpus KRES

a million word corpus of spoken Slovene GOS

Page 4: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Other activities

• NLP tools & resources– statistical tagger and parser– training corpus (500.000 words)– lexicon (100.000 lemmas)

• Language learning– integration of resources & tools in Slovene language teaching– pedagogical corpus interface– pedagogical corpus-based grammar

• Language description– lexical database (NLP & lexicography)– manual of style

Page 5: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Goals

Page 6: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

GigaFIDA• a billion word written corpus• linguistic annotation

– lemmatized– morpho-syntactically annotated– partly syntactically annotated

• format– XML TEI P5 format

• purpose– data for the new Slovene lexical database,

pedagogical grammar and manual of style– freely available on the web

Page 7: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

A bit of FIDA history

• FIDA corpus– 1997-2000– 100 million words– available for project partners (academic & industrial)

• FidaPLUS corpus– 2005-2006– 620 million words– publicly available in the web concordancer– available for partners as a data set– text type: fiction 3,5%, non-fiction 96,5% (90% newspapers and

magazines)

Page 8: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

KRES

• a 100 million word written subcorpus• criteria

– balanced (text types, production-reception etc.)– text quality (processing & annotation)– copyright issues: 10 %

• purpose– downloadable as a data set– freely available for research (BNC style)– Creative Commons (Authorship, Non-Commercial)

Page 9: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

New taxonomyKRES GigaFIDA

Print 80 50 <> 90

Books 35 15 <> 35

Fiction 17 20 <> 50

Non-fiction 18 30 <> 60

Periodicals 40 20 <> 40

Newspapers 20 30 <> 70

Magazines 20 30 <> 70

Other 5 5 <> 10

Internet 20 10 <> 50

News sites 8 30 <> 70

Corp. & govern. sites

12 30 <> 70

Page 10: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

GOS

• a million word corpus of spoken Slovene− 120 hours of speech

• criteria− demographic− speech type/situation− additional (language learning, 15%)

• transcription– pronunciation-based– standardized

Page 11: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Demographic criteria

– sex: 50% M– age: <34: 40%– education: primary/secondary school: 70%– region:

• SW: 35%, • Ljubljana r.: 25%,• NE: 25%, • Maribor r.: 15%

Page 12: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Speech type/situation criteria

– public/non-public discourse: 60% : 40%– media:

• face to face c.: 50%• telephone: 10%• radio: 20%• TV: 20%

Page 13: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Tools for linguistic annotation• Tokenization & segmentation

– new more trasparent rules

• Lemmatizer & tagger– rule-based (Amebis)– statistical (JSI)– metatagger (JSI)

• Parser– statistical (based on MSTParser)

• Online services (beta)– tagger: http://oznacevalnik.slovenscina.eu/– parser: http://razclenjevalnik.slovenscina.eu/

Page 14: New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

March 2011

• Three publicly and freely available annotated corpora of modern Slovene, all texts copyright (+ gathering of new texts still in progress)

• New user-friendly interface (see Iztok Kosem presentation)

• Freely available tools for linguistic annotation of Slovene (tagger, parser)

… and not much further down the road: new, up-to-date language descriptions and manuals

See: www.slovenscina.eu