Language and Knowledge Technologies for News Collections in Croatia
description
Transcript of Language and Knowledge Technologies for News Collections in Croatia
-
Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21
-
Talk overviewwho we are?what are we doing?text collections used for researchapplicable language technologiesapplicable knowledge technologies
-
Who we are?University of Zagreb, Croatiatwo faculties in a joint missionbuild the systems that will develop and enable the usage of language resources and tools for Croatian
-
Who we are 2?Faculty of Humanities and Social SciencesInstitute/Department of LinguisticsDepartment of Information Sciencesbasic computational linguistic tasks for Croatiancompiling and processing large language resourcesCroatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebankdigitalization of Croatian lexicographic heritage: 60+ dictionaries digitalized so fartagger, lemmatizerchunker, parserNERC system, gazeteers (e.g. Croatian (sur)names)
-
Who we are 3?Faculty of Electrical Engineering and ComputingDepartment of Electronics, Microelectronics, Computer and Intelligent Systems / KTLabKnowledge Technogies Laboratory Group deals withtext preprocessing techniques for Croatian for machine learning proceduresdimensionality reduction and document clustering in the vector space model + visualisationautomatic indexing of documentsintelligent, language specific and non-specific information retrieval and extraction
-
What are we doing?working jointly on several research projectsAIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011national research programme, prof. Marko TadiSources for Croatian Heritage and Croatian European Identity, 2007-2011national research programme, prof. Damir BorasCADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009prof. Marie-Francine Moens & prof. Bojana Dalbelo Bai
-
What are we doing 2?Composition of the programme RMJTP1: Croatian language resources and their annotationproject leader: Marko TadiP2: Computational syntax of Croatianproject leader: Zdravko DovedanP3: Lexical semantics in building Croatian WordNetproject leader: Ida RaffaelliP4: Information technology in translating Croatian and language e-learningproject leader: Sanja SeljanP5: Knowledge discovery in textual dataproject leader: Bojana Dalbelo Baiparticipation in a FP7 project CLARINLR & LT as a research infrastructure for e-SSH
-
Text collections used for researchwe have done research on different kinds of texts, but predominantly in journalistic genreCroatian National Corpus (hnk.ffzg.hr)101,2 million tokens in sizenewspaper articles: 37% (ca 37 million tokens)magazines articles: 16% (ca 16 million tokens)Croatian-English Parallel Corpus3,5 million tokens from Croatian Weeklynewspaper articles: 100%, bilingualspecial text collectionsdatabase of Vjesnik articles: 2000-2003, >90,000 articlesNarodne novine collection: 1998-2008, >10,000 texts, >15 million tokensParallel corpus of Southeast European Times: 2007-, >25,000 articles, >4 million tokens, in 10 languages
-
Applicable language technologiesmorphological processingimportant for inflectionally rich languages, e.g.Croatian noun in 14 word-forms (7 cases, 2 numbers):N: studentstudentiG: studentastudenataD: studentustudentimaA: studentastudenteV: studentustudentiL: studentustudentimaI: studentomstudentimaunlike English noun in 2(4?) word-forms (2 numbers + possesive?):Sg: studentPoss: (students)Pl: studentsPoss: (students)present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...
-
Applicable language technologies 2recognizing to which lexeme(s) a WF belongs tohelps us in avoiding the problem of data sparsness in many text processing tasks:information retrievaltext miningdocument classificationdocument indexingquery processingsearch engines are not inflectionally sensitivespeakers of inflectionally rich language use the normal/base form = lemmae.g. www.google.hr input: noun in nominative singulardid you know that accusative and genitive are more frequent in Croatian?
-
Applicable language technologies 3
-
Applicable language technologies 4
-
Applicable language technologies 5
-
Applicable language technologies 6Named Entity Recognition and Classification (NERC)NEs are introducing the exact information from outer world into the world-of-textrepresent answers to the basic journalistic questions: who?, where?, when?, how much?types of NEs (according to MUC conferences)personorganizationlocationdatetimevalute and measurementspercentagesystem that works for Croatian with >90% precision
-
Applicable language technologies 7system that works for Croatian with >90% precision
-
Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEs
-
Applicable L&K technologies
-
Applicable L&K technologies
-
Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)
-
Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenarios
-
Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenariosconnection with geo-data
-
Applicable knowledge technologiesautomatic document indexingeCADIS systemdeveloped for Croatian legal docsapplicable to any document collectionuses machine learning techniquesautomatically attaches the keywords (descriptors) from a controlled thesaurus to a documentrepresent the document content descriptionintegrates the corpus and document analysis
-
CADIS system
-
eCADIS systemintegrates the information from the whole document collectiongreyed n-grams are statistically relevant in the corpus i.e. collocations
-
eCADIS systemautomatic suggestion of relevant descriptors, hence the automatic indexing
-
eCADIS systemcompare it to manually attached descriptors
-
Applicable knowledge technologiesautomatic document classificationuses a series of classifiers, combined 3500 classifiersresults represented in a vector-space modeldimensionality reductionmatrices could be huge (Vjesnik: 90,000 x 600,000)features selectedtypeslemmascollocationsNEsevaluated by F1 measure (combination of precision/recall)F1 > 90% in most of cases
-
Applicable knowledge technologiesvisualisation of classification between pagesCroatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics
-
Applicable knowledge technologiesvisualisation of classification between culture (low right) and sport (high left)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics
-
Applicable knowledge technologiesvisualisation of classification for documents that differentiate between home (blue upward) and foreign policy (blue downward)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/eco. po = politics
-
Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21