Proceedings of - mlp2017mlp.computing.dcu.ie/mlp2017/docs/proceeding.pdf · 2017-10-25 ·...

103
Proceedings of MLP 2017 The First Workshop on Multi-Language Processing in a Globalising World 4–5 September 2017 ADAPT Centre Dublin City University Dublin, Ireland Edited by Mikel L. Forcada Chao-Hong Liu Jinhua Du Qun Liu

Transcript of Proceedings of - mlp2017mlp.computing.dcu.ie/mlp2017/docs/proceeding.pdf · 2017-10-25 ·...

  • Proceedings of

    MLP 2017The First Workshop on Multi-Language

    Processing in a Globalising World

    4–5 September 2017ADAPT Centre

    Dublin City University

    Dublin, Ireland

    Edited byMikel L. ForcadaChao-Hong Liu

    Jinhua DuQun Liu

  • Foreword by the chairs

    It’s our great pleasure to welcome you to the First Workshop on Multi-Language Processing ina Globalising Word (MLP 2017) to be held in Dublin, Ireland.

    We are living in a rapidly changing world of which globalisation is one of the most notablefeatures. On the one hand, globalisation has brought us significant growth in free internationaltrade and cross-cultural communication, as well as access to newly-developed technology, media,education, healthcare, consumers, goods, etc. On the other hand, it may have negative impactson local societies, such as cultural homogenization.

    Nowadays, the development of information technology has provided solid infrastructurefor a globalising world. However, without pervasive and high quality language technologies,the free flow of human knowledge and intelligence in the globalising world is unimaginable.Unfortunately, for most of the human languages in the world, there are no good enough or evenusable language technologies.

    To embrace cultural diversity and multilingual challenge, experts with research interestsin different languages and research areas (natural language processing and linguistics) wereinvited to participate in MLP 2017. The response of the community to the call for papers forthis workshop was very encouraging. What you have in your hands now is the result of workby contributors from all over the world to answer this call.

    The papers have been grouped in a one-and-a-half day workshop. We have four long papers,four short papers, three invited speeches, and three shared task papers.

    These papers will therefore give a wide but focused view of the current status of multi-language processing in a globalising world and give attendees the opportunity to discuss thecurrent status and future directions of research in multilingualism and minority languages inthis globalising world.

    We will finish this foreword by thanking the authors who submitted their papers; the review-ers, who did an great job and gave many suggestions to improve the papers that were finallyaccepted; and the local organizers. We also warmly thank the ADAPT Centre at Dublin CityUniversity for hosting this workshop.

    Dublin, September 2017

    Mikel L. ForcadaProgramme ChairUniversitat d’AlacantE-03690 Sant Vicent del Raspeig, Spain.

    Qun LiuGeneral Chair

    i

  • ADAPT CentreDublin City UniversityGlasnevin, Dublin 9, Ireland.

    ii

  • Foreword by the shared task chair

    The Shared Task on Cross-lingual Word Segmentation and Morpheme Segmentation is organisedin this workshop to review the fundamental problems of 1) how words are formed in eachlanguage, 2) how they could be segmented and 3) if we could use cross-lingual tools to segmenteither words or morphemes in multiple languages. These are the most fundamental naturallanguage processing (NLP) technologies we use to extract basic processing units for furtherNLP tasks in many languages.

    There are broadly two groups of segmentation tasks related to word formation, i.e. mor-pheme segmentation and word segmentation. Morpheme segmentation is required in languagessuch as Turkish, where words are formed by stems, roots, prefixes, and/or suffixes. It is the foun-dation for further morphological analysis tasks. Word segmentation is necessary in languagessuch as Mandarin Chinese, where there is no word boundary in the writing system.

    In this shared task, we encourage the participants to submit the results of one system/methodthat could be applied to either word or morpheme segmentation. These systems are expected todemonstrate the ability of cross-lingual processing on the segmentation tasks, which would giveinsights to our community on the building of fundamental NLP tools for low resource languages.

    Corpora of ten languages are curated for this shared task. For morpheme segmentation,these are Amharic, Basque, Farsi, Filipino, Finnish, Kazakh, Marathi, and Uyghur. For wordsegmentation, there are Japanese, traditional Chinese and Vietnamese. The sentences in mostof these corpora are initially collected from the web and all of them are annotated by nativespeakers. These corpora allow us to review segmentation tasks in multiple languages and weplan to release them at some point after the workshop.

    I would like to express my sincere gratitude to the many researchers who prepared, helpedannotating and gave comments in the preparation of the corpora; Without them, the shared taskwill be impossible. They are Alberto Poncelas, Alex Huynh, Dinh Dien, Erlyn Manguilimotan,Francis Tyers, Jinhua Du, Jonathan North Washington, Majid Latifi, Nasun-Urt, NathanielOco, Rachel Costello, Sara Garvey, Teresa Lynn, Tommi A Pirinen, Qun Liu, Vinit Ravis-hankar, Yalemisew Abgaz, Yasufumi Moriya, Yating Yang, as well as many other researcherswho discussed with us during the course. Thank you so much!

    Dublin, September 2017

    Chao-Hong LiuShared task chairADAPT CentreDublin City UniversityGlasnevin Dublin 9, Ireland.

    iii

  • Welcome

    I am very pleased indeed to welcome you all to MLP2017, The First Workshop on Multi-Language Processing in a Globalising World, organized by the ADAPT Centre, at Dublin CityUniversity, here in Dublin, Ireland.

    DCU is a young, dynamic and ambitious university; since admitting its first students in1980, DCU has grown in both student numbers and size and now occupies several sites inNorth Dublin, with the successful incorporation of three other campuses in 2016. To date,well over 50,000 students have graduated from DCU and are now playing significant roles inenterprise and business globally. Today, in 2017, DCU delivers more than 200 programmes toover 16,000 students across its five faculties. DCU’s excellence is recognised internationally andit is regularly ranked among the top 50 young Universities worldwide.

    The ADAPT Centre for Digital Content Technology is one of Science Foundation Ireland’scentres of excellence. It is a 50 million euro Centre, with funding from SFI, H2020 and industry.ADAPT focuses on developing next generation digital technologies that transform how peoplecommunicate by helping to analyse, personalise and deliver digital data more effectively forbusinesses and individuals. ADAPT researchers are based in four leading universities: Trin-ity College Dublin, Dublin City University, University College Dublin and Dublin Institute ofTechnology. ADAPT’s transformative tools allow users to explore video, text, speech and im-age data in a natural way across languages and devices, helping companies unlock opportunitiesthat exist within digital content tore-imagine how to connect people, process and data to realisenew economic value.

    In the ADAPT Centre at DCU, we have world-leading strengths in Natural Language Pro-cessing, Machine Translation and Information Retrieval, so it is very appropriate that we arethe first hosts of what we hope will be a long-standing series of workshops on Multi-LanguageProcessing. We are very pleased with the programme that has been assembled for you, com-prising one and a half days of invited talks by a range of experts, together with peer-reviewedpresentations on issues of key interest.

    We are sure that all of you participating at MLP2017 will enjoy the time you spend herein Ireland, and will look back on the event with fond memories. Thanks to all of you forcoming. We hope you all enjoy the workshop, that you benefit from the programme that hasbeen assembled, and that you go away from here having made new friends.

    Dublin, September 2017

    Andy WayDeputy DirectorADAPT, School of Computing,Dublin City University,

    v

  • Glasnevin, Dublin 9, Ireland.

    vi

  • Editors and Chairs:

    Mikel L. Forcada, Universitat d’AlacantChao-Hong Liu, Dublin City UniversityJinhua Du, Dublin City UniversityQun Liu, Dublin City University

    Programme Committee, Main Workshop:

    Tsering Gyal (Qinghai Normal University, China)Mikel L. Forcada (Universitat d’Alacant, Spain)Qun Liu (Dublin City University, Ireland)Sudip Kumar Naskar (Jadavpur University, India)Delyth Prys (Bangor University, Wales, UK)Kepa Sarasola (University of the Basque Country, Spain)Kevin Scannell (St Louis University, Missouri, USA)Francis Tyers (The Arctic University of Norway, Norway)Jonathan North Washington (Swarthmore College, PA, USA)Jiajun Zhang (Institute of Automation Chinese Academy of Sciences, China)

    Programme Committee, Shared tasks track:

    Yalemisew Abgaz (Dublin City University, Ireland)Alex Huynh (CLC Center, University of Science, VNU-HCMC-VN, Vietnam)Majid Latifi (Universitat Politècnica de Catalunya, Spain)Erlyn Manguilimotan (Weathernews Inc., formerly with NAIST, Japan)Yasufumi Moriya (Dublin City University, Ireland)Tommi A Pirinen (Universität Hamburg, Germany)Alberto Poncelas (Dublin City University, Ireland)Vinit Ravishankar (University of Malta, Malta)

    Local Organising Committee:

    Jinhua Du (Dublin City University, Ireland)Joachim Wagner (Dublin City University, Ireland)Chao-Hong Liu (Dublin City University, Ireland)Teresa Lynn (Dublin City University, Ireland)Qi Zhang (Dublin City University, Ireland)Nasun Urt (Inner Mongolia University, China)

    Sponsoring Institution:

    ADAPT: The Global Centre of Excellence for Digital Content and Media Innovation, Ireland

    vii

  • Table of Contents

    Main Workshop (September 4)

    Invited speeches

    NLP for all languages? A cross-lingual learning viewpointZeljko Agić, IT University of Copenhagen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Language technology for morphologically rich languagesTrond Trosterud, the Arctic University of Norway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    The landscape of Irish language TechnologyTeresa Lynn, Dublin City University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    Long papers

    Quantifying the Effects of Language ContactGualberto A. Guzman, Barbara E. Bullock & Almeida Jacqueline Toribio . . . . . . . . . . 15

    Language Independent Proposal to Profile-based Named Entity ClassificationIsabel Moreno, Maŕıa Teresa Romá Ferri & Paloma Moreda . . . . . . . . . . . . . . . . . . . . . . . . 21

    Linking Knowledge Graphs across Languages with Semantic Similarity and MachineTranslationJohn P. McCrae, Mihael Arčan & Paul Buitelaar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    Mongolian Lexical Analysis Using Labeled and Unlabeled DataZhenxin Yang, Miao Li, Lei Chen & Tao Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    Short papers

    A Tibetan input method for mobile intelligent devicesHeming Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    An Introduction to the Tibetan Case Grammar Structure and Its Grammatical Func-tionsHua Cai, Nuo Qun & Drup Ngo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Phonetic Analysis of HMM-based Tibetan Speech SynthesisCairang Zhuoma & Caizhi Jie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    ix

  • Processing of the Affixes in Tibetan Word SegmentationHuaquecairang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    Shared task (September 5)

    Introduction to the Shared Tasks on Cross-lingual Word Segmentation and MorphemeSegmentationChao-Hong Liu & Qun Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence LabellingYan Shao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    A Simple Multi-language Word Segmentation System with Conditional Random FieldVipas Sutantayawalee & Thepchai Supnithi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    Modeling an Intermediate-Level Learner for Word SegmentationYongkyoon No . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    x

  • Main workshop

  • Invited speech

    NLP for all languages? A cross-lingual learning viewpoint

    Zeljko Agić, IT University of Copenhagen

    Abstract: The world speaks in 7,000 languages today. Only 1% of those enjoy some naturallanguage processing support. For the remainder, not even the most basic processing is available.We don’t even have part-of-speech taggers for 99% of the world’s languages. Are there ways toapproach this challenge that don’t involve simply annotating more data? In this talk, I takethe viewpoint of realistic cross-lingual learning. I discuss a set of recent approaches to creatingbasic NLP (almost) from scratch for even the most hostile environments. The talk focuses onmassively multilingual learning through annotation projection, in comparison to other realistictransfer learning and the last resort that is manual annotation.

    3

  • Invited speech

    Language technology for morphologically rich languages

    Trond Trosterud, UiT The Arctic university of Norway

    Abstract: This paper presents infrastructure and language technology work for languageswith a rich and productive morphology. The infrastructure (called Giella) may be characterisedas a development environment for language processing. It contains languageindependent mod-ules for compiling and building both finite-state grammatical models and a wide range of lan-guage technology applications, as well as language-specific modules, one for each language inquestion. The infrastructure, grammatical models and language applications are all availableas open source.

    5

  • Language technology for morphologically rich languages

    Trond TrosterudUiT The Arctic university of Norway

    [email protected]

    Abstract

    This paper presents infrastructure andlanguage technology work for lan-guages with a rich and productive mor-phology. The infrastructure (calledGiella) may be characterised as a de-velopment environment for languageprocessing. It contains language-independent modules for compiling andbuilding both finite-state grammati-cal models and a wide range of lan-guage technology applications, as wellas language-specific modules, one foreach language in question. The infras-tructure, grammatical models and lan-guage applications are all available asopen source.

    1 IntroductionA prerequisite for many language technologyapplications is the ability of linking wordformsand grammatical words (lexeme and grammat-ical tags). For morphology-poor languages,this may be done by listing the pairs of word-forms and grammatical words. For large, andespecially for productive morphological sys-tems, listing becomes harder. This paper re-ports on work addressing this problem, andmaking grammatical modelling available for awide array of morphologically rich languages,with a focus on languages with marginal accessto corpus resources.

    The language-specific modules contain themorphological and syntactic models for eachlanguage, thereby making it possible to handlelanguages with rich morphology and paucityof text. As the language-dependent andlanguage-independent software is kept apart,the infrastructure is scalable, new languages

    Figure 1: Schematic overview of the infras-tructure

    are easily added, and developmental work car-ries over to all the languages. The technol-ogy chosen makes it possible to do deep gram-matical analysis, as well as language genera-tion, and thereby offer applications not avail-able for morphologically rich languages underother approaches. A schematic overview of theinfrastructure is given in Figure 1.

    This article will present the background forthe infrastructure, some applications that maybe derived from it, and its implication for workon morphologically rich languages. Finallycomes a conclusion.

    2 Background

    Language technology for morphology-rich lan-guages rests upon the ability of identifying

    Mikel L. Forcada, Chao-Hong Liu, Jinhua Du, Qun Liu (eds.)Proceedings of MLP 2017: The First Workshop on Multi-Language Processing in a Globalising World, p. 7–11Dublin, Ireland, 4–5 September 2017

  • the lemma and grammatical properties of thewordform, and for language production, uponthe ability of generating the appropriate word-form for each grammatical context.

    The basic technology behind Giella’s mor-phological modelling is finite-state transduc-ers. It goes back to (Koskenniemi, 1983).Good compilers were developed by Xerox inthe early nineties cf. (Beesley and Karttunen,2003). The development of a family of compil-ers under open licenses, the so-called HelsinkiFinite State Compilers, was initiated a gooddecade later, cf. (Lindén et al., 2011), (Lindénet al., 2013). Compared to the Xerox compil-ers, this generation of transducers also containnew features, such as the possibility to assignweights to the transducers, as well as the pos-sibility of integrating the transducers into dif-ferent end-user programs. An overview of thehistory of finite-state morphology is given in(Karttunen, 2001).

    Koskenniemi’s original idea was to distin-guish between segmental and suprasegmentalmorphophonology, in order to be able to makealternations to the stem based upon subse-quent morphological processes. An exampleis given in figure 2, which shows two trans-ducers. The first one, segmental morphology,links lemma and grammatical tags to stem,suffixes and metagrammatical symbols. Thesecond one, for morphophonology, turns thisabstract string into actual wordforms. Theprocess is reversible, and the same model maythus be used both for analysis and generation.

    Although the finite-state methodology in it-self is usable for all languages, it is crucialfor languages with complex morphology. Anobvious alternative to using transducers is tolist all pairs of grammatical word : wordform.Even though modern computers may handlesuch lists also when each lemma has hundredsand thousands of realisations, manually cor-recting such lists for morphologically rich lan-guages becomes impossible. Productive mor-phological processes, that cannot be listed, beit compounding or derivation, come on top ofthis.

    Within generative grammar, syntax wasmodelled as a set of rewriting rules of the typeS → VP, VP → V NP, etc. Formally, this ruleset acted like a filter, letting through only sen-

    Figure 2: A composed finite-state transducergives the accusative of North Saami gussa,‘cow’

    tence able to be generated by the rule set. Inpractice, sentences occurring in running textwere too complex to be subsumed under suchrule set, and this approach never resulted inrobust computational grammars.

    The Giella approach is a different one,Constraint grammar (CG), cf. (Karlsson,1990),(Karlsson et al., 1995), (Bick, 2000).Here, the output of the morphological trans-ducer will give all possible morphological anal-yses of each word, and CG will then constrainthe number of analyses, ideally to one, on thebasis of the context. Further CG componentswill thereafter add syntactic functions and de-pendency relations. As shown by (Antonsenet al., 2010), the potential of rule generalisa-tion increases as the level of abstraction getshigher. Thus, whereas homonymy resolution isdependent upon language-specific homonymypatterns, all languages share the generalisationthat the mother of a postposition complementis the postposition to its right.

    The Giella infrastructure contains approxi-mately 100 languages, 1/4 of them are in use inseveral practical applications. Figure 3 showssome of them. A presentation of the infras-tructure is given in (Moshagen et al., 2013)

    3 ApplicationsThe modularity of the infrastructure ensuresthat as soon as the model of a new languageis set up, the pipeline for making practical ap-plications is ready. This section looks at some

    8

  • Figure 3: Circumpolar languages in the Giellainfrastructure

    Figure 4: E-dictionary as transducer lookup

    of them.

    3.1 E-dictionaries

    The most popular language technology appli-cation for morphologically rich languages isthe morphologically enriched e-dictionary. Ithas turned out that speakers of languages witha complex morphophonology or with predom-inately prefixing, gladly let the computer han-dle the task of finding the lemma of the word-form and looking it up in the dictionary. TheGiella infrastructure contains a pipeline formaking such dictionaries, and equip them witha click-in-text functionality, as shown in Fig-ure 4. The principles behind the dictionaryare discussed in (Johnson et al., 2013).

    This functionality is by no means unique tothe infrastructure presented here, on the con-trary, it is found in all dictionaries shippedwith modern computers. But whereas lan-guages with limited morphology may list thewordforms referring to their respective lemmaarticles, this is not a viable option for lan-guages with a rich morphology.

    3.2 Spellcheckers and grammarcheckers

    Finite state transducers have been used asspellcheckers since the very beginning. Thelower side of a transducer contains all and onlythe wordforms in the language, and any word-form not member of this set will be flagged asa typographical error.

    The novelty of the Giella infrastructureis that it provides an open and modularpipeline for turning a grammatical model intoa spellchecker in word processors like Libre-Office, and, in case there is a language codeavailable, in Microsoft Office.

    The importance of this technology for mor-phologically rich languages is illustrated bythe situation for language technology in NorthAmerica. To our knowledge, the only Na-tive American languages in North Americafor which there are spellcheckers, are PlainsCree and Northern Haida, both made with theGiella infrastructure. Language technology forother languages have used other approaches,and therefore not being able to provide mod-els of the language suitable for distinguishingwords from non-words. Due to the morpho-logical structure of North American languages,wordform lists have not been an option.

    Another text processing tool is the grammarcheckers, a program that correct realword er-rors, be it typos resulting in existing words, orchoice of wrong inflection form. A robust wayof finding such errors is to write rules detect-ing known error patterns (say, nominative foraccusative), and a prerequisite for doing this isan explicit grammatical analysis of the input.

    3.3 Machine translationTo the general public, machine translation isperhaps the most well-known example of lan-guage technology, via Google Translate andsimilar services. Good quality translation intomorphology-rich languages is still rare. TheGiella infrastructure is integrated with theApertium MT platform, which was designed totranslate between closely related morphology-rich languages. The morphological modelsare exported to Apertium, where an intersec-tion of the models and the Apertium bilin-gual dictionary is compiled into an MT sys-tem. For a presentation, see (Antonsen et al.,

    9

  • 2017). Apertium is presented in (Forcadaet al., 2011).

    4 Different approaches tomorphology-rich languages

    The approach to language technology advo-cated here is by no means the one dominat-ing the field. During the last decades, vari-ous machine learning techniques have becomedominant, both because of their results andbecause of their scalability. A setup for learn-ing from the dataset of one language may bereused for the dataset of the next. The scal-ability has not quite included morphologicallyrich languages, though. As shown by (Koehn,2005), translation into Finnish shows a Bleuscore only half as good as translation into 9other EU languages.

    There is another way of ensuring scalability,though. For all languages with at least someliteracy there is at least one or more gram-marians devoted to the description and anal-ysis of that language. For these grammarians,machine learning techniques are not attrac-tive in themselves, as the models do not offerany explicit insight per se. The typical lan-guage team within the Giella framework, onthe other hand, consists of 3 persons (roles):The grammarian, the computational linguist,and (common to the whole infrastructure), theprogrammer. For the grammarian, the moti-vation to participate is that the infrastructureprovides both a formal framework for writinga concise descriptive grammar, and a way ofempirically testing the generalisations given inthe grammar. When the models are turnedinto practical applications, they need to becomprehensive to a degree not met by existinggrammars, and native speakers must be addedto the team, a very welcome development seenfrom the perspective of the grammarian.

    5 Conclusion

    Computational treatment of morphologicallyrich languages are dependent upon mod-els that recognise, disambiguate and gener-ate wordforms, grammatical words (lemma +analysis) and syntactic functions. Since thismodelling involves empirical testing of gram-matical analyses of the language in question,language technology becomes relevant also for

    the language expert. When the infrastruc-ture needed to compile and test both modelsand applications are in place, the result is ascalable setup for language technology also forlow-resource languages.

    6 AcknowledgementsThe infrastructure presented here was madeby Sjur Moshagen and Tommi Pirinen, Thegrammatical models are the result of jointwork by members of the groups Giellateknoand Divvun at UiT over the last two decades(http://giellatekno.uit.no and http://divvun.no), and of collaberation with numer-ous colleagues across the world, especially JackRueter, who has been involved in the setupof several Uralic languages, and Antti Arppe,who has led the work on North American lan-guages.

    ReferencesLene Antonsen, Ciprian Gerstenberger, Maja

    Kappfjell, Sandra Nystø Rahka, Marja-LiisaOlthuis, Trond Trosterud, and Francis M. Ty-ers. 2017. Machine translation with north saamias a pivot language. In Proceedings of the 21stNordic Conference on Computational Linguis-tics, NoDaLiDa, 22-24 May 2017, Gothenburg,Sweden. Linköping University Electronic Press,Linköpings universitet, 131, pages 123–131.

    Lene Antonsen, Trond Trosterud, and LindaWiechetek. 2010. Reusing grammatical re-sources for new languages. In Proceedings ofLREC-2010. ELRA, Valetta, Malta.

    Kenneth R. Beesley and Lauri Karttunen. 2003.Finite State Morphology. Studies in Computa-tional Linguistics. CSLI Publications, Stanford,California.

    Eckhard Bick. 2000. The Parsing System”Palavras”: Automatic Grammatical Analysis ofPortuguese in a Constraint Grammar Frame-work. Aarhus University Press, Aarhus.

    Mikel L. Forcada, Mireia Ginestí-Rosell, JacobNordfalk, Jim O’Regan, Sergio Ortiz-Rojas,Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Fran-cis M. Tyers. 2011. Apertium: a free/open-source platform for rule-based machine transla-tion. Machine Translation 25(2):127–144.

    Ryan Johnson, Lene Antonsen, and TrondTrosterud. 2013. Using finite state transduc-ers for making efficient reading comprehen-sion dictionaries. In Proceedings of the 19th

    10

  • Nordic Conference of Computational Linguis-tics (NODALIDA 2013); May 22-24; 2013; OsloUniversity; Norway. NEALT Proceedings Se-ries 16. Linköping University Electronic Press;Linköpings universitet, 85, pages 59–71.

    Fred Karlsson. 1990. Constraint grammar as aframework for parsing running text. Helsinki,pages 168–173.

    Fred Karlsson, Atro Voutilainen, Juha Heikkilä,and Atro Anttila. 1995. Constraint Grammar. ALanguage-Independent System for Parsing Un-restricted Text. Natural Language Processing.Mouton de Gruyter, Berlin, New York.

    Lauri Karttunen. 2001. A short history of two-levelmorphology .

    Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical machine translation. MT Summit5:79–86.

    Kimmo Koskenniemi. 1983. Two-level Morphology:A General Computational Model for Word-formProduction and Generation. Publications of theDepartment of General Linguistics, Universityof Helsinki. University of Helsinki, Helsinki.

    Krister Lindén, Erik Axelson, Senka Drobac, SamHardwick, Miikka Silfverberg, and Tommi A.Pirinen. 2013. Using HFST for CreatingComputational Linguistic Applications, SpringerBerlin Heidelberg, Berlin, Heidelberg, pages 3–25.

    Krister Lindén, Miikka Silfverberg, Erik Axel-son, Sam Hardwick, and Tommi Pirinen. 2011.HFST-Framework for Compiling and ApplyingMorphologies. In Cerstin Mahlow and MichaelPietrowski, editors, Systems and Frameworks forComputational Morphology. volume 100 of Com-munications in Computer and Information Sci-ence, pages 67–85.

    Sjur N. Moshagen, Tommi A. Pirinen, and TrondTrosterud. 2013. Building an open-source de-velopment infrastructure for language technol-ogy projects. In Proceedings of the 19thNordic Conference of Computational Linguis-tics (NODALIDA 2013); May 22-24; 2013; OsloUniversity; Norway. NEALT Proceedings Se-ries 16. Linköping University Electronic Press;Linköpings universitet, 85, pages 343–352.

    11

  • Invited speech

    The landscape of Irish Language Technology

    Teresa Lynn, Dublin City University

    Abstract: In the area of language technology, low-resourced languages present additionalchallenges. Low-resourced in this context means that there are lack of language resources thatare required to build robust applications that can handle language processing. For example,parallel data for machine translation, machine readable dictionaries, knowledge bases (syntactic,semantic), linguistically tagged corpora (e.g. POS-tagged, syntax trees), sentiment lexicons, andso on. Significant research and development has been carried out in these areas for the world’smajor languages. This is made possible through funding and training provided that facilitatework in these languages. The problem however with minority languages is that they often lackfinancial support, and finding skilled researchers to carry out this work is extremely challenging,therefore winding up in the category of “low-resourced”.

    Until several years ago, much of the research carried out on Irish Speech and LanguageTechnology was either voluntary or made possible through piecemeal funding from nationalorganisations. Progress was slow, and the Irish language looked to be at risk of digital extinction.However, in this talk, I will report on changes in the landscape of funding for Irish languagetechnology, increased interest from the Irish Government and the European Commission, andhow the future of Ireland in a Digital Age now appears brighter.

    13

  • Quantifying the Effects of Language Contact

    Gualberto A. Guzman Barbara E. Bullock Almeida Jacqueline ToribioUniversity of Texas at Austin

    {gualbertoguzman}@utexas.edu{bbullock,toribio}@austin.utexas.edu

    1 Introduction

    The linguistic effects of language contact appearto be diverse, ranging from lexical borrowing tolinguistic fusion to the rise of new, mixed lan-guages (Auer 1999). Comparison between con-tact corpora is potentially informative of the lim-its on the manner and extent to which languagesmay become intertwined in multilingual settings(Adamou, 2016). Yet, to date, linguists havelacked adequate methods of comparing languagemixing between corpora, so there is no objectiveway to gauge the nature of code-switching (C-S)between corpora. Our objective is to provide thetools to computationally characterize the degree,complexity, and regularity of C-S within a corpus.Here we present a set of metrics that represent C-Sin a corpus across these dimensions. We apply ourmetrics to quantify the differences between vari-ous corpora of endangered languages in contactwith majority languages in Europe and compareour results to those of Adamou (2016).

    2 Existing Work

    Scholars of language contact argue that the de-gree of language mixing is predicated on socialfactors such as the intensity of contact or de-gree of bilingualism among speakers and on thestructures of the languages involved (Weinreich1953, Myers-Scotton 1993, Thomason & Kauff-man 1988, Poplack & Sankoff 1988, Muysken2000). Adamou (2016) challenges these assump-tions, finding instead a range of C-S and borrow-ing patterns that cannot be predicted by inten-sity of contact or degree of bilingualism. She ar-gues that a quantitative scale of language mixingis required to empirically test typologies of lan-guage contact and to investigate variation in C-Sbetween and within language pairings. Indeed,variation in C-S has been documented for iden-

    tical language pairings. For instance, Vu et al.(2013) demonstrate that Mandarin-English bilin-guals in Malaysia and Singapore show individualand regional differences in C-S, which they referto as C-S attitudes. In her analysis of data fromthe Pangloss collection of endangered Europeanlanguages, Adamou quantifies mixing by a wordcount, which represents the frequency of the ma-jority language insertions relative to the endan-gered language base. On the basis of the word-count method, she is able to divide the corporainto two types of C-S, one type in which 0-5%of the word tokens are from the contact languageand the other type in which 20-35% of the wordtokens are from the majority language. In a finer-grained analysis, she considers how other multilin-gual properties of speech correlate with these twobroad types: morphological integration of contacttokens, borrowing of syntactic patterns, probabil-ity of particular word classes as borrowed inser-tions, etc.

    The metrics that we propose here are limitedto C-S defined by language tags and transitions.Other methods of quantifying C-S defined thisway have been proposed. In an early effort to cre-ate a standardized way of calculating the ratio oflanguages in a multilingual corpus, the LIPPESGroup (2000) proposed the M-Index, which isbounded by 0 (monolingual) and 1 (equally bal-anced between languages) and measures the bal-ance of languages across a corpus. The Code-Mixing Index of Das & Gämback (2014a, 2014b)and Gämback & Das (2016) determines the mostfrequent language of an utterance (usually de-fined as a tweet) and calculates the frequenciesof the other languages within the utterance. Theirmethod is particularly well-suited for a corpus oftweets, where it can be assumed that the baselanguage may change from one user to another.Guzmán et al. (2016, 2017a) present the I-index

    Mikel L. Forcada, Chao-Hong Liu, Jinhua Du, Qun Liu (eds.)Proceedings of MLP 2017: The First Workshop on Multi-Language Processing in a Globalising World, p. 15–19Dublin, Ireland, 4–5 September 2017

  • Name Languages M-index I-index Burstiness Memory LE SEHRV Burgenland Croatian, German 0.04 0.04 0.30 -0.26 0.15 3.52HSB Colloquial Upper Sorbian, German 0.07 0.06 0.37 -0.19 0.22 3.79MKD Balkan Slavic, Greek 0.10 0.09 0.38 -0.17 0.29 3.60RMN Thrace Romani, Turkish, Greek 0.39 0.26 0.15 -0.24 0.66 2.91SVM Molise Slavic, Italian 0.35 0.24 0.19 -0.22 0.62 2.96

    Table 1: C-S Metrics of Pangloss Corpora

    to measure the integration of languages in C-S.Guzmán et al. (2017b) use measures of burstinessand memory from complex systems research (Goh& Barabási 2008) to characterize the regularity ofC-S through time.

    3 Methods: Metrics

    In our approach we categorize C-S corpora usingthe ratio of the languages represented, the prob-ability of switching between them, and the timecourse of switching throughout the corpus. Thisallows us to capture how monolingual versus bilin-gual a corpus is, how integrated the languages arewithin a corpus, and the regularity of switching,respectively. These measures help us to quan-tify the desiderata expressed by Adamou (2016)by which comparative analyses of contact corporashould be able to determine 1) whether there isa dominant language, 2) whether a text is moremonolingual or more integrated, 3) how subcor-pora compare, and 4) whether there is a preferencewith respect to C-S or borrowing.

    3.1 Ratio

    3.1.1 M-IndexThe Multilingual Index (M-index), was developedby a group of European sociolinguists (Barnett etal., 2000) who were interested in having language-independent measures of the amount of C-S ina given corpus across a variety of language pair-ings. In line with Adamou, the M-index is basedon word-count and quantifies the balance of lan-guages within a corpus.The M-index is calculated as follows, where k > 1is the total number of languages represented in thecorpus, pj is the total number of words in the lan-guage j over the total number of words in the cor-pus, and j ranges over the languages present in thecorpus:

    M-Index ≡1−∑ p 2j

    (k − 1) ·∑ p 2j. (2)

    The index is bounded between 0 (monolingualcorpus) and 1 (each language in the corpus is rep-resented by an equal number of tokens).

    3.1.2 Language EntropyAn alternative way to express the inequality of lan-guages in a corpus that is widely used among digi-tal humanists is the Shannon entropy. For our pur-poses, we describe this metric, language entropy,that quantifies the complexity of the language tagdistribution returning the number of bits needed todescribe it.Using the same conventions of notation as definedabove, language entropy is calculated as

    LE = −k∑

    j=1

    pj log2(pj) (3)

    and is bounded from below by 0 (representing acompletely monolingual text) and bounded fromabove by

    −k∑

    j=1

    1

    klog2(

    1

    k) = log2(k), (4)

    which is the maximum entropy for a corpus withk languages (and, in such a case, each language isrepresented equally). In the case of two languages,the M-index and LE can be derived from one an-other.

    3.1.3 Probability of Switching (I-index)In an effort to describe the integration of lan-guages in a corpus, we developed the Integration-Index, a metric that describes the probability ofswitching within a text (Guzmán et al., 2016) (seealso (Gambäck and Das, 2014; Gambäck and Das,2016)) based on the bigram transitions between to-kens. We define a switch point as a change in lan-guage tags between tokens.The I-index expressesthe proportion of existing switch points relative tothe possible number of switch points in the corpus.It is the approximate probability that any given bi-gram in the corpus is a switch point.

    16

  • Given a corpus composed of tokens tagged by lan-guage {li} where j ranges from 1 to n, the size ofthe corpus, and i = j− 1, the I-index is calculatedby the expression

    I-Index ≡ 1n− 1

    1≤ i= j−1 ≤n−1S(li, lj), (5)

    where S(li, lj) = 1 if li 6= lj and 0 otherwise, andthe factor of 1/(n − 1) reflects the fact that thereare n − 1 possible switch sites in a corpus of sizen.The I-index has the potential to differentiatebetween corpora in which different patterns ofswitching occur despite a similar M-index. Forexample, a text followed by its translation wouldreturn an I-index very close to zero, whereas anoral corpus among frequent code-switchers wouldreturn values relatively farther away from zero.

    3.2 Time-course Measures

    In an effort to quantify the regularity of C-S withina corpus, we adapted two metrics, burstiness andmemory, from the field of complex systems (Gohand Barabási, 2008). If we define a languagespan as a sequence of words in a single language,then we can frame these two metrics as quanti-fying the patterns of switching throughout a cor-pus. Imagine a mixed language where all verbscome from one language and all nouns are drawnfrom another (e.g. Gurindji Kriol (McConvell andMeakins, 2005)). In essence, a language switchoccurs within every utterance. In such a case, thepattern of switching would be very regular andthe monolingual spans would be very short. Incontrast, we can imagine a situation in which aspeaker uses only sporadic borrowings or interjec-tions from a contact language. In this case, the pat-tern of switching is irregular with long monolin-gual spans followed by very short ones. Burstinessand memory combined attempt to account for theintermittency versus regularity of switching pat-terns. To be consistent with information theory,we can also define burstiness through Shannon en-tropy with the notion of span entropy.

    3.2.1 Burstiness

    Our burstiness measure is from Goh and Barabási(Goh and Barabási, 2008). Let στ denote the stan-dard deviation of the language spans and mτ themean of the language spans.

    Burstiness is calculated as

    Burstiness ≡ (στ/mτ − 1)(στ/mτ + 1)

    =(στ −mτ )(στ +mτ )

    (6)

    and is bounded within the interval [-1, 1]. In theabove example, Gurindji Kriol (McConvell andMeakins, 2005) would likely take on burstinessvalues close to -1 whereas the corpus with briefinterjections from a contact language would takeon values closer to 1.

    3.2.2 Span EntropyThe span entropy is a measure of how many bits ofinformation are needed to quantify the complexityof the language span distribution.Let M denote the total number of states within thelanguage span distribution, and l denote a specificspan within that distribution where pl representsthe sample probability of a span of length l. Thespan entropy is then defined as

    SE = −M∑

    l=1

    pl log2(pl) (7)

    and is bounded below by 0 (in which case all lan-guage spans are of equal length) and above by

    −M∑

    l=1

    1

    Mlog2

    (1

    M

    )= log2(M), (8)

    which is the maximum entropy for a corpus withM possible language span states (and in such acase, each possible span has probability 1M ).

    3.3 Memory

    While burstiness and span entropy measure theregularity or intermittency of C-S in a corpus, theydo not account for the chronological ordering ofthe language spans. Memory (Goh and Barabási,2008) informs us about the ordering of spans andthe degree to which their lengths are autocorre-lated. If all spans were of an equal length, thememory would be closer to 1. However, if there isan asymmetry of languages represented, we wouldexpect a memory value closer to -1. In all likeli-hood, our corpora are not long enough to yield re-liable results for memory.Let nr be the number of language spans in the dis-tribution and τi denote a specific language span inthat distribution ordered by i. Let σ1 and m1 bethe standard deviation and mean of all language

    17

  • spans but the last, where σ2 and m2 are the stan-dard deviation and mean of all language spans butthe first. Memory is calculated as

    Memory ≡ 1nr − 1

    nr−1∑

    i=1

    (τi −m1)(τi+1 −m2)σ1σ2

    (9)and is bounded within the interval [-1,1].

    Burstiness (or span entropy) in conjuction withmemory describe the intermittency of switchingbehavior. For our purposes, burstiness coupledwith the I-index provides us with an indicationof whether a text comprises insertional or alterna-tional switching. For instance, a relatively highI-index with a low burstiness value would yielda text that could be described as alternationalwhereas the reverse could be characterized as in-sertional.

    Figure 1: Pangloss Span Density by Corpus

    4 Data

    The data are from the XML text files that ac-company the natural recordings of endangeredEuropean languages in the La Collection Pan-gloss (lacito.vjf.cnrs.fr/pangloss/).We include in our analysis only the data fromfiles in which multiple languages are represented.The corpora include 6k words of Upper Sorbianin contact with German (HSB), 2.6k words ofThrace Romani in contact with Turkish and Greek(RMN), 27.2k words of Molise Slavic in con-tact with Italian (SVM), 2.4k words of Burgen-land Croatian in contact with German (HRV), and8k words of Balkan Slavic in contact with Greek(MKD). Every token from a contact language inthese corpora carries a manually encoded lan-guage identification tag. Tokens with no languagetag are assumed to come from the endangered ma-trix language. In pre-processing to use this data,we filled the unknown language tags with the ma-trix language identified by the titles for the files

    within the corpus (e.g. RMN = Thrace Romani).We use the count of these language tags to deter-mine ratios in our metrics and we use counts of bi-gram transitions to determine language spans forcalculating probability and time course of switch-ing.

    5 Results

    As indicated in Table 1 and depicted by the spandistributions in Figure 1, these endangered lan-guage corpora appear to fall into the two dis-tinct sets also found by Adamou on the basis ofword count. In particular, the Romani (RMN)and Molise Slavic (SVM) sub-corpora pattern dis-tinctly from the others. As indicated by theirhigher M-indices, Thrace Romani and MoliseSlavic present the greatest ratio of contact to en-dangered languages. This is also seen in theirhigher language entropy (LE) scores. These valuescapture similar information to Adamou’s word-count method. Their I-indices indicate that thesecorpora have a much higher probability of C-Srelative to the Croatian (HRV), Sorbian Slavic(HSB), and Balkan Slavic (MKD) corpora. Thisinformation cannot be captured solely by theword-count approach since the latter does not ac-count for the distribution of one language rela-tive to the other. Importantly, the Romani (RMN)and Molise Slavic (SVM) also show low bursti-ness values and, consequently, lower span entropyscores; thus, the lengths of language spans are dis-tributed throughout these corpora in a more regu-lar fashion than in the other sub-corpora. This im-plies that Adamou’s characterization of the ThraceRomani and Molise Slavic as alternational C-S iscorrect. By contrast, the remaining corpora ap-pear to present largely insertional C-S, with in-terpolated contact-language items. The negativememory values indicate that longer spans tend tobe followed by shorter spans and vice versa acrossall corpora. This suggests that every corpus in thisdataset is asymmetrical, i.e., there is a discerniblemajority language in every corpus, despite the dif-ferences in the percentage of contact forms con-tained therein.

    6 Conclusion

    The metrics that we have applied demonstrate thatthe endangered language corpora under study pat-tern in one of two ways, either with intermit-tent, short insertions of a contact language within

    18

  • the endangered language, or more regularly as aform of alternational C-S (Adamou 2016) . Whilethe measures for the degree of bilingualism inthese corpora (M-Index, LE) appear to be corre-lated with measures for integration (I-Index, SE)in these subcorpora, this need not be the case inall contact corpora and other scenarios are attested(see Guzman et al. 2017b). The advantages toour metrics are multiple: they can quantify thenature of CS in a language-independent way, canbe fit as continuous variables in statistical mod-els, and can be applied to any grammatical com-ponent annotated for language (morpheme, con-versational turns, etc.). With these metrics, we cantest whether these quantifiable dimensions of mix-ing predict other contact phenomena (borrowingand calquing) and outcomes (fused or mixed lan-guages) or correlate with social variables. In fu-ture work, we hope to employ these measures asvariables in predicting C-S sites as well as explorethe degree to which these measures help to classifyC-S typology.

    7 Acknowledgements

    We thank Lila Adamou for sharing her annotateddata and Joseph Ricard for sharing his mathemati-cal expertise.

    References

    Evangelia Adamou. 2016. A corpus-driven approachto language contact: Endangered languages in acomparative perspective, volume 12. Walter deGruyter GmbH & Co KG.

    Peter Auer. 1999. From Codeswitching via LanguageMixing to Fused Lects: Toward a Dynamic Typol-ogy of Bilingual Speech. International Journal ofBilingualism, 3(4):309–332, December.

    Ruthanna Barnett, Eva Codó, Eva Eppler, MontseForcadell, Penelope Gardner-Chloros, Roeland vanHout, Melissa Moyer, Maria Carme Torras,Maria Teresa Turell, Mark Sebba, Marianne Star-ren, and Sietse Wensing. 2000. The LIDES Cod-ing Manual A document for preparing and analyzinglanguage interaction data Version 1.1—July, 1999.International Journal of Bilingualism, 4(2):131–132, June.

    Amitava Das and Björn Gambäck. 2014. Identifyinglanguages at the word level in code-mixed indian so-cial media text. In Proceedings of the 11th Interna-tional Conference on Natural Language Processing,Goa, India, pages 169–178.

    Björn Gambäck and Amitava Das. 2014. On Measur-ing the Complexity of Code-Mixing. In Proceed-ings of the 11th International Conference on NaturalLanguage Processing, Goa, India, pages 1–7.

    Björn Gambäck and Amitava Das. 2016. Compar-ing the level of code-switching in corpora. Pro-ceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016,pages 1850–1855.

    K.-I. Goh and A.-L. Barabási. 2008. Burstiness andmemory in complex systems. EPL (EurophysicsLetters), 81(4):48002.

    Gualberto Guzmán, Jacqueline Serigos, Barbara E.Bullock, and Almeida Jacqueline Toribio. 2016.Simple Tools for Exploring Variation in Code-Switching for Linguists. EMNLP 2016, pages 2–20.

    Gualberto Guzmán, Jacqueline Serigos, Barbara E.Bullock, and Almeida Jacqueline Toribio. 2017a.Metrics for modeling code-switching across cor-pora. Interspeech 2017.

    Gualberto A. Guzmán, Joseph Ricard, JacquelineSerigos, Barbara Bullock, and Almeida JacquelineToribio. 2017b. Moving code-switching researchtoward more empirically grounded methods. InCDH2017: Corpora in the Digital Humanities.

    Patrick McConvell and Felicity Meakins. 2005.Gurindji kriol: A mixed language emerges fromcode-switching. Australian journal of linguistics,25(1):9–30.

    Pieter Muysken. 2000. Bilingual speech: a typologyof code-mixing. Cambridge University Press, Cam-bridge.

    Carol Myers-Scotton. 1993. Duelling languages:grammatical structure in codeswitching. OxfordUniversity Press (Clarendon Press), Oxford.

    Shana Poplack, David Sankoff, and Christopher Miller.1988. The social correlates and linguistic processesof lexical borrowing and assimilation. Linguistics,26(1):47–104.

    Sarah Grey Thomason and Terrence Kaufman. 1992.Language contact, creolization, and genetic linguis-tics. Univ of California Press.

    Ngoc Thang Vu, Heike Adel, and Tanja Schultz. 2013.An investigation of code-switching attitude depen-dent language modeling. Statistical Language andSpeech Processing, pages 297–308.

    Uriel Weinreich. 1953. Languages in contact: Find-ings and problems. Number 1. Walter de Gruyter.

    19

  • Language Independent Proposal to Profile-based Named EntityClassification

    Isabel MorenoDepartment of Software and

    Computing Systems,University of Alicante,

    Alicante, [email protected]

    Marı́a Teresa Romá-FerriDepartment of Nursing,University of Alicante,

    Alicante, [email protected]

    Paloma MoredaDepartment of Software and

    Computing Systems,University of Alicante,

    Alicante, [email protected]

    Abstract

    This paper presents a named entity clas-sification system, designed to be lan-guage independent. Our methodology em-ploys Random Forest, a supervised ma-chine learning algorithm, and its featuresare generated in an unsupervised manner.Our feature set includes local informationfrom the entity and profiles (context infor-mation), without external knowledge re-sources (such as gazetteers) or complexlinguistic analysis. This system is testedon three different languages: (i) Span-ish CoNLL2002 dataset (Overall Fβ=1 =81.40), (ii) Dutch CoNLL2002 dataset(Overall Fβ=1 = 81.40), and (iii) En-glish CoNLL2003 dataset (Overall Fβ=1= 81.40). Although our results areslightly lower than previous work, theseare reached without external knowledgeor complex linguistic analysis. Last, us-ing the same configuration for all cor-pora, the difference of overall Fβ=1 is only9.07 points (worst case Spanish–English).Thus, these results support our hypothesisthat our approach is language independentand does not require any external knowl-edge or complex linguistic analysis.

    1 Introduction

    Named entities (NE) are essential informationwhich play a key role to other text process-ing tasks such as opinion mining (Ding et al.,2009; Jin et al., 2009), natural language genera-tion (Vicente and Lloret, 2016), information re-trieval (Guo et al., 2009; Chen et al., 1998) orquestion answering (Peregrino et al., 2012; Leeet al., 2007, 2006), among others. Furthermore,NEs have proven its importance due to a positive

    effect in performance when NEs are extracted aspart of the pipeline of, for instance, automatic textsummarization systems (Alcón and Lloret, 2015;Fuentes and Rodrı́guez, 2002).

    Named entity recognition and classification(NERC) task has two aims: (i) to identify theNE mentions and its boundaries in text, which isknown as named entity recognition (NER), and (ii)to classify them into a predefined set of types ofinterest, which is is referred to named entity clas-sification (NEC). In order to build a NERC systemboth objectives can be tackled in joint or separatedmodules (Feldman and Sanger, 2007)[pp. 96-97].

    Despite NEs are an important source of infor-mation, adapting existing NERC tools to new lan-guages is not always a straight forward process.The reason behind this fact is that most of them aredesigned ad hoc for a specific corpus, and, conse-quently, there is a language dependence on suchcorpus. Therefore, the adaptation of a NERC sys-tem to a new language, with different constraintsor characteristics, is conditioned mainly by threefactors. First, NERC systems often rely on lin-guistic analysis tools, which are not always avail-able for all languages (Indurkhya, 2014). For ex-ample, Freeling (Padró and Stanilovsky, 2012) isa language analysis tool suite that deals with morethan 10 languages, but there are more than 7000known living languages on the world as of to-day (Simons and Fennig, 2017). Second, NERCtools usually require knowledge resources whichvary between languages (Marrero et al., 2013), ifthey exists. For instance, Wikipedia is a com-mon knowledge source. Currently there are al-most 300 Wikipedias for several languages withdifferent sizes1, but again all known languages arenot covered. Last, each language has differentrules (e.g. morphological or syntactical) and, thus,

    1https://meta.wikimedia.org/wiki/List_of_Wikipedias Last access: August, 3rd 2017

    Mikel L. Forcada, Chao-Hong Liu, Jinhua Du, Qun Liu (eds.)Proceedings of MLP 2017: The First Workshop on Multi-Language Processing in a Globalising World, p. 21–29Dublin, Ireland, 4–5 September 2017

  • different challenges. These may not be trivial totackle, and more if it needs to be done simultane-ously for several languages. In fact, performanceof language-independent NERC systems can beaffected (Tjong Kim Sang, 2002; Sang and DeMeulder, 2003). Consequently, if the target lan-guage changes, then the effort required to port anexisting NERC system increases.

    Bearing that in mind, our final aim is to de-velop a language independent NERC system thatconsists of two separate modules, NER and NEC.To that end, this work develops a language inde-pendent NEC based on local information as wellas context, through profiles (Lopes and Vieira,2015). This system assumes the output of a “per-fect NER” to avoid propagation of errors from theidentification phase.

    To confirm such language independence, ourNEC module is evaluated on three corpora fromthree different languages: English (Sang and DeMeulder, 2003), Spanish and Dutch (Tjong KimSang, 2002). The obtained results show the sta-bility of our proposal (i.e. maximum overall Fβ=1difference is 9.07 points).

    The rest of the paper is structured as follows.Section 2 reviews previous NERCs. Then, Section3 defines our approach. Latter, the experimentsset-up is described in Section 4. The evaluation isprovided in Section 5, and Section 6 discusses ourresults. Last, Section 7 concludes the paper andoutlines future work.

    2 Existing Work

    Traditionally NERC research focused only onone language, that was usually English (Nadeauand Sekine, 2007). However, efforts totackle other languages, as well as multiple lan-guages, have also appeared. For instance,CoNLL conference presented two shared tasksto deal with independent language NERC sys-tems: CoNLL2002 (Tjong Kim Sang, 2002) andCoNLL2003 (Sang and De Meulder, 2003). Thesetwo were focused on Spanish–Dutch and English–German languages pairs, respectively. In both edi-tions, each pair of languages obtained different re-sults regardless of the NERC used, including alsothe best systems in each edition.

    In 2002 shared task, Carreras et al. (2002, 2003)obtained the best results but was ranked 5th on2003, thus achieving a maximum overall Fβ=1difference of 15.85 points between the four lan-

    guages (Spanish Fβ=1=81.39, Dutch Fβ=1=77.05,English Fβ=1=85 and German Fβ=1=69.15). Thissystem performs NER and NEC sequentially withseparated modules. Both components use a binaryAdaBoost classifier to ensemble small decisiontrees. Regarding its NEC module, it considers asfeatures lexical (word forms, lemmas, their posi-tion and NE length) and orthographic (e.g. capital-ization or prefixes) information from context andthe NE being classified, linguistic tags (such asPoS and syntactic chunks) and external gazetteers.

    The first place on 2003 edition was for Florianet al. (2003) and got an overall Fβ=1 differenceof 16.35 points between English and German (En-glish Fβ=1=88.76 and German Fβ=1=72.41). Inthis case, a voting scheme was implemented totackle the NERC task in one single step. Specif-ically, four machine learning algorithms weretrained and combined: robust linear classifier,maximum entropy, transformation-based learningand hidden Markov models. These algorithms takeadvantage of local features, such as prefixes andsuffixes, as well as linguistic features (e.g. PoStags or lemmas) and external gazetteers.

    Once these competitions finished, several re-searchers have addressed this task using CoNLLcorpora. An example is the NERC proposed byKonkol et al. (2015), which addresses this taskin one step. It is based on Conditional RandomFields (CRF) algorithm, employing unsupervisedfeatures from clusters of semantic spaces (COALS- Correlated Occurrence Analogue to Lexical Se-mantic - and HAL - Hyperspace Analogue toLanguage) as well as Latent Dirichlet allocation,which can be seen as a substitute for a gazetteer.This system obtained different results for each lan-guage, getting a maximum overall Fβ=1 differ-ence of 15.10 points (English Fβ=1=89.18, Span-ish Fβ=1=82.74, Dutch Fβ=1=83.01 and CzechFβ=1=74.08). Ixa-pipes (Agerri and Rigau, 2016)is another example. It learns Perceptron models,using OpenNLP machine learning framework 2,to tackle NER and NEC jointly. Their inferredmodel is based on features gathered from localinformation (e.g. token, shape, n-grams, pre-fix and suffix), clusters (i.e. brown, word2vecand clark) and external knowledge (gazetteers).This systems also achieved different results foreach language, showing a maximum overall Fβ=1

    2https://opennlp.apache.org/ Last access:August, 3rd 2017

    22

  • difference of 14.94 points between the four lan-guages (Spanish Fβ=1=84.16, Dutch Fβ=1=85.04,English Fβ=1=91.36 and German Fβ=1=76.42).

    Furthermore, all NERC systems presented hereemployed different levels of linguistic analysis(lexical or syntactical), as well as external knowl-edge (gazetteers or substitutes). Consequently,all of them have shown a performance gap be-tween languages. Although these systems whereassessed on at least two languages, all require anadaptation process to tune feature sets for eachlanguage and their requirements. Therefore, thesecannot be considered truly language-independent.

    3 Method Proposed: Named EntityClassification through Profiles

    This NEC methodology is based on previouswork (Lopes and Vieira, 2015), in which profileswhere generated in an unsupervised manner forauthorship detection. In addition to the purposeof the profiles usage, there are two main differ-ences between their work and ours. On the onehand, Lopes and Vieira (2015) obtain profiles froma concept extractor system, while ours are derivedfrom lemmas of nouns, verbs, adjectives and ad-verbs. On the other hand, categorization of Lopesand Vieira (2015) is performed by ranking possi-ble entities using their own similarity measure, butours calculates similarity between profiles and en-tities through a machine learning algorithm. Thearchitecture of the proposed method can be seen inFigure 1. This method consists of two main stages(solid squared boxes in Fig. 1):

    1. Profile generation process, whose main goalis to train a machine learning system to performNEC, is an off-line process that works as follows:

    (a) Linguistic annotation: a corpus previouslyannotated with NEs is tokenized, sentence-splitted, morphologically analysed and PoS-tagged (dotted oval titled “Linguistic ana-lyzer” Fig. 1). In our case, Freeling (Padróand Stanilovsky, 2012) is used for Spanishwhereas Treetagger (Schmid, 1994, 1995) ischosen for Dutch and English.

    (b) Descriptors extraction: For each entity in-stance, we extract descriptors3 in a windowand their frequency as the number of oc-currences. The size of the window can be

    3Descriptors represent lemmas of content bearing termsthat is nouns, verbs, adverbs and adjectives

    parametrized, but a window of a fixed lengthis used. In this work, length of the windowwas set to 10 lemmas (5 descriptors beforeand 5 after). Then, occurrences (occ) of eachdescriptor (d) are aggregated by entity type(type): occ(d, type).

    (c) Split the training corpus: For each NE type(e), the original training corpus is dividedin two sets called target (Te) and contrasting(Ge). The former set (Te) represents a frag-ment of the corpus capable to characterizethat a given NE belongs to a certain class (i.e.positive examples of this NE type). While thelatter (Ge) is composed by a set of negativeexamples (i.e. examples of the remaining NEcategories) aggregated by entity type.

    (d) Descriptors division: For each NE type, theextracted descriptors are splitted in two listnamed unique (Ue) and common (Ce) de-scriptors list. The former (Ue) contains de-scriptors only present in the target set (Te),whereas the latter (Ce) includes common de-scriptors in both target (Te) and contrastingsets (Ge) for a given entity type (e).

    (e) Relevance computation: For both unique(Ue) and common (Ce) descriptors lists ofeach NE type (e), a relevance index is as-signed to weight them and determine theirimportance for a given NE category. TheTerm Frequency, disjoint corpora frequency(TFDCF) index (Lopes and Vieira, 2015), de-fined in Equation 1, is applied to items fromthe unique descriptors list (Ue). Whereas therelevance common index, defined in Equa-tion 2, is computed for each item in the com-mon descriptors list (Ce) to penalize descrip-tors found in the contrasting set (Ge) as wellas in the target set (Te).

    idxunique(d, Te,Ge) =

    log(1 +occ(d, Te)∏

    ∀g∈Ge 1 + log(1 + occ(d, g)))

    (1)

    idxcommon(d, Te,Ge) =log(1 + occ(d, Te)−

    occ(d, Te)∏∀g∈Ge 1 + log(1 + occ(d, g))

    ) (2)

    where d is a descriptor, Te is the target set foran entity type e, g is a contrasting set, Ge con-

    23

  • Figure 1: Architecture of the proposed NEC approach based on profiles and local features

    tains all contrasting sets from the contrastingentities for an entity type e, and occ(d, Te) isthe occurrences of a term d in a set Te for anentity type e.This step produces a profile Pe for each entitytype e (profiles for all types are representedas data base in Fig. 1). Pe would be consti-tuted by its unique Ue and common Ce de-scriptors lists: Pe = {Ue, Ce}. In turn, eachitem from these lists is a pair {d, idx(d)},where d represents a descriptor (i.e. term’slemma) and idx defines its relevance index.It is important to remember that relevance in-dexes are computed according to descriptors’occurrences extracted from both target Te andcontrasting Ge sets. These lists only containthe most frequent descriptors. Specifically,this work employs up to 1000 most habitualdescriptors in both lists.

    (f) Local features extraction: Profiles are com-plemented with local information from NEitself, regardless of the category. Similarly,such data is obtained easily without requiringsemantic or syntactic linguistic analysis orexternal knowledge. Concretely, three typesof features are acquired from state-of-the-art NEC systems and extracted from trainingdata: words of the entity4(denoted as NE);entity length without stop-words (denoted asNElen); and affixes, distinguishing betweensuffixes and prefixes with a length between 1to 4 characters, inclusive, from the first andlast words respectively (denoted as affix4).This information is drawn in Figure 1 as a

    4Please bear in mind that any special character is replacedwith “ ”.

    rounded rectangle box called “Local featuresgeneration”.

    (g) Model training: Our proposal creates amachine learning model (a rectangle withdouble-struck vertical edges in Fig. 1) forcomputing profile similarity between all NEclasses, local NE features and its gold stan-dard candidates. Thus, in this step, a multi-classification model is generated joining localfeatures and profiles from all NE types. As aresult, profiles are represented as follows: foreach NE type, all descriptors’ lemmas fromthe top list Te are a feature, that has as valueits relevance index, idxunique (Equation 1).Similarly, all descriptors’ lemmas from thecommon list Ce of this NE type are a fea-ture that has as value idxcommon (Equation2). Random Forest (RF) algorithm (Breiman,2001), from Weka 3.6.7 (Hall et al., 2009)- dotted oval titled “Machine Learning Suit”in Fig. 1), has been employed owing to itsability to deal with more than two classes.Moreover, its selection was motivated due toits fast training, its stability regarding datachanges and its automatic variable selection.RF algorithm employs the default parame-ters, but he number of trees was set to 100.

    2. Profile application process (solid squared boxin Fig. 1), whose aim is to classify a previouslyrecognized NE in a set of predefined types, takesthese steps:

    (a) Linguistic annotation: text is tokenized,sentence-splitted, morphologically analyzedand PoS-tagged, as in the generation phase.

    (b) Candidate extraction: these are gathered di-

    24

  • rectly from the gold standard. It should benoted that candidates can be extracted by anyNER module, but here the output of a “per-fect NER” is used to avoid any bias.

    (c) Descriptors extraction: For each candidateand each possible entity type, we extract de-scriptors that appear in a window using thesame restrictions as in the generation phase.

    (d) Relevance computation: The unique rele-vance index idxunique (Equation 1) is com-puted for all candidates and all possible NEtypes.

    (e) Local features extraction: For each candi-date, local information is gathered (NE, NE-len, affix4).

    (f) Similarity computation and classification:Once an entity candidate has filled its profileand its local information, these data is com-pared against the ones generated from thetraining data, to compute their similarity. RFalgorithm estimates similarity with a forest oftrees that use as features local information aswell as descriptors of all possible types of en-tities (P = {T,C}).

    Last, it should be noted that Language Adap-tation is direct, since there is no need to changethe NEC system or its parameters. It only has tworequirements: (i) a training corpus previously an-notated with the target NEs, as well as its tag set;and (ii) a basic linguistic analyzer that is able tosplit sentences, tokenize, lemmatize and PoS-tagthis new language.

    4 Datasets and experimental set-up

    Three experiments were conducted using corporafrom different languages:

    1. Spanish CoNLL2002 (Tjong Kim Sang,2002): a collection of news wire articles fromthe EFE News Agency.

    2. Dutch CoNLL2002 (Tjong Kim Sang, 2002):four editions of the De Morgen newspaper.

    3. English CoNLL2003 (Sang and De Meul-der, 2003): a collection of news stories fromReuters.

    In all cases, corpora contains four entitytypes (person, organization, location and miscella-neous). However, miscellaneous type is discarded

    because it has no practical application, as sug-gested by Marrero et al. (2013).

    All these corpora are divided in 3 sets: train-ing, development and testing. A machine learningmodel is inferred on the training set for all NEs andthe results reported for this model are obtained us-ing the test set. The development set is not usedsince no parameter tunning was done.

    The performance of each experiment is eval-uated using traditional Precision, Recall and F-measureβ=1 of the positive class (e.g. it is an Or-ganization). Last, overall macro-averaged resultsare calculated as the arithmetic-mean for all entitytypes (E=3) so as to avoid a possible bias to themost frequent type.

    5 Evaluation Results

    The aim of our experiments is to assess our systemon corpora from three different languages, as wellas to find out the contribution of each local feature(i.e. NE, NElen and affix4). To that end, Table 1collects overall results for each corpus, as well asperformance for each entity type, including localinformation gradually.

    Regarding results for Spanish (left on Ta-ble 1), Person and Location entities usually ob-tain higher Fβ=1 results than Organization. How-ever, when all local features are taken into account(p+NElen+NE+affix4), Person is the type withthe lowest Recall (50.07%) and, consequently, thelowest Macro-averaged Fβ=1 (MFβ=1=65.19%).As a result, a percentage of improvement up to44.40% (22.24 points) in overall Macro-averagedFβ=1 results is achieved due to this configuration.

    Concerning Dutch results (center on Table 1),Organization is the entity type that often getsthe higher Fβ=1 results. Regardless of theentity, the best Fβ=1 results are accomplishedif all features are integrated in our pipeline(p+NElen+NE+affix4). In this case, the percent-age of improvement is 66.71% (31.01 points).

    Regarding results for English (right on Table 1),the entity type that performs better, in terms ofFβ=1, is Location. But, again, the inclusion ofcontext (i.e. profiles) and all local features (NE-len+NE+affix4) leads to the highest results, re-gardless the type. As a result, a percentage of im-provement up to 50.49% (27.31 points) in over-all Macro-averaged Fβ=1 results is achieved dueto this configuration.

    Finally, a comparison between overall per-

    25

  • Table 1: Precision, Recall and F-measureβ=1 results for Spanish, Dutch and English

    Features Spanish CoNLL2002 Dutch CoNLL2002 English CoNLL2003Pr Re Fβ=1 Pr Re Fβ=1 Pr Re Fβ=1

    p

    ORG 46.94 28.16 35.20 46.66 68.12 55.39 55.91 43.84 49.15PER 58.71 49.72 53.85 48.33 39.15 43.25 51.99 58.10 54.88LOC 53.66 71.29 61.23 49.43 29.37 36.84 55.82 60.92 58.26M 53.10 49.72 50.09 48.14 45.55 45.16 54.57 54.29 54.09ORG 48.22 27.62 35.12 50.93 72.31 59.77 62.61 51.55 56.55

    p+ PER 59.40 50.74 54.73 52.40 46.51 49.28 57.83 63.15 60.37NElen LOC 53.53 71.57 61.25 50.39 29.02 36.83 57.62 62.19 59.82

    M 53.71 49.98 50.36 51.24 49.28 48.63 59.36 58.96 58.91ORG 63.72 29.39 40.22 56.70 80.15 66.42 69.98 44.65 54.52

    p + PER 71.84 59.32 64.98 68.54 58.27 62.99 61.06 80.55 69.46NE LOC 66.96 87.58 75.90 73.53 45.35 56.10 66.88 68.91 67.88

    M 67.51 58.76 60.37 66.26 61.26 61.83 65.97 64.70 63.95ORG 81.67 33.33 47.34 60.33 90.71 72.46 82.97 65.73 73.35

    p+ PER 83.97 46.86 60.15 79.16 53.49 63.84 80.65 75.57 78.03affix4 LOC 56.74 93.79 70.71 82.76 54.42 65.66 67.16 85.00 75.03

    M 74.13 57.99 59.40 74.08 66.21 67.32 76.93 75.43 75.47p + ORG 70.29 32.52 44.47 59.09 84.34 69.49 76.07 54.35 63.40NElen + PER 75.64 62.73 68.58 73.64 62.79 67.78 67.33 82.65 74.21NE LOC 60.51 85.57 70.89 74.19 44.33 55.50 66.19 69.39 67.75

    M 68.81 60.27 61.31 68.97 63.82 64.26 69.86 68.80 68.45p + ORG 90.74 51.97 66.09 67.17 92.62 77.87 89.14 66.85 76.40NE + PER 89.58 66.61 76.40 85.54 63.44 72.85 81.04 80.17 80.60affix4 LOC 66.62 94.79 78.24 85.14 64.29 73.26 71.29 89.23 79.26

    M 82.31 71.12 73.58 79.28 73.45 74.66 80.49 78.75 78.75p + ORG 77.36 33.47 46.72 60.74 92.44 73.31 84.59 68.59 75.76NElen + PER 85.86 46.49 60.32 85.54 55.81 67.55 85.15 76.77 80.74affix4 LOC 56.74 93.79 70.71 80.97 53.06 64.11 67.37 86.45 75.73

    M 73.32 57.92 59.25 75.75 67.11 68.32 79.04 77.27 77.41p + ORG 65.44 95.64 77.71 87.08 64.17 73.89 64.45 92.80 76.07NElen + PER 93.40 50.07 65.19 67.71 94.72 78.97 82.10 46.34 59.24NE + LOC 88.58 63.65 74.07 89.44 65.63 75.71 81.50 59.14 68.55

    affix4 M 82.47 69.79 72.33 81.41 74.84 76.19 83.21 81.36 81.40Acronyms, ordered by appearance: (i) Pr: Precision, (ii) Re: Recall, (iii) Fβ=1: F-measureβ=1, (iv) p: profile,

    (v) ORG: ORGANIZATION, (vi) PER: PERSON, (vii) LOC: LOCATION, (viii) M: Macro-averaged result,

    (ix) NElen: entity length without stop words, (x) NE: entity words, and (xi) affix4: entity suffixes and prefixes

    with a length between 1 to 4 characters (inclusive) from the first and last words, respectively.

    Table 2: Comparison with existing NERC in terms of Fβ=1 and the highest difference between corpora.Systems are ranked by difference.

    System Spanish Dutch English German Czech DifProfiles+local 72.33 76.19 81.40 - - 9.07Carreras et al. (2002, 2003) 81.39 77.05 85 69.15 - 12.24Agerri and Rigau (2016) 84.16 85.04 91.36 76.42 - 14.94Konkol et al. (2015) 82.74 83.01 89.19 - 74.08 15.10Florian et al. (2003) - - 88.76 72.41 - 16.35

    Dif column refers to the absolute difference of overall Fβ=1 between best and worst reported results.

    26

  • formance in each languages, for the bestconfiguration (p+NElen+NE+affix4), is made.The highest results are achieved for English(MFβ=1=81.40%), closely followed by Dutch(MFβ=1=76.19%). Spanish produced the lowerresults (MFβ=1=72.33%) as a consequence to re-call in Person (MRecall=50.07%). Nevertheless,for almost all entities, regardless of the language,Fβ=1 is higher than 65% for this setting. As a con-clusion, the best configuration is the same acrosslanguages and the obtained results are adequate re-gardless of the language.

    6 Discussion

    In view of the results, the combination of pro-files and local information has demonstrated tobe appropriated for NEC. In terms of macro-Fβ=1, using the same configuration (i.e. the samemachine learning algorithm with equal settingsplus identical sizes for profiles and windows),the lowest overall result is achieved for Span-ish (MFβ=1=72.33%), whereas English obtainsthe highest overall MFβ=1 (MFβ=1=81.40%). Inspite of this, these results supports our hypothe-sis that this NEC system proves to be language in-dependent since a small difference is achieved interms of global MFβ=1: the worst case (Spanish–English) has a difference between languages of9.07 points; while in the best case (Spanish–Dutch), it is 3.86 points.

    A comparison between our system and NERCspresented in Section 2 is made, even tough it isnot free of certain limitations: (i) Not all systemsworked with the same corpora (e.g only Konkolet al. (2015) experimented with Czech Named En-tity Corpus). Consequently, some languages aremissing and different entity types are used (e.g.this work discarded miscellaneous entity). (ii)Performance is assessed in various ways5. And(iii) this work gathers NE boundaries directly fromthe gold standard to provide results for NEC taskalone, whereas these systems reported results forNERC task as a whole.

    Still, such comparison is made in two steps. Onthe one hand, the appropriateness of our approachis compared to determine the extent of our contri-bution with systems trained on CoNLL datasets.On the other hand, a comparison in terms of ab-solute overall Fβ=1 difference between best and

    5For example, CoNLL reported micro-averages; whilethis paper, macro-averages.

    worst reported results is provided in order to provethat such difference6 is in line with them or lower.This data is summarized in Table 2, where ourNEC approach is highlighted in bold.

    Regarding adequateness of our NEC, it can beobserved that our results are slightly lower thanthe remaining systems trained on CoNLL cor-pora. However, it should be noted that our re-sults are obtained without using external knowl-edge (such as gazetteers) or complex linguisticanalysis (e.g. syntactic analysis), whereas theNERCs being compared achieved such results us-ing gazetteers, which are language dependent, andusually these also include syntactic information.

    Regarding language independence (Dif columnin Table 2), our profile-based NEC systems ob-tains the smaller difference (9.07 points) whencompared to the worst case of these systems.

    In light of all of this, our hypothesis, that thisNEC approach is language independent, is rein-forced. Nevertheless, language independence doesnot mean that our approach do not need any lan-guage resource. As previously mentioned, ourmethod requires lexical and morphological anal-ysis (i.e. sentence splitter, tokenizer, lemmatizerand PoS-tagger). However, this level of analysisis generally available regardless of the languages,but not all analyzers support the same set of lan-guages. For instance, Dutch is not currently sup-ported in Freeling. Although tools that performthis low-level analysis are mature enough and havesimilar performance among languages, the infor-mation provided by the analyzers may differ andaffect the resulting output. In fact, in our opinion,the observed differences in performance betweenlanguages are partially explained by the fact thatdifferent linguistic analyzers were employed foreach language: Freeling for Spanish and TreeTag-ger for the remaining languages.

    Hence, we plan to experiment with a differenttype of descriptor: instead of using lemmas ofnouns, verbs, adjectives and adverbs; descriptorscan just be tokens and non-function words can beignored using a stop-words list. This removes thelemmatizer and PoS tagger restrictions, which isextremely important in cases where such analyzeris not available. In this manner, this approachcan be easily converted to other language, regard-less of linguistic analyzers capabilities. Thus, en-

    6The difference is calculated from the best and theworst language from the reported results, which are micro-averaged.

    27

  • abling us to evaluate other languages that eitherhave less resources or languages more differentthan the ones in this work.

    7 Conclusion

    In this paper, a language independent named entityclassification system is presented. It employs Ran-dom Forest, a supervised machine learning algo-rithm. Its feature set is based on profiles (contextof the surrounding words) and local information,gathered from the NE itself.

    Language adaptation is not only performedwithout external resources, but also system param-eters remain unchanged. It only requires an anno-tated corpus with the target NEs; and a linguisticanalyzer to lemmatize and PoS tag this new lan-guage.

    Experiments have involved the adaptation ofour approach to three different languages tack-led as part of CoNLL shared tasks (Tjong KimSang, 2002; Sang and De Meulder, 2003), namely:Spanish (Fβ=1=72.33), Dutch (Fβ=1=76.19) andEnglish (English Fβ=181.40). The evaluationshows a maximum Fβ=1 difference of 9.07 pointsbetween languages in the worst case, which isless than other multilingual systems. Our resultsare encouraging, but slightly lower than previouswork. Thus, classification needs to be improved toreduce those differences.

    Hence, as future work, we plan to enhance pro-files with new language independent features de-rived automatically from training data. Besides,we intend to generate profiles using only tokensin order to remove the basic linguistic analyzer re-quirement. Additionally, aiming at reinforcing ourhypothesis, our approach will be applied to otherlanguages that either have less resources or aremore different than the ones selected for this pa-per.

    Acknowledgements

    This research work has been partially fundedby the Spanish Government, Generalitat Va-lenciana, University of Alicante and AyudasFundación BBVA a equipos de investigacióncientı́fica 2016 through the projects: TIN2015-65100-R, TIN2015-65136-C02-2-R, PROM-ETEOII/2014/001, GRE16-01 (“Plataformainteligente para recuperación, análisis y repre-sentación de la información generada por usuariosen Internet”) and “Análisis de Sentimientos Apli-

    cado a la Prevención del Suicidio en las RedesSociales” (ASAP).

    ReferencesRodrigo Agerri and German Rigau. 2016. Ro-

    bust multilingual Named Entity Recog-nition with shallow semi-supervised fea-tures. Artificial Intelligence 238:63–82.https://doi.org/10.1016/j.artint.2016.05.003.

    Óscar Alcón and Elena Lloret. 2015. Estudio dela influencia de incorporar conocimiento léxico-semántico a la técnica de Análisis de ComponentesPrincipales para la generación de resúmenes mul-tilingües [Studying the influence of adding lexical-semantic knowledge to Principal Component Analy-sis technique for multilingual summarization]. Lin-guamática 7(1):43–53.

    Leo Breiman. 2001. Random Forests.Machine Learning 45(1):5–32.https://doi.org/10.1023/A:1010933404324.

    Xavier Carreras, Lluı́s Márquez, and Lluı́sPadró. 2002. Named entity extraction us-ing adaboost. In proceedings of the 6thconference on Natural language learning.https://doi.org/10.3115/1119176.1119197.

    Xavier Carreras, Lluı́s Màrquez, and Lluı́sPadró. 2003. A simple named entity ex-tractor using AdaBoost. Proceedings of the7th conference on Natural language learninghttps://doi.org/10.3115/1119176.1119197.

    Hsinhsi Chen, Yungwei Ding, and Shihchung Tsai.1998. Named Entity Extraction for Information Re-trieval. In COMPUTER PROCESSING OF ORIEN-TAL LANGUAGES. volume 11.

    Xiaowen Ding, Bing Liu, and Lei Zhang. 2009. En-tity Discovery and Assignment for Opinion Min-ing Applications. In Proceedings of the 15thACM SIGKDD international conference on Knowl-edge discovery and data mining. pages 1125–1134.https://doi.org/10.1145/1557019.1557141.

    Ronen Feldman and James Sanger. 2007.The text mining handbook: advanced ap-proaches in analyzing unstructured data.Cambridge University Press, New York.https://doi.org/10.1017/CBO9780511546914.

    Radu Florian, Abe Ittycheriah, Hongyan Jing, andTong Zhang. 2003. Named Entity Recognitionthrough Classifier Combination. In Proceedings ofthe 7th conference on Natural language learning.https://doi.org/10.3115/1119176.1119201.

    Maria Fuentes and Horacio Rodrı́guez. 2002. Usingcohesive properties of text for automatic summariza-tion. In Actas de las Jornadas de tratamiento y re-cuperacin de la informacin (Jotri’2002).

    28

  • Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009.Named Entity Recognition in Query. In Proceed-ings of the 32nd international ACM SIGIR confer-ence on Research and development in informationretrieval. Boston, Massachusetts, USA, pages 267–274. https://doi.org/10.1145/1571941.1571989.

    Mark Hall, Eibe Frank, Geoffrey Holmes, Bern-hard Pfahringer, Peter Reutemann, and Ian Witten.2009. The WEKA data mining software: An up-date. SIGKDD Explorations Newsletter 11(1):10–18. https://doi.org/10.1145/1656274.1656278.

    Nitin Indurkhya. 2014. Natural Language Processing.In Teofilo Gonzalez, Jorge Dı́az-Herrera, and AllenTucker, editors, Computing Handbook, Third Edi-tion: Computer Science and Software Engineering,CRC Press, chapter 40, pages 40:1–17.

    Wei Jin, Hung Hay Ho, and Rohini K Srihari.2009. OpinionMiner: A Novel Machine Learn-ing System for Web Opinion Mining and Extrac-tion. In Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discoveryand data mining. Paris, France, pages 1195–1204.https://doi.org/10.1145/1557019.1557148.

    Michal Konkol, T. Brychcı́n, Konopı́, and Miloslav K.2015. Latent semantics in Named Entity Recogni-tion. Expert Systems with Applications 42(7):3470–3479. https://doi.org/10.1016/j.eswa.2014.12.015.

    Changki Lee, Yi-Gyu Hwang, and Myung-Gil Jang.2007. Fine-Grained Named Entity Recogni-tion and Relation Extraction for Question An-swering. In Proceedings of the 30th an-nual international ACM SIGIR conference on Re-search and development in information retrieval.Amsterdam, The Netherlands, pages 799–800.https://doi.org/10.1145/1277741.1277915.

    Chung-Hee Lee, Y G Hwang, H J Oh, SoojongLim, Jeong Heo, C H Lee, H J Kim, J HWang, and M G Jang. 2006. Fine-grainedNamed Entity Recognition using Conditional Ran-dom Fields for Question Answering. In Informa-tion Retrieval Technololgy, Proceedings, Springer,Berlin, Heidelberg, volume 4182, pages 581–587.https://doi.org/10.1007/11880592 49.

    Lucelene Lopes and Renata Vieira. 2015. Buildingand Applying Profiles Through Term Extraction. InProceedings of the 10th Brazilian Symposium in In-formation and Human Language Technology. So-ciedade Brasileira de Computação, pages 91–100.http://aclweb.org/anthology/W15-5613.

    Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbı́s. 2013. Named Entity Recognition: Fal-lacies, challenges and opportunities. Com-puter Standards and Interfaces 35(5):482–489.https://doi.org/10.1016/j.csi.2012.09.004.

    David Nadeau and Satoshi Sekine. 2007. A sur-vey of named entity recognition and classifica-tion. Lingvisticae Investigationes 30(1):3–26.https://doi.org/10.1075/li.30.1.03nad.

    Lluı́s Padró and Evgeny Stanilovsky. 2012. FreeLing3.0: Towards Wider Multilinguality. In Proceedingsof the Language Resources and Evaluation Confer-ence. ELRA, Istanbul, Turkey.

    Fernando S. Peregrino, David Tomás, and Fer-nando Llopis Pascual. 2012. Question Answer-ing and Multi-search Engines in Geo-Temporal In-formation Retrieval. In Alexander Gelbukh, edi-tor, Computational Linguistics and Intelligent TextProcessing: 13th International Conference, CI-CLing 2012, New Delhi, India, March 11-17, 2012,Proceedings, Part II, Springer, Berlin, Heidelberg,pages 342–352. https://doi.org/10.1007/978-3-642-28601-8 29.

    Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 SharedTask: Language-Independent Named Entity Recog-nition. In Proceedings of the 7th conferenceon Natural language learning. pages 142–147.https://doi.org/10.3115/1119176.1119195.

    Helmut Schmid. 1994. Probabilistic part-of-speechtagging using decision trees. In Proceedings ofInternational Conference on New Methods in Lan-guage Processing. pages 44–49.

    Helmut Schmid. 1995. Improvements In Part-of-Speech Tagging With an Application To German. InProceedings of the ACL SIGDAT-Workshop. pages47—-50. https://doi.org/10.1007/978-94-017-2390-9 2.

    Gary F. Simons and Charles D. Fennig, editors.2017. Ethnologue: Languages of the World.SIL International, Dallas, Texas, twentieth edition.http://www.ethnologue.com.

    Erik F Tjong Kim Sang. 2002. Introduction tothe CoNLL-2002 shared task. In proceeding ofthe 6th conference on Natural language learning.https://doi.org/10.3115/1118853.1118877.

    Marta Vicente and Elena Lloret. 2016. Exploring Flex-ibility in Natural Language Generation throughoutD