ATLAS – The Multilingual Language Processing Platform

8
ATLAS – The Multilingual Language Processing Platform * ATLAS – La Plataforma de Procesamiento del Lenguaje Multiling¨ ue Maciej Ogrodniczuk Institute of Computer Science Polish Academy of Sciences Warsaw, Poland [email protected] Diman Karagiozov Tetracom Interactive Solutions Ltd. Sofia, Bulgaria [email protected] Resumen: En este trabajo se presenta la plataforma ATLAS – marco multil- ing¨ ue de procesamiento del lenguaje que integra el conjunto com´ un de herramientas ling¨ ısticas para un grupo de lenguas europeas (con menos recursos: b´ ulgaro, croata, griego, polaco y rumano, junto con ingl´ es y alem´an como lenguas de referencia). La m´as avanzada funcionalidad PNL que ofrece la plataforma permite la anotaci´on de textos multiling¨ ues en los niveles inferiores (segmentaci´on, morfosintaxis) y a su vez soporta el procesamiento de m´as alto nivel como la categorizaci´on autom´atica, extracci´on de informaci´on, la traducci´on autom´atica o de resumen. M´ etodos de an- otaci´on m´as elaborados como la extracci´on de la entidad nombrada o lematizaci´on unitaria de varias palabras tambi´ en est´an disponibles. La anotaci´on multinivel de los textos se rige por las cadenas de procesamiento de lenguaje construidas con el est´andar de la industria UIMA. Para demostrar las capacidades del marco, se han construido en la parte superior del mismo tres servicios informados ling¨ ısticamente: ”i-Publisher” (plataforma de gesti´on de contenidos basada en la Web), ”i-Librarian” (una biblioteca digital de tra- bajos cient´ ıficos) y “EUDocLib” (p´agina para la navegaci´on y la b´ usqueda a trav´ es de documentos de EUR-LEX). Palabras clave: herramientas ling¨ ısticas, recursos ling¨ ısticos, servicios Web, sis- tema de gesti´on de contenidos, servicios en l´ ınea, UIMA Abstract: This paper presents the ATLAS platform – multilingual language pro- cessing framework integrating the common set of linguistic tools for a group of European languages (less-resourced: Bulgarian, Croatian, Greek, Polish and Ro- manian together with English and German as reference languages). State-of-the-art NLP functionality offered by the platform allows for multilingual annotation of texts on lower levels (segmentation, morphosyntax) which in turn supports higher-level processing such as automated categorization, information extraction, machine trans- lation or summarization. More elaborate annotation methods such as named entity extraction or multiword unit lemmatization are also available. Multilevel annotation of texts is governed by language processing chains constructed with UIMA (Unstruc- tured Information Management Application ) industry standard. To demonstrate capabilities of the framework, three linguistically-aware online ser- vices have been built on top of it: i-Publisher (Web-based content management platform), i-Librarian (a digital library of scientific works) and EUDocLib (site for browsing and searching through EUR-LEX documents). Keywords: linguistic tools, language resources, Web services, content management system, online services, UIMA * The work reported here was carried out within the Applied Technology for Language-Aided CMS project co-funded by the European Commission under the In- formation and Communications Technologies (ICT) Policy Support Programme (Grant Agreement No 250467). The authors would like to thank all repre- sentatives of project partners for their contribution. Procesamiento del Lenguaje Natural, Revista nº 47 septiembre de 2011, pp 241-248 recibido 30-04-2011 aceptado 23-05-2011 ISSN 1135-5948 © 2011 Sociedad Española Para el Procesamiento del Lenguaje Natural

Transcript of ATLAS – The Multilingual Language Processing Platform

Page 1: ATLAS – The Multilingual Language Processing Platform

ATLAS – The Multilingual Language Processing Platform∗

ATLAS – La Plataforma de Procesamiento del Lenguaje Multilingue

Maciej OgrodniczukInstitute of Computer SciencePolish Academy of Sciences

Warsaw, [email protected]

Diman KaragiozovTetracom Interactive Solutions Ltd.

Sofia, [email protected]

Resumen: En este trabajo se presenta la plataforma ATLAS – marco multil-ingue de procesamiento del lenguaje que integra el conjunto comun de herramientaslinguısticas para un grupo de lenguas europeas (con menos recursos: bulgaro, croata,griego, polaco y rumano, junto con ingles y aleman como lenguas de referencia). Lamas avanzada funcionalidad PNL que ofrece la plataforma permite la anotacion detextos multilingues en los niveles inferiores (segmentacion, morfosintaxis) y a suvez soporta el procesamiento de mas alto nivel como la categorizacion automatica,extraccion de informacion, la traduccion automatica o de resumen. Metodos de an-otacion mas elaborados como la extraccion de la entidad nombrada o lematizacionunitaria de varias palabras tambien estan disponibles. La anotacion multinivel delos textos se rige por las cadenas de procesamiento de lenguaje construidas con elestandar de la industria UIMA.Para demostrar las capacidades del marco, se han construido en la parte superiordel mismo tres servicios informados linguısticamente: ”i-Publisher” (plataforma degestion de contenidos basada en la Web), ”i-Librarian” (una biblioteca digital de tra-bajos cientıficos) y “EUDocLib” (pagina para la navegacion y la busqueda a travesde documentos de EUR-LEX).Palabras clave: herramientas linguısticas, recursos linguısticos, servicios Web, sis-tema de gestion de contenidos, servicios en lınea, UIMA

Abstract: This paper presents the ATLAS platform – multilingual language pro-cessing framework integrating the common set of linguistic tools for a group ofEuropean languages (less-resourced: Bulgarian, Croatian, Greek, Polish and Ro-manian together with English and German as reference languages). State-of-the-artNLP functionality offered by the platform allows for multilingual annotation of textson lower levels (segmentation, morphosyntax) which in turn supports higher-levelprocessing such as automated categorization, information extraction, machine trans-lation or summarization. More elaborate annotation methods such as named entityextraction or multiword unit lemmatization are also available. Multilevel annotationof texts is governed by language processing chains constructed with UIMA (Unstruc-tured Information Management Application) industry standard.To demonstrate capabilities of the framework, three linguistically-aware online ser-vices have been built on top of it: i-Publisher (Web-based content managementplatform), i-Librarian (a digital library of scientific works) and EUDocLib (site forbrowsing and searching through EUR-LEX documents).Keywords: linguistic tools, language resources, Web services, content managementsystem, online services, UIMA

∗ The work reported here was carried out within theApplied Technology for Language-Aided CMS projectco-funded by the European Commission under the In-formation and Communications Technologies (ICT)Policy Support Programme (Grant Agreement No250467). The authors would like to thank all repre-sentatives of project partners for their contribution.

Procesamiento del Lenguaje Natural, Revista nº 47 septiembre de 2011, pp 241-248 recibido 30-04-2011 aceptado 23-05-2011

ISSN 1135-5948 © 2011 Sociedad Española Para el Procesamiento del Lenguaje Natural

Page 2: ATLAS – The Multilingual Language Processing Platform

1 IntroductionThe need for a common multilingual NLPframework, offering a coherent set of lan-guage tools for a number of European lan-guages is getting more and more urgent inthe integrating Europe. Recently, a num-ber of pan-European initiatives fostering suchdevelopment appeared, but they seem to beconcentrating on gathering and standardizingresources rather than their practical interop-erability. ATLAS project (Applied Technol-ogy for Language-Aided CMS) which startedin 2010 responds to this need by offering aset of state-of-the-art NLP tools for severalEuropean languages wrapped in a commonlanguage processing framework which can beused and reused in various higher-level appli-cations.

The most obvious of them is language-powered document management, made easywith seamless integration of NLP tools, hot-swap/hot add-on of LPC engines, scala-bility of NLP infrastructure, content- andcontext-based navigation and true multi-liguality: language-independent informationextraction, document categorization andclustering (similar documents in foreign lan-guages), machine translation of abstracts(automated summarization).

This article presents the capabilities of theplatform, its architecture, components andformats together with sample interfaces torun the integrated linguistic machinery andperformance results.

2 Language ProcessingArchitecture

The architecture of the platform allows forasynchronous, queue-based processing of re-quests (see figure 1). When content process-ing is initiated (e.g. when a user saves a con-tent item in the user interface) the item isstored in the master NLP queue which is aprimary method of interaction between theintegrated subsystems: subsequent process-ing engines are using the content of the queuea processing source and return its result tothe queue.

All texts are first pre-processed withMIME type detectors, text converters (fromMS Office, Open Office, PDF, HTML, RTF,ePub etc. to plain text), language detectorsand cleaners – and finally sent back to thequeue. The core processing is governed bylanguage processing chains (see section 3.1)

Figure 1: Linguistic processing in ATLAS

which annotate text with properties relevantfor further processing – most often tokens,paragraph and sentence boundaries, namedentities and nominal phrases. As the laststep, post-processing engines are producinghigher-level annotation (summaries, classifi-cation).

2.1 The UIMA FrameworkUIMA (Unstructured Information Manage-ment Application, see http://uima.apache.org/) is a pluggable component architectureand software framework designed especiallyfor the analysis of unstructured content andits transformation into structured informa-tion. Currently UIMA is the only industrystandard (OASIS standard) for content ana-lytics.

Apart from offering common components(e.g. the type system for document and textannotations) UIMA builds on the concept ofanalysis engines (in our case, language spe-cific components) taking form of primitiveengines which can wrap up NLP (natural lan-guage processing) tools adding annotationsand aggregate engines which define the se-quence of execution of chained primitives.

2.2 The Annotation ModelMaking the tools chainable requires ensur-ing their interoperability on various levels.Firstly, compatibility of formats of linguisticinformation is maintained within the definedscope of required annotation.

The UIMA type system requires devel-opment of a uniform representation model

Maciej Ogrodniczuk y Diman Karagiozov

242

Page 3: ATLAS – The Multilingual Language Processing Platform

Table 1: Summary of text annotations

Annotationtype

Parameters1

Paragraph –Sentence –

Token

POS tag, MSD,lemma, gender,number, case, wordsense

Noun Phrase head, base form

Named Entity

type (one of: Date,Location, Money,Organization,Percentage, Person,Time), normalizedvalue

Markable type, reference

which helps to normalize heterogeneous an-notations of the component NLP tools. WithATLAS it covers properties vital for fur-ther processing of the annotated data, e.g.lemma, values for attributes such as gender,number and case for tokens necessary to runcoreference module to be subsequently usedfor text summarisation, categorization andmachine translation (see table 1; optional pa-rameters are shown in italics).

To facilitate introduction of further levelsof annotation a general markable type hasbeen introduced, carrying type and an op-tional reference to another markable object(which allows both for marking text frag-ments and creating mention chains). Thisway new annotation concepts can be testedand later included into the core model.

3 Language-related Functionality

The annotations are used as input of thehigher-level functionality of the platform. In-dividual functions are available as compo-nents which can be wrapped up in interfaces(see section 4).

3.1 Language ProcessingLanguage processing chains (LPCs, see fig-ure 2) provide the core text annotation. Aprocessing chain for a given language includesa number of existing tools, adjusted and/orfine-tuned to ensure their interoperability. Inmost respects a language processing chain

1Extending the standard set of parameters: beginoffset, end offset and text value.

Figure 2: Language Processing Chain

does not require development of new softwaremodules but rather combining existing tools.

The minimal set of annotation tools avail-able across all integrated languages includes:

• tokeniser,

• sentence boundary detector,

• paragraph boundary detector,

• lemmatizer,

• POS tagger,

• NP (noun phrase) chunker,

• NE (named entity) extractor

(when quality of the tool is questionablefor a given language, tools from the basic setare improved by project partners).

Currently, in the first phase of the project,the set of integrated tools (see table 2) islimited to English and is heavily OpenNLP-related. Tools for other project languages arecurrently being integrated and will be avail-able in the next released of the platform.

Table 3 presents the current average per-formance of the linguistic chain.

3.2 Information RetrievalThe retrieval component can automaticallycompiles a summary of each document, fea-turing all relevant information – contextu-ally important words and phrases, capital-ized phrases, URLs, similar documents, ex-tractive summary etc. With these extractsusers can create personal keywords for eachfile and categorize their collections accordingto their personal liking.

ATLAS – Multilingual Language Processing Platform

243

Page 4: ATLAS – The Multilingual Language Processing Platform

Table 2: NLP tools in ATLAS

Tool typeTool name /Source

Paragraphsplitter

Regexp-basedsolution by Tetracom

Sentence splitter OpenNLPTokenizer OpenNLPLemmatizer RASPPOS tagger OpenNLPWSD LESK-based2

NP extractor14 English-specificrules by Tetracom

NE extractor OpenNLP

SummarizerLexRank3 and OpenText Summarizer4

Categorization

MULAN5 with ak-NearestNeighbor-basedalgorithm

Table 3: Current performance of EnglishLPC

Tool type s/doc% of totalprocessing

timeParagraphsplitter

0,01 0,2%

Sentencesplitter

0,13 3,7%

Tokenizer 0,17 4,9%Lemmatizer 0,10 3,0%POS tagger 0,13 3,7%WSD 1,67 49,4%NP extractor 0,07 2,1%NE extractor 1,11 32,9%

3,37 100,0%

3.3 Automatic CategorizationUnlike most other repository approaches, nomanual document categorization is necessary– after upload, the categorization componentcan automatically catalogue documents us-ing a comprehensively trained model. Thecomponent can also make suggestions about

2See (Banerjee, 2002). The initial tool available inPerl was rewritten in C++ to improve performance(30 times).

3See (Erkan and Radev, 2004).4See http://libots.sourceforge.net/.5See (Tsoumakas et al., 2000; Tsoumakas,

Katakis, and Vlahavas, 2010) and http://mulan.sourceforge.net/.

other topics that are relevant to the docu-ment, therefore reducing the time spent oncategorization to a minimum.

3.4 Full-text Search and SimilaritySearch

The documents are indexed by a powerfulhigh-speed full-text search engine, based onLucene. Using a simple, Google-like searchform, users can quickly find words or phrasesin all their documents. The search resultsprovide up to three excerpts from the text,best matching the search terms. Addition-ally, similarity search is available basing onextracted essence of the documents.

3.5 Machine TranslationCurrently integrated machine translation ap-proach is based on the Serverland API(REST or XMLRPC) by DFKI – a middle-ware that provides a common interface toseveral machine translation services such asGoogle, Bing, Lucy, and Moses. The com-ponent is used to translate document sum-maries, noun phrases and the “long text”metadata fields (such as content item descrip-tions).

3.6 Text SummarizationSummarization tools will be prepared byadjusting and fine-tuning existing softwarecomponents for the project target languagesbasing on a discourse-parsing based method.

After the text is segmented into ele-mentary discourse units (mainly clauses),a discourse tree is composed for each sen-tence based on cue-phrases recognized by theparser. The sequence of sentence trees is ar-ranged into discourse tree by maximizing ascore contributed from centering transitionsand anaphoric links. The discourse tree isin turned used for computing the summaries,both general and focused, with the support ofspecialized resources and tools such as a col-lection of discourse markers and (optionally)an anaphora resolver.

4 InterfacesFor demonstration purposes three interfaceshas been developed to illustrate how thelinguistic building blocks can be intercon-nected into a useful application (Belogay etal., 2011). All three operate in a multilin-gual setting. Although in the first year of theproject the implemented functionality is of-fered only for English (a reference language),

Maciej Ogrodniczuk y Diman Karagiozov

244

Page 5: ATLAS – The Multilingual Language Processing Platform

Figure 3: i-Publisher architecture

it will be soon available for other project lan-guages – Bulgarian, Croatian, German, Ger-man, Greek, Polish and Romanian; hopefullyfor more due to the flexible architecture of thesystem.

The first of the currently available inter-faces is i-Publisher – the online Web Con-tent Management System integrating thelanguage-based technology to make its pro-cessing output available on the managed Websites.

Two other are thematic content-drivenWeb sites, i-Librarian and EUDocLib, builton top of ATLAS platform, using i-Publisheras content management layer. i-Librarianis intended to be a user-oriented personalworkspace for storing, sharing and publish-ing various types of documents automaticallyassigned to appropriate subject categories,summarized and annotated with importantwords, phrases and names. EUDocLib is apublicly accessible repository of EU legal doc-uments from the EUR-LEX collection withenhanced navigation and multilingual access.

4.1 i-Publisher: Web ContentManagement System

The i-Publisher service (see fig-ures 3 and 4) available at http://www.i-publisher.atlasproject.eu(restricted access) offers a flexible WebCMSconfiguration interface with dynamic datamodel, content versioning, filtering andmulti-level grouping, Web site layout anddesign editor, granular user access rights andlibrary of predefined themes. The easy point-and-click graphical Multi-Document userinterface supports drag-and-drop actions.

Linguistic features are available through

Figure 4: i-Publisher interface: welcomescreen and working area

widgets placed on Web pages; currently awide set of predefined functionalities for au-tomated processing of textual content is inte-grated, with categorization, summarization,identification of named entities and nounphrases etc.

4.2 i-Librarian: A PersonalLibrary

The i-Librarian service (see figures 5 and 6)available at http://www.i-librarian.eu(restricted access) is a demonstration of AT-LAS NLP capabilities in the form of a digi-tal library Web site created using i-Publisher.It can be used e.g. as a repository of scien-tific papers, therefore addressing the needsof authors, students and researchers by giv-ing them the ability to easily create, organizeand publish various types of documents andthen searching for similar documents in dif-ferent languages, sharing personal works withother people, and locating the most essentialtexts from large collections of unfamiliar doc-uments.

The library offers the public and pri-vate workspace and employs language tech-nology to extract important phrases and

ATLAS – Multilingual Language Processing Platform

245

Page 6: ATLAS – The Multilingual Language Processing Platform

Figure 5: i-Librarian architecture

named entities from indexed documents; sim-ilar items are then displayed on demand, ab-stract translated and document summariesproduced.

To evaluate performance and scalability ofthe site even before users started uploadingtheir papers, 4.4 K documents (165 M to-kens) from Project Gutenberg have been up-loaded.

Figure 6: i-Librarian interface: library viewand item view

4.3 EUDocLib: Electronic Libraryof Legal EU Documents

Another demonstration of ATLAS function-ality is the EUDocLib Web site (freely avail-able at http://eudoclib.atlasproject.eu, see figure 7), also created with i-

Publisher. The library offers easy accessto EU documents in all project languageswith automatic categorization (by assign-ing documents to Eurovoc, the EU’s multi-lingual thesaurus), extraction of importantphrases, named entities, similar items andcross-lingual information retrieval.

Figure 7: EUDocLib interface: search re-sults, browse perspective and individual item

Currently the site covers 140 K documents(182 M tokens).

5 Further Steps

The first obvious direction to be followed isinternationalization of the platform and im-provement of quality offered by currently in-tegrated tools. The generic OpenNLP tools

Maciej Ogrodniczuk y Diman Karagiozov

246

Page 7: ATLAS – The Multilingual Language Processing Platform

will be supplemented with their local coun-terparts and statistical models are planned tobe combined with rule-based approach (forexample, combining the OpenNLP namedentity extractor with a rule-based named en-tity recognition tool focusing mainly of ex-ceptions of the general rules, covered byOpenNLP).

Another important general further stepwill be improvement of the overall per-formance of the chained tools in ATLAS,mainly post-processing engines and datastores. However, the most innovative re-search areas to be covered are related to threesubjects: machine translation, summariza-tion and categorization.

5.1 Machine TranslationExisting translation approach is plannedto be amended with a system to processuser-submitted translations. Concept-basedcross-lingual search engine together with itsunderlying ontology, both developed duringthe LT4eL6 project (Degorski, Marcinczuk,and Przepiorkowski, 2008; Monachesi et al.,2006), will be adapted and further improved.Basing on categories and keywords a concep-tual space will be constructed and integratedinto a conceptual search mechanism (Vertanet al., 2007).

5.2 SummarizationSummarization tools are planned to be ad-justed and fine-tuned basing on a discourseparser developed by Iasi University. Afterthe text is segmented into elementary dis-course units (mainly clauses), a discoursetree is composed for each sentence basedon cue-phrases recognized by the parser.The sequence of sentence trees is arrangedinto discourse tree by maximizing a scorecontributed from centering transitions andanaphoric links. The discourse tree is inturned used for computing the summaries,both general and focused, with the supportof specialized resources and tools such as acollection of discourse markers and (option-ally) an anaphora resolver.

5.3 CategorizationA language-independent text categorizationtool fine-tuned to work with each project lan-

6Language Technology for eLearning FP6 Spe-cific Targeted Research Project (Information SocietyTechnologies), contract number 027391. See http://www.lt4el.eu/.

guage will be prepared and tuned to hetero-geneous domains. Its ultimate goal will beeffective organization of content in the onlineservices by using vector space models (VSMs)with lexical distribution patterns or alterna-tive features selected from the documents.Initial VSMs will be generated on the basis oflexical distribution in documents, using var-ious lexical windows, of 1 to n N-grams, aswell as normalization methods for matrix re-duction (i.e. by elimination of specific lexi-cal classes of elements, or elimination of lex-ical covariation etc.). Finally, classificationof documents will be performed by apply-ing similarity measures over the vector spacemodels of classes and particular documents.

Apart from the already mentioned ones,Support Vector Machines, Latent SemanticAnalysis, Naıve Bayes, MaxEnt, Centroidand advanced feature space reduction algo-rithms will be used for classification.

6 Conclusions

The ATLAS platform opens the door to stan-dardized multilingual online processing oflanguage and it offers localized demonstra-tion tools built on top of the linguistic mod-ules. We intend it to be a contribution tothe development of text processing chains forthe Web, especially for underrepresented lan-guages.

The framework is ready for integration ofnew types of tools and new languages to pro-vide wider online coverage of the needful lin-guistic services in a standardized manner.

New versions of the online services areplanned to be launched in the beginning of2012.

References

Banerjee, Satanjeev. 2002. Adapting theLesk algorithm for word sense disambigua-tion to WordNet. University of Min-nesota, Duluth. Master’s thesis.

Belogay, Anelia, Damir Cavar, Dan Cristea,Diman Karagiozov, Svetla Koeva,Roumen Nikolov, Maciej Ogrodniczuk,Adam Przepiorkowski, Polivios Raxis,and Cristina Vertan. 2011. i-Publisher,i-Librarian and EUDocLib – linguisticservices for the Web. To appear in theproceedings of PALC 2011.

Degorski, #Lukasz, Micha#l Marcinczuk, andAdam Przepiorkowski. 2008. Defini-

ATLAS – Multilingual Language Processing Platform

247

Page 8: ATLAS – The Multilingual Language Processing Platform

tion extraction using a sequential com-bination of baseline grammars and ma-chine learning classifiers. In Proceed-ings of the Sixth International Conferenceon Language Resources and Evaluation,LREC2008, Marrakech. ELRA.

Erkan, Gunes and Dragomir R. Radev.2004. LexRank: Graph-based Centralityas Salience in Text Summarization. Jour-nal of Artificial Intelligence Research 22.

Monachesi, Paola, Dan Cristea, Diane Evans,Alex Killing, Lothar Lemnitzer, KirilSimov, and Cristina Vertan. 2006. In-tegrating Language Technology and Se-mantic Web techniques in eLearning. InProceedings of The International Interac-tive Computer-Aided Learning Conference(ICL 2006), Villach, Austria, September.

Tsoumakas, Grigorios, Ioannis Katakis, andIoannis Vlahavas. 2010. Mining Multi-label Data. In O. Maimon and L. RokachData Mining and Knowledge DiscoveryHandbook, Springer, 2nd edition.

Tsoumakas, Grigorios, Jozef Vilcek, Elefthe-rios Spyromitros, and Ioannis Vlahavas.2000. Mulan: A Java Library for Multi-Label Learning. Journal of MachineLearning Research 1.

Vertan, Cristina, Paola Monachesi, KirilSimov, Petya Osenova, Lothar Lemnitzer,Alex Killing, and Diane Evans. 2007.Crosslingual retrieval in an eLearningenvironment. In Roberto Basili andMaria Teresa Pazienza, editors, Proceed-ings of The 10th Congress of the Ital-ian Association for Artificial Intelligence(AIIA 2007), pages 839–847, Berlin, Hei-delberg. Springer-Verlag.

Maciej Ogrodniczuk y Diman Karagiozov

248