Second year published papers - UPC

26
Second year published papers Document Number Working Paper 9.3 Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL http://www.lsi.upc.es/˜nlp/meaning/meaning.html Availability Public Authors: German Rigau (UPV/EHU) INFORMATION SOCIETY TECHNOLOGIES

Transcript of Second year published papers - UPC

Page 1: Second year published papers - UPC

Second year published papers

Document Number Working Paper 9.3Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/˜nlp/meaning/meaning.htmlAvailability PublicAuthors: German Rigau (UPV/EHU)

INFORMATION SOCIETY TECHNOLOGIES

Page 2: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 1

Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale

Language TechnologiesSecurity (Distribution level) PublicContractual date of delivery March 2003Actual date of delivery March 17, 2004Document Number Working Paper 9.3Type ReportStatus & version v FINALNumber of pages 24WP contributing to the deliberable WP9WPTask responsible German Rigau (UPV/EHU)Authors

German Rigau (UPV/EHU)

Other contributorsReviewerEC Project Officer Evangelia MarkidouAuthors: German Rigau (UPV/EHU)Keywords:Abstract: This document provides a brief summary of the published work resultingfrom the second year of Meaning

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 3: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 2

Contents

1 Executive Summary 31.1 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Papers related to WP2: Methodology and Design 5

3 Papers related to WP3: Linguistic Processors and Infrastructure 5

4 Papers related to WP4: (Knowledge) Acquisition 10

5 Papers related to WP5: Acquisition 11

6 Papers related to WP6: WSD 15

7 Papers related to WP7: Evaluation and Assessment 17

8 Papers related to WP8: User Validation 17

9 Papers related to WP9: Exploitation and Dissemination 19

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 4: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 3

1 Executive Summary

This document provides a brief summary of the published work resulting from the secondyear of Meaning. Last year, the consortium published 51 papers: 32 papers in Inter-national Conference Proceedings, 13 papers in International WorkShops and 6 papers ininternational journals. These papers covers different Meaning working parts: from WP2to WP9.

Next, we provide the complete list of Conferences and Workshops attended by theMeaning partners. We included here also Journals and books.

The rest of sections of this Working Paper provide, one per Work Part, a detailed listof the published papers. Each paper comes together with a brief summary.

1.1 Conferences

• (8 papers) 19th Congreso Anual de la Sociedad Espanola de Procesamiento delLenguaje Natural (SEPLN’2003)

• (7 papers) Second International Global WordNet Conference (GWC’2004)

• (6 papers) International Conference on Recent Advances in Natural Language Pro-cessing (RANLP’2003)

• (2 papers) 7th International Conference on Natural Language Learning (CoNLL’2003)

• (2 papers) Corpus Linguistics

• (1 paper) 10th Conference of the European Chapter of the Association for Compu-tational Linguistics (EACL’2003)

• (1 paper) 17th Annual Conference on Neural Information Processing Systems (NIPS’2003)

• (1 paper) International Conference on Intelligent Text Processing and ComputationalLinguistics (CICLing’2004)

• (1 paper) International Conference on Text Speech and Dialogue (TSD’2003)

• (1 paper) 18th ACM Symposium on Applied Computing, Special Track on Informa-tion Access and Retrieval Systems (SAC’2003)

• (1 paper) Advances in Artificial Intelligence, 8th Congress of the Italian Associationfor Artificial Intelligence (AI*IA’2003)

• (1 paper) 12th Text Retrieval Conference (TREC’2003)

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 5: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 4

1.2 Workshops

• (2 papers) TALN’03 workshop on NLP of Minority Languages and Small Languages

• (1 paper) RANLP’2003 Workshop on Information Extraction for Slavonic and OtherCentral and Eastern European Languages (IESL’03)

• (1 paper) ACL’2003 Workshop on Multilingual and Mixed-language Named EntityRecognition: Combining Statistical and Symbolic Models (MulNER’03)

• (1 paper) EACL’2003 Workshop on Morphological Processing of Slavic Languages

• (1 paper) EACL’2003 workshop on Finite-State Methods in Natural Language Pro-cessing

• (1 paper) ESSLLI’2003 Workshop on The Meaning and Implementation of DiscourseParticles in 15th European Summer School in Logic, Language and Information

• (1 paper) II Jornadas de Tratamiento y Recuperacion de Informacion (JOTRI’2003)

• (1 paper) IJCNLP’2004 workshop on Named Entity Recognition

• (1 paper) Workshop on Treebanks and Linguistic Theories (TLT’2003)

• (1 paper) ACL-SIGSEM workshop on Linguistic dimensions of prepositions and theiruse in computational linguistics formalisms and applications

• (1 paper) ACL’2003 workshop on Multiword Expressions: Analysis, Acquisition andTreatment

• (1 paper) ISWC’2003 workshop on Human Language Technology for the SemanticWeb and Web Services

1.3 Journals

• (2 papers) Revista Iberoamericana de Inteligencia Artificial, Special Issue on Multi-lingual Information Access

• (1 paper) Journal of Natural Language Engineering

• (1 paper) Computational Linguistics

• (1 paper) User Modeling and User Adapted Interaction

• (1 paper) Journal of Computer Spaeech and Language, special issue on Word SenseDisambiguation

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 6: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 5

2 Papers related to WP2: Methodology and Design

1. [Rigau et al., 2003] The MEANING project

Progress is being made in Natural Language Processing (NLP) but there is still along way towards Natural Language Understanding. An important step towards thisgoal is the development of technologies and resources that deal with concepts ratherthan words. However, to be able to build the next generation of intelligent opendomain Human Language Technology (HLT) application systems we need to solvetwo complementary and intermediate tasks: Word Sense Disambiguation (WSD) andautomatic large-scale enrichment of Lexical Knowledge Bases. Meaning proposesan innovative bootstrapping process to deal with the inter-dependency between WSDand knowledge acquisition.

3 Papers related to WP3: Linguistic Processors and

Infrastructure

1. [Carreras and Marquez, 2003a] Online Learning via Global Feedback for Phrase Recog-nition

This paper presents a system to recognize phrases based on perceptrons, and a globalonline learning algorithm to train them together. The recognition strategy applieslearning in two layers: a filtering layer, which reduces thesearch space by identifyingplausible phrase candidates, and a ranking layer, which discriminatively builds theoptimal phrase structure. The paper also provides a recognition-based feedback rulewhich reflects to each local function its committed errors from a global point ofview, and allows totrain them together online as perceptrons. Experimentation on asyntactic parsing problem, the recognition of clause hierarchies, gives state-of-the-artresults and evinces the advantages of our global training method over optimizingeach function locally, as in the traditional approach.

2. [Carreras and Marquez, 2003b] Phrase Recognition by Filtering and Ranking withPerceptrons

This paper presents a phrase recognition system based on perceptrons, and an onlinelearning algorithm to train them together. The recognition strategy applies learn-ingin two layers, first at word level, to filter words and form phrase candidates, second atphrase level,to rank phrases and select the optimal ones. The paper also provides aglobal feedback rule which reflects the dependencies among perceptrons and allows totrain them together online. Experimentation on Partial Parsing problems and NamedEntity Extraction gives state-of-the-art results on the CoNLL public datasets. Thepaper also shows empirical evidence that training the functions together is clearlybetter than training them separately, as in the conventional approach.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 7: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 6

3. [Gimenez and Marquez, 2003] Fast and Accurate Part-of-Speech Tagging: The SVMApproach Revisited

This paper presents a very simple and effective part-of-speech tagger based on Sup-port Vector Machines (SVM). Simplicity and efficiency are achieved by working withlinear separators in the primal formulation of SVM, and by using a greedy left-to-right tagging scheme. By means of a rigorous experimental evaluation, we concludethat the proposed SVM-based tagger is robust and flexible for feature modelling(including lexicalization), trains efficiently with almost no parameters to tune, andis able to tag thousands of words per second, which makes it really practical forreal NLP applications. Regarding accuracy, the SVM-based tagger significantly out-performs the TnT tagger exactly under the same conditions, and achieves a verycompetitive accuracy of 97.0% on the WSJ corpus, which is comparable to the besttaggers reported up to date.

4. [Oliver et al., 2003b] Use of Internet for Augmenting Coverage in a Lexical AcquisitionSystem from Raw Corpora

This paper presents a methodology for the automatic acquisition of lexical resourcesfrom raw corpora. This methodology has proved to be efficient for those languagesthat, like Russian, present a rich and mainly concatenative morphology. This methodcan be applied for the creation of new resources, as well as in the enrichment of ex-isting ones. This paper also presents an extension of the system that ises automaticquerying to Internet to acquire those entries for which there os not enough informa-tion in our corpus. The new basic acquisition methodology achieves similar resultscompared to the previous methods, but the use of Internet queries allows to increaserecall levels with only a slight decrease in precision, obtaining significatly betteroverall results.

5. [Oliver et al., 2003a] Automatic Lexical Acquisition from Raw Corpora: An Applica-tion to Russian

This paper presents a methodology for the automatic acquisition of lexical andmorpho-syntactic information from raw corpora. The system uses information aboutthe inflectional morphology declared by rules and is based on the co–occurrence of dif-ferent forms of the same paradigm in the corpus. A direct application of this method-ology gives very poor precision rates due to rule interaction between paradigms. Wepresent a rule analysis algorithm that solves this problem, quiving quite better pre-cision rates, although recall decreases dramatically. Finally, this paper investigatessome thechniques to raise the recall, achieving recall rates around 67% with a preci-sion of 92%.

6. [Alonso et al., 2003d] An Analytic Account of Discourse Markers for Shallow NLP

This paper presents a feature-based approach to the description of discourse markers(DMs) oriented to automated discourse analysis for shallow Natural Language Pro-cessing (NLP). DM describing features have been chosen based on previous work,

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 8: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 7

descriptive adequacy and our concrete NLP needs and capacities. An organizationof these features in dimensions of DM meaning has been inferred via data–driventechniques, and finally implemented in a computational DM lexicon.

7. [Carreras et al., 2003c] A Simple Named Entity Extractor Using AdaBoost

This paper presents a Named Entity Extraction (NEE) system for the CoNLL-2003shared task competition. As in the past year edition, the authors have approachedthe task by treating the two main sub-tasks of the problem, recognition (NER) andclassification (NEC),sequentially and independently with separate modules. Bothmodules are machine learning based systems, which make use of binary and multiclassAdaBoost classifiers.

Named Entity recognition is performed as a greedy sequence tagging procedure underthe well-known BIO labelling scheme. This tagging process makes use of three binaryclassifiers trained to be experts on the recognitionof B, I, and O labels, respectively.Named Entity classification is viewed as a 4-class classification problem (with LOC,PER, ORG, and MISC class labels), which is straight-forwardly addressed by the useof a multiclass learning algorithm.

The system presented here consists of a replication,with some minor changes, of thesystem that obtained the best results in the CoNLL-2002 NEE task. Therefore, itcan be considered as a benchmark of the state-of-the-art technology for the currentedition, and will allow also to make comparisons about the training corpora of botheditions.

8. [Carreras et al., 2003a] Learning a Perceptron-Based Named Entity Chunker via On-line Recognition Feedback

This paper presents a novel approach for the problem of Named Entity Recognitionand Classification (NERC), in the context of the CoNLL-2003 Shared Task. Thiswork is framed into the learning and inference paradigm for recognizing structures inNatural Language. The authors make use of several learned functions which, appliedat local contexts, discriminatively select optimal partial structures. On the top ofthis local recognition, an inference layer explores the partial structures and buildsthe optimal global structure for the problem.

9. [Carreras et al., 2003b] Named Entity Recognition For Catalan Using Spanish Re-sources

This work studies Named Entity Recognition (NER) for Catalan without makinguse of annotated resources of this language. The approach presented is based onmachine learning techniques and exploits Spanish resources, either by first trainingmodels for Spanish and then translating them into Catalan, or by directly trainingbilingual models. The resulting models are retrained on unlabelled Catalan datausing bootstrapping techniques. Exhaustive experimentation has been conducted onreal data, showing competitive results for the obtained NER systems.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 9: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 8

10. [de Ilarraza et al., 2003] HIZKING21: Integrating language engineering resources andtools into systems with linguistic capabilities

This paper presents the main lines of the HIZKING21 project. Its main objective is topromote basic research in language engineering, orienting this investigation towardsthe requirements of the globalized environment of the present day. The scope of workis the development of language technologies for the Basque language, as well as theintegration of resources and tools for the language industry, both already existingresources and resources to be developed in this project, into different devices (PCs,PDAs, electrical household appliances, car equipment and so on). The goal is tocontribute to the easy and user-friendly interaction with all kind of devices, usinglanguage as the natural means of communication. The starting up of this projecthas been possible thanks to the advances and developments in the Natural LanguageProcessing and Language Engineering for Basque language made by the participantsof this projects in the last fifteen years, as well as to the fact of sharing a vision ofto the more adequate strategy for the development of these technologies in the caseof a minority language like Basque.

11. [Alegria et al., 2003] Named Entity Recognition and Classification for texts in Basque

This paper presents a system for Named Entity (NE) recognition in written Basqueto be used in a CLIR application. Being an agglutinative language, Basque has highlyinflected forms, so a previous linguistic preprocess is required. The tool we presentrelies on a combined method that carries out the identification and recognition ofentity names in two subsequent steps. First, a grammar based on morphologicalinformation is applied in order to extract the entity names of the text, and then,the identified entities are classified by applying a heuristic that combines contextualinformation and gazetteers.

12. [Alegria et al., 2004] Design and Development of a Named Entity Recognizer for anAgglutinative Language

This paper presents the conclusions reached from the development of a system forNamed Entity recognition in written Basque. The system was designed in four steps:first, the development of a recognizer based on linguistic information representedon finite-state-transducers; second, the generation of semi-automatically annotatedcorpora from the result of these transducers; third, the achievement of the bestpossible recognizer by training different ML techniques on these corpora; and finally,the combination of the different recognizers obtained. Being Basque an agglutinativelanguage, a linguistic preprocess previous to these steps was required.

13. [Aduriz et al., 2003a] Finite State Applications for Basque

This paper presents the application of finite state technology (FST) to several kindsof linguistic processing of Basque, which can serve as a representative of agglutinative

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 10: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 9

languages and languages with free order of constituents. Three main tools will be de-scribed in this context: a morphological analyzer, a morphosyntactic disambiguationtool and a surface syntax processor.

14. [Aduriz et al., 2003b] Methodology and steps towards the construction of EPEC, acorpus of written Basque tagged at morphological and syntactic levels for the auto-matic processing

This paper focus on the methodology currently followed for the design of the Basquecorpus for NLP applications. First, the process of building the corpus has been carriedout in different steps: (1) compilation of texts and their classification; (2) design ofthe tagset; (3) morphosyntactic tagging of the corpus; (4) manual disambiguationof the corpus; (5) disambiguation process; (6) delimiting the chunks; (7) syntactictagging following the dependency style (treebank); (8) formalization of the all theinput text used in the previous steps by means of the SGML. Finally, the paperdiscuss some possible improvements and future research.

15. [Aduriz et al., 2003c] Construction of a Basque Dependency Treebank

This paper presents the process followed to build the Basque Dependency Treebank.We think that it is a necessary resource for the linguistic research in general andfor the development of real applications in the area of NLP. This work is part ofa general project (http://www.dlsi.ua.es/projectes/3lb) which objective is to buildannotated corpora with linguistic annotation at syntactic, semantic and pragmaticlevels. We annotate syntactically the Eus3LB corpus following the dependency-basedformalism as explained in Aduriz et al, (2002). This formalism is also used in thePrague Dependency Treebank for Czech (Hajic, 1998) and in Oflazer et al.,(1999)and, is the one that could best deal with the free word order (Skut et al. 1997)displayed by Basque syntax.

16. [Aduriz et al., 2004] A Cascaded Syntactic Analyser for Basque

This article presents a robust syntactic analyser for Basque and the different mod-ules it contains. Each module is structured in different analysis layers for whicheach layer takes the information provided by the previous layer as its input; thuscreating a gradually deeper syntactic analysis in cascade. This analysis is carried outusing the Constraint Grammar (CG) formalism. Moreover, the article describes thestandardisation process of the parsing formats using XML.

17. [Negri and Magnini, 2004] Using WordNet Predicates for Multilingual Named EntityRecognition

Wordnet predicates (WN-preds) establish relations between words in a certain lan-guage and concepts of a language independent ontology. In this paper we show howWN-preds can be profitably used in the context of multilingual tasks where two ormore wordnets are aligned. Specifically, we report about the extension to Italian

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 11: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 10

of a previously developed Named Entity Recognition (NER) system for written En-glish. Experimental results demonstrate the validity of the approach and confirm thesuitability of WN-preds for a number of different NLP tasks.

18. [Bentivogli et al., 2003] The MEANING Italian Corpus

The MEANING Italian Corpus (MIC) is a large size corpus of written contempo-rary Italian, which is being created at ITC-irst, in the framework of the EU-fundedMEANING project. Its novelty consists in the fact that domain-representativenesshas been chosen as the fundamental criterion for the selection of the texts to be in-cluded in the corpus. A core set of 42 basic domains, broadly representative of all thebranches of knowledge, has been chosen to be represented in the corpus. The MEAN-ING Italian corpus will be encoded using XML and taking into account, wheneverpossible according to the requirements of our NLP applications, the XML versionof the Corpus Encoding Standard (XCES) and the new standard ISO/TC 37/SC4 for language resources. A multi-level annotation is planned in order to encodeseven different kinds of information: orthographic features, the structure of the text,morphosyntactic information, multiwords, syntactic information, named entities, andword senses.

4 Papers related to WP4: (Knowledge) Acquisition

1. [Daude et al., 2003a] Making Wordnet Mappings Robust

Building appropriate resources for broad–coverage semantic processing is a hard andexpensive task, involving large research groups during long periods of developement.The outcomes of these projects are, usually, large and complex semantic structures,not compatible with resources developed in previous projects and efforts. To maintaincompatibility between wordnets of different languages and versions, past and new, itis fundamental to dispose of a high accurate tool. This paper presents an accurate,quantitative and qualitative validation of the methodology used by [Daude et al.,2001] to map two WordNet versions. The accuracy of the technique is checked byapplying it to map a WordNet version onto itself, which enables not only quantitativeevaluation but also a qualitative study of the error cases and algorithm tuning.

2. [Daude et al., 2003b] Validation and Tuning of Wordnet Mapping Techniques

This paper presents an accurate, quantitative and qualitative validation of the method-ology used by [Daude et al., 2001] to map two WordNet versions. The accuracy ofthe technique is evaluated by applying it to map a WordNet version onto itself, whichenables not only quantitative evaluation but also a qualitative study of the error casesand algorithm tuning. In addition, we also evaluate the behaviour of the techniquewhen mapping non-identical hierarchies by randomly erasing synsets from either thetarget or the source copy of the used WordNet.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 12: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 11

3. [Atserias et al., 2003c] Starting up the Multilingual Central Repository

This paper presents the first version of the Multilingual Center Repository (Mcr)of Meaning. The Mcracts as a multilingual interface for integrating and distribut-ing all the knowledge acquired by Meaning for five European languages (Basque,Catalan, English, Italian and Spanish).

4. [Atserias et al., 2003b] Integrating and porting Knowlege across Languages

This paper describes the first version of the Multilingual Central Repository. Cur-rently, Mcr integrates into the same EuroWordNet framework, five local wordnets(including three versions of the English WordNet from Princeton), the EuroWordNetTop Ontology, MultiWordNet Domains, and hundreds of thousand of new semanticrelations and properties automatically acquired from corpora. In fact, the resultingMcr is going to constitute the largest and richest multilingual lexical–knowledgeever build.

5. [Atserias et al., 2004] The MEANING Multilingual Central Repository

This paper describes the first version of the Multilingual Central Repository, a lexi-cal knowledge base developed in the framework of the Meaning project. Currentlythe Mcr integrates into the EuroWordNet framework five local wordnets (includingfour versions of the English WordNet from Princeton), an upgraded version of theEuroWordNet Top Concept ontology, the MultiWordNet Domains, and hundreds ofthousand of new semantic relations and properties automatically acquired from cor-pora. We believe that the resulting Mcr will be the largest and richest MultilingualLexical Knowledge Base in existence.

6. [Bentivogli and Pianta, 2004] Extending WordNet with Syntagmatic Information

This paper presents a proposal to extend WordNet-like lexical databases by addinginformation about the co-occurrence of word meanings in texts. More specificallythe authors propose to add phrasets, i.e. sets of free combinations of words whichare recurrently used to express a concept (let’s call them Recurrent Free Phrases).Phrasets are a useful source of information for different NLP tasks, and particularlyin a multilingual environment to manage lexical gaps. At least a part of recurrentfree phrases can also be represented through a new set of syntagmantic (lexical andsemantic) WordNet relations.

5 Papers related to WP5: Acquisition

1. [Atserias et al., 2003a] Exploring large-scale Acquisition of Multilingual SemanticModels for Predicates

This paper investigates the feasibility to obtain large-scale semantic patterns forany language based only on shallow parsing and some basic semantic generalizations.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 13: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 12

Being this a exploratory experiment we performed only a qualitative evaluation. Thiswork compares several semantic patterns coming from translation equivalent verbsselected from different languages and domains.

2. [Bentivogli and Pianta, 2003] Beyond Lexical Units: Enriching WordNets with Phrasets

This paper presents a proposal to extend WordNet-like lexical databases by addingphrasets, i.e. sets of free combinations of words which are recurrently used to expressa concept (let’s call them recurrent free phrases). Phrasets are a useful source ofinformation for different NLP tasks, and particularly in a multilingual environmentto manage lexical gaps. Two experiments are presented to check the possibility ofacquiring recurrent free phrases from dictionaries and corpora.

3. [Castillo et al., 2003] Asignacion Automatica de Etiquetas de Dominios en WordNet

This paper describes a process to automatically assign wordnet domain labels toWordNet glosses. One of the main goals of this work is to enrich lexical sourceswith WordNet information. WorNet domains are used as knowledge source. Finally,Domain labels for nouns and verbs are suggested and verified.

4. [Castillo et al., 2004] Automatic Assigment of Domain Labels to WordNet

This paper describes a process to automatically assign domain labels to WordNetglosses. One of the main goals of this work is to show different ways to enrichsistematically and automatically dictionary definitions (or gloses of new WordNetversions) with MultiWordNet domains. Finally, we show how this technique can beused to verify the consistency of the current version of MultiWordNet Domains.

5. [Lascurain et al., 2003] Disambiguation of case suffixes in Basque

The goal of this paper is to build a tool for automatic classification of grammaticalcases in Basque. To achieve this goal two inductive learning techniques are applied,namely systems Tilde and Timbl. WordNet is also used to locate synsets and hyper-onyms of words in a context. For Tilde the method reached accuracy higher than70% and for Timbl 63%.

6. [Lersundi and Agirre, 2003] Semantic interpretations of postpositions and preposi-tions: a multilingual inventory for Basque, English and Spanish

This article describes a common inventory of interpretations for postpositions (Basque)and prepositions (English and Spanish). The inventory is a flat list of tags, basedmainly on thematic roles. Using the same inventory allows to know for each post-position or preposition, which are the translations for each possible interpretation.We think this resource will be useful for studies on machine translation, but alsoon lexical acquisition experiments on the syntax-semantic interface that make useof multilingual data. The method to derive the inventory and the list of interpreta-tions for Basque postpositions and Spanish and English prepositions has tried to besystematic.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 14: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 13

7. [Agirre et al., 2003] A pilot study of English Selectional Preferences and their Cross-Lingual Compatibility with Basque

The specific goals of this experiment are to study automatically acquired Englishselectional preferences from a number of sources, and to assess portability and com-patibility issues with regard to selectional preferences acquired for Basque. This workstudies a wide-range of techniques and issues, with the aim of providing an analysisof the interplay of selectional-learning techniques, domain and multilinguality. Theoverall goal is the acquisition of complex lexical information for verbs (both syntacticand semantic) using multilingual sources.

8. [Agirre and de Lacalle, 2003] Clustering WordNet Word Senses

This paper presents the results of a set of methods to cluster WordNet word senses.The methods rely on different information sources: confusion matrixes from Senseval-2 Word Sense Disambiguation systems, translation similarities, hand-tagged exam-ples of the target word senses and examples obtained automatically from the web forthe target word senses. The clustering results have been evaluated using the coarse-grained word senses provided for the lexical sample in Senseval-2. Cluto, a generalclustering environment, is used in order to test different clustering algorithms. Thebest results are obtained for the automatically obtained examples, yielding purityvalues up to 84% on average over 20 nouns.

9. [Agirre, 2004] Clustering word senses

WordNet does not provide any information about the relation among the word sensesof a given word, that is, the word senses are given as a list. The primary goal of thiswork is to tackle the fine-grainedness and lack of structure of WordNet word senses,using the clusters to improve Word Sense Disambiguation results. The authors planto make this resource publicly available for all WordNet nominal word senses, andwe expect for the similarity measure to be valuable in better acquiring the explicitrelations among Word-Net word senses, including specialization, systematic polysemyand metaphorical relations.

10. [Alfonseca et al., 2004] Approximating hierachy-based similarity for WordNet nominalsynsets using Topic Signatures.

Topic signatures are context vectors built for concepts. They can be automaticallyacquired for any concept hierarchy using simple methods. This paper explores thecorrelation between a distributional-based semantic similarity based on topic signa-tures and several hierarchy-based similarities. The authors show that topic signaturescan be used to approximate link distance in WordNet (0.88 correlation), which allowsfor various applications, e.g. classifying new concepts in existing hierarchies. Twomethods have been explored for building topic signatures (monosemous relatives vs.all relatives) exploring a large number of different parameters for both methods.

11. [Gibert et al., 2004] Towards binding spanish senses to wordnet senses

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 15: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 14

This work tries to enrich the Spanish WordNet using a Spanish taxonomy as a knowl-edge source. The Spanish taxonomy is composed by Spanish senses, while WordNetis composed by synsets (English senses). A set of weighted associations betweenSpanish words and WordNet synsets is used for inferring associations between bothtaxonomies.

12. [McCarthy et al., 2003] Detecting a Continuum of Compositionality

This paper investigates the use of an automatically acquired thesaurus for measuresdesigned to indicate the compositionality of candidate multiword verbs, specificallyEnglish phrasal verbs identified automatically using a robust parser. The authorsexamine various measures using the nearest neighbours of the phrasal verb, and insome cases the neighbours of the simplex counterpart and show that some of thesecorrelate significantly with human rankings of compositionality on the test set. Theyalso show that whilst the compositionality judgements correlate with some statisticscommonly used for extracting multiwords, the relationship is not as strong as thatusing the automatically constructed thesaurus.

13. [Catala et al., 2003] A portable method for acquiring information extraction patternswithout annotated corpora

The main issue when building Information Extraction (IE) systems is how to ob-tain the knowledge needed to identify relevant information in a document. Mostapproaches require expert human intervention in many steps of the acquisition pro-cess. In this paper we describe ESSENCE, a new method for acquiring IE patternsthat significantly reduces the need for human intervention. The method is based onELA, a specifically designed learning algorithm for acquiring IE patterns withouttagged examples. The distinctive features of ESSENCE and ELA are that (1) theypermit the automatic acquisition of IE patterns from unrestricted and untagged textrepresentative of the domain, due to (2) their ability to identify regularities aroundsemantically relevant concept-words for the IE task by (3) using non-domain-specificlexical knowledge tools such as WordNet, and (4) restricting the human interventionto defining the task, and validating and typifying the set of IE patterns obtained.Since ESSENCE does not require a corpus annotated with the type of informationto be extracted and it uses a general purpose ontology and widely applied syntactictools, it reduces the expert effort required to build an IE system and therefore also re-duces the effort of porting the method to any domain. The results of the applicationof ESSENCE to the acquisition of IE patterns in an MUC-like task are shown.

14. [Avancini et al., 2003] Expanding Domain-Specific Lexicons by Term Categorization

This paper discuss an approach to the semi-automatic expansion of domain-specificlexicons by means of term categorization, a novel task employing techniques frominformation retrieval (IR) and machine learning (ML). Specifically, the authors viewthe expansion of such lexicons as an iterative process of learning previously unknownassociations between terms and domains (i.e. disciplines, or fields of activity). The

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 16: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 15

process is iterative, in that it generates, for each ci in a set C = {c1, . . . , cm} ofdomains, a sequence Li

0⊆ Li

1⊆ . . . ⊆ Li

nof lexicons, bootstrapping from an initial

lexicon Li

0and a set of text corpora Θ = {θ0, . . . , θn−1} given as input. The method

is inspired by text categorization, the discipline concerned with labelling natural lan-guage texts with labels from a predefined set of domains, or categories. However,while text categorization deals with documents represented as vectors in a space ofterms, we formulate the task of term categorization as one in which terms are (du-ally) represented as vectors in a space of documents, and in which terms (instead ofdocuments) are labelled with domains. As a learning device we adopt boosting, since(a) it has demonstrated state-of-the-art effectiveness in a variety of text categoriza-tion applications, and (b) it naturally allows for a form of “data cleaning”, therebymaking the process of generating a lexicon an iteration of generate-and-test steps.

6 Papers related to WP6: WSD

1. [McCarthy and Carroll, 2003] Disambiguating Nouns, Verbs and Adjectives UsingAutomatically Acquired Selectional Preferences

Selectional preferences have been used by word sense disambiguation (WSD) systemsas one source of disambiguating information. We evaluate WSD using selectional pref-erences acquired for English adjective noun, subject, and direct object grammaticalrelationships with respect to a standard test corpus. The selectional preferences arespecific to verb or adjective classes, rather than individual word forms, so they can beused to disambiguate the co-occurring adjectives and verbs, rather than just the nom-inal argument heads. We also investigate use of the one-sense-per-discourse heuristicto propagate a sense tag for a word to other occurrences of the same word within thecurrent document in order to increase coverage. Although the preferences performwell in comparison with other unsupervised WSD systems on the same corpus, theresults show that for many applications, further knowledge sources would be requiredto achieve an adequate level of accuracy and coverage. In addition to quantifyingperformance, we analyze the results to investigate the situations in which the selec-tional preferences achieve the best precision and in which the one-sense-per-discourseheuristic increases performance.

2. [Vazquez et al., 2003] Metodo de desambiguacion lexica basada en el recurso semanticoDominios Relevantes

This paper presents a new WSD that exploits WordNet Domains and the informationcontained in glosses of WordNet to build a new resource: Relevant Domains. Thenew method is based on this new lexical resource and has been tested using theSENSEVAL-2 English all-words task.

3. [Magnini and Strapparava, 2004] User modelling for news Web sites with word sensebased techniques

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 17: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 16

This paper proposes to use a content-based document representation as a startingpoint to build a model of the user’s interests. Documents passed over are processedand relevant senses (disambiguated over WordNet) are extracted and then combinedto form a semantic network. A filtering procedure dynamically predicts new docu-ments on the basis of the semantic network.

There are two main advantages of a content-based approach: first, the model pre-dictions, being based on senses rather then words, are more accurate; second, themodel is language independent, allowing navigation in multilingual sites. The au-thors report the results of a comparative experiment that has been carried out togive a quantitative estimation of these improvements.

4. [Magnini et al., 2003a] Making Explicit the Semantics Hidden in Schema Models

Most of the data stored in the Semantic Web is organized in schema models, which canbe represented as labeled graphs where labels are short natural language expressions.Examples of schema models include ER-schema automata, ontologies, taxonomies,and Web Directories. The semantics of schema models is not explicit but is hiddenin their structures and labels. To obtain semantic interoperability we need to maketheir semantics explicit by taking into account both the interpretation of the labelsand the structures described by the arcs. The authors propose a methodology forinterpreting schema models on the basis of the taxonomic relations and the linguisticmaterial they contain. The authors rely on a set of linguistic repositories, such asWordNet, and explore a number of crucial linguistic issues such as disambiguation ofpolysemous words, multiwords, and coordinations. The Web Directories of Googleand Yahoo! have been chosen as an evaluation set. The paper shows that there is aconsiderable amount of information to be made explicit and discuss the performanceof an implementation of our analysis.

5. [Magnini et al., 2003b] Making Explicit the Hidden Semantics of Hierarchical Clas-sifications

Hierarchical classifications are concept hierarchies used to organize large amounts ofdocuments. File systems, products’ taxonomies for the market place and the directo-ries provided by Web portals are common examples of hierarchical classifications. Assemi-structured knowledge sources, hierarchical classifications have peculiar features:they differ both from plain texts since they are based on a taxonomy of concepts,and from structured data sources (such as databases and formal ontologies), becausemany semantic relations are implicit. The authors propose a methodology for build-ing a semantic interpretation of hierarchical classifications on the basis of the analysisof the taxonomic relations and the linguistic material they contain. The authors pro-vide a formal semantics for hierarchical classifications and then we use that formalframework to interpret the implicit knowledge represented, by exploring a numberof crucial linguistic issues. Relevant phenomena addressed include the disambigua-tion of polysemous words, the semantics of multiwords, and the interpretation of

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 18: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 17

coordinations. The Web Directories of Google and Yahoo! have been chosen as anevaluation set. The paper shows that there is a considerable amount of informationto be made explicit and discuss the performance of an implementation of our analysis.

6. [Gliozzo et al., 2004] Unsupervised and Supervised Exploitation of Semantic Domainsin Lexical Disambiguation

Domains are common areas of human discussion, such as economics, politics, law,science etc., which demonstrate lexical coherence. This paper explores the dual roleof domains in word sense disambiguation (WSD). On one hand, domain informationprovides generalized features at the paradigmatic level that are useful to discriminateamong many word senses. On the other hand, domain distinctions constitute auseful level of coarse grained sense distinctions, which lends itself to more accuratedisambiguation with lower amounts of knowledge.

This paper extends and grounds the modeling of domains and the exploitation ofWordNet Domains, an extension of WordNet in which each synset is labeled withdomain information. The authors propose a novel unsupervised probabilistic methodfor the critical step of estimating domain relevance for contexts, and suggest utilizingit within unsupervised Domain Driven Disambiguation (DDD) for word senses, aswell as within a traditional supervised approach.

The paper presents empirical assessments of the potential utilization of domains inWSD at a wide range of comparative settings, both supervised and unsupervised.Following the dual role of domains the authors report experiments that evaluateboth the extent to which domain information provide effective features for WSD, aswell as the accuracy obtained by WSD at domain-level sense granularity. Further-more, the authors demonstrate the potential for either avoiding or minimizing manualannotation thanks to the generalized level of information provided by domains.

7 Papers related to WP7: Evaluation and Assessment

1. [Civit et al., 2003] An’alisis cualitativo y cuantitativo del acuerdo entre anotadoresen el desarrollo de corpus interpretados lingu’isticamente

The main goal of thiswork is to present a qualitative and quantitative analysis of dis-agreements among annotators during the syntactic labelling of the Cast3LB corpus.A one-thousand-sentence corpus has been annotated by five annotators. Consecu-tive evaluations of the results has been performed and have provided consecutiveimprovements of the guidelines. Finally, a qualitative analysis and a classification ofthe differences among annotators is presented.

8 Papers related to WP8: User Validation

1. [Massot et al., 2003] QA UdG-UPC System at TREC-12

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 19: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 18

This paper describes a prototype multilingual Q&A *system* that we have designedto participate in the Q&A Track of *TREC-12*. The *system* answer concreteresponses, then the authors participate in the Q&A main task for factoid questions.The main areas of our *system* are: (1) Inductive Logic Programming to learn thequestion type, (2) Clustering of Named Entities to improve Information Retrieval and(3) Semantic relations and EuroWordNet synsets to perform a language-independentanswer extraction.

2. [Turmo, 2003] Information Extraction, Multilinguality and Portability

The growing availability of on-line textual sources and the potential number of ap-plications of knowledge acquisition approaches to textual data, such as InformationExtraction (IE), has lead to an increase in IE research. Some examples of theseapplications are the generation of data bases from documents, as well as the ac-quisition of knowledge useful for emerging technologies like question answering andinformation integration, among others related to text mining. However, one of themain drawbacks of the application of IR refers to the intrinsic language and domaindependence. For the sake of reducing the high cost of manually adapting IE appli-cations to new domains and languages, different Machine Learning (ML) techniqueshave been applied by the research community. This survey describes and comparesthe main approaches to IE and the different ML techniques used to achieve adaptableIE technology.

3. [Alonso et al., 2003c] Approaches to Text Summarization: Questions and Answers

In this paper a comparative study of Automated Text Summarization (TS) Sysemsis presented. It describes the factors to be taken into account ofr evaluating thosesystems and outlines three alternative classifications. The paper provides extensiveexamples of working TS systems according to their characterizing features, perfor-mance, and obtained results, with special emphasis on the multilingual aspect ofsummarization.

4. [Alonso et al., 2003b] Combining heterogeneous knowledge sources in e-mail summa-rization

This paper presents Carpanta, an e-mail summarization system that applies aknowledge intensive approach to obtain highly coherent summaries. Robustness andportability are guaranteed by the use of general-purpose NLP, but it also exploitslanguage– and domain–dependent knowledge. The system is evaluated against a cor-pus of human–judged summaries, and the the contribution of each kind of informationto summary goodness is assessed.

5. [Alonso et al., 2003a] CARPANTA eats words you don’t need from e-mail

This paper presents Carpanta, an e-mail summarization system that applies aknowledge intensive approach to obtain highly coherent summaries. Robustness andportability are guaranteed by the use of general-purpose NLP, but it also exploits

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 20: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 19

language– and domain–dependent knowledge. The system is evaluated against acorpus of human–judged summaries, reaching satisfactory levels of performance.

6. [Magnini et al., 2003a] Making Explicit the Semantics Hidden in Schema Models

See abstract in section 6

7. [Magnini et al., 2003b] Making Explicit the Hidden Semantics of Hierarchical Clas-sifications

See abstract in section 6

9 Papers related to WP9: Exploitation and Dissemi-

nation

1. [Rigau et al., 2003] The MEANING project

See abstract in section 2

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 21: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 20

References

[Aduriz et al., 2003a] I. Aduriz, I. Aldezabal, I. Alegria, J. Arriola, A. Dıaz de Ilarraza,N. Ezeiza, and K. Gojenola. Finite state applications for basque. In Proceedigns ofEACL’2003 workshop on Finite-State Methods in Natural Language Processing, Bu-dapest, Hungary, 2003.

[Aduriz et al., 2003b] I. Aduriz, M. Aranzabe, J. Arriola, A. Atutxa, A. Dıaz de Ilarraza,M. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. Methodology and stepstowards the construction of epec, a corpus of written basque tagged at morphological andsyntactic levels for the automatic processing. In Proceedings of the Corpus Linguistics,Lancaster, UK, 2003.

[Aduriz et al., 2003c] I. Aduriz, M. Aranzabe, J. Arriola, A. Atutxa, A. Dıaz de Ilarraza,A. Garmendia, and M. Oronoz. Construction of a basque dependency treebank. InProceedings of TLT 2003 workshop on Treebanks and Linguistic Theories, Vaxjo, Sweden,2003.

[Aduriz et al., 2004] I. Aduriz, M. Aranzabe, J. Arriola, A. Atutxa, A. Dıaz de Ilarraza,K. Gojenola, M. Oronoz, and L. Uria. A cascaded syntactic analyser for basque. In Pro-ceedings of International Conference on Intelligent Text Processing and ComputationalLinguistics (CICLing’04), Seoul, Korea, 2004.

[Agirre and de Lacalle, 2003] E. Agirre and O. Lopez de Lacalle. Clustering wordnet wordsenses. In International Conference Recent Advances in Natural Language Processing,RANLP’03, Borovets, Bulgary, 2003.

[Agirre et al., 2003] E. Agirre, I. Aldezabal, and E. Pociello. A pilot study of english selec-tional preferences and their cross-lingual compatibility with basque. In Proceedings of theInternational Conference on Text Speech and Dialogue (TSD’2003), Ceske Budojovice,Czech Republic, 2003.

[Agirre, 2004] E. Agirre. Clustering word senses. In Proceedings of the Second Interna-tional Global WordNet Conference (GWC’04). Panel on figurative language, Brno, CzechRepublic, January 2004. ISBN 80-210-3302-9.

[Alegria et al., 2003] I. Alegria, I. Balza, N. Ezeiza, I. Fernandez, and R. Urizar. Namedentity recognition and classification for texts in basque. In Actas de II Jornadas deTratamiento y Recuperacion de Informacion, JOTRI’03, Madrid, Spain, 2003.

[Alegria et al., 2004] I. Alegria, O. Arregi, I. Balza, N. Ezeiza, I. Fernandez, and R. Urizar.Design and development of a named entity recognizer for an agglutinative language. InProceedigns of the IJCNLP-04 workshop on Named Entity Recognition, Hong Kong,China, 2004.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 22: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 21

[Alfonseca et al., 2004] E. Alfonseca, E. Agirre, and O. Lopez de Lacalle. Approximatinghierachy-based similarity for wordnet nominal synsets using topic signatures. In Pro-ceedings of the Second International Global WordNet Conference (GWC’04). Panel onfigurative language, Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Alonso et al., 2003a] L. Alonso, B. Casas, I. Castellon, S. Climent, and L. Padro.Carpanta eats words you don’t need from e-mail. In Proceedings of SEPLN’03, Alcalade Henares, Spain, September 2003. ISSN 1136-5948.

[Alonso et al., 2003b] L. Alonso, B. Casas, I. Castellon, S. Climent, and L. Padro. Com-bining heterogeneous knowledge sources in e-mail summarization. In Proceedingsof the International Conference on Recent Advances in Natural Language Processing(RANLP’03), Borovets, Bulgaria, 2003.

[Alonso et al., 2003c] L. Alonso, I. Castellon, S. Climent, M. Fuentes, L. Padro, andH. Rodrıguez. Approaches to text summarization: Questions and answers. RevistaIberoamericana de Inteligencia Artificial, Special Issue on Multilingual Information Ac-cess, 5(22):34–52, 2003.

[Alonso et al., 2003d] L. Alonso, J. Shih, I. Castellon, and L. Padro. An analytic accountof discourse markers for shallow nlp. In Proccedings of the ESSLLI’03 workshop on TheMeaning and Implementation of Discourse Particles in 15th European Summer Schoolin Logic, Language and Information, Vienna, Austria, 2003.

[Atserias et al., 2003a] Jordi Atserias, Mauro Castillo, Francis Real, Horacio Rodrıguez,and German Rigau. Exploring large-scale acquisition of multilingual semantic modelsfor predicates. In Proceedings of SEPLN’03, pages 39–46, Alcala de Henares, Spain,September 2003. ISSN 1136-5948.

[Atserias et al., 2003b] Jordi Atserias, Luıs Villarejo, and German Rigau. Integrating andporting knowleges across languages. In RANLP’03, pages 31–37, Borovets, Bulgaria,2003.

[Atserias et al., 2003c] Jordi Atserias, Luıs Villarejo, and German Rigau. Starting up themultilingual central repository. In SEPLN’03, pages 261–268, Alcala de Henares, Spain,September 2003. ISSN 1136-5948.

[Atserias et al., 2004] Jordi Atserias, Luıs Villarejo, German Rigau, Eneko Agirre, JohnCarroll, Bernardo Magnini, and Piek Vossen. The meaning multilingual central reposi-tory. In Proceedings of the Second International Global WordNet Conference (GWC’04),Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Avancini et al., 2003] H. Avancini, A. Lavelli, B. Magnini, F. Sebastiani, and R. Zanoli.Expanding domain-specific lexicons by term categorization. In Proceedings of the 18thACM Symposium on Applied Computing, Special Track on Information Access and Re-trieva Systems (SAC’03), Melbourne, Florida, 2003.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 23: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 22

[Bentivogli and Pianta, 2003] L. Bentivogli and E. Pianta. Beyond lexical units: Enrich-ing wordnets with phrasets. In Proceedings of the Research Note Sessions of the 10thConference of the European Chapter of the Association for Computational Linguistics(EACL’03), Budapest, Hungary, 2003.

[Bentivogli and Pianta, 2004] L. Bentivogli and E. Pianta. Extending wordnet with syn-tagmatic information. In Proceedings of the Second International Global WordNet Con-ference (GWC’04), Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Bentivogli et al., 2003] L. Bentivogli, C. Girardi, and E. Pianta. The meaning italiancorpus. In Proceedings of the conference on Corpus Linguistics 2003, Lancaster, UK,2003.

[Carreras and Marquez, 2003a] X. Carreras and L. Marquez. Online learning via globalfeedback for phrase recognition. In Proceedings of 17th Annual Conference on NeuralInformation Processing Systems, NIPS’03, Vancouver, Canada, 2003.

[Carreras and Marquez, 2003b] X. Carreras and L. Marquez. Phrase recognition by fil-tering and ranking with perceptrons. In Proceedings of the International Conferenceon Recent Advances in Natural Language Processing (RANLP’03), Borovets, Bulgaria,2003.

[Carreras et al., 2003a] X. Carreras, L. Marquez, and L. Padro. Learning a perceptron-based named entity chunker via online recognition feedback. In Proceedings of the CoNLL2003: Shared Task Contribution, Edmonton, Canada, 2003.

[Carreras et al., 2003b] X. Carreras, L. Marquez, and L. Padro. Named entity recognitionfor catalan using spanish resources. In 10th Conference of the European Chapter of theAssociation for Computational Linguistics (EACL’03), Budapest, Hungary, 2003.

[Carreras et al., 2003c] X. Carreras, L. Marquez, and L. Padro. A simple named entityextractor using adaboost. In Proceedings of the CoNLL 2003: Shared Task Contribution,Edmonton, Canada, 2003.

[Castillo et al., 2003] M. Castillo, F. Real, and G. Rigau. Asignacion automatica de eti-quetas de dominios en wordnet. In Proceedings of SEPLN’03, Alcala de Henares, Spain,September 2003. ISSN 1136-5948.

[Castillo et al., 2004] M. Castillo, F. Real, and G. Rigau. Automatic assigment of domainlabels to wordnet. In Proceedings of the Second International Global WordNet Conference(GWC’04), Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Catala et al., 2003] N. Catala, N. Castell, and M. Martın. A portable method for ac-quiring information extraction patterns without annotated corpora. Natural LanguageEngineering, 9(2):151–179, 2003.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 24: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 23

[Civit et al., 2003] M. Civit, A. Ageno, B. Navarro, N. Bufi, and M.A. Marti. Analisiscualitativo y cuantitativo del acuerdo entre anotadores en el desarrollo de corpus inter-pretados lingsticamente. In Proceedings of the XIX Congreso de la Sociedad Espanolapara el Procesamiento del Lenguaje Natural (SEPLN’03), Alcala de Henares, Spain,2003.

[Daude et al., 2001] J. Daude, L. Padro, and G. Rigau. A complete wn1.5 to wn1.6 map-ping. In Proceedings of NAACL Workshop ”WordNet and Other Lexical Resources:Applications, Extensions and Customizations”, Pittsburg, PA, United States, 2001.

[Daude et al., 2003a] J. Daude, L. Padro, and G. Rigau. Making Wordnet MappingsRobust. In Proceedings of the 19th Congreso de la Sociedad Espanola para el Proce-samiento del Lenguage Natural, SEPLN’03, Universidad Universidad de Alcala deHenares. Madrid, Spain, 2003.

[Daude et al., 2003b] J. Daude, L. Padro, and G. Rigau. Validation and Tuning of WordnetMapping Techniques. In Proceedings of the International Conference on Recent Advancesin Natural Language Processing (RANLP’03), Borovets, Bulgaria, 2003.

[de Ilarraza et al., 2003] A. Dıaz de Ilarraza, K. Sarasola, A. Gurrutxaga, I. Hernaez, andN. Lopez de Gerenu. Hizking21: Integrating language engineering resources and toolsinto systems with linguistic capabilities. In Proceedings of TALN’03 workshop on NLPof Minority Languages and Small Languages, Nantes, France, 2003.

[Gibert et al., 2004] K. Gibert, J. Farreres, and H. Rodr’iguez. Towards binding spanishsenses to wordnet senses. In Proceedings of the Second International Global WordNetConference (GWC’04), Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Gimenez and Marquez, 2003] J. Gimenez and L. Marquez. Fast and accurate part-of-speech tagging: The svm approach revisited. In Proceedings of the International Con-ference on Recent Advances in Natural Language Processing (RANLP’03), Borovets,Bulgaria, 2003.

[Gliozzo et al., 2004] A. Gliozzo, C. Strapparava, and I. Dagan. Unsupervised and super-vised exploitation of semantic domains in lexical disambiguation. Submitted to Journalof Computer Speech and Language, special issue on Word Sense Disambiguation, 2004.

[Lascurain et al., 2003] V. Lascurain, E. Agirre, M. Lersundi, and L. Popelınsky. Disam-biguation of case suffixes in basque. In Proceedings of TALN’03 workshop on NLP ofMinority Languages and Small Languages, Nantes, France, 2003.

[Lersundi and Agirre, 2003] M. Lersundi and E. Agirre. Semantic interpretations of post-positions and prepositions: a multilingual inventory for basque, english and spanish.In Proceedings of the ACL-SIGSEM workshop on Linguistic dimensions of prepositionsand their use in computational linguistics formalisms and applications, Tolouse, France,2003.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 25: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 24

[Magnini and Strapparava, 2004] Bernardo Magnini and Carlo Strapparava. User mod-elling for news web sites with word sense based techniques. To appear in User Modelingand User Adapted Interaction, UMUAI, 14, special issues, 2004.

[Magnini et al., 2003a] B. Magnini, L. Serafini, and M. Speranza. Making explicit thesemantics hidden in schema models. In Proceedings of the ISWC workshop on HumanLanguage Technology for the Semantic Web and Web Services, Sanibel Island, USA,2003.

[Magnini et al., 2003b] B. Magnini, L. Serafini, and M. Speranza. Procedings of ai*ia2003, cappelli a. and turini f. ai*ia 2003: Advances in artificial intelligence, lecture notesin artificial intelligence 2829 (lncs/lnai) springer verlag. In Proceedings of the ISWCworkshop on Human Language Technology for the Semantic Web and Web Services,pages 436–448, Pisa, Italy, 2003.

[Massot et al., 2003] M. Massot, H. Rodriguez, and D. Ferres. Qa udg-upc system at trec-12. In Proceedings of the 12th Text Retrieval Conference (TREC), Gaithersburg, USA,2003.

[McCarthy and Carroll, 2003] D. McCarthy and J. Carroll. Disambiguating nouns, verbsand adjectives using automatically acquired selectional preferences. Computational Lin-guistics, 29(4):639–654, 2003.

[McCarthy et al., 2003] D. McCarthy, B. Keller, and J. Carroll. Detecting a continuum ofcompositionality in phrasal verbs. In Proceedings of the ACL workshop on MultiwordExpressions: Analysis, Acquisition and Treatment, pages 73–80, Sapporo, Japan, 2003.

[Negri and Magnini, 2004] M. Negri and B. Magnini. Using wordnet predicates for mul-tilingual named entity recognition. In Proceedings of the Second International GlobalWordNet Conference (GWC’04), Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Oliver et al., 2003a] A. Oliver, I. Castellon, and L. Marquez. Automatic lexical acquisitionfrom raw corpora: An application to russian. In Proceedings of the EACL-2003 Workshopon Morphological Processing of Slavic Languages, Budapest, Hungary, 2003.

[Oliver et al., 2003b] A. Oliver, I. Castellon, and L. Marquez. Use of internet for augment-ing coverage in a lexical acquisition system from raw corpora. In Proceedings of theRANLP’03 International Workshop on Information Extraction for Slavonic and OtherCentral and Eastern European Languages (IESL’03), Borovets, Bulgaria, 2003.

[Rigau et al., 2003] G. Rigau, E. Agirre, and J. Atserias. The meaning project. In Proceed-ings of the XIX Congreso de la Sociedad Espanola para el Procesamiento del LenguajeNatural (SEPLN’03), Alcala de Henares, Spain, 2003.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 26: Second year published papers - UPC

WP9-Working Paper 9.3 Version: FINALSecond year published papers Page : 25

[Turmo, 2003] J. Turmo. Information extraction, multilinguality and portability. RevistaIberoamericana de Inteligencia Artificial, Special Issue on Multilingual Information Ac-cess, 5(22):57–78, 2003.

[Vazquez et al., 2003] S. Vazquez, A. Montoyo, and G. Rigau. Metodo de desambiguacionlexica basada en el recurso semantico dominios relevantes. In Proceedings of the XIXCongreso de la Sociedad Espanola para el Procesamiento del Lenguaje Natural (SE-PLN’03), Alcala de Henares, Spain, 2003.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies