N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di...

47
N. Calzolari 1 2nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa [email protected] The Future of KYOTO … with some historical notes to show a path along an evolving vision Language Resources in today EU context: META- SHARE, ...

Transcript of N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di...

Page 1: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 12nd KYOTO Workshop, Gifu, Japan, January 2011

Nicoletta Calzolari

Istituto di Linguistica Computazionale – CNR – Pisa

[email protected]

The Future of KYOTO

… with some historical notes to show a path along an evolving vision

Language Resources

in today EU context: META-SHARE, ...

Page 2: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Why such needed LRs, are lacking after 30 years of R&D in the field?

1) Because the main trend until mid-’80s was to privilege the processing of so-called “critical” phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language

As a result CL was focusing on: few examples - often artificially built lexicons made of few entries (toy lexicons) grammars with poor coverage

2) Because large-scale LRs are costly & their production requires a big organizing effort

N. Calzolari 22nd KYOTO Workshop, Gifu, Japan, January 2011

Old slide with Antonio Zampolli (’80s/early ‘90s)

Why we still lack them??

Page 3: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

… back from the early ‘80s

It became evident that:Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured,

were unmanageable at the formal representation level, and had to be blurred into unique features and values

Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries

N. Calzolari 32nd KYOTO Workshop, Gifu, Japan, January 2011

Automatic acquisition of lexical information from MRDs

Was my first research & became central in the Pisa group (ACQUILEX)And also Amsler, Briscoe, Boguraev, Wilks’ group, IBM, then Japanese groups, …The trend was: “large-scale computational methods for the transformation of machine readable dictionaries into machine tractable dictionaries”Instead of relying on linguists’ introspection

Pioneering

Research

Historical notes

Page 4: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Automatic acquisition of info from

texts: This trend has become today a consolidated &

pervasive fact

From acquisition of “linguistic information”To acquisition of “general knowledge”, with more data intensive, robust, reliable methods

N. Calzolari 42nd KYOTO Workshop, Gifu, Japan, January 2011

… back from the late ‘80s

After acquisition from MRDs,

Historical notes

Need of adequate models to handle actual usage of language

Lesson learned

(IN-)Adequacy of (current) lexicons

Lesson learned

Going from core sets to large coverage has implications not just in quantitative terms, but more interestingly in terms of changes to the models and the strategies of processes

Lesson learned

Page 5: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 52nd KYOTO Workshop, Gifu, Japan, January 2011

Looking into the past

All started with the situation we had in the late ‘80s – early ‘90s

With all the Xxx-LEX projects

5

MultiLex

GeneLex AcquiL

ex

Xxx-Lex

A. Zampolli: Let’s be coherent:

Xxx-Lex

After the “Grosseto Workshop” (1985): a turning

point

EAGLESISLE Standards, Best Practices, ...

Page 6: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

ISO LMFLexical Markup

Framework

Morphology

NLP Multilingual notations

NLP MWE pattern

NLP Paradigm class

NLP Semantic

MRD

NLP Syntax

Constraint Expression

Core Package

N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 6

Structural skeleton, with the basic hierarchy of information in a lexical entry

+ various extensions

Modular framework LMF specs comply with

modelling UML principles an XML DTD allows

implementation

Builds on EAGLES/ISLE

NEDOAsian Lang.uages

The field is

mature

NICT Language-

Grid Service Ontology

ICT

KYOTO

LIRICSNewinitiatives…

LexInfo

Page 7: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 72nd KYOTO Workshop, Gifu, Japan, January 2011

KYOTOA search environment using semantic

technologiesA “compass” for the web2.0

Interdisciplinarity scientific community (LRT,

web technologies, knowledge engineers),

companies, domain experts

Multilingualism

7 languages (2 Asiatic languages)

needs to share lexical/knowledge bases & tools

both general & domain-related

under the form of lexical/ontological & sw repositories

Kyoto Core System is open & free

The “resource” perspective

Page 8: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Annotation Format (KAF)Multi-level Annotation Format• stand-off annotation• uniform representation for 7 languages Shared through the languages

• Text: tokenisation, sentences, paragraphs with reference to the sources• Terms: words & multi-words, parts-of-speech, etc.• Chunks: constituents & syntagmatic realization• Dependencies: grammatical functions● L1 – Semantic modules:

Multiword tagging, Sense Tagging, Named Entity Recognition, OntoTagging

● L2 – Semantic module: event/fact extraction

N. Calzolari 82nd KYOTO Workshop, Gifu, Japan, January 2011

from Piek Vossen

Page 9: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 92nd KYOTO Workshop, Gifu, Japan, January 2011

KYOTO System & Adoption of Standards

LinearMAF/SYNAF

LinearSEMAF

Term extraction Tybot Generic

TMF

Semantic annotation

LinearGenericFACTAF

Fact extraction Kybot

Domain editing Wikyoto

Wordnet

Domain Wordnet

LMF API

Ontology

Domain ontology

OWL APIConceptUser

FactUser

from Piek Vossen

SourceDocuments

Could be at the basis of

a new standard?

Page 10: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

2nd KYOTO Workshop, Gifu, Japan, January 2011

A common representation format for WordNets

Seven WordNets similar but not identical

hampered interoperability

to be accessed both intra- and inter-linguistically to support easier integration

WnIT

WnEN

WnEU

WnNL

WnJP

WnCH

WnES

endow WordNet with a representation format allowing easy access, integration & interoperability among resources

WnIT

WnEN

WnEU

WnNL

WnJP

WnCH

WnES

Page 11: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari 11

GlobalInformation

Lemma

MonolingualExternalRef

MonolingualExternalRefs

Sense

LexicalEntry

Statement

Definition

SynsetRelation

SynsetRelations

MonolingualExternalRef

MonolingualExternalRefs

Synset

Lexicon

InterlingualExternalRef

InterlingualExternalRefs

SenseAxis

SenseAxes

LexicalResource

1..1 1..* 0..1

1..*1..*

1..1 0..*

0..1

1..*

Meta0..1

0..1

Meta

0..1 0..1

Meta Meta

0..1

Meta

0..*

0..1 0..10..1

1..* 1..*0..*

0..1

1..*

A common representation format: WordNet - LMF Data

Categories

from Monica Monachini

Page 12: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

2nd KYOTO Workshop, Gifu, Japan, January 2011

Towards a Centralized WordNet DC Registry

A list of 85 sem.rels as a result of a mapping of the KYOTO

WordNet grid

Inter-WNIntra-WN

N. Calzolari 12

Page 13: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari 13

SWN<fuego_3, llama_1>09686541-n

<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>

IWN<fuoco_1, fiamma_1>00001251-n

WordNet-LMF Multilingual level - Cross-lingual Relations

WN3.0<fire_1 flame_1 flaming_1>13480848-n

groups monolingual synsets corresponding to each other and sharing the same relations to English

link to ontology/(ies)

specifies the type of correspondence

from Monica Monachini

Page 14: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 142nd KYOTO Workshop, Gifu, Japan, January 2011

Complex picture!Is there anything we need to do for

Interoperability?Work within ISO: LMF: abstract meta-model for lexical representation Ontology Group or more Groups? Language Resource Ontologies: ontology of data

categoriesReal life: Lexicons (e.g. WordNets) that are called Ontologies Lexicons linked to Ontologies: to be used in applications,

in multilingual systems, domains, … Work on “ontologising” Lexicons: to allow exploiting

various relations, to make inferences, … Semantic Lexicons, with many types of relations among

semantic units: these are often of “conceptual/world-knowledge” nature. Do we want DCs for these?

ISO SC 4/WG 4 – Lexicon-Ontology relationsNew work item: PWI 24622

KYOTO can

contribute

Page 15: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 152nd KYOTO Workshop, Gifu, Japan, January 2011

To explore the need of doing something within ISO about the relations between

Lexicon and OntologyDo we/ISO need to address another (lexical) layer? How lexicons and ontologies are linked and information

mapped from one to the other The ontological layer in a/connected to a lexiconPossible issues/questions: Is LMF enough to represent Ontological links? How to connect work being done in ISO Lexical group

and ISO Ontology groups? Lexicon and Ontologies: separation? or lexicalised

ontologies? or ontologies lexicons? Lexicon, Ontologies and Domains On a very different dimension: Ontology of

lexical/semantic/conceptual categories? Standardised semantic categories, ontology labels?

Relation to multilinguality ...

KYOTO can

contribute

Page 16: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 162nd KYOTO Workshop, Gifu, Japan, January 2011

Input to Multilingual Web http://www.multilingualweb.eu/

The MultilingualWeb project is exploring standards and best practices that support the creation, localization and use of multilingual web-based information

It aims to raise the visibility of existing best practices and standards and identify gaps

The core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved

Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises:

“Content on the Multilingual Web”4-5 April 2011

Pisa, Italy

KYOTO can

contribute

Page 17: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 172nd KYOTO Workshop, Gifu, Japan, January 2011

A new paradigm of R&D in LRs & LT

Since few years

Open & distributed linguistic infrastructures for LRs & LT

Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results

Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject)

e. g. initiatives to achieve international consensus on annotation guidelines

Emerging concept of collective intelligence

Emphasize interoperability among LRs & LT

Page 18: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Some steps for a “new generation” of LRs

N. Calzolari 182nd KYOTO Workshop, Gifu, Japan, January 2011

From huge efforts building static, large-scale, general-

purpose LRs

To dynamic LRs rapidly built on-demand, tailored to specific

user needs

From closed, locally developed and

centralized resources

To LRs residing over distributed places, accessible on the web,

choreographed by agents acting over them

From Language Resources To Language Services

Need of an infra that makes this vision operational

Page 19: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Lexical WEB

As a critical step for semantic mark-up in the Semantic Web

N. Calzolari 192nd KYOTO Workshop, Gifu, Japan, January 2011

ComLex

SIMPLE

WordNetsWordNets

WordNets

FrameNet

Lex_x

Lex_y

LMF

with intelligent

agents

NomLex

Standards for Content

Interoperability

Enough??

Global WordNet GRID

BioLexicon

SIMPLE-WEB

Standards

Page 20: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

(Distributed) Language Services

N. Calzolari 202nd KYOTO Workshop, Gifu, Japan, January 2011

content interoperab

ility standards

supra-national

cooperation

architectures enabling accessibilit

y

Collaborative & collective/social development &

validation, cross-resource integration & exchange of

information

A scenario implying:

Create new resources on the basis of

existing

Exchange & integrate

information across

repositories

Compose new services on

demand

Enabling:Can

KYOTO contribut

e?

Page 21: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 21

Which Communities? Language Resources

Language Technologies

Standardisation Content/

Ontologies System developers Integrators SSH

ECNational

funding agencies Industry

Many applications/domains

MTCLIR…e-governmentcontent industryintelligencee-culturee-healthdomotics…

core

Multilinguality

EUForum

with

Focus on cooperation

Many LRs & LTs exist, but a global vision, policy & strategy

is needed for

CLARINfor SSH

FLaReNetNetwork

META-NETNoE

Need to consider together technical organisational strategic economic, social cultural legal political issues wrt LRs & LTs

Many dimensions

Today

Page 22: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Fostering Language Resources Network

FLaReNet at a glance

An international Forum to facilitate interaction, to Overcome the fragmentation in LR & LT & recreate a

community

Anticipate the needs of new types of LR & LT & Language Infrastructures

Create a shared policy for the next years Foster a European strategy for consolidating the sector

22

http://www.flarenet.eu

N. Calzolari 222nd KYOTO Workshop, Gifu, Japan, January

2011

98 Institutional Members From 33 countries

351 Individua

l Subscribe

rs

Essential Community mobilisation (also to prepare the ground for a RI)

A “roadmap”: a plan of actions as input to policy development

A ( EU) model for the LRs/LTs area of the next years

Ambitious!

Page 23: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 23

Create a shared repository of data formats, annotations, etc. as a major help to achieve standardisation

Common repositories for tools & language data should be established that are universally and easily accessible by everyone

Coordinate input to ISO/W3C standardisation work

Results from Vienna & Barcelona Forum:Shaping the Future of the Multilingual

Digital Europe

Standards, Interoperability & Metadata are topics to be approached in cooperation

Access to LRs is critical & should involve all the communityNeed to create the means to plug together different LR & LT, In a web-based resource and technology “grid”

For a new world-wide language infrastructure

Page 24: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

2nd Blueprint

Result of a permanent and cyclical consultation Inside the community it represents Outside it, through connections with neighbouring

projects, associations, initiatives, funding agencies Organised along three main “directions”:

Infrastructural Aspects Research and Development Political and Strategic Issues

Reflect three major development factors that can boost or hinder the growth of the field of LRT

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 24

Provide feedback!

http://www.flarenet.eu/sites/default/files/D8.2b.pdf

Page 25: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Sources: many meetings

Operational Interoperability

Asian Collaboration

Workshop

FL-SILT

Workshop

Lexicon/Ontology

Standards NEERI

2nd FLaReNet

Forum

Less-resourced

Languages

Automatic Acquisition

Legal Issues

Standards

International Cooperation

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 25

Page 26: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 26

3rd FLaReNet Forum

The European Language Resources and Technologies Forum:

Important role in defining recommendations In Barcelona: 120 Participants from 22 Countries

Define final recommendations

Next Forum in Venice!26-27 May

2011

Previous Proceedings & Reports on the web

Blueprint will be discussed

Also for adoption & endorsement by FLaReNet Institutional Members

Page 27: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 27

Issue Challenge Recommended Actions

Metadata

Interoperability of Metadata sets

•Set up a global infrastructure of common and uniform and/or interoperable metadata sets

Metadata usable both by humans and by machines

•Create machine-understandable metadata with formal syntax and clear semantics

•Automate the process of metadata creation

•Develop structured metadata

Documentation

Reliable documentation of LRs according to common best practices

•Collect all possible and existing LR documentation

•Devise and adopt a widely agreed standard documentation template for all types of resources

Infrastructural Aspects

Page 28: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Political and Strategic dimensions

N. Calzolari2nd KYOTO Workshop, Gifu, Japan,

January 2011 28

Issue Challenge Recommended Actions

Funding Agencies policies

Devise models to allow different types of players easy access to resources

•Ensure that publicly funded resources are publicly available either free of charge or at a small distribution cost

•Encourage/enforce use of best practices or standards in LR production projects

•Make sustainability and sharing/distribution plans mandatory in projects concerning LR production

LR citation

Appropriate citation of Language Resources like traditional publications

•Develop a standard protocol for citing language resources

KYOTO can be an example

Page 29: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

LRE Map: Why??

The Map as an answer to start to fill this gap, but also: To encourage the needed “change in culture”

N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January

2011 29

Problem: Lack of information & documentation about resources is,

in the e-science paradigm, a very critical issue Non documented resources don’t exist!!

A collective enterprise: Each researcher must become aware of the importance of his/her personal engagement in documenting resources

A task as important as creating new resources and not an accessory to be disregarded

As the necessary service to the whole community Will become an essential instrument to monitor the field

www.resourcebook.eu

Page 30: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari 302nd KYOTO Workshop, Gifu, Japan, January 2011

How many LRs & Types at LREC?

Corpora: 785

Lexicons: 289

Tagger/Parser: 181

Annotation tool: 134

Ontology: 73Evaluation

data: 40

Annotation Guidelines:

35 ...

Submissions: 1288 LR forms: 1994

30

How many LRs & Types at COLING?Submissions:

880 LR forms: 735

Corpora: 359 -

50%

Tagger/Parser: 81 -11%

Lexicons: 71 - 10%Evaluation data: 51 -

7%

Ontology, Annotation tool, Evaluation tool,

Tokenizer, NER < 20 - 2%

Languages: 170!

Page 31: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Languages: But obviously …

N. Calzolari 312nd KYOTO Workshop, Gifu, Japan,

January 2011

170 !!

image courtesy of Wordle (http://www.wordle.net)

Page 32: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Availability

N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January

2011 32

Freely available!

The wide majority of resources

are freely available

54%

3%15%

25%

57%

LREC COLING

Page 33: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

The Project META-NET

N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 33

META-NET is a Network of Excellence (coord. Hans Uszkoreit) dedicated

to fostering the technological foundations of the European multilingual

information society

Objectives: Prepare the ground for a large-scale concerted effort by building a

strategic alliance of national and international research programmes,

corporate users and commercial technology providers and language

communities Strengthen the European research community through research networking

and by creating new schemes and structures for sharing resources and

efforts Build bridges by approaching open problems in collaboration with other

research fields such as machine learning, social computing, cognitive

systems, knowledge technologies and multimedia content

Final goal:

META – The Multilingual Europe Technology Alliance

Page 34: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

34

language communitie

s

language communitie

s

policy makers and

funding bodies

policy makers and

funding bodies

user industries

user industries

provider industriesprovider

industries

language technologycommunity

machinelearning

community

semantic techno-logies

community

cognitivesystems

community

multimediacontenttechno-logies

The META Alliance

N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011

Page 35: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

35

Founding Members

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany

Barcelona Media – Centre d'Innovació, Spain

Consiglio Nazionale Ricerche – Instituto di Linguistica Computazionale “Antonio Zampolli”, Italy

Institute for Language and Speech Processing, R.C. “Athena”, Greece

Charles University in Prague, Czech Republic

Centre National de la Recherche Scientifique – Laboratoire d'Informatique pour la Mécanique et les Sci.s de l'Ingénieur, France

Universiteit Utrecht, The Netherlands Aalto University, Finland Fondazione Bruno Kessler, Italy Dublin City University, Ireland Rheinisch Westfälische Technische

Hochschule Aachen, Germany Jožef Stefan Institute, Slovenia Evaluations and Language Resources

Distribution Agency, France N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011

Page 36: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

36

Three Lines of Action

The META-NET objectives translate into three lines of action:

N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011

Page 37: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

The Process

Visions

Strategic Research Agenda

Roadmap

2010 2011 2012

communicationwithin META-NET (META-VISION)

communication in the wider LT community

and among other stakeholders

communication to policy makers funding bodies, public

N. Calzolari 372nd KYOTO Workshop, Gifu, Japan, January 2011

Page 38: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Data has become a key factor in LT R&D

A few indicators:

Increasing size & importance of LREC conference, corpora mailing

list, etc.

Citation ranks of publications on language resources

Language research and language technology belong to the Data

Intensive Sciences

Expensive data become valuable through sharing

However, the long demanded and well-contemplated instruments

for managing and sharing this data are still missing

N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 38

Page 39: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

META-SHARE: Key Features

META-SHARE is an open, integrated, secure, interoperable

exchange infrastructure (resp. Stelios Piperidis) for language data

& tools for the Human Language Technologies domain ever-evolving, scalable, including free and for-a-fee LRs/LTs and services including legacy, contemporary and emerging datasets, tools and technologies

A marketplace where language data & tools are documented, uploaded

and stored in repositories, catalogued and announced, downloaded,

exchanged, aiming to support a data economy (includes free and for-a-

fee LRs/LTs and also services)

Standards-compliant, overcoming format, terminological and semantic

differences

Based on distributed networked repositories accessible through

common interfacesN. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 39

Page 40: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

40

What we’re offering

A channel to share and distribute language data and

tools

Technical solutions for building your own repositories

Protocols and mechanisms for making the descriptions of

your resources (and the actual resources) harvestable

Guidelines and recommendations on standards used in

the LR production and documentation processes

Recommendations on data and tools licensing issues

Access to large catalogues of documented, high-

quality resources, as well as the actual data and toolsN. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011

KYOTO can be among the first

Page 41: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

41

Features

Single Sign-On

Easy Administration

Metadata Harvesting

Persistent Identifiers

(PIDs)

Intuitive Search

N. Calzolari

Open Source

Service-Oriented

Distributed

Replication/Backup

Reporting & Statistics

2nd KYOTO Workshop, Gifu, Japan, January 2011

Page 42: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

v0 architecture

Page 43: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

On the communication/mobilisation side

A change of culture

Convincing arguments that data assets and their value do not necessarily grow if locked in the drawer

Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use)

Interoperability, common metadata, formats, etc.

In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models

The present time window seems appropriate

Challenges

43N.Calzolari Multilingual Web, Madrid, 2010

KYOTO can be a “model”

For other projects to

follow

Page 44: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

Collaborative iResources

LR building as collaborative “common shared task”

New methodology of work

Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map)

Interoperability acquires even more value

Needs consensual planning of common strategies towards shared objectives

Not just the sum of many individual effortsBut an organised, well-structured, collective enterpriseSimilar to more mature sciences: Physicists/Astronomers’s experiments … of X,000 people working on the same big enterprise

N. Calzolari44 2nd KYOTO Workshop, Gifu, Japan, January 2011

META-SHARE is a big step that needs a real Paradigm shift

Page 45: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari

N. Calzolari

452nd KYOTO Workshop, Gifu, Japan, January 2011

We wanted more & more data ... Have we been too

successful ?!?We experience today a sort of statistical “intoxication” !

It started as a new strategy, a revolution maybe? But it has turned to tactics. Stuck with it? In a narrow loop of small advances, not linked to each other

Can we add also a new strategy? and hopefully a vision?

Main Statement We tend to forget about “language” &

the need to understand its properties & complexities

Where do we (try to) encode what we know about language properties?

In annotations

Preamble

Vision Like the big Genome project, ... a large Language initiative

Is there any theoretical knowledge of orAny serious methodology of studying and exploitingthe interactions among the various annotation layers?

BUT

Page 46: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari

N. Calzolari

462nd KYOTO Workshop, Gifu, Japan, January 2011

Strategy

A Multilingual Annotation PlanAs a Very Large International Initiative

MANY (parallel) texts for MANY languages With ALL possible annotation layersSimilar to more mature sciences, e.g. physics, … of

thousands of people working together on the same big experiment

Create a sustainable infrastructure for a large Language

repository plan,

Where we accumulate all the knowledge we have

about language &

Encourage analysis of linguistic interrelations

Collaborative Resources : A new paradigm for a big language map

Means a change of mentality: going beyond “individual” research interestsFrom “my approach” to some “compromise” allowing to go for

big amounts/ integration/building on each other/…

Page 47: N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it.

N. Calzolari

From no infrastructure ...To many

infrastructures/networks We were complaining there was no infrastructure ...

Have we been too successful??

Now many infrastructural/networking initiatives

Very good opportunity

But only if we are able to act in a coordinated & coherent way

Otherwise we spoil & confuse the field 47472nd KYOTO Workshop, Gifu,

Japan, January 2011N. Calzolari