CLARIN Common Language Resources and Technology Infrastructure

17
CLARIN Common Language Resources and Technology Infrastructure Daan Broeder & Dieter van Uytvanck Max-Planck Institute for Psycholinguistics TF-EMC2 Meeting, Dec 3 2008

description

CLARIN Common Language Resources and Technology Infrastructure. Daan Broeder & Dieter van Uytvanck Max-Planck Institute for Psycholinguistics. TF-EMC2 Meeting, Dec 3 2008. - PowerPoint PPT Presentation

Transcript of CLARIN Common Language Resources and Technology Infrastructure

Page 1: CLARIN Common Language Resources and Technology Infrastructure

CLARINCommon Language Resources and Technology Infrastructure

Daan Broeder & Dieter van UytvanckMax-Planck Institute for

Psycholinguistics

TF-EMC2 Meeting, Dec 3 2008

Page 2: CLARIN Common Language Resources and Technology Infrastructure

What is CLARIN

The CLARIN project is a large-scale pan-European collaborative effort to coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers.

• Resources: Lexica, text corpora, multi-media/multi-modal recordings, …

• Technology: applications & (web-)services as parsers, tokenizers, speech recognizers & segementators, …

TF-EMC2 Meeting, Dec 3 2008

Page 3: CLARIN Common Language Resources and Technology Infrastructure

The problem• Existence and

location of resources only known to insiders

• Archives mostly unconnected islands

• Every archive has its own standards for storage and access

• Normally need to download first when processing resources

• Social sciences and humanities researchers are not language or speech technologists

• They are often not aware of the potential benefits of using language and speech technology

• Available tools are hard to use for non-specialist

TF-EMC2 Meeting, Dec 3 2008

Page 4: CLARIN Common Language Resources and Technology Infrastructure

• CLARIN is an EU Infrastructure project with 4.2 ME funding for a 3 year preparatory phase

• Additional funding from national governments (at this moment at least 7 ME + 9 ME)

• The CLARIN consortium has now 32 partners from 26 EU countries

• The CLARIN community has 146 member organizations in 32 countries (mostly from NLP organizations)

• CLARIN is based on earlier initiatives with many participants: LangWeb, EARL, TELRI, LIRICS and more recent DAM-LR

CLARIN overview

TF-EMC2 Meeting, Dec 3 2008

Page 5: CLARIN Common Language Resources and Technology Infrastructure

CLARIN OrganizationWP1 Management & Coordination,

Steven Krauwer, OTS U. Utrecht NL

WP2 Technical Infrastructure, Peter Wittenburg, MPI Nijmegen NL

WP3 Humanities Overview, Tamas Varadi, Hung Ac. Sc., Hun

DLO DARIAH Liaison, Martin Wynne, Oxford Univ. UK,

WP5 Language Resources & Technology Erhard Hinrichs, U. Tuebingen, Ger

WP6 Dissemination, Dan Crista, U. Iasi Rom

WP7 IPR and Business Models, Kimmo Koskiennemie, Univ. Helsinki Fin

WP8 Organizational Agreements, Bente Maegaard, Univ. Copenhagen, Dk

TF-EMC2 Meeting, Dec 3 2008

Page 6: CLARIN Common Language Resources and Technology Infrastructure

Time plan

• 2008 - 2010 Preparatory Phase– Limited set of federated centers (10+) – Showcases, demonstrators– WP8: Investigate embedding in national funding

schemes for construction phase & maintenance • 2010 - 2020 Construction Phase

– No important European funding– Depend on national project commitments

• 2020? - … Maintenance Phase

TF-EMC2 Meeting, Dec 3 2008

Page 7: CLARIN Common Language Resources and Technology Infrastructure

CLARIN “Holy Grail” User Scenario

A researcher authenticates himself with his own organization and creates a “virtual” collection of resources from different repositories. He does this on the basis of browsing a catalogue, searching through metadata, or searching in resource content. He is then able to use a workflow specification tool and process this virtual collection with possibly a mix of home grown and remote service components. Resulting data can be added to the origin repositories with proper access rights and the “virtual” collection specification can be stored for future reference.

For our domain this is very ambitious and challenging, but even a partial realization is worthwhile!

TF-EMC2 Meeting, Dec 3 2008

Page 8: CLARIN Common Language Resources and Technology Infrastructure

DAM-LR EU project (2005-2007)

Small EU project on archive integration of 4 partners corpus/computational linguistics and endangered language documentation

• Resource discovery: sharing a single metadata set for searching & browsing

• Authentication & Authorization: single user identity, single sign-on by using Shibboleth.

• Referencing and citing “archived resources” using a single persistent identifier system.

TF-EMC2 Meeting, Dec 3 2008

Page 9: CLARIN Common Language Resources and Technology Infrastructure

AAI & Federation issues

• Experiences from DAM-LR wrt. to AAI:– Standard eduP. attr. set is probably sufficient, (but CCs …)– Shibboleth is nice when using web applications, but applications

need access too!– Shibboleth efficient when dealing with groups e.g. staff, student,

… But our domain has also to deal with individuals -> store user IDs in authorization records

– DAM-LR federation of both IdPs & SPs, CLARIN aims at a much larger potential user group whose home organizations do not want to run a CLARIN specific IdP -> use the national IDFs

TF-EMC2 Meeting, Dec 3 2008

Page 10: CLARIN Common Language Resources and Technology Infrastructure

CLARIN Federation Infrastructure I

CLARIN wants to be a LR&T “service federation”• simplified and unified rules for licensing, accessing• agreements with national identity federations• must make sure all necessary attributes are available• cater also for AA

• of non-web applications • and web services

• interaction with GRID AAI

national Identity Federations

eJournal Service Providers

LRT Service Providers

TrustAgreement

TrustAgreements

TF-EMC2 Meeting, Dec 3 2008

Page 11: CLARIN Common Language Resources and Technology Infrastructure

Applications need Authentication too

IdP

Shib.apache

user application

User scenario:Copying resources from different repositories to the local machine

archiveA

The application speaks only HTTP with basic authenticationIt does not understand form based authentication employed by the Shib. IdP

Shib.apache

archiveB

The application is also not able to profit from the SSO over archives

IMDIcopier

TF-EMC2 Meeting, Dec 3 2008

Possible solution:Use certificates for authenticationObtained by SLCSBut can auth. handshake be mimicked by sw

Page 12: CLARIN Common Language Resources and Technology Infrastructure

CHAT

EAF

Shoebox

MPI Archive

DB/SE

Search service

Parsers “normalize” the structural format

The scenario of searching through thecontent of just one archive is no problem there is just one SP that needs to check the if the user has access to the annotations.

Searching through annotations

Auth DB

IdP

TF-EMC2 Meeting, Dec 3 2008

Page 13: CLARIN Common Language Resources and Technology Infrastructure

CHAT

EAF

Shoebox

MPI Archive Archive B

DB/SEDB/SE

CHAT

Search service Search

service

Specialized web portal

Federative search scenario

Parsers “normalize” the structural format

Searching through annotations

Auth DB

IdP

Auth DB

TF-EMC2 Meeting, Dec 3 2008

The web portal app would like to act on behalf of the user and access the search services.

Page 14: CLARIN Common Language Resources and Technology Infrastructure

Licenses & Code of conducts 1

IdP

SPa

SPb

user

SP requires CC signed and takes care of this but only for its own domain

This can break the SSO if the user is required to sign the same CC several times

browser

TF-EMC2 Meeting, Dec 3 2008

CC DB

CC DB

CLARIN will harmonize the CCs and licenses to a limited number

Page 15: CLARIN Common Language Resources and Technology Infrastructure

Licenses & Code of conducts 2

IdP

SPa

SPb

user

browser

TF-EMC2 Meeting, Dec 3 2008

Store the CC DB info in the user attributes at the IdP

But how does it get there?•Special app?•Not every IdP will/can run thisCC DB

Page 16: CLARIN Common Language Resources and Technology Infrastructure

Licenses & Code of conducts 3

IdP

SPa

SPb

user

browser

TF-EMC2 Meeting, Dec 3 2008

Create special CC service. This is part of the SPF independent of the IDFs

CC DB

CCservice

Page 17: CLARIN Common Language Resources and Technology Infrastructure

The End

Thank you for your attention

More info: www.clarin.eu