ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

20
ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET

Transcript of ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

Page 1: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

ENABLER, BLARK, what’s next?

Steven Krauwer

Utrecht University / ELSNET

Page 2: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

Overview

• ENABLER• BLARK• BLARK Results• Recent developments• CLARIN• Some reflections• MyBLARK• Concluding remarks

Page 3: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

ENABLER

• EU Project, FP5, under Information Society Technologies (see www.enabler-network.org)

• bringing together national language resources projects in many EU countries

• aimed at providing a cooperative framework to foster cooperation and interoperability

• with a strong industrial drive• led by Pisa, and ended –as an EU project– in 2004• … but still existing as a community, in close

collaboration with ELSNET (www.elsnet.org)

Page 4: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

BLARK (1)

• Basic Language Resource Kit

• Idea (first launched in 1998): definition of the minimal set that is needed to do any (precompetitive) R&D and education at all

• Definition should be in principle language independent (although specific languages may require specific adaptations)

Page 5: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

BLARK (2)

• Definition should include both data collections (corpora, lexicons) and modules (taggers, parsers, synthesizers, annotation tools)

• It should include both qualitative aspects (e.g. standards) and quantitative aspects (e.g. size)

Page 6: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

BLARK (3)

• Once the definition is available it can be used as a common reference point that allows to– assess the resources situation of a language

(how much of the BLARK is available, and what is still missing)

– make priority plans for bringing the resources situation up to date

Page 7: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

BLARK (4)

• Note that the BLARK is necessarily dynamic, as new technological developments will come with new requirements

• Note that the BLARK for a language will only work if there is a body that takes responsibility for its implementation and for the maintenance and distribution of the resources created

Page 8: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

BLARK Results

• First adopted by the Dutch Language Union, resulting in a first 12 Meuro implementation programme launched at the end of 2004

• Explored and developed for Arabic in the NEMLAR project (CST, ELDA, ELSNET, and others; see www.nemlar.org and the presentation at this conference O27-G on Thursday)

• BLARK concept included in a number of proposals, but without tangible results

• Suggestions for a more advanced variant (ELARK) have been put forward by ELDA and others

Page 9: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

Recent developments

• CLARIN: Common Language Resources and Technology Infrastructure (see LREC 2006 workshop on May 22, or otherwise www.mpi.nl/clarin)

• NOT a project proposal, but rather a proposal for a Research Infrastructure to be included in the European Roadmap for Research Infrastructures

Page 10: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

CLARIN (1)

• Creation of open European Language Resources Network with strong service centers and repositories, providing the humanities community at large (i.e. not just the language and speech technology community) with– knowledge about which language resources and tools exist and

how to use them– access to existing language resources– coordinated creation of new resources– access to advanced services for access and adaptation– bundling of expertise in specific problem areas– training centers

Page 11: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

CLARIN (2)

Three important observations:

• CLARIN has no industrial drive

• CLARIN aims at addressing all languages in the EU (and associated countries)

• One of CLARIN’s objectives is the definition and the coordinated creation of BLARKs for all languages of the EU

Page 12: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

Some reflections

• Whatever progress has been made (DLU, NEMLAR, ELARK) was mostly inspired by industrial needs

• Industrial considerations do not favour smaller languages

• Progress of the BLARK since 1998 has been slow• No new funding opportunities in FP6 to get

anything done• CLARIN may offer exciting opportunities (if

successful), but this will take a lot of time

Page 13: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

More reflections

• The present (embryonic) BLARK definition may be one or more steps too far for under-resourced languages

• So, why not add to the concept the BLARKette, which should represent a very basic entry level variant of the BLARK, targeting exclusively the research and (especially) education community

• Small and simple, should fit on a CDROM

Page 14: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

And yet more reflections

• Nothing funded will happen before well into 2007• Why wait until then, e.g. if and when CLARIN is

in place and some formal process has put into motion to define the BLARK (and the BLARKette)?

• Why not start an action to consult the language communities and to arrive at a first proposal for a BLARK and BLARKette definition?

Page 15: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

MyBLARK, the proposal

• We initiate MyBLARK, aiming at collecting (for each language in the EU)– a description of the essential components of the

BLARK – and of the BLARKette

• We try to distill from this a broadly supported proposal for the definition of both concepts

• We offer this as an input to the CLARIN project if it ever happens, or otherwise use it to launch other initiatives

Page 16: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

MyBLARK, the process

• ELSNET (possibly in collaboration with COCOSDA/WRITE) will send out a simple questionnaire to all known language resources centers, asking for descriptions of BLARK and BLARKette components

• ELSNET (maybe with COCOSDA/WRITE) will set up a committee to synthesize the results in the form of recommendations

Page 17: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

MyBLARK participants

• Language resources centers for languages of EU and associated countries known to us

• Language resources centers in the EU (+associated countries) that send me a message that they are willing to participate ([email protected])

Page 18: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

MyBLARK Questionnaire

• Language• Type of resource• Usage• Size• Annotation required• Brief description

• Available for your language?

• If so: pointer to it• If not, pointer to

similar resource for another language

• References• Comments

Page 19: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

MyBLARK Schedule

• June – August 2006: collection of contacts• Sept 2006: questionnaires sent out• October 2006: questionnaires in, 1st analysis and

draft definition proposals• November 2006: proposals sent out for feedback• December 2006 – January 2007: collecting

feedback• February 2007: Final report

Page 20: ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.

Concluding remarks

• I have proposed the introduction of a slightly weaker variant of the BLARK, the BLARKette, for under-resourced languages

• I have proposed an action entitled MyBLARK to arrive at an initial definition of both the BLARK and the BLARKette

• I hope that this will (a) speed up the process, and (b) provide an intermediate coverage level for under-resourced languages