Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens...

34
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend [email protected] .nl Meertens Institute, Amsterdam 18/05/2010 LREC, Malta

description

3 Outline 1.What is CMDI? 2.What is the goal of our project? 3.How to go from a resource to harvestable metadata? 4.Findings of the project and future challenges

Transcript of Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens...

Page 1: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Creating & Testing CLARIN Metadata

Components

A CLARIN-NL project

Folkert de [email protected]

Meertens Institute, Amsterdam

18/05/2010

LREC, Malta

Page 2: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Project partners

Daan Broeder Dieter Van Uytvanck

Folkert de Vriend

Laura van Eerten Griet Depoorter

Page 3: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

3

Outline

1. What is CMDI?2. What is the goal of our project?3. How to go from a resource to harvestable

metadata?4. Findings of the project and future

challenges

Page 4: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

1) What is CMDI?

CLARIN MetaData Infrastructure (CMDI) is the infrastructure used for descriptive metadata in CLARIN (Common Language Resources and Technology Infrastructure)

Descriptive metadata is used to characterize data resources and tools, to facilitate discovery and management in large (virtual) infrastructures and repositories.

4

Page 5: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Advantages CMDI

Compared to other metadata infrastructures:- Flexibility

- Researchers can decide what metadata fits their needs and use ready made metadata components.

- Researchers can also create new metadata components if they want.

- Complete Infrastructure: software for metadata modeling, editing, harvesting, exploitation

- Still compatible with existing frameworks: OLAC, IMDI, TEI

5

Page 6: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

TechnicalMetadata

Sample frequency

Format

Size

Lets describe a sound recording

Page 7: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

Language

TechnicalMetadata

Name

Id

Lets describe a sound recording

Page 8: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

Language

TechnicalMetadata

ActorSex

Language

Age

Name

Lets describe a sound recording

Page 9: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

Language

TechnicalMetadata

Actor

Location…

Continent

Country

Address

Lets describe a sound recording

Page 10: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

Language

TechnicalMetadata

Actor

Location

Project…

Name

Contact Lets describe a sound recording

Page 11: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Basic Component Metadata Modeling

Language

TechnicalMetadata

Actor

Location

Project Lets describe a sound recording

Metadata profile

Page 12: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Main principles behind CMDI

Component approach which is flexible and lets you design your own metadata profile

But semantics need to be declared explicitly by making use of concepts that are stored in the ISOcat registry. This way interoperability can still be guaranteed.

12

Page 13: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

2) What is the goal of our project?

Testing of CMDI principles by applying them to existing resources at MI and INL

13

Page 14: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Lexical resources (with proper names, monolingual and bilingual lexica, historical and scientific dictionaries)

Linguistic databases (with syntactical, morphological and phonological dialect variation)

Ethnological databases (containing data about folktales, songs, probate inventories and pilgrimages).

Corpora (spoken and written) Historical documents (bible texts)

14

Resources at MI and INL used

Page 15: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

3) Workflow from resource to harvestable metadata instance

AResource analysis

BConstruction of XML metadata

profiles for each granularity level

present in resource

CAdd

metadata to instances

ResourceHarvestable

metadata instance

Very basic tool kit for creating

schema and instances

Page 16: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Let’s apply this workflow to one of the resources in the project

16

Dynamic Syntactic Atlas of the Dutch dialects (DynaSand)

A linguistic database of speech and text to chart the syntactic variation at the clausal level in 267 dialects of Dutch spoken in the Netherlands, Belgium and North-West France.

Page 17: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

A) Resource analysis

AResource analysis

DynaSAND

Data, information, metadata?

Granularity levels?

Page 18: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

B) Profile construction

BConstruction of XML metadata

profiles for each granularity level

present in resource

Use existing components

Page 19: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Existing components

19

Page 20: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

B) Profile construction

BConstruction of XML metadata

profiles for each granularity level

present in resource

Introduce newcomponents

Use existing components

Page 21: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

New Components

21

Page 22: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

B) Profile construction

BConstruction of XML metadata

profiles for each granularity level

present in resource

Introduce newcomponents

Use existing components

Link concepts in new components to existing

ISOCat concepts(ensuring semantic

interoperability)

Page 23: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Link concepts in new components to existing ISOCat

23

Page 24: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

B) Profile construction

BConstruction of XML metadata

profiles for each granularity level

present in resource

Introduce newcomponents

Introduce new ISOCat concepts

(ensuring semantic interoperability)

Use existing components

Link concepts in new components to existing

ISOCat concepts(ensuring semantic

interoperability)

Page 25: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Introduce new ISOCat concepts

25

Page 26: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Result 1: DynaSand collection profile

26

Page 27: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Result 2: DynaSand subcollection profile

27

Page 28: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

C: Generate schemas and add metadata to instances

BConstruction of XML metadata

profiles for each granularity level

present in resource

CAdd

metadata to instances

Very basic tool kit for creating

schema and instances

Page 29: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Instance for DynaSand collection metadata

29

Page 30: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Workflow from resource to harvestable metadata instance

AResource analysis

BConstruction of XML metadata

profiles for each granularity level

present in resource

CAdd

metadata to instances

Introduce newcomponents

ResourceHarvestable

metadata instance

Introduce new ISOCat concepts

(ensuring semantic interoperability)

Data, information, metadata?

Granularity levels?

Use existing components

Link concepts in new components to existing

ISOCat concepts(ensuring semantic

interoperability)

Very basic tool kit for creating

schema and instances

Page 31: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

4) Most important findings of the project CMDI appeared flexible enough for the resources selected at MI

and INL:- Many existing components could be reused.- Where this was not possible the framework indeed made it possible to

make new components. This was the case for both IMDI and non-IMDI type of resources. A very general issue when making existing resources available

through a metadata infrastructure (not CMDI-specific), is how to deal with “data, information, metadata distinction” and granularity levels. -> Advice: keep an end user perspective (discovery and management).

Document with best practices will be made available on CLARIN.EU website.

31

Page 32: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Future challenges for CMDI

Existing ISOCat concept definitions can be too specific or too broad (“birth year” versus “birth date” f.i.). What if too many components and concepts are created and the semantics become too diffuse to be useful?- Will we need increasingly more standardization and “cleaning” effort

from ISOCat in the future?- Will we need more ways of encouraging reuse of existing

components and concepts?- Should we add success indicators?: “this component is already being used by 1

million satisfied customers!”- Should we make more explicit what the benefits of reuse are?: “all of these great

tools can be used on your data too when you reuse components X and Y!”.

32

Page 33: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

33

Some links

CLARIN-NL components: http://www.clarin.eu/cmd/components/clarin-nl/

ISOcat data category registry: http://www.isocat.org

Tools for creating CMDI: - XML-toolkit: http://www.clarin.eu/toolkit

- Component registry and browser and Arbil metadata editor: http://www.clarin.eu/cmdi

Page 34: Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Thank you

34