CLARIN-NL Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens...
-
Upload
gordon-hopkins -
Category
Documents
-
view
228 -
download
1
description
Transcript of Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens...
Creating & Testing CLARIN Metadata
Components
A CLARIN-NL project
Folkert de [email protected]
Meertens Institute, Amsterdam
18/05/2010
LREC, Malta
Project partners
Daan Broeder Dieter Van Uytvanck
Folkert de Vriend
Laura van Eerten Griet Depoorter
3
Outline
1. What is CMDI?2. What is the goal of our project?3. How to go from a resource to harvestable
metadata?4. Findings of the project and future
challenges
1) What is CMDI?
CLARIN MetaData Infrastructure (CMDI) is the infrastructure used for descriptive metadata in CLARIN (Common Language Resources and Technology Infrastructure)
Descriptive metadata is used to characterize data resources and tools, to facilitate discovery and management in large (virtual) infrastructures and repositories.
4
Advantages CMDI
Compared to other metadata infrastructures:- Flexibility
- Researchers can decide what metadata fits their needs and use ready made metadata components.
- Researchers can also create new metadata components if they want.
- Complete Infrastructure: software for metadata modeling, editing, harvesting, exploitation
- Still compatible with existing frameworks: OLAC, IMDI, TEI
5
Basic Component Metadata Modeling
TechnicalMetadata
Sample frequency
Format
Size
…
Lets describe a sound recording
Basic Component Metadata Modeling
Language
TechnicalMetadata
Name
Id
…
Lets describe a sound recording
Basic Component Metadata Modeling
Language
TechnicalMetadata
ActorSex
Language
Age
Name
…
Lets describe a sound recording
Basic Component Metadata Modeling
Language
TechnicalMetadata
Actor
Location…
Continent
Country
Address
Lets describe a sound recording
Basic Component Metadata Modeling
Language
TechnicalMetadata
Actor
Location
Project…
Name
Contact Lets describe a sound recording
Basic Component Metadata Modeling
Language
TechnicalMetadata
Actor
Location
Project Lets describe a sound recording
Metadata profile
Main principles behind CMDI
Component approach which is flexible and lets you design your own metadata profile
But semantics need to be declared explicitly by making use of concepts that are stored in the ISOcat registry. This way interoperability can still be guaranteed.
12
2) What is the goal of our project?
Testing of CMDI principles by applying them to existing resources at MI and INL
13
Lexical resources (with proper names, monolingual and bilingual lexica, historical and scientific dictionaries)
Linguistic databases (with syntactical, morphological and phonological dialect variation)
Ethnological databases (containing data about folktales, songs, probate inventories and pilgrimages).
Corpora (spoken and written) Historical documents (bible texts)
14
Resources at MI and INL used
3) Workflow from resource to harvestable metadata instance
AResource analysis
BConstruction of XML metadata
profiles for each granularity level
present in resource
CAdd
metadata to instances
ResourceHarvestable
metadata instance
Very basic tool kit for creating
schema and instances
Let’s apply this workflow to one of the resources in the project
16
Dynamic Syntactic Atlas of the Dutch dialects (DynaSand)
A linguistic database of speech and text to chart the syntactic variation at the clausal level in 267 dialects of Dutch spoken in the Netherlands, Belgium and North-West France.
A) Resource analysis
AResource analysis
DynaSAND
Data, information, metadata?
Granularity levels?
B) Profile construction
BConstruction of XML metadata
profiles for each granularity level
present in resource
Use existing components
Existing components
19
B) Profile construction
BConstruction of XML metadata
profiles for each granularity level
present in resource
Introduce newcomponents
Use existing components
New Components
21
B) Profile construction
BConstruction of XML metadata
profiles for each granularity level
present in resource
Introduce newcomponents
Use existing components
Link concepts in new components to existing
ISOCat concepts(ensuring semantic
interoperability)
Link concepts in new components to existing ISOCat
23
B) Profile construction
BConstruction of XML metadata
profiles for each granularity level
present in resource
Introduce newcomponents
Introduce new ISOCat concepts
(ensuring semantic interoperability)
Use existing components
Link concepts in new components to existing
ISOCat concepts(ensuring semantic
interoperability)
Introduce new ISOCat concepts
25
Result 1: DynaSand collection profile
26
Result 2: DynaSand subcollection profile
27
C: Generate schemas and add metadata to instances
BConstruction of XML metadata
profiles for each granularity level
present in resource
CAdd
metadata to instances
Very basic tool kit for creating
schema and instances
Instance for DynaSand collection metadata
29
Workflow from resource to harvestable metadata instance
AResource analysis
BConstruction of XML metadata
profiles for each granularity level
present in resource
CAdd
metadata to instances
Introduce newcomponents
ResourceHarvestable
metadata instance
Introduce new ISOCat concepts
(ensuring semantic interoperability)
Data, information, metadata?
Granularity levels?
Use existing components
Link concepts in new components to existing
ISOCat concepts(ensuring semantic
interoperability)
Very basic tool kit for creating
schema and instances
4) Most important findings of the project CMDI appeared flexible enough for the resources selected at MI
and INL:- Many existing components could be reused.- Where this was not possible the framework indeed made it possible to
make new components. This was the case for both IMDI and non-IMDI type of resources. A very general issue when making existing resources available
through a metadata infrastructure (not CMDI-specific), is how to deal with “data, information, metadata distinction” and granularity levels. -> Advice: keep an end user perspective (discovery and management).
Document with best practices will be made available on CLARIN.EU website.
31
Future challenges for CMDI
Existing ISOCat concept definitions can be too specific or too broad (“birth year” versus “birth date” f.i.). What if too many components and concepts are created and the semantics become too diffuse to be useful?- Will we need increasingly more standardization and “cleaning” effort
from ISOCat in the future?- Will we need more ways of encouraging reuse of existing
components and concepts?- Should we add success indicators?: “this component is already being used by 1
million satisfied customers!”- Should we make more explicit what the benefits of reuse are?: “all of these great
tools can be used on your data too when you reuse components X and Y!”.
32
33
Some links
CLARIN-NL components: http://www.clarin.eu/cmd/components/clarin-nl/
ISOcat data category registry: http://www.isocat.org
Tools for creating CMDI: - XML-toolkit: http://www.clarin.eu/toolkit
- Component registry and browser and Arbil metadata editor: http://www.clarin.eu/cmdi
Thank you
34