ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer...
-
Upload
maxim-chisnell -
Category
Documents
-
view
215 -
download
0
Transcript of ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer...
![Page 1: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/1.jpg)
ISOcat
Data Category RegistryDefining widely accepted linguistic concepts
Menzo Windhouwer
1CLARIN-NL MD tutorial, 24-25 September 2009
![Page 2: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/2.jpg)
ISOcat: a reference implementation• ISO 12620:2009
– Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources
– ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures
• ISO Technical Committee 37– Terminology and other language and content
resources
2CLARIN-NL MD tutorial, 24-25 September 2009
![Page 3: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/3.jpg)
ISO 24613:2008 Lexical Markup Framework
3
Lexicon
Lexical Entry
Form Sense
0..*
0..*1..*
1..*
partOfSpeech
writtenForm
writtenForm
grammaticalNumber
lexicalType
Word Form
Lemma
CLARIN-NL MD tutorial, 24-25 September 2009
![Page 4: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/4.jpg)
Data categories• “result of the specification of a given data
field ” (ISO 12620:2009)• data element concept (ISO 11179)
– “concept for which the definition, identification and conceptual domain are specified independently of any particular representation”
• complex data categories are data element concepts
4CLARIN-NL MD tutorial, 24-25 September 2009
![Page 5: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/5.jpg)
Data category types
5
writtenForm
string
open
grammaticalGender
string
neuter
masculine
feminine
closed
simple:
string
constrained
Constraint: .+@.+
complex:
CLARIN-NL MD tutorial, 24-25 September 2009
![Page 6: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/6.jpg)
Data category relationships• Value domain
membership• Subsumption
relationships between simple data categories
• Relationships between complex data categories are not stored in the DCR
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 6
partOfSpeech
string
pronoun
personalpronoun
![Page 7: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/7.jpg)
Data category specification• Administration Information Section• Description Section
– Data Element Name– Language Section
• Name Section
• Conceptual Domain• Linguistic Section
– Conceptual Domain
7
Mandatory:1.A mnemonic identifier2.An English definition3.An English name4.A conceptual domain
CLARIN-NL MD tutorial, 24-25 September 2009
![Page 8: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/8.jpg)
Guidelines for data categories (I)• Identifier:
– camel case and XML-valid element name (without a namespace)• partOfSpeech• my:POS, 123POS
• Data Element Name:– language independent name for the data category
used in a specific application domain (specified in the source)• PoS in TBX
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 8
![Page 9: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/9.jpg)
Guidelines for data categories (II)• Name Section in a Language Section
– legible name• ‘part of speech’ in the English language section• ‘partie du discours’ in the French language section
• Definition:– intentional definitions (ISO 704)– should consist of a single sentence fragment
• Source:– add a source for any quoted material
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 9
![Page 10: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/10.jpg)
Guidelines for data categories (III)• Justification:
– a simple statement justifying the relevance of the data category to the field of language resources
– especially needed for standardization
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 10
![Page 11: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/11.jpg)
Private versus standard• The standard subset of data categories in
the registry should be coherent• The coherency is guarded by Thematic
Domain Groups and the DCR Board• Standard data categories need to meet
some more constraints then private ones:– mandatory justification– DC relations demand profile overlap– …
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 11
![Page 12: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/12.jpg)
Data Category Selections• Anyone
1. can register with ISOcat
2. can create data categories
3. can create data category selections (DCSs)
4. can share DCSs
5. can make DCSs public
6. may submit DCSs for standardization
12CLARIN-NL MD tutorial, 24-25 September 2009
![Page 13: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/13.jpg)
Profiles versus DCSs• Profile membership is part of the DC
specification– the profile indicates the thematic domain of the DC– the profile view in the UI is created by a query– there are a limited number of profiles
• A DCS is a collection of DCs– hand picked by an user for a specific purpose– can contain DCs from various profiles– there can be an unlimited number of DCSs
• There isn’t (yet) a profile specific view on a DCS ___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 13
![Page 14: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/14.jpg)
ISO standardization process
14
Submissiongroup
Data Category RegistryBoard
Validation
Thematic DomainGroup
Evaluation
Stewardshipgroup
ISO
Publication
CLARIN-NL MD tutorial, 24-25 September 2009
![Page 15: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/15.jpg)
Submission group• The owner, possibly together with a group of
users, which submit a DCS for standardization
• The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible)– justification– profile(s)– …
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 15
![Page 16: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/16.jpg)
Thematic Domain GroupsTDG 1: Metadata
TDG 2: Morphosyntax
TDG 3: Semantic Content Representation
TDG 4: Syntax
TDG 5: Machine Readable Dictionary
TDG 6: Language Resource Ontology
TDG 7: Lexicography
TDG 8: Language Codes
TDG 9: Terminology
TDG 11: Multilingual Information Management
TDG 12: Lexical Resources
TDG 13: Lexical Semantics
TDG 14: Source Identification
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 16
• TDGs are the owner and guardians of a coherent subset of the DCR
• TDGs own one or more profiles
• Each TDG has a chair• A number of judges (assigned
by SC P members)• A number of expert members
(up to 50%)
• TDGs are constituted at the TC37/SC plenary
• New TDGs need to be proposed by a SC
![Page 17: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/17.jpg)
Harmonization• When a DC belongs to multiple profiles
belonging to different TDGs harmonization may be needed– one TDG becomes the owner of the DC– judges from the other TDG(s) are involved in
the evaluation process
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 17
![Page 18: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/18.jpg)
Stewardship group• Members of the TDG who will maintain the
data category• The TDG becomes the owner of a
standardized data category• Changes to the data category need to go
through the standardization procedure (evaluation by the TDG, validation by the DCR Board)
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 18
![Page 19: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/19.jpg)
Using data categories (I)• Each data category has a Persistent
Identifier (PID):http://www.isocat.org/datcat/DC-1297
– once a data category has been created it can never be deleted only deprecated or superseded
– the registration authority of 12620 is obliged to keep these URLs working
19CLARIN-NL MD tutorial, 24-25 September 2009
![Page 20: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/20.jpg)
Using data categories (II)• This PID can be embedded in the schemata
of linguistic resources:– CMD<CMD_Element name="Role" ValueScheme="string" ConceptLink="…/DC-1234">
– Relax NG<rng:element name="gender" dcr:datcat="…/DC-1297">
– XML Schema, TEI ODD, TBX, RDF, XML, …• DC Reference vocabulary:
– http://www.isocat.org/12620/
20CLARIN-NL MD tutorial, 24-25 September 2009
![Page 21: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/21.jpg)
Using data categories (III)• The full data category specification can be
downloaded from ISOcat in the Data Category Interchange Format (DCIF)– DCIF is based on a simplified version of the DCR
data model, and leaves out some administrative information
– DCIF vocabulary:• http://www.isocat.org/12620/
21CLARIN-NL MD tutorial, 24-25 September 2009
![Page 22: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/22.jpg)
Usage scenarios• DC references only:
– find semantic overlap between two or more resources by comparing their DC references
• DC references and a schema/component registry:– find interesting resource (types) by comparing the DC
references of schemas/components in the registry
• DC references and a network of registries:– find (in)direct related resources by related DCs
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 22
![Page 23: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/23.jpg)
Relation Registry• ISOcat contains a ‘flat’ list of concepts• The Relation Registry will support storing
(user-specific) relations between these concepts– is-a– part-of– equivalent-to– related-to– …
23
Will support:1.Ontologies and taxonomies
on top of data categories2.Searches across related data
categories3.…
CLARIN-NL MD tutorial, 24-25 September 2009
![Page 24: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/24.jpg)
Registry network
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 24
Linguistic resources
Data category registries
Relation registries
MPIDCR
ISODCR
Typological Database SystemRRMPI RR
MPIarchive
TDSdatabaseresource
![Page 25: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/25.jpg)
Status of ISOcat• ISOcat is under active development:
– Now:• You can access public data categories and selections• You can create your own data categories and selections• You can share your data categories and selections with others (everyone, or
a specified group)– Future:
• Some social features (forum to discuss specific data categories)• Cleanup of profiles by TDGs• Import external ‘data category’ sets, such as:
– parts of the ISO Concept Database– Dublin Core– TEI
• Standardization workflow• High availability (mirrors)• Relation registry
25CLARIN-NL MD tutorial, 24-25 September 2009
![Page 26: ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.](https://reader035.fdocuments.in/reader035/viewer/2022070307/551ab5db5503466b6a8b4611/html5/thumbnails/26.jpg)
Thanks for your attention!
http://www.isocat.org/
26CLARIN-NL MD tutorial, 24-25 September 2009