Integrating Ontologies and Thesauri to Build RDF Schemascedric.cnam.fr/fichiers/RC15.pdfIntegrating...

21
Integrating Ontologies and Thesauri to Build RDF Schemas B. Amann Cedric CNAM, 292 Rue St. Martin 75141 Paris Cedex 03 France I. Fundulaki INRIA-Rocquencourt 78153 Le Chesnay Cedex Abstract In this paper we present a new approach for building RDF schemas by inte- grating existing ontologies and structured vocabularies (thesauri). We will present a simple mechanism based on the specification of inclusion relationships between thesaurus terms and ontology concepts and show how these relationships can be exploited to create application-specific RDF schemas incorporating the structural views of ontologies and deep classification schemes provided by thesauri. 1 Introduction With the emergence of the World Wide Web, Internet and Intranet technologies, a large number of information sources from a variety of different application domains have become available on line. In such open and evolving environments, discovering, ac- cessing and integrating information are difficult and complex tasks due to the existence of semantic heterogeneities [35], resulting from the different terminologies and con- ceptualizations employed by the various information providers and consumers. A partial solution to the semantic heterogeneity problem is the exchange of domain- specific metadata [22, 41, 35] between interconnected systems, describing the seman- tics of the underlying information. More specifically, these semantics are expressed by metadata schemas, defined by specific resource description communities. A metadata schema is comprised of (1) a vocabulary, i.e. a set of element names to be used for the description of information in a domain (e.g. the creator, title elements of the Dublin Core [12] metadata element set), and (2) a set of semantic relationships to structure this information. One of the several roles of metadata schemas in open and evolving environments such as the Web, is to support a sharable, structural view of information with rich semantics to be communicated between users and applications. Email: [email protected] Email:[email protected] 1

Transcript of Integrating Ontologies and Thesauri to Build RDF Schemascedric.cnam.fr/fichiers/RC15.pdfIntegrating...

  • IntegratingOntologiesandThesaurito BuildRDF Schemas

    B. Amann�

    CedricCNAM,292RueSt. Martin

    75141ParisCedex 03France

    I. Fundulaki�

    INRIA-Rocquencourt78153Le ChesnayCedex

    Abstract

    In this paperwe presenta new approachfor building RDF schemasby inte-gratingexistingontologiesandstructuredvocabularies(thesauri).We will presenta simplemechanismbasedon thespecificationof inclusionrelationshipsbetweenthesaurustermsandontologyconceptsandshow how theserelationshipscanbeexploited to createapplication-specificRDF schemasincorporatingthestructuralviews of ontologiesanddeepclassificationschemesprovidedby thesauri.

    1 Intr oduction

    With theemergenceof theWorld WideWeb,InternetandIntranettechnologies,a largenumberof informationsourcesfrom a variety of differentapplicationdomainshavebecomeavailableon line. In suchopenandevolving environments,discovering,ac-cessingandintegratinginformationaredifficult andcomplex tasksdueto theexistenceof semanticheterogeneities[35], resultingfrom the different terminologiesandcon-ceptualizationsemployedby thevariousinformationprovidersandconsumers.

    A partialsolutionto thesemanticheterogeneityproblemis theexchangeof domain-specificmetadata[22, 41, 35] betweeninterconnectedsystems,describingtheseman-ticsof theunderlyinginformation.Morespecifically, thesesemanticsareexpressedbymetadataschemas, definedby specificresourcedescriptioncommunities.A metadataschemais comprisedof (1) a vocabulary, i.e. asetof elementnamesto beusedfor thedescriptionof informationin a domain(e.g. the creator, title elementsof the DublinCore [12] metadataelementset),and(2) a setof semanticrelationshipsto structurethis information. Oneof the several rolesof metadataschemasin openandevolvingenvironmentssuchastheWeb,is to supporta sharable,structuralview of informationwith rich semanticsto becommunicatedbetweenusersandapplications.�Email: [email protected]�Email:[email protected]

    1

  • Metadataspecificationlanguages,suchas the Resource DescriptionFramework(RDF)[34, 6], supportstandardmechanismsfor therepresentationof metadataschemasaswell assourcespecificmetadata(sourcedescriptions).RDFis anongoingstandard-izationeffort of theWorld-Wide WebConsortium(W3C) for thecreationof metadatadescribingWebresources.Althoughit enablesthedescriptionandexchangeof meta-dataschemas,it doesnot provide a mechanismto facilitatetheir construction,whichis a difficult andtimeconsumingtaskespeciallyin environmentsthatcomprisea largenumberof informationsources.Moreover, it offersno mechanismto decidewhethera particularmetadataschemameetsthe needsof an applicationor domain. For thatsake,we needto considersemanticcomponentsandstructuralviews thatdescribetheorganizationof theunderlyinginformation.

    In thispaper, wepresentamodularapproachfor thecreationof RDFschemasbasedon the integrationof existing ontologiesandthesaurushierarchiesdefinedaccordingto theISO2788[20] standardfor monolingualthesauri.Ontologiesandthesauricanbeconsideredasorthogonalwaysfor describinginformation. Theformerprovide struc-tural,sharableviewsof information,with usuallyshallow semantics,capturedin meta-dataschemas.They aredeclarativespecificationsof theconceptsandrolesin adomainof discourse.Thesauriarestructuredvocabularies,with rich semanticsbut little or nostructure.For example,althoughtheArt & Architecture Thesaurus, oneof the largestthesauriin thefield of westernart terminology, includesextendedtaxonomiesof cul-turalartifactsandstyles,thereis noexplicit relationshipdenotingthefactthatartifactshave a style. In the context of our approach,ontologiesareperceived to have a dualrole: provideagenericview of informationanda structuralinterfaceover thesauri.

    We follow a three-stepapproachto the constructionof RDF schemas.In a firststep,we specifyfor eachthesaurusterm,a setof ontologyconcepts,theformerbeingconsideredassub-conceptsof thelatter. Theresultof thisstepis aconnectionrelationbetweentermsandconceptswith inclusionsemantics.In a second,intermediatestep,we extract automaticallyfor eachconcepta conceptthesaurus. This thesauruscon-tainsonly the termsconnectedto this conceptby the connectionrelation,alongwithbroader-genericrelationshipsderived from the initial thesaurus.In the final stepweintegratethesethesauriwith theontologyto produceanRDFschemaconsistingof (1)a structural view providedby theontology, (2) connectionrelationsbetweenconceptsandterms,and(3) thesaurushierarchies. With this intermediatestepit is possibletoconstructtheresultingschemaincrementallyby extractingondemandconceptthesaurithatcorrespondto differentontologyconcepts.

    Ourcontributionis two-fold. First,by usingexistingcomponents,weminimizethetime andeffort to specifyappropriatenotionsthat describethe contentandstructureof a domainin theform of anRDF schema.Second,theresultingRDF schemais notboundto a specificimplementationandcanbeusedby any applicationwhich is basedon theRDFstandard.

    To illustrateourapproach,we takeexamplesfrom theculturalapplicationdomain.ThesaurusexamplesaretakenfromtheArt & ArchitectureThesaurus(AAT). TheArt &ArchitectureThesaurusisoneof theGettyInformationInstitute’s(http://www.gii.getty.edu/)ongoingprojectsandknown asoneof the largestthesauriin the areaof westernarthistoricalterminology. Ontologyexamplesareinspiredfrom the ICOM/CIDOCRef-erenceModel. The ICOM/CIDOC ReferenceModel [19] is the result of oneof the

    2

  • mostsignificantefforts for a formal representationof thebasicnotionsof theculturalapplicationdomain.

    This paperis organizedasfollows. Relatedwork is presentedin Section2. Sec-tion 3 givesa shortpresentationof RDF. In Sections4 and6, we describethe notionof ontologyandthesaurusrespectively. In Section8, we presentour approachto theautomaticconstructionof RDF schemasby integratingontologiesandthesaurushier-archies.Conclusionsandfuturework aregivenin Section11.

    2 RelatedWork

    Over the pastyearsa greatamountof effort hasbeeninvestedin the developmentofmetadatavocabulariesfor the exchangeof information acrossdifferent applicationsand domains[12, 25, 40, 10]. Dublin Core [12] contributesto semanticinteroper-ability by promotinga commonsetof elementswhich can be usedto describein aconsistentmannerinformationconcerningthecontentsof electronicdocuments,suchastheir title, creator, or subject. USMARC [40] definesa setof descriptive elementsfor the representationandexchangeof bibliographicdata. In theculturaldomain,theAquarelleProject[31] usestheSGML CI DTD (DataTypeDefinition) of theFrenchMinistry of Culture[10] to describea setof elementnames,dedicatedto territory in-ventorymaking.All theabovemetadataelementsetsaretheresultof thecollaborationof anumberof usercommunitiesandotherauthoritiesin thecorrespondingfields.Ourapproachcanbe consideredasa methodologyto provide suchmetadataelementsetsby usingexisting semanticcomponentsof the domainof interest,namelyontologiesandthesaurushierarchies.

    Besidesspecificmetadataelementsets,ontologieshave beendevelopedandusedin severalprojectsto structureandaccessWebknowledge.TheOntoSeek[18] systemis usedfor gatheringandorganizingWeb sourcedescriptions. It exploits the SEN-SUSOntology[24] whichis basedontheWordNet[32] linguisticontologyto describesourcecontents.TheWebKB setof tools[29] buildson a terminologicalontologyandconceptualgraphsto represent(andindex) documents.Our approachcanbe consid-eredcomplementaryto the above systems,in the sensethat not only do we providea methodologyto defineontologiesenrichedwith thesaurushierarchies,but also thechoiceof RDF as the representationlanguageenablestheir exchangein a machinereadableformat.

    BesidesstructuringandrepresentingWeb data,metadataschemas(referredto asdomainmodels) arealsousedin mediationbasedsystemssuchasInformationMan-ifold [4], SIMS [9] andCarnot[11]. They provide a uniform view of informationina domainof discourseandareusedto describethe contentsof different informationsources.For example,Carnotis an information integrationsystemthat relieson theCYC [28] knowledgebasefor describingsourcecontents.TheCYC knowledgebaseis a formalizedrepresentationof a “vast quantityof fundamentalhumanknowledge”andcontainsabout ����� generalconceptsand ���� assertionson theseconcepts.We donot aim at providing a mediationsystembut rathera methodologyto definemediatordomainmodels.The interestingissueis that in somecases,underlyingsourcesmightusethesaurushierarchiesthatcouldbeintegratedin themediatorto produceexpressive

    3

  • domainmodels.Ourapproachcanbeconsideredasafirst steptowardsthis integration,which is a requirementfor thenext generationinformationsystems[35, 33].

    Integratingontologiesandthesauricanbeconsideredasaschemaintegrationprob-lem [5]. An importantissuein this field concernsthecoherentintegrationof databaseschemaswith overlappingconcepts,rolesanddata.Ourapproachis moresimplesinceontologiesandthesauricanbeconsideredasorthogonalwaysof describinginforma-tion. First,ontologiescapturemoregeneralsemanticsthanthesauri,andconsequentlywe considerthesaurustermsto bespecializationsof ontologyconcepts.Second,the-sauri incorporateonly a fixed setof semanticrelationshipsdefinedindependentlyofany applicationor domain.Finally, theconsistency of theresultingmetaschemais notbasedon actualdata,but on the meaningof ontologiesandthesauriasperceived byexpertsin thedomain.

    3 ResourceDescription Framework

    The Resource Description Framework (RDF) is a foundationfor processingmeta-data [34, 6] which supportsstandardmechanismsfor the representationof metadataschemasaswell assourcedescriptions.It relieson a simple,graph-baseddatamodelandusesXML (eXtensibleMarkupLanguage)[42], to communicateandprocessmeta-datain amachinereadableandhumanunderstandableformat.Similar to theseparationof schemaandinstancein traditionaldatabases,we candistinguishbetweenRDF de-scriptionsandRDFschemas,theformerconsideredasinstancesof thelatter.

    3.1 RDF descriptions

    RDF can be usedto describeany kind of resource [26] that is identified by a URI(Uniform ResourceIdentifier),suchasaWebserver, anXML documentor anelementof an HTML page(e.g. an image). RDF supportsthe definition of resourceproper-tieswhosevaluescanbeotherresourcesor literals(strings,integers).A collectionofproperty/valuepairsthatrefersto a specificresourceis calledanRDF descriptionandcanberepresentedasa labeleddirectedgraphwherenodescorrespondto resourcesorliterals(values)andedgesto resourceproperties.

    Figure 1 shows an RDF descriptionfor a Web pagethat describesa paintingoftheFrenchpainterClaudeMonet. RDF usestheXML namespacemechanismto dis-tinguishamongdifferentRDF schemas(Section3.2) usedin RDF descriptions.Forexample, lines 2 and3 definetwo XML namespaceswhere the first (web-page)containsgeneralpropertiesof HTML pages(title, presents, creator) andthesecond(artifact) specifiespropertiesof culturalartifacts(title, style,type, period). Thismechanismis very importantsinceit permitsthereuseof ex-isting,distinctRDF schemaswithin thesameRDF description,without creatingnam-ing conflicts(e.g.web-page:title, artifact:title). Line 4 tells usthatthedescriptionthat follows concernstheHTML pagewhich canbeaccessedby theURLhttp://metalab.unc.edu/louvre/paint/monet/first/impression/.The title of this pageis “Web Museum: Monet, Claude: Impression: soleil lev-ant ” (line 5) andhasbeencreatedby NicolasPioch(line 14). To describeproper-

    4

  • 1. 4. 5. Web Museum: Monet, Claude: Impres-sion: soleil levant

    6. 7. 8. oil painting9. Impression : soleil lev-ant10. impressionism11. first-impressionism12. 13. 14. Nicolas Pioch15. 16.

    Figure 1: An RDF description for resourcehttp://metalab.unc.edu/-louvre/paint/monet/first/impression.

    ties of the painting, it is necessaryto definea local resourcewhich is identified byURI soleil levant that refersto the painting. The painting’s propertiesare itstype (oil painting, line 8), title (Impression : soleil levant, line9), style (impressionism, line 10) andperiod(first-impressionism, line11).

    3.2 RDF Schemas

    TheRDFSchemaSpecificationLanguage[7] is adeclarativelanguageusedfor thedef-inition of RDFschemas1 incorporatingaspectsfrom knowledgerepresentationmodels(e.g. semanticnets),databaseschemadefinition languagesandgraphmodels. It is asimplelanguageof restrictedexpressive power comparedto predicatecalculusbasedspecificationlanguagessuchasCycL [28] andKIF [23].

    An RDF schemadefinesclassesandpropertieswhich canbe instantiatedin RDFdescriptions.More specifically, anRDF schemais comprisedof (1) a vocabulary, i.e.asetof classandpropertynamesto describeinformationin a domain(asfor example,thecreator, title elementsof theDublin Coremetadataelementset),and(2) asetof se-

    1In thefollowing, RDF Schemawill denotethespecificationlanguageusedto defineRDF schemas.

    5

  • 1. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

    Figure2: An RDFschemafor describingculturalresources.

    manticrelationshipsto structurethis information.Classesareorganizedin hierarchiesusingthepropertyrdfs:subclassOfwhichis definedin RDFSchema(namespacerdfs) andhasthe standardsemanticsof inheritancerelationshipin object-orienteddatamodels.For example,the RDF schemaillustratedin Figure2 definesclassManMade Object (line 4) andits subclassIconographic Object (lines5,6). ItalsodefinesclassesStyle (line7), Period (line8). RDFSchemaallowsbothtypedanduntypedproperties.Propertiesin our examplearetyped(i.e. they havearestricteddomainandrange).In Figure2, propertyperiod (line15) is definedbetweenclassesMan Made Object (line 16) andPeriod (line 17), usingtheRDF Schemaprop-ertiesrdfs:domain andrdfs:range respectively.

    Summarizing,RDFoffersarich, comparativelysimplegraph-baseddatamodelandsupportsthe definition of sourcespecificmetadata(RDF descriptions)andmetadataschemata(RDF schemas).It usesXML for the syntacticalrepresentation,exchange,andprocessingof thesemetadata.

    4 Ontologies

    Thetermontologyhasbeenusedin severaldisciplines,from philosophy, to knowledgeengineering,whereanontologyis consideredasacomputationalentity, containingcon-ceptsandtheir properties,relationshipsbetweenconceptsandconstraints.Ontologies

    6

  • aredefinedindependentlyof the actualdata[16], reflecta commonunderstandingofthesemanticsof thedomainof discourseandareusedto shareandexchangesemanticinformationbetweensources[15, 33]. They aredeclarative specificationsof thebasicconceptsandrolesin anapplicationdomain.We only considerontologieswith inheri-tancerelations������� andtypedrolesbetweenconcepts,sufficientto modelalargeclassof ontologies[17] thatcanbeeasilyrepresentedasRDFschemas(Section3.2).

    Definition 5 Anontology is a triple ������������������� definedasfollows:1. � � !�"�#$�%"'&(�*)�)*)'�+"',.- is a setof concepts,where each concept"'/ refers to a set

    of realworld objects(conceptinstances),

    2. �0�0!�1 # �21 & �*)�)*)*�%1�34- is a setof binary typedrolesbetweenconcepts,3. ���� is a setof inheritancerelationshipsdefinedbetweenconcepts.Inheritance

    relationshipscarry subsetsemanticsanddefinea partial orderover concepts.

    Ontologiescanberepresentedasdirectedgraphswherenodescorrespondto con-ceptsandarcscorrespondto rolesand ���� relationships.Figure3 illustratesanexam-ple ontology, inspiredfrom the ICOM/CIDOC ReferenceModel [19] which is usedto describecultural information. ConceptPhysical Object collectsall physicalobjects,thelattercomposedof otherphysicalobjects.Activities (conceptActivity)areassociatedwith physicalobjects,theformerperformedbypersons,institutionsandorganizations(conceptActor). ConceptsBiological Object andMan-MadeObject are sub-conceptsof Physical Object and inherit all roles definedintheir superclass.Instancesof Man-Made Object have a title (role has-title) andhave beencreatedin a specificperiod(role of-period). Iconographic Object isa sub-conceptof Man-Made Object. Iconographicobjectshave a style(role style)which is aninstanceof conceptStyle.

    Iconographic Object

    Biological Object

    Physical Object

    Man-Made Object

    Period

    Title

    Style

    Actor

    performed by

    is-composed-of

    style

    isA

    isAisA

    Activityassociated with

    of-period

    has-title

    Figure3: A simpleculturalontology.

    7

  • 6 Thesauri : Structured Vocabularies

    A vocabulary is a collectionof termsthatdescribeinformationin a domainof interest.Examplesof suchvocabulariesaretheACM ComputingClassificationSystem[2], theLibrary of CongressSubjectHeadings[27], theUnifiedMedicalLanguageSystem[39]for medicineand the Art & Architecture Thesaurus[38, 1] for the cultural domain.Thesauriarestructuredvocabulariesof thousandsof termswhich have andarebeingusedasefficientmeansfor consistentindexingandretrieval of information[14].

    Thesaurustermsareconsideredasthe “r epresentationof conceptsin the form ofa nounor a nounphrase” [20]. Conceptsareperceived by thesaurusdevelopersasreferringcollectively to a setof objects(conceptinstances) [30] thatareconsideredassuchnot with respectto a formal classificationprocessbut througha commonagree-ment. Under this perspective, the interpretationof a thesaurusterm is a set of ob-jects,which we will call theextensionof the term. Thesauriaresaidto bestructuredsincethey includea fixedsetof semanticterm relationships.Due to thesettheoreticdefinition of terms,thesesemanticrelationshipsare interpretedasrelationsbetweensets[13, 21, 36].

    TheISO 2788Standard[20] for thedocumentationandestablishmentof monolin-gual thesauridefinesthe following four kindsof termrelationshipswhich distinguishstructuredthesaurifrom arbitrarycollectionsof terms:

    1. generalization(broadertermgeneric- btg),

    2. instance(broaderterm- bt),

    3. partitiveor part-of (broadertermpartitive - btp),

    4. associative(relatedterm- rt) and

    5. equivalence(usedfor term- uf).

    Term relationships57698 and 5:6 visual works? in Figure4). We only assumemono-hierarchicalthesauri,i.e. eachtermhasexactly onebroaderterm.

    In the following, we will considera thesaurusasa setof hierarchies,organizedusingthehierarchical57698 -relationship.Althoughthedefinitionwegiveis notcompletew.r.t. all possibletermrelationsexisting in realthesauriit is sufficient for creatingrichmetadataschemata.

    2The interestedreadercanrefer to ISO 2788[20] for a deeperpresentationof the remainingtermrela-tionships.

    8

  • < visual works >

    < visual works by medium or technique >

    paintings sculpture drawings

    < paintings by material or technique >< paintings by form >

    oil paintings miniatures broader term generic relationship

    Figure4: Part of theArt & ArchitectureThesaurushierarchyVisual Workswhich col-lectsall artifactsthatareusedfor visualcommunication(paintings,sculptures,photos).

    < international post-1945 styles and movements >

    first-impressionismimpressionism post-impressionism

    renaissance

    high renaissancelate renaissance

    early renaissance

    european

    < post-1945 fine arts stylesand movements >

    abstract impressionist

    Art Deco

    broader term generic relationship

    Figure5: Part of theArt & ArchitectureThesaurushierarchyStyles& Periodswhichcollectsall styles,periodsandmovementsof Art in thewesternworld.

    9

  • Definition 7 A thesaurusis a couple@A�B�CD�757698�� such that1. CE�B!�6%#$�26�&F��)*)�)�6�,G- is a setof terms,2. 5:698 is a binary relationshipbetweentermssuch that for each pair of terms(t,t’)

    there existsat mostone 57698 -pathbetweent,t’ (mono-hierarchical thesaurus).

    8 CreatingRDF Schemasfr omOntologiesandThesauri

    In this section,we presenta methodologyfor theconstructionof RDF schemasbasedontheintegrationof ontologiesandthesaurushierarchies.Theconstructionof anRDFschemais donein threesteps.In a first step,we specifyfor eachontologyconcept,asetof terms,thelatterconsideredassub-conceptsof theformer. This stepis similar toestablishinginter-schemaassertions[8, 11] for databaseschemaintegrationandcan-notbeacompletelyautomatedprocesssinceit requirestheknowledgeof thethesaurusand ontology semantics.Nevertheless,we must note that this knowledgecould bepartially derivedfrom sourcedatawhenthesauriareusedto index conceptinstances.In a second,intermediatestep,we extract automaticallyfor eachconcepta conceptthesaurus. This thesauruscontainsonly the termsconnectedto this conceptby theconnectionrelation,alongwith broader-genericrelationshipsderivedfrom the initialthesaurus.Thisprocesscanbedoneautomaticallyanddoesnot requiretheknowledgeof theontology. In thefinal stepwe integratethesethesauriwith theontologyto pro-duceanRDF schemaconsistingof (1) a structural view providedby theontology, (2)connectionrelationsbetweenconceptsandterms,and(3) thesaurushierarchies.

    Observethatby theintermediatestepit is possibleto constructtheresultingschemaincrementallyby extractingon demandconceptthesaurithat correspondto differentontologyconcepts.Moreover, anotherbenefitof theseparationof the integrationintoseveral stepsis that we areable to monitor the result at any level of the integrationprocess.Last,it is importantto mentionthatourmethodologyis notrelatedto aspecificimplementationplatform.

    8.1 Step1 : Specializationof Conceptswith Terms

    In thefirst stepof theintegrationprocess,thesaurustermsare“connected”to ontologyconcepts.Theseconnectionshave inclusionsemanticsandarerepresentedby a binaryconnectionrelation �IHKJMLONBPQ� overa setof thesaurusterms N anda setof ontol-ogy concepts� . An exampleof a connectionrelationis presentedin Figure6. Termsimpressionism, post-impressionismandabstract impressionismof theArt & Architec-tureThesaurushierarchyStyles& Periods(Figure5) describespecificstyles(ontologyconceptStyle in Figure3). Termfirst-impressionismof thesamehierarchydescribesbothastyleandaperiod(conceptsStyle andPeriod respectively). Similarly, termrenaissanceand its narrower termsof Styles& Periods hierarchydescribedifferenttypesof stylesandperiods(ontologyconceptsStyle andPeriod respectively). Fi-nally, termspaintings, oil paintingsandsculpture of theAAT hierarchyVisual Works(Figure4) definedifferentkinds of iconographicobjects(ontologyconceptIcono-graphic Object).

    10

  • Term Concept Term Conceptimpressionism Style paintings Iconographic Objectpost-impressionism Style oil paintings Iconographic Objectabstract impressionism Style sculpture Iconographic Objectrenaissance Style early renaissance Stylerenaissance Period early renaissance Periodlate renaissance Style high renaissance Stylelate renaissance Period high renaissance Periodfirst-impressionism Period first-impressionism Style

    Figure 6: A connectionrelation �IHKJ for AAT hierarchiesStyles& Periods, VisualWorksandontologyconceptsStyle, Period andIconographic Object.

    In thefollowing wewill saythat,if term 6 is connectedto concept" , it is labeledbyc. Theway theuseractuallylabelsterms(choosesconceptsto beconnectedto a giventerm) will not be discussedin this paperbecauseof lack of space.Briefly speaking,eitheronelabels 6 with someconcept" with theassumptionthatall descendantsof 6areconnectedto (labeledwith) " or onechoosesexplicitly amongthedescendantsof 6 .

    In the previousexample,we do not connectthe whole thesaurushierarchyStyles& Periods to conceptsStyle andPeriod. We adoptthis selectiveapproach, i.e.relatingthesaurustermsto ontologyconceptsexplicitly, for severalreasons.An obvi-ousreasonis that sometermscouldbeout of the scopeof theapplicationthathastobe describedby the resultingRDF schema.For example,if someapplicationis onlyconcernedwith paintings,then termsreferring to artifactsother than paintings(e.g.sculpture, drawings) neednot beconsideredin the resultingschema.Anotherreasonis that someterms(e.g. guide terms in [20, 38]) areusedto organizethesaurushi-erarchies(e.g. > visual worksby mediumor technique? ) andmight have no usefordescribinginformation.Finally, anotherimportantreasonis that thesaurushierarchiesmightcontaintermswhich canbeconnectedto differentconcepts.For example,termsof theAAT hierarchyStyles& Periods(Figure5) describestyles(e.g. impressionism),periods(e.g.art deco), or bothstylesandperiods(e.g.renaissance). Connectingtermsto conceptsin a selective mannerallows usersto clarify betweenthemultiple seman-tics of a term (e.g. as in the caseof homonyms) andconsequentlyresolve semanticambiguitiesat thethesauruslevel.

    8.2 Step2 : ThesaurusExtraction

    After having definedtheconnectionrelationsbetweentermsandconcepts,we extractfor eachconceptin theconnectionrelation,a thesaurus,calledconceptthesaurus. Thisis donein two steps.First,eachtermin thesaurus@ is labeledby theconceptsto whichit is connectedin the relation �IHKJ . Observe that a term canbe connectedto severalconcepts,i.e. term labelsaresetsof conceptnames.For example,in the connectionrelationillustratedin Figure6, term first-impressionismis connectedto bothStyleandPeriod concepts.In this case,the labelof termfirst-impressionismis thesetof

    11

  • concepts! Style, Period - .Second,we definea selectionoperationR thatconstructsfrom a labeledthesaurus

    @TS anda setof conceptnamesU a new labeledthesaurusthat contains(1) the setoftermsin @TS whoselabelscontainat leastoneconceptin U and(2) 57698 relationsbetweentheseterms,inducedby the 5:698 relationsin theinitial thesaurus.More precisely:Definition 9 Let @ S �B�CD�+5:698�� bea labeledthesauruswhere each termis labeledbya (possiblyempty)setof concepts: VXWYC[Z]\�^ . Let U bea setof conceptnames.TheselectionR_�U`�2@ S � createsa new thesaurusasfollows:

    1. keepall terms 6 of @ S which are labeledby at least one conceptnamein U( UbacVY�6%��d�Oe ),

    2. create 57698 relationsbetweenall terms 6 and 6�f in R_9U`�%@ S � which are relatedbya 57698 pathin @ S that containsno termin R_�U`�2@ S � .

    A naive algorithmcalculating R_9U`�%@TS�� is shown in AppendixA. Since R_�U`�2@TS��can be evaluatedwithout the knowledgeof the underlyingontology, this algorithmcanbe executedon the thesaurussite without accessinginformationfrom the ontol-ogy. Thispropertyis usefulin adistributedenvironmentwherethesauriandontologiesmightbestoredon differentsites.

    Using this selectionoperationit is possibleto definea labeledthesaurus@hgS foreachconcept" in thesetof ontologyconcepts� asfollows :Definition 10 Let @TSO���CD�+5:698�� be a labeledthesaurus. Let " be a conceptandU g be the setof sub-conceptsof " in � including " . Then,we can definea labeledconceptthesaurus@ gS �iR_9U g �%@ S � which containsall termsconnected(in �IHKJ ) to "or a sub-conceptof " .

    For thedefinitionof a conceptthesauruswe exploit not only the 57698 relationsbe-tweenthe termsbut alsothe ���� relationshipsat the ontologylevel. Considertheex-amplein Figure 7. Term j is labeledby concept k , term 6 by " and l by m . Theselectionoperationon concept" will constructthethesaurus@ gS thatcontainsbesidesterm 6 , terms j and l that arelabeledby its sub-concepts.Observe alsothat a termcanappearin multiple conceptthesauri,andtermsthatarenot labeledby any concepthave disappearedfrom theconceptthesauri.For example,term n is not connectedtoany conceptandhasdisappearedfrom theconceptthesauriin Figure7. Moreover, theselectionoperationon concept" createda 57698 relationbetweenterms l and 6 whichwerenot directly relatedin theoriginal thesaurus.

    Eachconceptthesauruscanbeextractedindependentlyandcontainsonly a subsetof thetermsdefinedin theconnectionrelation.Thismeansthatthesizeof theextractedthesauriis boundby thenumberof termsin theconnectionrelationandis independentof thesizeof theoriginal thesaurus.

    At this point,we shouldmentionthata conceptthesauruscanbeinducedby thoseof its super-concepts. For example, if k is a sub-conceptof " , and @ogS is the con-cept thesaurusof " , then the conceptthesaurusof k can be extractedas follows :@opS �qR_�U p �2@ gS � . In the previous example,only conceptthesaurus@ gS of concept"

    12

  • t

    uv

    w

    Thesaurus T

    t

    Tc

    Td

    Te

    Concept Thesauri

    v w

    v w

    w

    r r rs st t tu u

    v v vw wisa e

    c

    disa

    Ontology OLabeling function

    Selection

    Figure7: ExtractedThesaurusExamples.

    hasto beextractedfrom theoriginal thesaurus@ S . All thesauricorrespondingto sub-conceptsof " might thenbecreatedondemandduringthecreationof theRDFschema(Section10.1).

    10.1 Creationof the RDF Schema

    In this sectionwe will presenthow theRDFschemais constructedout of a setof con-cept thesauri. This schemawill incorporatethe setof ontology conceptsand roles,theconceptthesauridefinedfor eachontologyconceptandconnectionsbetweentermsandconcepts.In short,ontologyconceptsandthesaurustermsaremodeledasRDFclasses, ontologyrolesasRDF properties. Ontology ���� relationships,connectionre-lationsbetweentermsandconceptsand 57698 relationsbetweentermsall carryinclusionsemanticsandaremodeledwith theRDFsubclassOfproperty.

    Thecreationof theRDF schemax for anontology �E�y9���%���2������ , anda setofconceptthesauri@ gS �z�CD�757698�� is straightforward:

    1. Thesetof RDFclassesin x is obtainedasfollows:(a) for eachontologyconcept" defineRDFclassc,(b) for eachterm 6 in @ gS defineRDFclassc:t,

    2. Thesetof RDF propertiesis obtainedby definingfor eachtypedrole 1{�"F�%k|� in� anRDFpropertywith domain RDFclassc andrange RDFclassd.

    3. Thesetof RDFsubclassOf propertiesis obtainedasfollows:

    (a) for each����}�"$�%k|� relationshipbetweenontologyconcepts" and k defineanRDFsubclassOf propertybetweenRDF classesc andd.

    13

  • (b) for eachRDFclassc:t, correspondingto a root termin thesaurus@ gS , addanRDFsubclassOf propertybetweenRDFclassesc:t andc.

    (c) for each 57698 relation betweentwo terms 6 and 6 f in a conceptthesaurus@ gS , defineanRDFsubclassOf propertybetweenRDFclassesc:t andc:t’.

    It is interestingto notethatweconnectonly theroot termof eachconceptthesaurustothecorrespondingconcept.Due to the transitivity of theRDF subclassOfproperty, itcanbeinducedthata term 6 is a subclassOfanotherterm 6 or a concept" .

    The RDF schemaillustratedin Figure8 hasbeenconstructedfrom the ontologyin Figure3, the thesaurusclassificationschemesin Figures4, 5, andthe connectionrelationin Figure6.OntologyconceptsMan Made Object, Iconographic Object, Style, Pe-riod andtermsoil paintings, paintings, impressionismandfirst-impressionismareallrepresentedasRDF classes(lines5,7,9,11,21,23). For simplification,we onlyprefix termswith thecorrespondingconceptif they arecontainedin differentconceptthesauri.RDF classpaintings is definedasa subclassof classIconographicObject (line 22), sinceterm paintings is the root term of Iconographic Ob-ject conceptthesaurus.In the sameway, classesimpressionism andfirst-impressionism aredefinedassubclassesof conceptsStyle andPeriod respec-tively (lines 26,28). Classoil paintings is a subclassof classpaintings(line 24) (definedby the 57698 -relationsbetweentermoil paintingsandtermpaintings).Ontologyrolestyle, is definedasanRDFproperty, its domainbeingtheclassIcono-graphic Object (line 19) and its rangeclassStyle (line 20). By definitionof the subclassOfproperty, all subclassesof Iconographic Object inherit thisproperty.

    Usingthis RDF schema,onecanprovide RDF descriptionsaboutspecificwebre-sources.For example,a new RDF descriptionfor the sourcedescribedin Figure 1is shown in Figure9. Whencomparingthis new descriptionwith the previous one,onecanobserve that we have replacednamespaceartifact by a new namespaceint which correspondsto the RDF schemain Figure 8. In this RDF description,semanticinformation that was capturedas a value in the previous descriptionhasbeenaddedat the schemalevel. For example, the fact that the resourcedescribedan impressionistpaintingwas encodedin the valueof tag > artifact:style ? .This value correspondsin fact to a term in the AAT and is representedas an in-stanceof classint:impressionism (line 11) in the new schema. The sameargumentholds for the valuefirst-impressionismwhich is now representedasan in-stanceof RDF classint:first-impressionism (line 12). Observe also (tag> rdf:Description ? ) (Figure1, line 7), hasbeenreplacedby a typednodetag> int:oil paintings ? (line 8) indicatingthat the describedresourceis an oilpainting.

    14

  • 1. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

    Figure 8: The RDF schemaresultingfrom the integration of the ontology and the-saurus.

    15

  • 1. 4. 6. Web Museum: Monet, Claude :Impres-sion :

    soleil lev-ant7. 8. 10. Impression : soleil levant11. 12. 13 14. 15. 16. Nicolas Pioch17. 18.

    Figure9: RDFdescriptionfor ClaudeMonetpaintingusingtheintegratedschema.

    16

  • 11 Conclusionsand Futur e Work

    In thispaper, wehavepresentedamodular, componentbasedapproachto theconstruc-tion of RDF schemasbasedon theintegrationof ontologiesandthesaurushierarchies.Ourexamplesweretakenfrom theculturalapplicationdomain,however thepresentedapproachcanalsobeappliedto othersemanticallyrich scientific(e.g.medicine,biol-ogyor chemistry)or electroniccommerce(e.g.electroniccatalogue)applications.

    An interestingissueconcernsthespecificationof theconnectionrelation.Whereas,this relationcanbespecifiedmanuallyfor a limited numberof termsandconcepts,itscreationgetscumbersomewhenthenumberof connectedtermsandconceptsincreases.Therearetwo possiblesolutionsto thisproblem.First,asalreadymentioned,thesaurustermsareusedfor indexing documentsandotheractualdata. The existenceof thesetermsat the datalevel might thenbe exploited for the automaticcreationof the con-nectionrelationby usingdatamining techniques.Thesecondsolutionconsistsin thedefinition of a query languagefor thesauriwhich allows to extract setsof termsbysimpledeclarativequeries(e.g.pathexpressions).In this way, theconnectionrelationcanberepresentedasa conceptualview on the thesauruswhereconceptsaredefinedby termqueries.

    The resultingRDF schemacan be perceived as the domainmodel in mediationbasedsystemssuchasInformationManifold andSIMSandit playsanessentialrole inachieving semanticinteroperability betweenthesources.This domainmodelprovidesa uniform view of informationin thedomainof discourse.Usersposequeriesagainstthis modelandinformationsourcesexport their contentdescriptionsasviews on thismodel.Thesourcedescriptions,areusedby mediatorsto identify thesetof relevantin-formationsourceswith respectto auserquery. We intendto applyourapproachto thedefinitionof semanticallyrich domainmodelsin thecontext of theArtemisproject[3],which proposesa flexible framework for the definition of cultural mediators. In thesameapplicationcontext, we considerthat an importantissueis the exploitation ofthe relatedterm (synonym) relationfor queryrewriting andconceptsubsumption.Inthis direction,we intendto studymorethoroughlytheintegrationof multilingual the-sauri andthe exploitation of inter-thesaurusrelationsfor creatingmulti-lingual RDFschemas.Suchschemascanbeusedeffectively in thecontext of queryingmulti-lingualinformationsources.

    A first prototypeis currentlyunderimplementation.Wehavealreadyimplementeda loaderof RDFdescriptionsinto the ~ & object-orienteddatabasemanagementsystemof ArdentSoftwareusingtheSiRPAC [37] RDFparserto parseRDFdocuments(RDFschemasanddescriptions).Thenext stepis to providetoolsfor specifyingtheconnec-tion relationandcreateconceptthesauri.Finally, we wantto providea queryinterfacebasedonOQL for selectingresourcesaccordingto their descriptions.

    Acknowledgments Wewouldliketo thankMichel SchollandVassilisChristophidesfor their preciouscommentson preliminaryversionsof this paper. The authorsalsothankSergeAbiteboul,Marie-ChristineRousset,Rodney Topor, andAnne-MarieVer-coustrefor fruitful discussions.Last,but not leastwewould like to thankMartin Doerrfor his helpin theunderstandingof theICOM/CIDOC ReferenceModel.

    17

  • References

    [1] The Art & Architecture Thesaurus. http://www.ahip.getty.edu/vocabulary/-aat intro.html.

    [2] ACM ComputingClassificationSystem.http://www.iicm.edu:8080/jucsclassifi-cation.

    [3] B. Amann,V. Christophides,I. Fundulaki,M. Scholl,andA.M. Vercoustre.In-telligentMediationof CulturalInformationSources.ERCIMNews, (35),October1998.

    [4] J.Ordille A.Y. Levy, A. Rajaraman.Queryingheterogeneousinformationsourcesusingsourcedescriptions.pages251–262,Bombay, India,September1996.Mor-ganKaufmann.

    [5] C.Batini,M. Lenzerini,andS.B. Navathe.A comparativeanalysisof methodolo-giesfor databaseschemaintegration. ACM ComputingSurveys, 18(4):323–364,December1986.

    [6] T. Bray. RDF andMetadata,June1998. http://www.xml.com/xml/pub/98/06/-rdf.html.

    [7] D. Brickley andR.V. Guha. ResourceDescriptionFramework (RDF) SchemaSpecification.TechnicalReportREC-rdf-model-19990303, W3C, March 1999.W3CProposedRecommendation.

    [8] DiegoCalvanese,GiuseppeDeGiacomo,MaurizioLenzerini,DanieleNardi,andRiccardoRosati.Descriptionlogic framework for informationintegration.To ap-pearin Proceedingsof the6th InternationalConferenceon Principlesof Knowl-edgeRepresentationandReasoning(KR’98).

    [9] C. Y. Chee,Y. Arens,C. A. Knoblock,andC. N. Hsu. Retrieving andIntegratingDatafrom Multiple InformationSources.InternationalJournalof IntelligentandCooperativeInformationSystems, 2(2):127–158,1993.

    [10] Lesdossiersdel’in ventairegéńeral.http://aquarelle.inria.fr/Inventaire.

    [11] C. Collet,M. Huhns,andW. Shem.ResourceIntegrationUsinga LargeKnowl-edgeBasein Carnot.IEEE Computer, pages55–62,December1991.

    [12] Dublin CoreMetadataInitiative. http://purl.oclc.org/dc/.

    [13] M. DoerrandI. Fundulaki. A proposalon extendedinterthesauruslinks seman-tics. TechnicalReportTR-215, Instituteof ComputerScience-FORTH, March1998.

    [14] D.J. Foskett. Readingsin Information Retrieval, chapterThesaurus. MorganKaufmann,1997.

    18

  • [15] Thomas.R.Gruber. A TranslationApproachto PortableOntologySpecifications.TechnicalReportKSL 92-71,KnowledgeSystemsLaboratory,ComputerScienceDepartment,StanfordUniversity, April 1993.

    [16] N. Guarino. Understanding,Building, and Using Ontologies. A commentaryto ”Using Explicit Ontologiesin KBS Developemtn”,by Heijst, Schreiber, andWielinga.

    [17] N. Guarino.InformationExtraction: A MultidisciplinaryApproach to anEmerg-ing Information Technology, chapterSemanticMatching : Formal OntologicalDistinctions for Information Organization,Extraction, and Integration, pages139–170.SpringerVerlag,1998.

    [18] N. Guarino,C. Masolo,andG. Vetere. OntoSeek:Using Large Linguistic On-tologiesfor GatheringInformation Resourcesfrom the Web. Techicalreport,LADSEB, March1998.Submittedfor Publication.

    [19] InternationalGuidelinesfor MuseumObjectInformation: TheCIDOC Informa-tion Categories.http://www.cidoc.icom.org/guide/.

    [20] Documentation- Guidelinesfor theestablishmentanddevelopmentof monolin-gual thesauri.InternationalOrganizationfor Standardization,11 1986. Ref. NoISO2788-1986.

    [21] Documentation- Guidelinesfor theestablishmentanddevelopmentof multilin-gual thesauri. InternationalOrganizationfor Standardization,2 1985. Ref. No.ISO5964-1985.

    [22] K. Jeffery. Metadata:An Overview andsomeIssues.ERCIMNEWS, (35),Octo-ber1998.

    [23] KnowledgeInterchangeFormat(KIF). http://logic.stanford.edu/kif/kif.html.

    [24] K. Knight and S. Luk. Building a Large-ScaleKnowedgeBasefor MachineTranslation.In Proc.of theAmericanAssociationof Artificial IntelligenceAAAI-94, Seattle,WA, 1994.

    [25] C.Lagoze.TheWarwickFramework : A ContainerArchitecturefor DiverseSetsof Metadata.D-Lib Magazine, July/August1996.

    [26] O. LassilaandR. Swick. ResourceDescriptionFramework (RDF) Model andSyntaxSpecification.TechnicalReportREC-rdf-syntax-19990202,W3C,Febru-ary1999.W3CProposedRecommendation.

    [27] The Library of CongressSubject Headings. http://www.grci.com/services/-library/libcongress/index.shtml.

    [28] D. B. Lenat.CYC: A Large-ScaleInvestmentin KnowledgeInfrastructure.Com-municationsof theACM, 38(11):33–38,1995.

    19

  • [29] P. Martin. The WebKB set of tools. Technical report, Griffith University,School of Information Technology, Australia, October 1997. available at:http://meganesia.int.gu.edu.au/phmartin/WebKB/doc/index.html.

    [30] R.S.Michalski. CategoriesandConcepts,TheoreticalViewsandInductiveDataAnalysis, chapterBeyondPrototypesandFrames:TheTwo-TieredConceptRep-resentation.AcademicPress,1993.

    [31] A. Michard, V. Christophides,M. Scholl, M. Stapleton,D. Sutcliffe, andA-M.Vercoustre.The AquarelleResourceDiscovery System. Journal of ComputerNetworksandISDNSystems, 30(13):1185–1200,August1998.

    [32] G. A. Miller. WordNet:A Lexical Databasefor English.Communicationsof theACM, 38(11):39–41,1995.

    [33] R. Neches,R. Fikes,T. Finin, T. Gruber, R. Patil, T. Senator, andW. Swartout.EnablingTechnologyfor KnowledgeSharing.AI Magazine, Fall 1991.

    [34] W3C Technologyand Society Domain : ResourceDescription Framework(RDF). http://www.w3.org/RDF/.

    [35] A. Sheth.InteroperatingGeographicInformationSystems, chapterChangingFo-cuson Interoperabilityin InformationSystems:FromSystem,Syntax,Structureto Semantics.Kluwer, 1999.

    [36] M. Sintichakisand P. Constantopoulos.A Method for Monolingual ThesauriMerging. In Proc.20thInternationalConferenceon Research andDevelopmentin InformationRetrieval, ACM SIGIR, PhiladeplphiaPA, USA, July 1997.

    [37] SiRPAC - Simple RDF Parser & Compiler. http://www.w3.org/RDF/-Implementations/SiRPAC/.

    [38] D. Soergel. TheArt andArchitectureThesaurus,AAT. A critical appraisal.Tech-nical report,College of Library andInformationSciences,University of Mary-land,1995.

    [39] UnifiedMedicalLanguageSystem.http://gmedserv.nlm.nih.gov/research/umls/.

    [40] MARC STANDARDS. http://lcweb.loc.gov/marc/marc.html.

    [41] StuartWeibel.Metadata:TheFoundationsof ResourceDescription.D-Lib Mag-azine, July1995.

    [42] W3C TechnologyandSocietyDomain: ExtensibleMarkup Language(XML).http://www.w3.org/XML/.

    20

  • A A NaiveAlgorithm for ThesaurusExtraction

    Input : (i1) setof conceptsU(i2) a thesaurus@A�BN�757698T�(i3) a labelingfunction V(i4) a setof root terms� in @

    Output: (o1)a thesaurusR_�@ S �Procedure:

    NW�e ; 57698cW�efor all terms1 in �

    for all narrow terms6 of 1":1$mK�6�m 5:698.�1$�26%�

    ":1$mK�6�m 57698G1$�%6%� : if V1(�abU�en�6

    elseadd 1 to N if VY�6%�abUd�e

    add 6 to Nadd 6:�%1(� to 57698n�6

    elsen�1

    for all narrow terms6�f of 6":1$m��6�m 57698Gn`�26�f�

    21