Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core...

46
Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project TELDAP Research Fellow Research Center for Information Technology Innovation & Institute of Information Science, Academia 2012.04 20

Transcript of Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core...

Union Catalog and Knowledge Engineering

for TELDAPKeh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAPResearch Fellow Research Center for Information Technology Innovation &Institute of Information Science, Academia Sinica

2012.04.20

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Introduction

The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly.

The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

What is the union catalog?• It is a catalog and portal for all digital collections

of TELDAP.

• It is an integrated platform for browsing and

searching entire digital contents of TELDAP.

• Metadata provides core descriptions and

licensing information of each digital collection.

Some improved functions for IR

• Keyword suggestion

• Keyword extension

• Recommendation of related collections

• Keyword

suggestion

• Keyword extension

Digital Image

Recommendation of related

collections

Hyperlink to database

Metadata

Citation

Social networking service

Licensing Information

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Metadata models for different types of objects

Archived digital items

• Union catalog metadata model- Dublin core+Web sites

• DCCAP (Dublin Core Collections Application

Profile)

• Fields for internal used only― Unique Identifier, Format, Evaluation, Cataloging

History

Documents

• Document metadata-Dublin core

13

• Over 4 million

digital items and

still increasing

Element Definition

Title A name given to the resource

Creator An entity primarily responsible for making the content of the resource

Subject and Keywords The topic of the content of the resource

Description An account of the content of the resource

Publisher An entity responsible for making the resource available

Contributor An entity responsible for making contributions to the content of the resource

Date A date associated with an event in the life cycle of the resource

Resource Type The nature or genre of the content of the resource

Format The physical or digital manifestation of the resource

Resource Identifier An unambiguous reference to the resource within a given context

Source A Reference to a resource from which the present resource is derived

Language A language of the intellectual content of the resource

Relation A reference to a related resource

Coverage The extent or scope of the content of the resource

Rights Management Information about rights held in and over the resource

Metadata for digital

items :

14

Metadata for websites

• Over 690 websites and still increasing• Metadata

– DCCAP (Dublin Core Collections Application

Profile)

– To Combine the standard with our

requirements: 19 data fields

The Website Homepage Picture

URL, Project Information

Type, Name, Author, Subject, Description, Language, Item Type, Target

Archived Information:URL, time, authorization

Copyright, Purpose, Other Information

Figure: http://digitalarchives.tw

Social networking service

Uses of Metadata

Search collections by matching keyword and

features

Provide basic information of each collection

Dynamic categorization

Provide information to compute similarity or

relatedness of two collections

Extract keywords

(1) Chinese Keyword Search

Keyword+(Features)

Synonyms, hyponyms

Matched Collections

Collections+Weights

Display Results

Keyword Extension

AAT-Taiwan &Teldap Thesaur

us

Keyword Matching

Ranking

Filtering

Keyword Dictionar

y

English Keyword Search

• English Keyword+ (Features)• Translations, Synonyms, Hyponyms• Matched Collections

• Collections+Weights

• Display Results

Keyword Translation &

Extension

AAT-Taiwan &Teldap

Thesaurus

Keyword Matching

Ranking

Filtering

Keyword Dictionar

y

Ranking Algorithm

Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item)– Association(Keyword, item)=W1*Topical

Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item)

– Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item)

• Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item))

• Importance of relation (Keyword, item) = W1*Keyword-from Value + W2*Mutual Information (keyword, Topic(item))

• Keyword-from Value= 1 if keyword is contained in title(item)

0.5 if keyword is contained in description(item)• Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}

Algorithm for Recommending Related Collections

i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….}

Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…;

where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0;

Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)

(2) Dynamic categorizationUser-oriented categorization

• General, elementary school students, high school

students, researchers, …etc.

Topical-based categorization

• Archaeology, painting, animal, plant, document, …

etc.

Functional-based categorization

• Research, education, business, technology,…

Categorization based on institutions

• Academia Sinica, Taiwan U., Palace museum,…

(3) Multi-purposes of Core IR System and Databases

Teldap– Whole

collections– Searched by

institutes, domains, and media types (documents, images, videos, and web sites)

– Monolingual

Digital Shop– Whole

collections or only fine arts

– General search and searched by licensing types

– Rely on multilingual thesaurus

• Taiwan Academy– Fine artsSearched by institutes and domains– Multilingual– Rely on

multilingual thesaurus

Purpose: EducationTarget: Elementary school student, Junior high school student, Teacher…

Purpose: Creative applications

Purpose: Academic researchSubject: Animal, Archaeology, Anthropology…

Digitalarchives

.tw

Categorization based on institutionsTopical-based categorization

Taiwan Acade

my

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Plans of making knowledge structures for TELDAP

Construct metadata models for different

objects.

Establish hyperlinks between contexts and

objects.

• Develop keyword extraction tools.

• Design automatic tagging tools.

Construct TELDAP ontology and thesaurus.

• Art & Architecture Thesaurus by Getty

• Chinese WordNet

(1) Metadata models for different objects

• Digital collections– Union catalog metadata model- Dublin core+

• Web sites– DCCAP (Dublin Core Collections Application

Profile)– Public fields– Private fields

Unique Identifier, Format, Evaluation, Cataloging History

• Documents– Document metadata-Dublin core

(2) Create keyword dictionary

Extract from metadata Collect from Google search terms By social tagging Manually collect while tag hyperlinks

Lexical Entry of Keyword Dictionary Keyword id Keyword Synset id Hypernym id Hyponym id Features Related Collections + Association

Strengths

(2) Establish hyperlinks between contents and objects

• Identify keywords in contents.

• Tag keywords with related object

hyperlinks.

Develop hyperlink tagging tools

• Word segmentation tools– Resolve word segmentation ambiguities and

identify keywords.– CKIP word segmentation system:

http://ckipsvr.iis.sinica.edu.tw/

Develop hyperlink tagging tools• TELDAP keyword dictionary– Extract keywords from metadata and establish

object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles,

descriptions, authors, locations, eras etc. From each class of text file extract keywords by

automatic word segmentation, keyword extraction, and manual post editing.

– Current dictionary contains more than 120,000 Keywords.

Prototype system for hyperlink tagger

• Identify and select keywords from the input text

Prototype system for hyperlink tagger• Produce text with hyperlinks

Prototype system for hyperlink tagger• Hyperlinks point to the related digital

collections

(3) Construct TELDAP ontology and thesaurus

• Establish association links between

Chinese keywords and Getty AAT.

• Merge TELDAP keywords with Chinese

AAT.

AAT Browsing trees of Taiwan Academy

AAT subject search of Taiwan Academy

Recommendation of related items

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Future Perspective

• Technology development– Construct multi-lingua thesauri – Getty AAT.– Maintain the TELDAP keyword-and-object

relation database.– Construct name authority files, gazetteers, and

universal calendars.– Design hyperlink taggers and keyword

extension tools.– Design an authoring tool which provides

hyperlinks of keyword related digital contents automatically.

– Design knowledge-based content retrieval system.

Future Perspectives

• Content enrichment– Within TELDAP:

Standardize object metadata model and data format.

Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with

Wiki-like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and

TELDAP collections.– Extend the knowledge sources: e.g. Wikipedia