Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core...
-
Upload
rosalind-grant -
Category
Documents
-
view
216 -
download
1
Transcript of Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core...
Union Catalog and Knowledge Engineering
for TELDAPKeh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAPResearch Fellow Research Center for Information Technology Innovation &Institute of Information Science, Academia Sinica
2012.04.20
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Introduction
The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly.
The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
What is the union catalog?• It is a catalog and portal for all digital collections
of TELDAP.
• It is an integrated platform for browsing and
searching entire digital contents of TELDAP.
• Metadata provides core descriptions and
licensing information of each digital collection.
Browsing by topics
Search by keywords
Home Page of Union Catalog
Some improved functions for IR
• Keyword suggestion
• Keyword extension
• Recommendation of related collections
Digital Image
Recommendation of related
collections
Hyperlink to database
Metadata
Citation
Social networking service
Licensing Information
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Metadata models for different types of objects
Archived digital items
• Union catalog metadata model- Dublin core+Web sites
• DCCAP (Dublin Core Collections Application
Profile)
• Fields for internal used only― Unique Identifier, Format, Evaluation, Cataloging
History
Documents
• Document metadata-Dublin core
13
• Over 4 million
digital items and
still increasing
Element Definition
Title A name given to the resource
Creator An entity primarily responsible for making the content of the resource
Subject and Keywords The topic of the content of the resource
Description An account of the content of the resource
Publisher An entity responsible for making the resource available
Contributor An entity responsible for making contributions to the content of the resource
Date A date associated with an event in the life cycle of the resource
Resource Type The nature or genre of the content of the resource
Format The physical or digital manifestation of the resource
Resource Identifier An unambiguous reference to the resource within a given context
Source A Reference to a resource from which the present resource is derived
Language A language of the intellectual content of the resource
Relation A reference to a related resource
Coverage The extent or scope of the content of the resource
Rights Management Information about rights held in and over the resource
Metadata for digital
items :
Metadata for websites
• Over 690 websites and still increasing• Metadata
– DCCAP (Dublin Core Collections Application
Profile)
– To Combine the standard with our
requirements: 19 data fields
The Website Homepage Picture
URL, Project Information
Type, Name, Author, Subject, Description, Language, Item Type, Target
Archived Information:URL, time, authorization
Copyright, Purpose, Other Information
Figure: http://digitalarchives.tw
Social networking service
Uses of Metadata
Search collections by matching keyword and
features
Provide basic information of each collection
Dynamic categorization
Provide information to compute similarity or
relatedness of two collections
Extract keywords
(1) Chinese Keyword Search
Keyword+(Features)
Synonyms, hyponyms
Matched Collections
Collections+Weights
Display Results
Keyword Extension
AAT-Taiwan &Teldap Thesaur
us
Keyword Matching
Ranking
Filtering
Keyword Dictionar
y
English Keyword Search
• English Keyword+ (Features)• Translations, Synonyms, Hyponyms• Matched Collections
• Collections+Weights
• Display Results
Keyword Translation &
Extension
AAT-Taiwan &Teldap
Thesaurus
Keyword Matching
Ranking
Filtering
Keyword Dictionar
y
Ranking Algorithm
Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item)– Association(Keyword, item)=W1*Topical
Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item)
– Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item)
• Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item))
• Importance of relation (Keyword, item) = W1*Keyword-from Value + W2*Mutual Information (keyword, Topic(item))
• Keyword-from Value= 1 if keyword is contained in title(item)
0.5 if keyword is contained in description(item)• Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}
Algorithm for Recommending Related Collections
i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….}
Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…;
where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0;
Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)
(2) Dynamic categorizationUser-oriented categorization
• General, elementary school students, high school
students, researchers, …etc.
Topical-based categorization
• Archaeology, painting, animal, plant, document, …
etc.
Functional-based categorization
• Research, education, business, technology,…
Categorization based on institutions
• Academia Sinica, Taiwan U., Palace museum,…
(3) Multi-purposes of Core IR System and Databases
Teldap– Whole
collections– Searched by
institutes, domains, and media types (documents, images, videos, and web sites)
– Monolingual
Digital Shop– Whole
collections or only fine arts
– General search and searched by licensing types
– Rely on multilingual thesaurus
• Taiwan Academy– Fine artsSearched by institutes and domains– Multilingual– Rely on
multilingual thesaurus
Figure: http://digitalarchives.tw
Digitalarchives
.tw
Purpose: EducationTarget: Elementary school student, Junior high school student, Teacher…
Purpose: Creative applications
Purpose: Academic researchSubject: Animal, Archaeology, Anthropology…
Digitalarchives
.tw
Figure: http://taiwanacademy.tw
Taiwan Acade
my
Categorization based on institutionsTopical-based categorization
Taiwan Acade
my
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Plans of making knowledge structures for TELDAP
Construct metadata models for different
objects.
Establish hyperlinks between contexts and
objects.
• Develop keyword extraction tools.
• Design automatic tagging tools.
Construct TELDAP ontology and thesaurus.
• Art & Architecture Thesaurus by Getty
• Chinese WordNet
(1) Metadata models for different objects
• Digital collections– Union catalog metadata model- Dublin core+
• Web sites– DCCAP (Dublin Core Collections Application
Profile)– Public fields– Private fields
Unique Identifier, Format, Evaluation, Cataloging History
• Documents– Document metadata-Dublin core
(2) Create keyword dictionary
Extract from metadata Collect from Google search terms By social tagging Manually collect while tag hyperlinks
Lexical Entry of Keyword Dictionary Keyword id Keyword Synset id Hypernym id Hyponym id Features Related Collections + Association
Strengths
(2) Establish hyperlinks between contents and objects
• Identify keywords in contents.
• Tag keywords with related object
hyperlinks.
Develop hyperlink tagging tools
• Word segmentation tools– Resolve word segmentation ambiguities and
identify keywords.– CKIP word segmentation system:
http://ckipsvr.iis.sinica.edu.tw/
Develop hyperlink tagging tools• TELDAP keyword dictionary– Extract keywords from metadata and establish
object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles,
descriptions, authors, locations, eras etc. From each class of text file extract keywords by
automatic word segmentation, keyword extraction, and manual post editing.
– Current dictionary contains more than 120,000 Keywords.
Prototype system for hyperlink tagger
• Identify and select keywords from the input text
(3) Construct TELDAP ontology and thesaurus
• Establish association links between
Chinese keywords and Getty AAT.
• Merge TELDAP keywords with Chinese
AAT.
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Future Perspective
• Technology development– Construct multi-lingua thesauri – Getty AAT.– Maintain the TELDAP keyword-and-object
relation database.– Construct name authority files, gazetteers, and
universal calendars.– Design hyperlink taggers and keyword
extension tools.– Design an authoring tool which provides
hyperlinks of keyword related digital contents automatically.
– Design knowledge-based content retrieval system.
Future Perspectives
• Content enrichment– Within TELDAP:
Standardize object metadata model and data format.
Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with
Wiki-like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and
TELDAP collections.– Extend the knowledge sources: e.g. Wikipedia