Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003.

Content Management

and the role of taxonomies

Judith Molka-DanielsenOct. 13, 2003

Primary Challenges for Content Management Systems

Heterogenenous Data Sources – create some normalized representation of data to provide equal (reading) accessibility for human and machine alike. retrieving data from a RDBMS involves

programmatic access (ODBC, SQL) HTML files consist of tagged text. Stylistic and

structural info, different code is interpreted by browsers in different ways, confusing for automated programs, but humans manage it.

Word processing applications – Word, Acrobat, binary data converted to text with proprietary interpreter, and associated viewer. Want interoperability of viewers with other formats.


Distribution of Data Sources Access involves use of protocols (HTTP,

HTTPS, FTP, SCP,…) to go through firewalls. With business applications we still need security

and to limit views to selected individuals and groups.

Additional protocols (XML, IIOP, SOAP and Web Services) are being used to build tools for integrating systems. To deliver messages to components through http, a

protocol is needed. The Simple Object Access Protocol (SOAP), written in XML, is emerging as the protocol.


What is being used to identify distributed data sources: Distributed Directories and protocols The Domain Name Service (DNS) is a hierarchically

distributed directory of Names (home.himolde.no) and IP addresses.

The X.500 directory service is a hierarchically distributed directory of objects. Object attribute-value pairs may be stored and looked up.

LDAP is a protocol for accessing a directory service. Most visions of the Web imagine “federated” servers to help find objects.

UDDI is one protocol for advertising and discovery

The Web Today

WebServer

Client

DNServer

DNServer DN

ServerDNServer

1. LocationLookup

2. ObjectRequest

The Web with Object Directories

WebServer

Client

DNServer

DNServer

DNServer

DNServer

LDAPServer

1. RegistrationWeb

Server

2. Attribute/Value Requestand Object/Location Response

3. The Rest


Data Size and the Relevance Factor Large repositories like WWW Need a system to drill down to subsets of

relevant information. Speed and automation is critical. (Find not just more results, but better.)

Find a particular needle in a haystack with a billion needles.

Find all the needles which are similar to some other needle which has already been discovered.

What can help? Semantic web technology

XML and the Resource Description Framework (RDF) will allow XML tags to be labeled in conjunction with a referential knowledge representation.

Machine based inference engines should replace today's search engines.

New editors are needed to infuse semantic information into the content easily, as some editors allow users that do not know html syntax to create web pages.

Syntactic Integration

Structural Integration

Semantic Integration

RDF

RDF provides a simple data model for expressing statements using (subject, predicate, object) triples, and an associated serialization syntax in XML. All three elements of the triple can be defined within the current document or refer to another resource on the Web.

As an example of RDF applied in a logistic context we model the three entities ship,container and item.

RDF in use

In RDF we can express relations between entities, such as a ship transports a container, and a container contains an item. These relations can but need not to be hierarchical, i.e. a business can be the owner of the transported item, and at the same time the user of the container. It is important to note that these relations can change over time, ownership moves from one business to another, and container move from ships to trucks for further transportation. These transitions may trigger events, like financial transactions or notifications.

An ontology can be used to define all the concepts and their meaning used in a certain (set of) schema(s).

Components of Semantic Technology

Classification Metadata Ontologies (taxonomies)

Classification General keyword searches lead to many irrelevant

results. An automatic classification system could for example,

divide a 1000 stories into 5 categories, so keyword searches would be more relevant.

Techniques for classification Statistical analysis and pattern matching Rule-based methods Linguistic analysis Bayesian theory (probabilistic) Ontology driven: name-entity and domain-phrase recognition Committee-based approaches use various techniques

Classification is more precise if documents are tagged with metadata and conform to a predetermined schema.

Metadata

Data about the data Levels of Metadata

Syntatic Structural Semantic

Syntatic Metadata

General information Little for context determination Document size, location, date of creation.. Used in

Assessment of the document’s relevance Version tracking User level access policies

Email, docs in file systems, have this info.

Structural Metadata

Information about the structure of content

Varies widely with document type XML allows creators to enclose

content within meaningful tags. Can make associations between

content from multiple documents.

Semantic Metadata

Semantic Metadata is “data which may be associated explicitly or implicitly with a given piece of content (such as a document) and whose relevance for that content is determined by its ontological position (its context) within one or more domains of knowledge.”

Semantic Metadata

Metadata receives its contextual information from a reference knowledgebase.

Metadata that is extracted from any document may be stored as a snapshot of that document’s relevant information.

The metadata contained within this snapshot simply references the instances of name-entities, which are stored in the ontology.

Each name-entity has related information stored: synonyms, attributes, related entities.

Semantic Metadata

Documents can link to each other in several ways Explicit metadata – docs that mention the same

exact metadata Implicitly related metadata – docs that contain

synonyms or hierarchically related name entities. Ontoloical associations – by name-entities

associations, one doc mentions a company name while another mentions the ticker symbol.

Standards: DCML defines a generic element set, non-specific to domain of knowledge. Can be used as a top domain.

Forms of knowledge representation Dictionary – terms are the keys and definitions are

the values. There are no links between terms. Thesaurus – includes antonyms and synonyms.

The pieces of knowledge are linked. Taxonomy – includes etymological information

(derivation) and synonyms are organized hierarchically (inheritance). Flower is a subclass of plant. But a rose may be related to

love. Associations may be emotional, cultural, temporal. Relevant associations Can be discovered by a data-

analysis system utilizing a reference knowledge base. Ontology – is the labeling of the relationship in the

taxonomy.

Types of Metadata

Ontology Description Languages

Knowledge model building in a given domain is subjective

Problems combining independently developed ontologies

Resource Description Framework (RDF) and RDF-Schema (RDF-S) data model tries to address this: Resource – is an item of interest at the atomic level, entitity,

concept or document. Each resource is uniquely identified by a URI

Properties – descriptive, characteristics and attributes of a resource. They may be associative, relating one resource to another.

Statement – is what is known as an RDF triple. It contains a reference to a resource, a property names, and that property’s value. These identifiers take the form of link addresses.

Ontology Description Languages

RDF-S (specification for ontoloy modeling.) http://www.w3.org/TR/2000/CR-rdf-schema-200003

27/ Dublin Core Metadata Initiative

http://dc2003.ischool.washington.edu/program.html DARPA Agent Markup Language + Ontology

Interface Layer (DAML+OIL) expands on the RDF-S. Classes are defined as elements and can be related to other classes in disjunction, union, or equality. The W3C has a ontology web language (OWL) that

is based on OIL.

Meta-data Interpretation DAML (DARPA) endeavor to interpret a

simple ontology to infer information about resources.

Put very simply: If people have names If students are people If resource X is about a student Resource X should have a name

This kind of inference could be easily constructed within the context of an object-oriented directory

Schema Interpretation – and integration

consider two sets of resources: For set A, the attributes are structured in accord with

the kind of meta data described on the previous slide.

Imagine the same for set B, but using different attribute names and values

Accept that the attribute-values are called resource descriptions and a document called a resource description schema defines the relations for each set.

Imagine the two schema are related through a third schema

Finally imagine an engine that relates resources in set A to resources in set B based on schema level inferences

The Semantic Web Vision

WebServer

Client

DNServer

DNServer DN

ServerDNServer

LDAPServer

WebServer

5. The Rest

LDAPServer

LDAPServer

SchemaServer

SchemaServer

SchemaServer

SchemaServer

2. DescriptionAssociation

1. SchemaRegistration

3. Object Query

4. Inferencing

Sample Knowledgebases

WordNet is a networked thesaurus, developed at Princeton, in the form of a lexical matrix. It maps word forms to word meanings, M2M relationship. The set of word-meanings for a word is a synset. It is not an ontology because it does not contain real world

information required in labeled relationships, such as, a “branch” is an administrative division with a chairman above it.

Open Directory Project http://www.dmoz.org/

National Library of Medicine has an ontology system, Unified Medical Language System (UMLS), with researchers and intstitutions contributing to it. http://www.nlm.nih.gov/research/umls/

Toolkits – should provide for.. Establishing of configurable parameters Extraction agents and classifiers modules The system should accept training sets of

data, and learn from patterns, so future items are classified without manual trigger.

Easily navigatible visual environment Tracking date and time of data entry ROADS provides tools for creating subject

gateways, http://www.ilrt.bristol.ac.uk/roads/

Extracting Wrapper Technologies WysiWyg Web Wrapper Factory (W4F), crawl and

retrieve data from web pages, to create wrappers that represent the content of the pages.

ANDES, uses XPath rules XWRAP toolkit, has interactive rules formulation S-CREAM (semiautomatic creation of metadata)

lets the user annotate documents. Ontoprise (product by Semagix)

http://www.ontoprise.com BUT, an ontology driven classifier and domain

specific metadata annotator allows searching on classification by keyword AND on implied entity association. (SEE example on next slide.)

Semagix Visualizer – is a visualization tool for viewing an ontology or schema.

Related References

http://bazaar.sis.pitt.edu/ The E-Speak Initiative at the University of Pittsburgh E-Speak Overview (

http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoESpeak_files/frame.htm )

E-Speak Revised (http://bazaar.sis.pitt.edu/es_ppt_over/AESpeakRevisited_files/frame.htm )

Oracle9i Data Mining Concepts Oracle9i AS Personalization is used to build

data mining models.

Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003.

Documents

Transcript of Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003.