Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003.
Content Management
and the role of taxonomies
Judith Molka-DanielsenOct. 13, 2003
Primary Challenges for Content Management Systems
Heterogenenous Data Sources – create some normalized representation of data to provide equal (reading) accessibility for human and machine alike. retrieving data from a RDBMS involves
programmatic access (ODBC, SQL) HTML files consist of tagged text. Stylistic and
structural info, different code is interpreted by browsers in different ways, confusing for automated programs, but humans manage it.
Word processing applications – Word, Acrobat, binary data converted to text with proprietary interpreter, and associated viewer. Want interoperability of viewers with other formats.
Primary Challenges for Content Management Systems
Distribution of Data Sources Access involves use of protocols (HTTP,
HTTPS, FTP, SCP,…) to go through firewalls. With business applications we still need security
and to limit views to selected individuals and groups.
Additional protocols (XML, IIOP, SOAP and Web Services) are being used to build tools for integrating systems. To deliver messages to components through http, a
protocol is needed. The Simple Object Access Protocol (SOAP), written in XML, is emerging as the protocol.
Primary Challenges for Content Management Systems
What is being used to identify distributed data sources: Distributed Directories and protocols The Domain Name Service (DNS) is a hierarchically
distributed directory of Names (home.himolde.no) and IP addresses.
The X.500 directory service is a hierarchically distributed directory of objects. Object attribute-value pairs may be stored and looked up.
LDAP is a protocol for accessing a directory service. Most visions of the Web imagine “federated” servers to help find objects.
UDDI is one protocol for advertising and discovery
The Web Today
WebServer
Client
DNServer
DNServer DN
ServerDNServer
1. LocationLookup
2. ObjectRequest
The Web with Object Directories
WebServer
Client
DNServer
DNServer
DNServer
DNServer
LDAPServer
1. RegistrationWeb
Server
2. Attribute/Value Requestand Object/Location Response
3. The Rest
Primary Challenges for Content Management Systems
Data Size and the Relevance Factor Large repositories like WWW Need a system to drill down to subsets of
relevant information. Speed and automation is critical. (Find not just more results, but better.)
Find a particular needle in a haystack with a billion needles.
Find all the needles which are similar to some other needle which has already been discovered.
What can help? Semantic web technology
XML and the Resource Description Framework (RDF) will allow XML tags to be labeled in conjunction with a referential knowledge representation.
Machine based inference engines should replace today's search engines.
New editors are needed to infuse semantic information into the content easily, as some editors allow users that do not know html syntax to create web pages.
Syntactic Integration
Structural Integration
Semantic Integration
RDF
RDF provides a simple data model for expressing statements using (subject, predicate, object) triples, and an associated serialization syntax in XML. All three elements of the triple can be defined within the current document or refer to another resource on the Web.
As an example of RDF applied in a logistic context we model the three entities ship,container and item.
RDF in use
In RDF we can express relations between entities, such as a ship transports a container, and a container contains an item. These relations can but need not to be hierarchical, i.e. a business can be the owner of the transported item, and at the same time the user of the container. It is important to note that these relations can change over time, ownership moves from one business to another, and container move from ships to trucks for further transportation. These transitions may trigger events, like financial transactions or notifications.
An ontology can be used to define all the concepts and their meaning used in a certain (set of) schema(s).
Components of Semantic Technology
Classification Metadata Ontologies (taxonomies)
Classification General keyword searches lead to many irrelevant
results. An automatic classification system could for example,
divide a 1000 stories into 5 categories, so keyword searches would be more relevant.
Techniques for classification Statistical analysis and pattern matching Rule-based methods Linguistic analysis Bayesian theory (probabilistic) Ontology driven: name-entity and domain-phrase recognition Committee-based approaches use various techniques
Classification is more precise if documents are tagged with metadata and conform to a predetermined schema.
Metadata
Data about the data Levels of Metadata
Syntatic Structural Semantic
Syntatic Metadata
General information Little for context determination Document size, location, date of creation.. Used in
Assessment of the document’s relevance Version tracking User level access policies
Email, docs in file systems, have this info.
Structural Metadata
Information about the structure of content
Varies widely with document type XML allows creators to enclose
content within meaningful tags. Can make associations between
content from multiple documents.
Semantic Metadata
Semantic Metadata is “data which may be associated explicitly or implicitly with a given piece of content (such as a document) and whose relevance for that content is determined by its ontological position (its context) within one or more domains of knowledge.”
Semantic Metadata
Metadata receives its contextual information from a reference knowledgebase.
Metadata that is extracted from any document may be stored as a snapshot of that document’s relevant information.
The metadata contained within this snapshot simply references the instances of name-entities, which are stored in the ontology.
Each name-entity has related information stored: synonyms, attributes, related entities.
Semantic Metadata
Documents can link to each other in several ways Explicit metadata – docs that mention the same
exact metadata Implicitly related metadata – docs that contain
synonyms or hierarchically related name entities. Ontoloical associations – by name-entities
associations, one doc mentions a company name while another mentions the ticker symbol.
Standards: DCML defines a generic element set, non-specific to domain of knowledge. Can be used as a top domain.
Forms of knowledge representation Dictionary – terms are the keys and definitions are
the values. There are no links between terms. Thesaurus – includes antonyms and synonyms.
The pieces of knowledge are linked. Taxonomy – includes etymological information
(derivation) and synonyms are organized hierarchically (inheritance). Flower is a subclass of plant. But a rose may be related to
love. Associations may be emotional, cultural, temporal. Relevant associations Can be discovered by a data-
analysis system utilizing a reference knowledge base. Ontology – is the labeling of the relationship in the
taxonomy.
Types of Metadata
Ontology Description Languages
Knowledge model building in a given domain is subjective
Problems combining independently developed ontologies
Resource Description Framework (RDF) and RDF-Schema (RDF-S) data model tries to address this: Resource – is an item of interest at the atomic level, entitity,
concept or document. Each resource is uniquely identified by a URI
Properties – descriptive, characteristics and attributes of a resource. They may be associative, relating one resource to another.
Statement – is what is known as an RDF triple. It contains a reference to a resource, a property names, and that property’s value. These identifiers take the form of link addresses.
Ontology Description Languages
RDF-S (specification for ontoloy modeling.) http://www.w3.org/TR/2000/CR-rdf-schema-200003
27/ Dublin Core Metadata Initiative
http://dc2003.ischool.washington.edu/program.html DARPA Agent Markup Language + Ontology
Interface Layer (DAML+OIL) expands on the RDF-S. Classes are defined as elements and can be related to other classes in disjunction, union, or equality. The W3C has a ontology web language (OWL) that
is based on OIL.
Meta-data Interpretation DAML (DARPA) endeavor to interpret a
simple ontology to infer information about resources.
Put very simply: If people have names If students are people If resource X is about a student Resource X should have a name
This kind of inference could be easily constructed within the context of an object-oriented directory
Schema Interpretation – and integration
consider two sets of resources: For set A, the attributes are structured in accord with
the kind of meta data described on the previous slide.
Imagine the same for set B, but using different attribute names and values
Accept that the attribute-values are called resource descriptions and a document called a resource description schema defines the relations for each set.
Imagine the two schema are related through a third schema
Finally imagine an engine that relates resources in set A to resources in set B based on schema level inferences
The Semantic Web Vision
WebServer
Client
DNServer
DNServer DN
ServerDNServer
LDAPServer
WebServer
5. The Rest
LDAPServer
LDAPServer
SchemaServer
SchemaServer
SchemaServer
SchemaServer
2. DescriptionAssociation
1. SchemaRegistration
3. Object Query
4. Inferencing
Sample Knowledgebases
WordNet is a networked thesaurus, developed at Princeton, in the form of a lexical matrix. It maps word forms to word meanings, M2M relationship. The set of word-meanings for a word is a synset. It is not an ontology because it does not contain real world
information required in labeled relationships, such as, a “branch” is an administrative division with a chairman above it.
Open Directory Project http://www.dmoz.org/
National Library of Medicine has an ontology system, Unified Medical Language System (UMLS), with researchers and intstitutions contributing to it. http://www.nlm.nih.gov/research/umls/
Toolkits – should provide for.. Establishing of configurable parameters Extraction agents and classifiers modules The system should accept training sets of
data, and learn from patterns, so future items are classified without manual trigger.
Easily navigatible visual environment Tracking date and time of data entry ROADS provides tools for creating subject
gateways, http://www.ilrt.bristol.ac.uk/roads/
Extracting Wrapper Technologies WysiWyg Web Wrapper Factory (W4F), crawl and
retrieve data from web pages, to create wrappers that represent the content of the pages.
ANDES, uses XPath rules XWRAP toolkit, has interactive rules formulation S-CREAM (semiautomatic creation of metadata)
lets the user annotate documents. Ontoprise (product by Semagix)
http://www.ontoprise.com BUT, an ontology driven classifier and domain
specific metadata annotator allows searching on classification by keyword AND on implied entity association. (SEE example on next slide.)
Semagix Visualizer – is a visualization tool for viewing an ontology or schema.
Related References
http://bazaar.sis.pitt.edu/ The E-Speak Initiative at the University of Pittsburgh E-Speak Overview (
http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoESpeak_files/frame.htm )
E-Speak Revised (http://bazaar.sis.pitt.edu/es_ppt_over/AESpeakRevisited_files/frame.htm )
Oracle9i Data Mining Concepts Oracle9i AS Personalization is used to build
data mining models.