An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is...

Institut für Informatik der

Friedrich-Schiller-Universität Jena

An approachfor semantic enrichmentof social media resourcesfor context dependent processing

Diplomarbeitzur Erlangung des akademischen Grades

Diplom-Informatiker

vorgelegt von

Oliver Schimratzki

betreut von

Birgitta König-Ries

Fedor Bakalov

January 26, 2010

Department of Computer Science at

Friedrich-Schiller-University Jena

An approachfor semantic enrichmentof social media resourcesfor context dependent processing

Diploma Thesissubmitted for the degree of

Diplom-Informatiker

submitted by

Oliver Schimratzki

supervised by

Birgitta König-Ries

Fedor Bakalov

January 26, 2010

Abstract

This diploma thesis provides the functional basis for information filtering in the domain

of complexity. It helps to create the domain-specific, adaptive portal CompleXys, that

filters blog entries and similar social media resources according to their relevance to a

specific context.

The first of two required modules, that are developed throughout this work, is a se-

mantic enrichment module. Its purpose is to extract and provide semantic data for each

input document. This semantic data should be appropriate for a relevance decision to

the domain of complexity as well as for further usage in the filter module. It utilizes

various approaches to perform a multi-label text classification onto a fixed complexity

thesaurus.

The second implemented module is a content filter module. It provides a dynamic

system of filters, which forms an access interface to the document store. It uses the

previously extracted annotation and classification data to enable complex, semantically

based filter queries.

Though the total system performance will only be testable after the complete system

is implemented, this thesis also conducts a first proof-of-concept evaluation of the two

created modules. It investigates the classification quality of the semantic enrichment

module as well as the response time behavior of the content filter module.

Acknowledgements

This thesis is the result of my research and implementation work in a project of the

Heinz-Nixdorf Endowed Chair of Practical Computer Science at the Friedrich-Schiller

University of Jena. I have been really fortunate to get the possibility to finish my studies

within such a pleasant and interesting environment. For this chance I like to give special

thanks to my two supervisors Birgitta König-Ries and Fedor Bakalov. Without them I

would have never been able to create this work.

Furthermore I like to thank Adrian Knoth and again Fedor Bakalov, who contributed

a lot to the basic project architecture, on which upon I builded my work and who im-

plemented the basic functions I relied on. Adrian has also been so kind to provide a

database server for my work.

Additionally I am indebted to my thesis reviewers Fedor Bakalov, Birgitta König-Ries

and Gerald Albe. They all helped to improve the text with their various comments and

suggestions.

Yet another important source for this thesis were the developers of the tools I worked

with and the writers of the papers and books I cited. Among them special mentions

should be given to the makers of GATE and KEA++, who were the most important

external supporters of my work.

Of course this page has also a place, that is solely reserved for my parents, who set

my personal longtime record of twenty-six years nonstop support. I can hardly express

how grateful I am. Just...thank you!

Last but not least, I like to give a huge thank-you to my beloved fiancee Monika Heyer

for her steady patience, encouragement and love. You are great. =o)

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Scope Of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.1 Complex Systems Portal . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.2 Semantic Content Enrichment . . . . . . . . . . . . . . . . . . . . 5

1.4.3 Complexity Domain Model . . . . . . . . . . . . . . . . . . . . . . 5

1.4.4 Semantic Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 CompleXys 92.1 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Performance Requirements . . . . . . . . . . . . . . . . . . . . . . 17

2.1.4 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Essentials 233.1 Notation of Semantic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 Semantic Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.3 Microformats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Morphological Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

viii Contents

3.2.5 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.6 Pragmatic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.7 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Tools and Standards 31

4.1 SIOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Corpus Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 ANNIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 SKOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 KEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.1 Candidate Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.3 KEA++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Calais . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 OpenCalais WebService . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Related Work 45

6 Semantic Content Annotator 51

6.1 CompleXys Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1.2 CompleXys Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Semantic Content Annotator Pipeline . . . . . . . . . . . . . . . . . . . . 54

6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2.2 Crawled Content Reader . . . . . . . . . . . . . . . . . . . . . . . 57

6.2.3 Onto Gazetteer Annotator . . . . . . . . . . . . . . . . . . . . . . . 58

6.2.4 Kea Annotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2.5 Open Calais Annotator . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2.6 Content Writer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Semantic Filter 61

7.1 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1.1 Filter Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1.2 Abstract Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1.3 Logic Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1.4 Basic Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Contents ix

7.2 Output Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.1 RSS Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2.2 Sesame Triplestore Converter . . . . . . . . . . . . . . . . . . . . . 66

8 Evaluation 698.1 Classification Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.1.1 Document Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.1.2 Test Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.1.3 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.2 Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2.1 Test Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Summary and Future Work 79

References 82

List of Figures 89

List of Tables 91

x Contents

CHAPTER 1

Introduction

This chapter introduces the subject of this thesis, describes the task and clarifies the

further working procedure. First it introduces and motivates the general topic in the

Sections 1.1 and 1.2. Then it clarifies the objectives in Section 1.3 and sets the thesis

scope in Section 1.4. Finally it outlines the further chapters in Section 1.5.

1.1 Background

The world wide web is by far the greatest data repository mankind created. But a

majority of the therein stored information is incomprehensible, when one lacks the se-

mantic context it is stored in. Most people are able to manually reconstruct this con-

text out of a text, but to search the web for complex information is often an incredibly

time-consuming and hard task, even in times of elaborate search engines. Enhanced

automatization capabilities would therefore be a great achievement in the evolution of

the www. Unfortunately, machines provide up to now a far poorer performance in text

understanding than people do. A possibility to overcome this problem is to change

the structure of the data itself and to explicitly provide the additional semantics, that

people normally implicitly add in their minds.

Tim Berners-Lee, who is credited with the invention of the world wide web, proposed

a corresponding concept already back in 1989 [4]. Back then he suggested not just mere

hyperlinks, but typed ones - "the web of relationships amongst named objects". This

ideas resulted in the first HTML version [60], that contains the type element as well as

rel and rev attributes. Type is used to define the kind of relationship the source doc-

ument has towards another resource. The rel attribute is applicable to other HTML

elements and can be used to describe the appropriate type of semantic relationship to-

wards a second resource. Rev is the reverse - an adequate, passive version to the active

2 Introduction

rel attribute. Indeed, type became popular with defining structural references like the

related stylesheet document for a website and linking the alternate printing version

or RSS feed, but never get widely established as semantic informant, while rel and revremained mostly unused. The semantic HTML elements were misused for presenta-

tional purposes a long time, until the W3C CSS Level 1 Recommendation [38] in 1996

started the slowly progressing counterrevolution of strictly divided presentation and

content. This development finally leads to a rediscovery of semantic HTML in today’s

microformats movement, that will be further described in Subsection 3.1.3.

Figure 1.1: The Semantic Web layers1

However, these beginning problems did not hinder Tim Berners-Lee to pursue his vi-

sion further and to publish a Semantic Web Road Map in 1998 [59], which marks the

starting point to the W3C’s Semantic Web activities towards the machine-understandable

web. The Semantic Web layer diagram in Figure 1.1 shows the components, that should

finally achieve, what the first attempt had not. It is easily perceptible that the whole

approach is based on the traditional URIs2 and the Resource Description Framework

RDF3, which are used to reference and describe resources in a standardized way. Fur-

thermore the ontology component is of special interest for the topic of this thesis, be-

cause ontologies can be used to describe a domain in a machine-understandable way.

1Accessed on January 20, 2010: http://www.w3.org/2007/03/layerCake.png2http://tools.ietf.org/html/rfc39863http://www.w3.org/RDF/

1.2 Motivation 3

While the development and establishment of the Semantic Web is still the focus of many

researchers and organizations, the Web 2.0, another fundamental change to the usage

of the internet, seemingly overtook their efforts throughout the preceding decade. It

is often characterized as the step from a read-only (Web 1.0) environment to a read-

write ecology [56]. With blogs, social networks and wikis today’s web is not mostly

a consumption media anymore, but an infrastructure to collaborate and publish for

everyone. Among others, this leads to such interesting usage approaches as crowd-

sourcing [28], that harnesses the collective knowledge and creativity of network users

to produce outcomes, that are competitive to those of task experts, but often notably

less expensive.

Another important development in web related systems is context awareness. Among

others web applications are nowadays often conscious of the habits and interests of the

accessing user and tailor the content and structure of the user interface individually for

his special needs. For example this approach is successfully applied by the Amazon4

and Sevenload5 recommendations to improve their respective portals.

1.2 Motivation

Aware of their characteristics, it is a reasonable step to combine Web2.0, Semantic Web

and context awareness in order to achieve an even more useful web environment.

While the Web2.0 emergence provides an enormous and steadily growing amount of

new information, Semantic Web and context awareness are powerful approaches to

efficiently utilize all this data for single users, without them getting lost in a state of

information overload. A contribution to such efforts can be made by exploring the

possibilities to use Semantic Web components for crowdsourcing and context aware

content systems. Accordingly this thesis investigates the question how semantic data

and Semantic Web technologies can be applied to the task of utilizing resources, that are

freely published in scientific blogs or news sites, as content for a fixed domain, adaptive

portal.

This could help to improve information portals in two ways. It assists in the process of

automatically picking potentially relevant social media resources out of a multitude of

distributed sources. Furthermore the automatic extraction and annotation of semantic

metadata can be used to estimate the resources’ usefulness for the users. By doing so

this information can be used to provide content recommendations and to dynamically

4http://www.amazon.com5http://sevenload.com

4 Introduction

adapt the portal in order to display the optimal content for each individual.

1.3 Objectives

The goal of this thesis is to investigate the applicability of semantic data, that is auto-

matically extracted out of heterogeneous, social media resources, for various tasks in

the environment of a domain specific, adaptive information portal. The first task is the

binary decision, if a particular resource should be regarded as relevant for the field of a

certain domain and hence be further processed. The second task is the categorization of

the relevant resources into several main domain categories, which can, among others,

be used to organize the contents for intuitive browsing across several subpages. The

third task is the assignment of resources to a domain set of finer-grained topical terms,

that can be used to outline its subject.

The result of the third task can also be used to match user interest information to the

available set of resources to identify suitable content recommendations. However, to do

so the user interest had to be recorded in a way, that is comparable or equal to the same

topical term set. Assuming that this is the case, the second goal of this thesis is to ex-

plore the possibilities to efficiently pick out resources, that match to certain, potentially

complex conditions concerning their previously annotated semantic attributes.

The two described goals can not be successfully accomplished without an underlying

set of domain relevant terms. Thus it is necessarily a third goal of this thesis to provide

a sufficient domain model, in order to perform a proof-of-concept and enable a proper

evaluation of the previous goals.

1.4 Scope Of Thesis

This section clarifies the scope of this thesis and therewith provides a statement of what

should be achieved and what not. The first Subsection 1.4.1 sets the thesis’ work into

the context of the CompleXys project, whose component it is. The succeeding three

subsections describe the scopes of the implementation units, that emerge from the par-

ticular goals, that were identified in Section 1.3. Subsection 1.4.2 provides the scope

of the Semantic Content Enrichment module, Subsection 1.4.3 that of the CompleXys

Domain Model and Subsection 1.4.4 is dedicated to the scope of the Semantic Filter

module.

1.4 Scope Of Thesis 5

1.4.1 Complex Systems Portal

This work is part of the CompleXys project, that intends to provide a domain specific,

adaptive portal for complexity. This portal should be able to provide complexity re-

lated social media resources chosen by context. To achieve this, it needs to collect the

resources from the internet, enrich them with semantic data, match them with a domain

model for classification, filter them based on the raised data as well as on the context

and finally display them to the user. This thesis contributes to the system by providing a

module for the semantic enrichment and classification of the collected social resources,

a complexity domain model as necessary basis for these tasks and a module for the

content filtering itself. The scope of the single elements will be detailed further in the

succeeding subsections.

1.4.2 Semantic Content Enrichment

The first module, that should be provided, aims to enrich social resources with seman-

tic content and to classify them therewith. To achieve this, there is first a need to find

and apply ways to analyze the resources and extract semantic data out of them. This

task involves complex subfields of natural language processing and received extensive

research over many decades, so it is reasonable to assume, that a single part of this the-

sis is unlikely to be sufficient for outperforming the existent solutions. Thus the focus is

set on identifying and utilizing fitting state-of-the-art tools for the special requirements

of this module. Furthermore the module needs to be able to use the semantic data for

several classification tasks and to persist the extracted data in a usefully accessible way.

1.4.3 Complexity Domain Model

To provide a sufficient domain model, a set of complexity specific terms had to be col-

lected and usefully structured. The model will be used as a basis for the classification

and filter processes so it had to be extensive and specific enough to successfully identify

many texts out of the broad, interdisciplinary area of complexity. However, the creation

of a comprehensive model is a very time-consuming task and beyond the scope of this

thesis. So a good prototype will be enough, as long as the access interface onto it is

flexible and abstract enough to improve the model subsequently without problems.

Accordingly a suitable data structure for the representation of the model had to be

found.

6 Introduction

1.4.4 Semantic Filter

The second module should provide a method for content filtering, so that social re-

sources can be displayed according to a set of predefined filter criteria. The filter should

provide a possibility to express complex queries regarding the semantic attributes of

the resources and to efficiently access the subsets of resources, that match these queries.

The intention of this thesis is thereby not to provide an exhaustive set of imaginable

filters, but a flexible, free extendable system as well as a basic set of useful filters for

example and presentation purposes.

1.5 Thesis Outline

This introductory chapter gave an insight to the general topic as well as the particular

research problem. It clarified the motivation, the objectives and the thesis scope.

Chapter 2 supplies a presentation of the CompleXys project and information about gen-

eral considerations, needs and design decisions. Section 2.1 identifies and formulates

the requirements of the system. Section 2.2 introduces the CompleXys architecture and

its several working steps.

Chapter 3 provides the background knowledge for the remainder of this work. There-

fore, Section 3.1 treats options to notate semantic data. Section 3.2 gives an overview

over natural language processing as a fertile research field for semantic data extraction.

Chapter 4 presents tools and standards, that became apparent to be useful within the

practical part of this work. Section 4.1 introduces SIOC as a standard for the metadata

of social media resources. Section 4.2 describes GATE, which is basically an architec-

ture and framework for language processing systems. Section 4.3 deals with the taxon-

omy standard SKOS. Section 4.4 pays attention to the elaborate keyphrase exctraction

package KEA. Finally Section 4.5 gives an insight into the OpenCalais toolkit and web

service, that is capable to enrich content with semantic data upon request.

Chapter 5 provides an overview to previous research in the field of information filter-

ing. Various approaches to the task are presented and set into a context to this thesis.

Chapter 6 discusses the Semantic Content Annotator module. Section 6.1 is devoted

to the CompleXys domain model and Section 6.2 explains the concept of the Semantic

Content Annotator pipeline and gives a detailed description of its implementations.

Chapter 7 provides an insight into the semantic filter module. Section 7.1 treats the

AbstractFilter concept and its implementations. Section 7.2 presents the output variants

1.5 Thesis Outline 7

for the filtered data, that have been implemented for presentation purposes.

Chapter 8 evaluates the two introduced modules. Section 8.1 examines the classifica-

tion quality of the Semantic Content Annotator module. Section 8.2 tests the runtime

performance of the Semantic Filter module. The evaluation results are concluded in

Section 8.3.

Finally Chapter 9 provides a summary of the thesis and considers possible future work,

that is based on the obtained results.

8 Introduction

CHAPTER 2

CompleXys

The CompleXys project is the environment, in which this thesis’ work is embedded.

This chapter introduces the system. Therefore, a requirement analysis is performed in

Section 2.1 and a general architectural overview is given in Section 2.2.

2.1 Requirement Analysis

This section is dedicated to the requirements of CompleXys. For obtaining these, it is

helpful to first identify the relevant actors and respective use cases and only thereof

deduce the actual requirements. Accordingly Subsection 2.1.1 introduces the identified

actors. Subsection 2.1.2 is dedicated to the use cases, linking the actors to the system.

Subsection 2.1.3 specifies the performance requirements of the system and Subsection

2.1.4 the design constraints. This section is loosely oriented on standard requirement

analyses, but is due to the much smaller scope of this chapter of course rather shortened

and abstracted in comparison to a full-sized software requirement analysis.

2.1.1 Actors

The actors are parties outside the system that interact with the system [10]. These par-

ties can be users or other systems and they can be divided into consuming entities, that

use the functionality of the system and assisting entities, that help the system to achieve

its purpose. Five different actors could be identified for CompleXys:

• Information Consumers

• Information Providers

• Assisting Systems

10 CompleXys

• Administrators

• Developers

Information Consumers are the central clients of the systems. They are the ones, getting

value out of the system, whose main purpose indeed is to be a mediating software

layer between large resource sets and these same information consumers. They are

characterized by an interest focus on complexity related topics. Furthermore every

information consumer is supposed to have personal preferences and special interest

fields within this domain, so it is sensible to treat him as an individual. His interests

may change over time. It is expected, that information consumers are average world

wide web users and have at least basic web browsing skills. The usage frequency can

vary from one-time uses to many times a day, depending on the individual information

needs, time and access possibility. The number of information consumers could in

short-term vary from very few to hundreds and may long-term become notably higher.

Information Providers are the second most important actor class, because they provide

the resources, that will be displayed to the information consumers. Potentially every-

one in the internet could be an information provider as long as he publishes contents

and allows agents to crawl his site. They do not necessarily care or even know, that

their resources are processed within CompleXys. Thus there is no implicit control over

topic, quality, publishing frequency, size, form, language, media type and subsequent

modification of resources. Likely examples of information providers for CompleXys

are scientific bloggers or researchers, that publish their papers freely in the internet.

Assisting systems are all those external systems, that are utilized within CompleXys.

They may serve various purposes, that are beyond the scope of CompleXys. Up to now

this is only the OpenCalais web service, because the majority of the reused software is

applied internally, but this might change in the future. However, being dependent on

external systems, comes always with the risk of externally caused outages or errors, so

new systems must be integrated with care.

Administrators are the entities, that assist and support the running system. They are

responsible for generally maintaining the system. Furthermore they can manually add

information sources and resources to the resource set. The expertise level of the ad-

ministrators is naturally very high inside their respective working domain, but due to

specialization reasons it can not be assumed, that every administrator is able to main-

tain every part of the system. They should be promptly available whenever problems

with the system occur. Their number depends on the size of the system and the spe-

cialization level of the administrators. Special data administrators may manage the

2.1 Requirement Analysis 11

resources, that should be harvested and further administrators may be entrusted with

database, system and network management.

Developers are the only actor class, that is not occupied with the running system, but

with the code itself. They are responsible to evolve and extend CompleXys beyond

its initial version. Possible goals for these actors can be the elimination of flaws, new

functions or performance enhancement. They need to be skilled programmers.

2.1.2 Use Cases

The use cases are descriptions of how an actor uses a system to achieve a goal and

what the system does for the actor to achieve that goal. It tells the story of how the

system and its actors collaborate to deliver something of value for at least one of the

actors [6]. Use cases are strongly related to the functional requirements of a software

requirements analysis and are quite convenient to identify the external interfaces by

regarding their relationship to the actors . Due to the coarse grained perspective of this

thesis onto the requirements and the fact, that they tend to do be much more detailed

than the corresponding use cases, this subsection will act as an abstract surrogate for

the functional requirements and external interface requirements subsections.

The following eleven use cases could be identified:

• get information recommendations

• search

• modify user interest manually

• get digest

• gather resources

• use assisting service

• manage source or resource list manually

• maintain system

• add feature

• identify users

• record user interest

Figure 2.1 visualizes how the particular use cases relate to the roles of the previously

identified actors towards the system.

12 CompleXys

Figure 2.1: The relationships between actors and use cases

"Get Information Recommendations" is one of the most essential features of the sys-

tem and a typical use case of the information consumer. More precisely the use case

involves the selection and dynamic display of resources depending on their estimated

level of interest to a particular information consumer. In order to perform successfully,

the use case assumes, that several conditions are fulfilled. First it is dependent on the

use case "Identify Users", because a user had to be recognized, before a system can make

useful personal assumptions about him. Furthermore it is dependent of "Record User

Interest", because the system needs a possibility to store user interest representations,

for afterwards matching them to the available resources. And finally it is dependent

on the use case "Provide Information", because the displayed resources must obviously

be obtained in the first place. Additionally the resources have to actually match the

user interest. Avoiding front-end error messages, the system had to behave sensible,

whenever these assumptions are not given. There should be a possibility to display

resources in an unpersonalized way, if a user can not be identified, if no user inter-

est data has been raised yet or if there are simply no matching resources available for

the particular user. The use case involves the sequential steps "Load Stored User In-

terest", "Load Resources", "Match Resources to User Interest" and "Display Matching

Resources". Important demands are an acceptable response time, high recall and high

precision. These do reappear in the Subsection 2.1.3 and are discussed closer at this

point of the text. Furthermore the use case must be intuitively accessible for users with

the assumed average internet expertise level of the information consumer actors. This

involves the need of dynamically reflecting the probability of a resource to be inter-


esting in display attributes like size, position and highlighting. To efficiently achieve

this the implicit relevance rating done in the step "Match Resources To User Interest"

should be expressed in relative probability values instead of binary decisions or an in-

teger sorted order. Because of its core importance to the system and its meaning for

information consumers as the central actor class "Get Information Recommendations"

can be rated with highest priority1.

"Search" is another important use case, that is related to information consumer. While

"Get Information Recommendations" is characterized by a passive information con-

sumption of the user, "Search" is the active querying of needed data. It is basically in-

dependent of other use cases than "Gather Resources", but may be used as information

source by the "Record User Interest" use case, when the user is additionally identified by

the "Identify User" use case. It involves the sequential steps "User Send Search Query",

"Search For Matching Resources" and "Display Search Results". Important demands are

good response times, that do not seriously interrupt the users’ browsing flow and an

intuitively understandable and controllable user interface. The use case is rated with

high priority, because albeit it is not actually a core feature of CompleXys, the internet

user is highly accustomed to this function and is likely to insist on needing it.

"Manually Modify User Interest" is another use case related to the information con-

sumer. Its goal is to visualize the recorded interest model to the described user and

let him alter it as he wishes. This assists to improve system transparency and possibly

also the system’s value to the user, because it is capable to establish a very up-to-date

and correct user interest model. This helps to smooth away three common flaws of in-

formation filtering systems. Firstly it helps to rapidly adapt the system to new interest

emphases of the user, secondly it helps to remove expired interests instantly and thirdly

it provides the possibility to correct erroneously added interest entries. Traditional sys-

tems may require quite a long time to autonomically adapt to the cases one and two,

because they usually require a certain amount of related behavior data. The third case

is worse, because the system may repeatedly draw the wrong conclusion and the un-

wanted topic may not even lose importance over time, when autonomic adaption is the

only option to change user models. This use case is dependent on the "Identify Users"

use case, because the actual user must be recognized by the system in order to find and

visualize his user model for him and to persistently store changes for future usage. Fur-

thermore, the use case is dependent on the assumption, that users benefit from a more

accurate interest profile. This is true as long as use cases like "Get Information Recom-

mendation" apply the profiles to produce value for the user. The use case involves the

1Priority levels: Highest = 5/5, High = 4/5, Middle = 3/5, Low = 2/5, Lowest = 1/5

14 CompleXys

sequential steps "Load User Interest Data", "Display User Interest Data" and optionally

"User Alters User Interest Data" as well as "Store Altered User Interest Data". Important

demands are good response times, that do not seriously interrupt the users’ browsing

flow and an intuitively understandable and controllable user interface, that orientates

on the assumed average internet expertise level of information consumers. The use

cases’ priority is middle, because it provides a useful, but not an essential feature for

the system.

The use case "Get Digest" is related to the information consumer too. It provides the

users with the possibility to subscribe to a digest, that regularly delivers email sum-

maries of recent resources, that may be of personal interest to them. Beside the obvi-

ous "Gather Resources" the use case is also dependent on the "Identify User" and the

"Record User Interest" use cases, because in order to make assumptions about the rel-

evancy of resources to him the user needs to be assigned to his interest records. The

use case involves the sequential steps "User Subscribes To Digest" and regularly "Load

Stored User Interest", "Load Recent Resources", "Match Recent Resources to User In-

terest", "Summarize Matching Resources" and "Send Summary To User". Furthermore

there had to be an optional step "User Unsubscribes" to end the digest subscription.

The use cases’ priority is low, because it is a nice feature, but it is not essential.

"Gather Resources" is a use case, that is related to the information provider actors. Its

purpose is to collect resources, that may be of interest to the information consumers and

can be displayed later on. It assumes, that it has access to a set of source addresses and

that it is able to access the corresponding sources. Furthermore it assumes, that these

sources frequently provide new resources, that are potentially relevant to the domain

of complexity and interesting for the users of CompleXys. It involves the sequential

steps "Get Source Address", "Load Source", "Crawl Source For New Resources" and in

case of successfully finding one or more new resources "Store New Resources". The

size and number of processed and stored resources may range from few and little ones

to millions with possible sizes up to whole books or long-term discussion archives.

Complexity related scientific resources are usually not very time-critical, so a gathering

frequency of one to three times a day should be sufficient. Furthermore the system

must give attention to the needs of the information providers. This involves respect

of privacy and thereby to the crawler access policy described in the sources’ robots.txt

file [51]. It also includes respecting copyrights and do only crawl and display resources

from those sources, which allow to do so. This use cases’ priority is highest, because an

information filtering system without information would be worthless.

"Use Assisting Service" is a use case, that is related to the actor class of assisting systems.


Its purpose is to outsource tasks, that are beyond the scope of CompleXys, to external

services, that are capable to master them. It assumes, that the external service works

correct, reliable and sufficiently fast. Furthermore it assumes, that the connection to the

external service is stable and also fast enough. A useful service utilization is depen-

dent of the system’s capability to provide the data the respective service needs and to

understand the service response. A successful use case is composed of the sequential

steps "Send Request To Service", "Service Receives Call", "Service Processes Request",

"Service Sends Response" and "Receive Response". The priority depends on the impor-

tance of the respective problem, that is solved by the service, and can therefore not be

generally specified.

"Manually Manage Source Or Resource List" is a use case, which is managed by the

actor class of the administrators. The sources of the initial source list and the resources,

which are autonomically found on them, provide by nature a very limited set of in-

formation, that may moreover become outdated very quickly inside its fast moving

internet environment. This problem can be treated by instructing one or more adminis-

trators to manually add, modify and remove sources or solitary resources and thereby

keep the resource list up-to-date. In order to achieve a successful use case termination,

the administrator needs to have access rights and a possibility to modify the source

and resource list. Additionally administrators are assumed to be capable of indepen-

dently finding new relevant sources and resources, identifying outdated or erroneously

added resources and reacting in a suitable way to these cases. The use case involves the

sequential steps "User Modifies Source Or Resource List", "Store Modified Source Or

Resource List" and optionally for modifying and deleting tasks "Display Source or Re-

source List". These steps can all either refer to the source list or the resource list within

a single step sequence and are just summarized for clarity reasons. The use case is gen-

erally not time-critic, but involves human-machine interaction and should therefore

provide acceptable response times. Not every data administrator can be assumed to be

a computer expert, so the interface should also be intuitively accessible.The priority is

rated high, because inaccurate and outdated resources could strongly disturb the user

experience, but then again it is not a core feature of the system

"Maintain System" is a very general use case, that involves all works, that help to keep

the running system in a useful state or to bring it back to one, which involves cor-

rect functionality, accessibility and sufficient performance. The use case is mainly per-

formed by administrators, handling the network, software and hardware environment,

in which CompleXys is embedded, but it also involves developers, who can optimize

the code for better performance or eliminate bugs. Of course the administrators and

16 CompleXys

developers need high expertise in and access to their respective responsibility layer

in order to accomplish their maintenance duties. To grant the needed access rights,

involved actors need to be identified first, so this use case is dependent of "Identify

Users". Due to the versatile characteristics of the various maintenance tasks involved

actors are often quite heterogeneously skilled. Thus the system needs to be partitioned

carefully in order to clearly divide between the different layers and to allow a main-

tainer to do his job without detailed knowledge about the job of another maintainer.

The priority is the highest, because an unmaintained system is not likely to be usable

at all.

The initial system is probably not the final version of CompleXys. Software systems are

usually subject to an evolutionary progress and should therefore be prepared to evolve.

Accordingly the "Add Feature" use case represents the developer task of adding a new

feature to CompleXys. This assumes that the developer possesses high programming

expertise, which is clearly reasonable. Furthermore it demands an easily expandable

system structure, which is characterized by low coupling, high cohesion and standard

compliance. The use cases’ priority is rated middle, because it is not essential for the

running system, but steadily helps to improve it.

The use case "Identify User" aims to recognize a certain user and relate him to his role,

stored profile data and access rights. It requires the user or a third party to make an in-

put, that identifies him. The identification procedure should happen very rapidly and

require no expertise above average internet skills. If a user is needed to enter identifi-

cation data it should be in a form, that he can remember it by heart, so he do not need

to write it down. The use case is likely to divide in an initial step "User Registrates"

and repeating "User logs in" steps at the beginning of each new session. A "User logs

out" step at the end of each session is optionally and can alternatively be automatically

accomplished by session expiration. On the other hand the identification procedure

should be secure. No other entity than the specific user should be able to successfully

claim, that he is the user. The two highest priority use cases "Get Information Recom-

mendations" and "Manually Modify User Interest" are dependent of this use case so it

is also rated as highest priority.

The "Record User Interest" use case is related to the information consumer. Its purpose

is to record the interests of the actual user, in order to use the gathered information

to improve the quality of the recommended resources. It assumes, that the user ex-

plicitly or implicitly provides information about his interests to the system and that

these are digitally record- and classifiable. The use case involves the sequential steps

"User Provides User Interest Clues", "Classify User Interest" and "Recalculate And Store


User Model". The former step may alternatively be "Third Party Provides User Interest

Clues", when the information is gathered from external sources like social networks or

other information filtering systems. However, user data from external sources should

for privacy reasons never be collected without the permission of the respective user.

Furthermore, "Recalculate And Store User Interest" involves a recalculation, because

new interests may replace old ones and all relative values in the system may change.

The process of implicit user interest recording should usually run in background and

therefore not slow the system down in a noticeable way. The highest priority use case

"Get Information Recommendations" is dependent of this use case so it is also rated as

highest priority.

2.1.3 Performance Requirements

According to the IEEE Recommended Practice for Software Requirements Specifica-

tions [30] performance requirements should specify both the static and the dynamic

numerical requirements imposed on the software. Static numerical requirements may

include the number of simultaneous users to be supported or the amount and type

of information to be handled. Dynamic numerical requirements may include, for ex-

ample, the numbers of transactions and tasks and the amount of data to be processed

within certain time periods for both normal and peak workload conditions.

CompleXys is a web application and thereby potentially accessible to an extremely large

audience of internet users. Admittedly the restricted interest domain of complexity

scales the amount of likely users down again, but due to the adaptability of the system

to other domains, it is nevertheless reasonable to design CompleXys in a way, that it is

scalable to be used by thousands of people, which hardly suggests linear performance

and memory complexity in every user-relevant system aspect.

Likewise, regarding the world wide web environment and its steady information flood,

the system is expected to be faced with an enormous amount of documents over time.

The size of single documents is quite hard to estimate and may range from the re-

stricted number of symbols in a microblogging post up to whole books and long-term

discussion archives. This suggests linear performance and memory complexity in every

resource-relevant system aspect as well. Especially effective ways of storing, accessing

and filtering big amounts of differently sized documents had to be implemented.

CompleXys is an information filtering system, that hardly relies on user interaction.

Active information consumption is simultaneously the main value provided to its users

and a way of further improving the system performance by implicitly shaping correct

18 CompleXys

user interest models. However, this assumes, that the user actually wishes to use the

system in everyday life. To attract him do so, the system had to provide a comfortable

and undisturbed usage experience. One relevant performance topic in this aspect is the

response time. The response time is regarded as the time passing between a given input

to a system or system component and the corresponding response. This is essential,

because the exceeding of particular tolerance values in waiting time is likely to provoke

user discontent and a loss of attention to the system. Andrew B. King states in "Website

optimization" [35], that these long known effect is even enhanced by the familiarization

of broadband users to rapid access. Acceptable response times for entry-level DSL

users are estimated to lie between two and three seconds. These shall be achieved in

ninety-five percent of all transactions at normal workload and in sixty percent of all

transactions at peak workload. Furthermore a transaction should possibly never take

longer than twenty seconds.

The performance of an information filtering system is obviously also dependent on

the provided filtering quality. This quality is measured by the two metrics recall and

precision. Precision is thereby a measure of the usefulness and recall a measure of the

completeness of the calculated document ranking [40]. Or more formally stated recall

is defined as:

recall =# relevant hits in hitlist

# relevant documents in the collection

It measures how well the system identifies relevant documents. The recall is optimal

when every relevant document is rated above the display threshold. Of course this

can be easily done by simply rating every document with a high value. This is why

precision is needed as a second metric. Precision is defined as:

precision =# relevant hits in hitlist

# hits in hitlist

It measures how well the system filters non-relevant documents out, rating them low

in CompleXys’ approach. The precision is optimal when every document returned to

the user is relevant to the query. Both metrics have a relative, fixed range between 0.0

and 1.0. A useful information filtering system must have a high recall, so that most of

the relevant documents are displayed to the user. But to be comfortable and to stand

out in comparison to simple search engines it must also achieve a high precision value,

so that the user had not to troublesomely search for relevant documents as needles in

2.2 Overview 19

a haystack. Using the evaluation results of [36] and [7] as reference points the average

recall should be higher than 0.5 and the average precision higher than 0.7.

2.1.4 Design Constraints

This subsection specifies design constraints, that can be imposed by other standards,

hardware limitations and others.[30]

According to the needs of the developers, that were identified in Subsection 2.1.2 the

system had to be expandable. This includes a properly modularized system structure,

characterized by low coupling and high cohesion of the several components. Next to

this horizontal division into functionality modules, the system needs to be clearly par-

titioned into vertical abstraction layers, that are connected by well designed interfaces

and allow the replacement of a layer without affecting another. Furthermore, it should

allow administrators, to maintain their layer without possessing any knowledge of ad-

jacent layers apart from how the interface looks like.

CompleXys utilizes semantic data to provide a semantic topic classification and is thereby

a strong candidate to be a native part of the emerging Semantic Web itself. To achieve

this and further benefit from improving linked data clouds, semantic data reasoners

and the like, CompleXys needs to apply the common semantic web standard languages

and protocols whenever it is sensible.

2.2 Overview

CompleXys is a word construct derived from "complex sys(tem)", referring to its pur-

pose of providing a user context sensitive, adaptive portal for the domain of complexity.

Fedor Bakalov and Adrian Knoth identified five basic modules and two sorts of re-

sources to be relevant for the CompleXys project. Thereupon, they developed a system

architecture, that reflects the requirements of the previous subsection. It is visualized

in Figure 2.2.

The first module is the harvester. Its purpose is to proactively collect resources and it

is composed of three components. The crawler searches for new, potentially interesting

resources and stores their access data in the resource list. But some special data like

those about university resources is entered manually by the administrator, instead of

being crawled. The resource list is a data storage containing the URLs as well as op-

tional invocation methods and authentication information for accessing the resources,

20 CompleXys

Figure 2.2: The CompleXys overview schema

that should be harvested. The fetcher collects those resources, that are specified in the

resource list and fetches their content. The outcome of this module is a growing collec-

tion of raw resources.

The second module is the Content Type Indexer. It performs an analysis of the content’s

format and sources to extract metadata like the source type, the content type or simply

the title. The metadata is restructured to RDF, using Dublin Core1 as reference for com-

mon metadata and SIOC2 for online community related type data like the affinity to a

special blog or forum (see Section 3.1 for more information on SIOC). The outcome of

this module are documents with an amount of metadata, that was extracted out of the

source or predefined as markup in the content.

The third module is the Semantic Content Annotator. Its implementation is a main

emphasis of this diploma thesis. The Semantic Content Annotator extracts machine-

readable semantics out of the received content and annotates it. To achieve this, natural

language preprocessing, keyphrase extraction and annotation web services can be uti-

lized. The preprocessing and some basic semantic matching is done by using the GATE3

framework and some of its plugins. It is also used for the persistent storage of the pro-

cessed documents and semantics, employing a prepared Hibernate4 implementation

(see Section 4.2 for GATE). Keyphrase extraction as a possibility for extracting topic se-

mantics from the text, is implemented by means of KEA5 with a modified version of

1http://dublincore.org/2http://rdfs.org/sioc/spec/3http://gate.ac.uk/4https://www.hibernate.org/5http://www.nzdl.org/Kea/index.html

2.2 Overview 21

the encapsulating GATE Plugin (see Section 4.4 for KEA). Finally the OpenCalais web

service6 is called by another encapsulating GATE plugin, returning semantic entities

and facts (see Section 4.5 for Calais). The extracted semantic data is mapped to the do-

main ontology concepts, thereby providing the probability for the text to be member

of the particular category. The structure and implementation of the Semantic Content

Annotator module is further described in Chapter 6.

The fourth module is the Semantic Filter. It is also the second module, that will be

realized in the scope of this thesis. Its purpose is to apply the collected semantics for

information filtering. So it is meant to provide a dynamically configurable interface

for the access to those documents, that fit to certain filter conditions. An abstract java

filter bundled with a FilterIterator data structure is utilized to achieve this. The filter

is implemented in a set of logic filters and proposition filters, forming a filter system,

that is similar to propositional logic. The implementation of the Semantic Filter will be

detailed in Chapter 7.

The fifth and last module is the CompleXys Portal. It utilizes the collected and seman-

tically annotated resources as well as a user model to deliver those resources to the

user that he most probably might be interested in. The resources are accessed by users

through a number of navigation topologies. A general topology will be available to

every user and additionally there will be personalized recommendations tailored to the

interests of the individual user. The user interests are defined in a user model, being up-

dated automatically based on the user browsing history. Furthermore the user himself

will be able to update his model manually.

6http://www.opencalais.com/

22 CompleXys

CHAPTER 3

Essentials

This chapter provides the background knowledge for the remainder of this thesis. A

main task of this work is the semantical enrichment of text within the Semantic Content

Annotator module. Therefore, Section 3.1 introduces existing options to notate seman-

tic data. But the semantic enrichment can obviously not be done, without collecting

the semantic data in the first place. For that reason Section 3.2 gives an overview over

natural language processing as a fertile research field for semantic data extraction.

3.1 Notation of Semantic Data

Semantic data can be displayed and stored in various ways, depending on the quality

and quantity of the data, as well as on the kind of purposed reuse. The three succeed-

ing subsections will introduce important notation possibilities. Subsection 3.1.1 will

deal with ontologies, Subsection 3.1.2 with annotations and Subsection 3.1.3 with mi-

croformats.

3.1.1 Ontologies

An ontology is a formal, explicit specification of a shared conceptualization [21]. It pro-

vides the needed syntax and semantics to describe relevant aspects of a domain in a

way others and especially machines can understand. This is achieved by determining

concepts and the relations between them. A tiny example ontology might be a concept

dogOwner, a concept dog and a relation owns that can connect both. Special properties

may add more information to a concept. For example dog may need to have a dogTagproperty. Furthermore axioms are defined to assign semantic information to those con-

cepts and relations. Axioms are sets of logical terms, which can be used to describe

24 Essentials

facts like: Every dogOwner needs to have at least one own relation towards a dog.

Frequently used ontology modeling languages are nowadays the Web Ontology Lan-

guage OWL1, the Web Service Modeling Language WSML2 and the Simple Knowledge

Organization System SKOS3. The latter will be described in detail in Section 4.3, be-

cause it fits the requirements of this thesis best.

3.1.2 Semantic Annotations

The knowledge structure represented in ontologies is an important step towards a

working semantic web, but up to this point the data is still abstract and not yet con-

nected to the actual world wide web. Therefore, todays websites need additional meta-

data, that describes its semantic meaning in a machine-understandable way. The pro-

cess of adding this metadata to a document is called semantic annotation.

There are basically three ways to link semantic data to a document [52]: Embedded,

internally and externally referenced annotations. Embedded annotations are directly

written into the HTML document. This can be done either by using an object or scriptelement or by writing it into an HTML comment. Either way, it is not displayed by

common browsers, but can be parsed and used by any semantically based application.

The advantage of this possibility is that the semantics are always present and do not

need to be fetched in a second loading step. The disadvantage is that much semantic

data may result in confusing source code and annotations in elements like script may

violate the code’s validation rules.

Internally referenced annotations reference to an external annotation storage out of

their code. This can be done in a link element with the rel attribute set to ’meta’ and

type attribute for instance set to ’application/rdf+xml’ in case of RDF based metadata

notation. References starting from object elements or anchors are also possible.

As a third option the external metadata document can reference to the annotated one.

To address special parts of the website XPointer or simple offset values may be used.

While the other possibilities expect direct write access to the source document, this

can be done by externals and can thereby be applied to a wide set of scenarios like

personalized annotations or social meta-tagging systems.

Beside the question how annotations are linked to a document, it is also interesting whoactually does it. Manual annotation is of course a valuable option. But it is probably

1http://www.w3.org/TR/owl-features/2http://www.wsmo.org/wsml/3http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/

3.1 Notation of Semantic Data 25

not sufficient, because even when incentives in combination with crowd sourcing prin-

ciples and useful annotation tools like SMORE [50], CREAM [22] and Annotea [33] may

accomplish a lot, the sheer mass of documents in the www is still likely to exceed what

can be achieved this way. Fortunately the field of natural language processing provides

promising approaches for automatic annotation. These will be discussed in Subsection

3.2.

3.1.3 Microformats

The web itself was originally conceptualized for managing semantic information as we

already stated in Section 1.1. The microformats idea is about utilizing this fact and pro-

ducing machine-understandable semantics just by providing special purpose notation

standards based on this POSH, which is a recently created abbreviation for ’Plain Old

Semantic Html’. Beside predefined semantic elements like address for contact informa-

tion or blockquote for quotes, the class attribute is applicable to every element and can

be used to assign other semantic descriptors. But to be useful for machine processing

these descriptors need to follow a common convention, that can be parsed. There-

fore, the microformats community defines modular, open data format standards. The

schema in Figure 3.1 reflects the principles of coherence and reusability, that are central

to the microformats idea. Fine grained elemental microformats are always reused to

build up the more complex, compound microformats like hCalendar4 or hCard5.

Figure 3.1: The basic microformats schema6

An example for microformats may be the following hCard, which describes identity

4http://microformats.org/wiki/hcalendar5http://microformats.org/wiki/hcard6Accessed on January 12, 2010: http://microformats.org/media/2008/micro-diagram.gif

26 Essentials

and contact information of the FSU Jena:

<div class="vcard"><a class="fn org url" href="http://www.uni-jena.de/">

Friedrich-Schiller-University Jena</a><div class="adr">

<span class="type">Work</span>:<div class="street-address">Fürstengraben 1</div><span class="locality">Jena</span>,<abbr class="region" title="Thuringia">TH</abbr><span class="postal-code">07737</span><div class="country-name">Germany</div>

</div></div>

It is obvious how naturally anyone with HTML skills can adopt to this style. It is ex-

tremely simple, light-weighted and pragmatic. It concentrates on modular, specific

topics and is quite human-readable. Furthermore it is self-contained because it is based

on embedded annotations and avoids language redundancy by reusing existing and

well-known HTML elements.

On the other hand microformats do not support URI identification of entities, which

leads to problems when trying to interoperate with the Semantic Web concepts around

the RDF-based W3C initiative. Additionally microformats do have a flat vocabulary

structure without namespaces, which may become problematic when different micro-

formats with equal class attributes are supposed to be combined on a single page. And

finally microformats are controlled by a little, closed community, that standardizes ex-

isting and common formats. This approach makes it unlikely to ever provide dozens

of domain-specific formats, therefore "the long tail" [1] of the social web will probably

always stay excluded from this kind of semantics.

3.2 Natural Language Processing 27

3.2 Natural Language Processing

Natural language processing (NLP) is an interdisciplinary research field, that resides

between linguistics and computer science and strongly interrelates with artificial intel-

ligence. It is concerned with the processing of natural language by computers. NLP

emerged originally from machine translation researches in the middle of the twenti-

eth century [46]. Today’s applications involve useful tasks like spellcheckers, machine

translation, speech recognition and information extraction. In this particular thesis the

subproblem of text classification is a central task of the Semantic Content Annotator

module. Therefore, it will be amplified separately in Subsection 3.2.7.

There are various basic approaches to handle the problems of natural language process-

ing, which will be discussed in Subsection 3.2.1. The essential subfields of text analysis

are mostly derived from the linguistic language description layers. Namely they are

lexical, morphological, syntactic, semantic and pragmatic analysis [37]. These are the-

matized in the Subsections 3.2.2 to 3.2.6.

3.2.1 Approaches

The basic approaches for natural language processing can be broadly divided in sym-

bolic, statistical, connectionist and hybrid approaches [37]. The symbolic approach rests

on the usage of explicit knowledge representations like logic propositions, rule sets or

semantic networks for language analysis. It is based on the assumption that an exhaus-

tive formal representation of words, grammar rules, possible syntactic and semantic

word relations and other linguistic data must provide a machine with all necessary in-

formation to perform text procession. A given text shall thereby be stepwise analyzed

and transformed, until it is directly displayed in the intended machine-understandable

format.

The statistical approaches are on a widely varying degree based on mathematical statis-

tics and often strongly related with machine learning techniques. The corresponding

methods make use of large sets of already worked out machine-understandable text

data. These data sets can for example be used to train naive Bayesian networks, which

thereon build up a statistical model. This model can afterwards be used to transform

unprocessed texts in the same way, that was shown in the training data.

Connectionist approaches base on the idea of neural networks, that intelligence emerges

from parallel interaction of many single neuron-units. These approaches combine sym-

bolic knowledge representation with statistical methods. The knowledge is stored in

28 Essentials

the weights of the neuronal connections, but the network is trained up like the statisti-

cal approaches, until it is capable of solving unprocessed cases itself.

Finally hybrid approaches pay attention to the fact, that all three preceding approaches

have strengths and weaknesses and may be optimally used in combination by utilizing

them in those NLP subtasks, in whose individual requirements they fit in best.

3.2.2 Lexical Analysis

Lexical analysis deals with text segmentation tasks. The central program of these anal-

ysis is called a tokenizer and it divides the text in known token-units like words, punc-

tuation and numbers. The sentence splitter is responsible for the segmentation of the

text in separate sentences. Another related tool is the part of speech tagger, that matches

sentence parts to word classes like noun, verb and adjective. This is necessary to resolve

ambiguations for the tokenizer.

3.2.3 Morphological Analysis

Morphological analysis is charged with the word structure and the morphological pro-

cesses it results from. The goal is to normalize a word into a morphology independent

form. This is important to simply reduce the size and complexity of the underlying lex-

icons. It is easier to store morphology independent forms and a set of rules, expressing

how any word can be reduced to it, than to store every possible morphologically trans-

formed form and maybe even to add word heritage relations just to gain a comparable

expressiveness.

Morphology independent forms can be stems or lexemes. Stems are the remaining part

of a word, when all suffixes are cut off. For example the words ’category’, ’categorical’

and ’categories’ do share the same stem ’categ’. Lexemes on the other hand are basic

words like those one can find in a lexicon. For example the lexeme of words like ’took’,

’taken’ and ’taking’ would be ’take’. The latter is harder to implement but also more

expressive, because stemmer would not be able to match ’took’ to ’take’ or ’better’ to

’good’ while a program for lexemes will.

Part of speech tagger, which were already introduced in 3.2.2, are of importance for this

analysis layer too, because the way morphological processing is done relies strongly on

the affected part of speech type. Therefore, it is reasonable to use morphological data

for part of speech tagging and vice versa.

3.2 Natural Language Processing 29

3.2.4 Syntactic Analysis

Syntactic analysis works with the syntax of sentences. It deals with word order and

phrase structure. Phrases are thereby word groups with a collective function in this

particular sentence. The word order in those phrases and the phrase distribution in

the sentence follows language inherent rules and carries information of grammatical

states. So syntactic analysis contributes to the extraction of central linguistic concepts

like sentence type, tense and morphologic case. Another application for this analysis

level is text parsing, which is the verification of a sentence by means of syntactic well-

formedness.

3.2.5 Semantic Analysis

The semantic analysis aims to perceive the meaning of text . It is generally dividable in

lexical and compositional semantics. Lexical semantics deal with the semantic of single

words or phrases. This may involve the classification into a relation network of similar

synonyms, hierarchical related higher classed hypernyms or lower classed hyponyms,

contrary antonyms and others. An important application of this analysis step is the

word sense disambiguation.

Compositional semantics deal with those semantics emerging from the composition of

words and phrases to bigger clusters like sentences or whole texts. An instance for this

is the semantic deduction, which is drawn from a reflexive pronoun referring to the

noun of a preceding sentence. The application of semantic analysis to the interrelation

between sentences of a text is called discourse analysis.

3.2.6 Pragmatic Analysis

Pragmatic analysis is responsible for the highest level of text understanding. The se-

mantic meaning is considered by its relation to a wide-ranging set of context, back-

ground knowledge and conventions to extract hidden inherent information like action

implications, speaker motivation, irony or citation. The ability of mastering this anal-

ysis level is probably a main obstacle of a machine to reliably pass the turing test and

may according to Alan Turing [62] therefore count as equivalent to human intelligence.

However, highest does in no way suggest, that other levels are of lesser importance -

pragmatic understanding can not be achieved without profound preparatory work at

the preceding levels.

30 Essentials

3.2.7 Text Classification

Text classification is a subfield of natural language processing. It determines, whether a

text is a member of certain categories. Such categories may for instance refer to the text

genre or to topic domains. The latter categorization task can be labeled separately as

term assignment, but in this thesis we will include it under the term text classification

for reasons of clarity and simpleness. Generally classification is useful for supporting

effective access to big amounts of information. Hence it is especially of great interest in

regard to the rapidly growing world wide web.

Text classification is based on a controlled vocabulary, which lists all permitted clas-

sification terms. The opposite is text clustering, which freely arranges document sets

according to shared words, phrases or even just shared relations to words. On the one

hand this free indexing strategy has the advantage, that it is domain independent and

more flexible towards unexpected inputs. Controlled indexing on the other hand pro-

vides better performance in its special domain and provides predictable output, that is

easier to work with on application side. Furthermore it can be easier semantically used

by preparing a specialized semantic net for the anticipated outputs and the consistency

with human classification is higher.

Text classifiers usually consists of a knowledge extractor and a filter. The knowledge

extractor creates class models containing sets of weighted features. These are mostly

displayed as word- or letter n-grams and represent extracted text data like frequency

counts, entropy and correlations. Each module can either work in a statical way, which

is usually symbolical and rule-based, or self-learning, which involves training data and

statistical or connectionist methods.

CHAPTER 4

Tools and Standards

This chapter describes important tools and standards, that were used during the thesis

work. Section 4.1 explains the purpose and features of the SIOC ontologies, which are

used by CompleXys’ Content Type Indexer to express social media specific metadata.

Section 4.2 surveys the GATE project, that is utilized as a basic framework for the Se-

mantic Content Annotator module and provides one of the implemented methods for

semantic extraction and annotation. Section 4.3 treats the taxonomy description lan-

guage SKOS, that serves as ontology description language for the CompleXys domain

taxonomy. Section 4.4 discusses the keyphrase extraction package KEA and its follow-

ups, on which another approach for semantic extraction is based on. Finally section

4.5 introduces the Calais initiative and its semantic annotation web service OpenCalais,

that was the third utilized semantic data source within the Semantic Content Annotator.

4.1 SIOC

The abbreviation SIOC [9] is short for Semantically-Interlinked Online Communities.

The intitiative aims to bridge the gap between the social web and semantic web tech-

nologies. To achieve this it provides a series of ontologies, defining a description stan-

dard for the domain of online communities.

The ontology structure specifies different abstraction levels, that relate to each other.

For example Figure 4.1 presents the semantic net of the SIOC main classes. The abstract

items relate to a superordinate container, which in turn belongs to a certain space. In

case more details are known, the item may be more precisely described as a post and

the container as a forum, both located in a concrete site. A post may have replies, tags,

categories and a creator. The creator may be member of a usergroup, have a function

in the forum and may be further related to special person description ontologies like

32 Tools and Standards

FOAF [11].

Generally said these concepts enable people to describe and consolidate their identity

across the social web and possibly merge all the multiple accounts of today’s web life

into a coherent web identity. This is coupled with rapid access and processing capacity

for community related data and thereby with many interesting application options.

But on the other hand it may also lead to an increased potential of abusive data storage,

hence increasing the necessity for public awareness towards data parsimony.

Figure 4.1: The SIOC main classes in relation1

4.2 GATE

The abbreviation GATE [16] is short for a General Architecture for Text Engineering. It

is an infrastructure for language processing software development. The contained soft-

ware architecture defines a fundamental organization schema for NLP software based

on loosely coupled GATE layers. These can also be externally utilized, by accessing the

corresponding open source API set of the GATE Embedded framework, whose compo-

nents are visualized in Figure 4.2.

1Accessed on January 25, 2010: http://wiki.sioc-project.org/images_sioc/f/f2/Sioc_spec_5_small.png

4.2 GATE 33

Figure 4.2: The APIs, which form the GATE architecture2

It is easily perceivable, that GATE has a meticulous focus on clean level separation, di-

viding its APIs in IDE GUI-, Application-, Processing-, Language Resource-, Corpus-,

Document Format-, DataStore and Index Layer as well as Web Services. The inter-

nal resources are structured in three categories. Basic data and language documents

like lexicons, ontologies and corpora are termed as Language Resources (LR). Algo-

rithmic components like part of speech tagger, tokenizer and parser are called Process-

ing Resources (PR). Visualization- and GUI related components are denoted as Visual

Resources. A division, that obviously mirrors the Model-view-controller architectural

pattern. The mutual set of these three resource types is collectively known as CREOLE,

which is short for Collection of REusable Objects for Language Engineering.

Furthermore, GATE contains a graphical IDE, a ready-to-use data model for corpora

and documents discussed in Subsection 4.2.2 and an elaborate Information Extraction

system called ANNIE, which will be discussed in detailed in Subsection 4.2.3.

2Accessed on January, 2010: http://gate.ac.uk/sale/talks/gate-apis.png


4.2.1 Corpus Data Model

The corpus data model is used as document and annotation format for the Semantic

Content Annotator and Semantic Filter modules of CompleXys. It can be described by

the six essential data objects, whose relation network is visualized in Figure 4.3.

Figure 4.3: A data model diagram for GATE’s corpus layer

The corpus object is per definition a large, structured set of texts and therefore contains

an arbitrary big set of documents. Additionally it is identified by a name and may con-

tain a FeatureMap, which lists descriptive features of an object in key-value pairs. The

documents possess the actual document content, a name, a source URL, a FeatureMap

and AnnotationSets. An AnnotationSet contains any number of annotations and an

identifying set name.

An Annotation has an id and a type, potentially connecting it to an ontology concept.

Further information can also be noted in an attached FeatureMap. The Annotation ob-

jects are implementations of the externally referenced semantic annotation approach,

that is discussed in Subsection 3.1.3. This means, that the annotations are neither em-

bedded in the content itself nor even referred to from within the content, but point

to the respective text interval by simply externally describing a start node and an end

node offset. The format, which is a modified form of TIPSTER [20] , is useful on the

one hand, because it cleanly divides content and semantic description and preserves

the original text. On the other hand even slight modifications of the text will result

in reference inconsistencies from annotation to content. Thus a more flexible reference

approach would be desirable.

4.3 SKOS 35

4.2.2 ANNIE

GATE delivers in bundle with a ready set of Processing Resources for information ex-

traction named A Nearly-New Information Extraction system or short ANNIE. It’s cen-

tral processing resources are tokenizer, sentence splitter, part of speech tagger, gazetteers

and for our purposes the semantic tagger.

The tokenizer is responsible to divide the text in known token-units like words, punc-

tuation and numbers, while the sentence splitter has to identify the sentences and split

the text into them. The part of speech tagger categorizes tokens and token clusters as

part of speech like noun, verb or adjective. All three were already introduced as lexical

analysis related tools in Subsection 3.3.

A gazetteer is by word heritage a geographical directory listing information about

places. However, in the domain of NLP the term meaning changed and now gener-

ally implies a set of wordlists each referring to a certain category, for example lists

of persons, cities or companies. Listed words must thereby not exclusively be entity

names, but can also be mere indicators like Ltd. is for a company. Furthermore a GATE

gazetteer module provides lookup functionality to match text parts to words occur-

ring in the respective list and annotating them with the respective list category. This

functionality can be implemented by finite state machines or hashtables.

The semantic tagger builds upon the gazetteer principle, using JAPE rules to further

describe matching patterns and the resulting annotations. JAPE [17] is short for Java

Annotations Pattern Engine and it provides finite state transduction over annotations

based on regular expression. Hereby it is possible to assign and process rules like

"When a text part has been already tagged by the gazetteer with the name x, then

add a feature ontology referring to the corresponding ontology concept y.". In this way

gazetteers can be used to automatically assign semantic annotations to text. This is one

of the semantic extraction approaches, that are used in the Semantic Content Annotator.

Its application is explained in detail within Subsection 6.2.3.

4.3 SKOS

SKOS [31] is a W3C standard ontology description language and a particular imple-

mentation language for the ontology concept described in Subsection 3.1.1. The ab-

breviation SKOS is short for Simple Knowledge Organization System. It is based on

RDF and thereby natively integrated in the semantic web environment. Furthermore

it is a light-weight modeling language specialized on hierarchical data structures like


thesauri, taxonomies and classification schemes.

SKOS concepts can possess three kinds of labels - prefLabel, altLabel and hiddenLabel.The prefLabel property defines the preferred label of the concept and altLabel defines al-

ternative labels, which is useful to assign synonyms, acronyms and abbreviations. A

hiddenLabel is a label, that can be used internally, for tasks like search operations and

text-based indexing, but should never be visibly displayed. A practical example there-

fore might be common misspellings of actual labels. Every kind of label can optionally

have a language tag, that restricts the scope of a label to this particular language and

by doing this, enables an executing entity to preferably display the label in the native

language of the calling instance.

Figure 4.4: An exemplary SKOS taxonomy

SKOS allows three relation types - broader, narrower and related. Broader and narrower can

be used to build up a concept hierarchy as demonstrated in the example in Figure 4.4.

A relation broader of a concept tinyBooks towards a concept books would express, that

tinyBooks is a subconcept of books and so every narrower instance of tinyBooks is also

an instance of books. A relation narrower of the concept book towards tinyBooks implic-

itly expresses the same. The relation related can be used to express a non-hierarchical

connection between two concepts. For example dog and dogOwner are in no way sub-

concepts of each other, but they are naturally related.

4.4 KEA 37

4.4 KEA

The abbreviation KEA [63] is short for Keyphrase Extraction Algorithm. This piece of

Java-based open source software is supposed to analyze text documents to extract a set

of keywords or keyphrases, which are multi-word units. Keyphrases are widely used

in corpora to shortly describe the content of single documents and to provide a basic

sort of semantic metadata, that can be reused by other processing tasks.

The task of assigning keyphrases to a document is called keyphrase indexing. Tradi-

tionally authors or special indexing experts have done this task manually, but with an

increasing amount of texts in digital libraries and the whole world wide web this ap-

proach is no longer sufficient. KEA provides a software-driven, free indexing approach

to automate this task.

Figure 4.5: The KEA algorithm diagram together with KEA++3

The diagram in Figure 4.5 visualizes the overall process of KEA. It can be divided in

two basic subtasks. The first step candidate extraction is accordingly termed as extractcandidates within the schema. It is further described in Subsection 4.4.2. The second step

is the filter process to extract those keyphrases, that are most likely to be useful. This

3Accessed January 6, 2010: http://www.nzdl.org/Kea/img/kea_diagram.gif


subtask essentially involves the schema entities compute features and compute entitieswhen it actually works, but also includes compute model, while being in training mode.

It will be detailed in Subsection 4.4.3. Finally Subsection 4.4.4 will introduce the KEA

advancement KEA++.

4.4.1 Candidate Extraction

Candidate Extraction is responsible for extracting candidate phrases out of a plain text

using lexical methods. (see Subsection 3.2.3 for general information about lexical anal-

ysis) This subtask is again dividable in the three basic steps input cleaning, candidate

identification and normalization of phrase candidates.

Input cleaning normalizes the raw input text into a standardized format. Therefore,

the text is divided into tokens, using spaces and punctuation as splitting clues. The

outcome is modified by separating out single or framing symbols like marks, brackets,

numbers and apostrophes as well as non-token characters and those tokens, that do not

contain any letters. Furthermore hyphenated words are split in pieces.

Phrase identification is the task of considering all token sequences as possible phrases

and finding the suitable candidates among them. KEA uses three conditions to match

suitability. The first condition claims, that a candidate phrase is composed of a limited

number of tokens. Three words appeared to be a good choice of length configuration.

The second condition claims, that proper names can not be candidate phrases and the

third one, that phrases can not begin or end with a stopword. Stopwords are thereby a

wordlist containing types like conjunctions, articles, particles, prepositions, pronouns,

anomalous verbs, adjectives and adverbs, that are unlikely to begin or end a useful

phrase.

The third task is meant to normalize the identified phrase candidates by stemming and

case-folding. Stemming is usually achieved by iteratively cutting the suffix of the can-

didate until just the stem is left. Case-folding is simply done by a general lower case

conversion. Additionally multi-word phrases can be re-ordered, so that for instance

Technical Supervisor and supervising technician both result in the normalized form supertech. This extracted form is called a pseudo-phrase. Besides the most frequent original

phrase of every pseudo-phrase is investigated to be presented as phrase label to human

users.

4.4 KEA 39

4.4.2 Filtering

The filtering task is responsible for choosing the most suitable keyphrases out of a given

set of keyphrase candidates. To achieve this the candidates have to be measured in a

way, that makes them comparable. Thereupon, it must be decided which candidates

will be chosen as keyphrases. The three applied metrics for free indexing are TFxIDF,

first occurrence and phrase length.

TFxIDF is a frequency metric, that relates the phrase occurrence in a particular docu-

ment to the its occurrence frequency in all preceding documents. The idea behind this

is that a phrase is more likely to be a keyphrase, when it occurs often inside the respec-

tive document and this occurrence is also generally rare in corpora average. Rareness

relates in this case to unpredictability and thereby to a higher amount of information

gain. For example the fact that a document is related to chemistry is not very surprising

inside a purely chemical corpora, but may be a useful descriptor, when the document

is member of a computer science corpora. The formula for TFxIDF is:

TFxIDF =f req(P, D)

size(D)x− log2

d f (P)N

, where

1. TF, is the term frequency in the actual document

2. IDF, is the inverse document frequency, which measures the probability of a term to

occur in a document of the corpora

3. freq(P,D) is the number of times term P occurs in document D

4. size(D) is the number of words in document D

5. df(P) is the number of documents containing the term P in the global corpus

6. N is the size of the global corpus.

The second feature metric is the relative position of first occurrence. It is calculated

with the following formula:

FO =prec(P, D)

size(D), where

1. FO, is the relative value of the first term occurence

2. prec(P,D), is the number of words in document D preceding the first occurence of the

term P

3. size(D) is the total number of words in document D

The third feature is the phrase length. This metric gives attention to the idea, that

human indexing experts tend to choose two word phrases instead of one- or three word

phrases. Therefore, it might be reasonable to weight these candidates higher.


After the features have been derived, the selection itself must be processed. Therefore,

KEA applies a machine-learning algorithm based on WEKA [64], that is supposed to

learn how to valuate a phrase candidate and afterwards do it autonomically. Within the

approach categorization introduced in Subsection 3.2.2 KEA uses a statistical approach,

working with naive Bayesian networks to build up the prediction model. Hence KEA

needs a set of already annotated documents in the first place to train the model how

to distinguish useful keyphrases among the candidates. Once the model is sufficiently

trained, KEA is able to differentiate useful and useless keyphrases from unknown doc-

uments quite well.

4.4.3 KEA++

KEA++ [43] is an advancement of KEA, that enhances it by the possibility of controlled

indexing. Since version 4 it is also included in the KEA main distribution. Controlled

indexing in contrast to free indexing restricts the number of possible keyphrases to a

fixed set of determined phrases. The advantages and disadvantages of these two ap-

proaches were already discussed in Subsection 3.2.7. Summarizing one may say, that

controlled indexing is very useful in fixed domains and in cases where predictable

keyphrases are an important requirement. The most fundamental resource of con-

trolled indexing is the thesaurus. KEA++ is hereby designed for using SKOS tax-

onomies. (see Section 4.3 for information on SKOS)

An effect of the advancement on the candidate extraction is that a phrases had to be

successfully matched with a thesaurus entry before it is considered as a keyphrase can-

didate. The matching process is done by normalizing the thesaurus entries in the same

way as the candidates and comparing the pseudo-phrases instead of the originals in

order to avoid complex morphology handling.

An additional metric in the feature extraction process of the controlled index approach

is the node degree, which uses the number of direct semantic relations from one can-

didate to the others as clue for the representativity, thus modifying its weight. This

feature has the interesting effect, that even phrases, that do not actually appear in the

text, might become keyphrase candidates, just because they are well connected to the

other candidates. For instance a text, that mentions astronomy, biology, physics, chem-

istry and earth science might be well described by the common related term natural

science, albeit it does not appear in the text.

Another new metric, directly resulting from the preceding one, is the actual appearance

of a phrase in the text. Although a particular node degree might lead to the conclusion,

4.5 Calais 41

that a not appearing thesaurus term may be a good candidate, appearance is still a

strong indicator, that should be considered in the selection process.

4.5 Calais

Calais [12] is a strategic initiative at Thomson Reuters, that aims to improve the inter-

operability of content. Therefore, it utilizes state-of-the-art natural language processing

techniques to "turn static text into Smart Media that is enriched with open data and con-

nected to a dynamic Linked Content Economy" [61]. More precisely calais provides free

metatagging services, developer tools and an open standard for the generation of se-

mantic content. The key component of their efforts is the OpenCalais webservice, that

will be detailed in Subsection 4.5.2. Finally the underlying data format will be treated

separately in Subsection 4.5.3.

Figure 4.6: Input and output data of the OpenCalais web service4

4.5.1 OpenCalais WebService

OpenCalais is the web service at the core of the Calais initiative. It is an API, that takes

unstructured plain text as input, processes it with natural language processing and ma-

4Accessed on January 8, 2010:http://enioaragon.files.wordpress.com/2009/12/12-03-calais.jpg?w=450&h=325


chine learning methods and returns a semantically annotated text version to the user.

The access to the web service is even for commercial use basically free, but requires a

registration to get an API key, that is mandatory for every request. Furthermore the re-

quest frequency for a single API key is currently limited to fifty thousand transactions

per day and four transactions per second.

Method invocation can be done by sending either SOAP or REST requests. Calais takes

the committed content, that must not be larger than one hundred thousand characters,

identifies the occurring entities therein and tags them with metadata. Relevant entity

classes are categories, named entities, facts and events as shown in Figure 4.6. The exact

data model is described further in Subsection 4.5.3. Web service responses return the

improved content with all the assigned tags, document IDs and URIs as RDF, Microfor-

mats, JSON [15] or Calais’ hybrid Simple Format.

4.5.2 Data Model

OpenCalais’ data model is strongly oriented on the linked data design principle, prop-

agated by W3C director Tim Berners-Lee [5]. This principle can be described by four

simple rules:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information, using the standards

(RDF, SPARQL)

4. Include links to other URIs. so that they can discover more things.

According to the points one and two, OpenCalais identifies every relevant object with

an HTTP URI. Common URIs relate to types, type instances, documents, text instances

or resolution nodes. Types are the predefined entity categories, that Calais provides.

Their URIs are statically formed like http://s.opencalais.com/1/type/em/e/Company.

Type instances are special individuals of a certain type. For example the ClearForest

Ltd. would be a type instance of Company. Their URIs are composed of a type related

prefix and an instance specific hash token. The URI of the ClearForest Ltd. would be

http://d.opencalais.com/comphash-1/899a2db3-ce69-3926-ba4f-6dea099c3fc9. If the

relevance feature is turned on, the RDF also includes a score, that estimates the im-

portance of the entity for the document.

Document URIs refer to the actual text, that is sent within the request, and are com-

posed like type instances, but with a prefix referring to document. An example URI is

http://d.opencalais.com/dochash-1/00b00ecd-7e8b-3773-b30f-2169abd75efe.

4.5 Calais 43

Type instances like the ClearForest Ltd. on their part possess instances in the docu-

ment, each describing an actual occurrence in the text. These text instances are referred

to by an ID composed of their container document’s URI and a document internal in-

stance counter as suffix. For example the sixth instance of a document may be named

http://d.opencalais.com/dochash-1/00b00ecd-7e8b-3773-b30f-2169abd75efe/Instance/6.

A special problem are ambiguous type instances. These cases are resolved by noting a

general ambiguous type entity followed by a resolution node, that represents the most

likely disambiguate instance. For example the city Paris would be described by an am-

biguous type instance http://d.opencalais.com/genericHasher-1/56fc901f-59a3-3278-

addc-b0fc69b283e7 and the additional resolution node http://d.opencalais.com/er/geo/city/ralg-

geo1/797c999a-d455-520d-e5cf-04ca7fb255c1, describing the special individual Paris,

France and rating the certainty of disambiguation correctness by an attribute score.

Figure 4.7: An example application of linked data5

Like point three and four suggest, the URIs can be looked up again to gain additional,

RDF-based information, including new related URIs for further research. In doing so

OpenCalais helps to semantically link the data, creating an expanding semantic net-

work. Furthermore it links with external open data sources like Wikipedia6, IMDB7,

5Accessed on January 8, 2010:http://www.slideshare.net/KristaThomas/simple-opencalais-whitepaper

6http://wikipedia.org/7http://www.imdb.com/


Geonames8, Citeseer9, IEEE10, Project Gutenberg11 and many more to even intensify

the usefulness of the provided metadata.

In some cases the resulting linked data can be yet used to identify relationships between

people, businesses, and others, thereby potentially easing investigation processes a lot

and providing many other automatization possibilities. Traditionally a human infor-

mation search would take some serious time to link the IBM Corporation with Warren

Buffett, whereas a software with an extracted semantic net like that of Figure 4.7 may

draw the conclusion quite rapidly.

8http://www.geonames.org/9http://citeseer.ist.psu.edu/

10http://www.ieee.org/portal/site11http://www.gutenberg.org/wiki/Main_Page

CHAPTER 5

Related Work

This chapter provides an overview of the related work, that has been done in the field

of information filtering, which is the most relevant research area to the CompleXys

project. The focus will lie on recent papers, theses and projects, that are mostly repre-

sentative for extensive subfields. They were chosen due to their similarities and differ-

ences towards the characteristics of CompleXys. Each of them is suitable to spot the

accomplished work within the context of germane, international research.

Various different definitions have been made for information filtering, especially con-

cerning the relationship of information filtering to information retrieval. A frequently

cited paper in this aspect is [3], which states that "Information filtering is a name used

to describe a variety of processes involving the delivery of information to people who

need it." Belkin and Croft finally concluded, that information retrieval is on such an ab-

stract level just another side of the same coin and even a superior discipline to filtering

as its specialization. This view influences the research up to today, but also receives

critic for constricting the perspective and thereby suppressing attention to the spe-

cific attributes of information filtering [47]. Oard phrases another popular definition

in 1996, stating "Text Filtering is an information seeking process in which documents

are selected from a dynamic text stream to satisfy a relatively stable and specific infor-

mation need." and thereby classifying it among related fields as specified in Table 5.1

[48]. Others draw the line between access styles, dividing between information filtering

as passive push and information retrieval as active pull research [34]. And yet others

see information retrieval as equivalent to one-time-querying and information filtering

as equivalent to continuous querying or selective dissemination of information [66].

One early idea of information filtering was already presented in 1982 in [18]. The au-

thors proposed an approach to automatic ordering of incoming emails according to

their priority. This work marked the beginning of the email classification as a very

46 Related Work

Process Information Need Information Sources

Information Filtering Stable and Specific Dynamics and Unstructured

Information Retrieval Dynamic and Specific Stable and Unstructured

Database Access Dynamic and Specific Stable and Structured

Information Extraction Specific Unstructured

Alerting Stable and Specific Dynamic

Browsing Broad Unspecific

Entertainment Unspecific UnspecificTable 5.1: Examples of information seeking processes [48]

intensive application field of information filtering. Nowadays nearly all providers use

some kind of spam filter, making them probably the most frequent information filtering

applications.

Information filtering generally needs some kind of source metadata to match it with cer-

tain filter conditions and thereby decide whether to filter or display a single resource.

So it can be divided by the coarse sources of data into content-based filtering and col-

laborative filtering. In the subfield of content-based filtering this data is extracted out

of the content itself, while the competing collaborative filtering uses averaged com-

munity behavior in comparable contexts as model. The latter received much attention

recently, due to several very successful applications like the Amazon1 and Last.fm2

portals, which leverage the approach for product and music recommendations. Fur-

thermore the social web hype is a fortunate time for building and leveraging web com-

munities. Ongoing research work for collaborative filtering involves projects like the

social web page aggregator MakeMyPage [29]. This thesis however focuses on content

based filtering, because we can not assume to always have an adequately big commu-

nity present. Content-based filtering has not yet been as successfully utilized as its

competing approach [47], albeit the great potential of personalized access to informa-

tion networks beyond community borders is obvious and often recognized.

The content data itself is not of any filtering use without the information, what the

user actually wants to get. There are various approaches to select and present informa-

tion tailored to the user interest. The two most important possibilities are personalized

models and queries. Queries are the most common expression of information need in

today’s information systems. They are traditionally collections of metadata phrases,

that describe a special one-time resource request. They are used in nearly every web-

1http://www.amazon.com2http://www.last.fm

47

site from search engines over weblogs to e-stores. However, these queries are normally

restricted to relatively simple tasks and lack any possibility of support for long time

research. Instead one had to manually combine a lot of those queries over time, provid-

ing all necessary context metadata over and over again. This is just natural in respect

of their heritage in information retrieval, but makes them unhandy for information

filtering. A special idea to overcome these shortcomings are continuous [65] or retroac-

tive [14] [24] queries. They base on the idea, that some queries can not be sufficiently

answered in the first research step and ongoing investigation might be interesting. Rea-

sons for this can be, that new information could be available later and the researcher

wants to stay up to date or that some information’s value is simply coupled to certain

events, that had not happened yet. Retroactive queries are an intersecting approach be-

tween queries and personalized models, because they are structured like queries, but

persistently retained like personalization data.

CompleXys itself actually uses a user model driven, personalized filtering approach.

User models could either be short-term, long-term or intermediate models. Short-term

models concentrate on the recent behavior of the user, while long-term models focus

on permanent user interests. Task models [32] are an intermediate approach between

both - they can spread over many sessions and time units, but are restricted on a certain

task scope. It’s purpose is similar to that of retroactive queries, trying to satisfy the spe-

cial needs of complex, multi-session tasks. This thesis’ work is essentially combinable

with all of them, although the ontology annotation implicitly suggests a correspond-

ing ontology based profile. The benefit of ontologies in user modeling was intensively

investigated during the last years [23] [55] [53] and is likely to become quite common

with the emergence of a Semantic Web, which both underlines the implication.

Information filtering is often seen as a binary text classification problem [49] [36]. This

thesis’ work differs from most text classification based filters insofar, that it does not tra-

ditionally derive one fixed category, but lists category candidates ordered by its proba-

bilities, thereby combining a multi-value classification with fuzzy data. This is useful,

because a text can be relevant in many aspects - for example one may want to read

this thesis, because of its relation to information filtering, because it applies GATE or

because it is part of an adaptive portal project. However, traditional text classification

would be inappropriate to achieve this, because it narrows the perspective to a sin-

gle aspect and deletes the less probable data. Furthermore a simple binary decision

for each classification candidate ignores certainty levels and provides no possibility to

order documents based on their category relevance. Using the probability order ap-

proach, priorisation can be a natural part of the information filtering process, among

48 Related Work

others easing smart, dynamic display styles in information portals.

A fundamental design decision of each information filtering system is the domain de-

pendency. A domain specific approach can consider the unique circumstances of a

certain domain and harness it to raise its performance. Such approaches were for in-

stance investigated for the high energy physics in [54] and the agriculture domain in

[43]. Both however used domain ontologies for this purpose, so their basic system is

theoretically applicable to any other domain by simply exchanging the underlying do-

main model. This is also the approach, that was chosen for the CompleXys project,

because it focuses on the complexity science domain and competing domain indepen-

dent approaches like the wikipedia based [58] do, albeit making appreciable progress,

not yet achieve a similar classification performance.

Recent research also implied, that the consideration of background knowledge is capa-

ble of raising the classification performance. Domain ontologies are an obvious source

for such approaches, even when they are automatically created [7]. But also other the-

sauri like WordNet were already successfully utilized [19] [41] [25]. Within this thesis’

work the keyphrase extractor KEA [44] was employed to harness the relation data be-

tween the domain specific keyphrase concepts.

A comparable project to CompleXys was h-TechSight [42], which in 2004 implemented

an information filtering system on a controlled vocabulary text classification basis. Be-

sides their goal of better performance, they also utilize ontology technologies, thereby

being natively compatible with the Semantic Web. This leads to various possibilities

like achieving performance gains by utilizing external linked data and elaborate Se-

mantic Web reasoners.

A more recent related project is the PIRATES framework [2], whose name is an abbrevi-

ation for Personalized Intelligent Recommender and Annotator TEStbed. They provide

a promising architecture for text-based content retrieval and categorization with a focus

on constructive interoperability with social bookmarking tools, describing their project

as "a first step towards the generation and sharing of personal information spaces de-

scribed in [13].". There are some obvious intersections with the current approach of the

CompleXys project, including a module, that utilizes GATE for information extraction

on the basis of part of speech tagging, and one module, that uses a variation of the

KEA algorithm for keyphrase extraction. Albeit the main goals are quite different, a

properly configured PIRATES implementation including a fitting domain model possi-

bly provides a suitable alternative to the CompleXys system. But although a detailed

comparison would have been interesting, the focus of this thesis had not yet allowed to

do one.

49

Recent research is apparently often influenced by the ongoing social web emergence.

Besides the PIRATES project, there was research for the multi-value classification of

very short text like comments or microblogging entries in [26] and the categorization of

blogger interests based on short blog post snippets in [39]. Up to now CompleXys uses

mostly whole websites and reformats them into the SIOC format. There is no differ-

entiation between the content size yet, but due to the positive results of the mentioned

papers, this may be changed in the future.

50 Related Work

CHAPTER 6

Semantic Content Annotator

The Semantic Content Annotator module, which has been the first of the two practical

development goals , is the focus of this chapter. It extracts and annotates semantic data

from input text and classifies the text according to concepts of a CompleXys domain

ontology, that was required to perform the text classification in this module. Section

6.1 describes the design and implementation of the central domain ontology. Section

6.2 introduces the principle of CompleXys Tasks and describes their implementation

instances, thereby explaining how the module’s goals are achieved.

6.1 CompleXys Domain Ontology

This section is dedicated to the design of a CompleXys domain ontology. Therefore,

Subsection 6.1.1 will outline the focus and adjacent topics of complexity. Afterwards

Subsection 6.1.2 will describe the CompleXys taxonomy, that is derived from the for-

merly identified issues.

6.1.1 Complexity

Complexity is basically an interdisciplinary approach to science and society. It tries to

describe, understand and apply a wide range of chaotic, dissipative, adaptive, nonlin-

ear and complex systems and phenomena [8]. It cannot be strictly defined, but only sit-

uated in between order and disorder. A complex system is usually modeled as a set of

interacting agents, which represent diverse components like people, cells or molecules.

Because of the non-linearity of the interactions, the overall system evolution is to an im-

portant degree unpredictable and uncontrollable. Still the system tends to self-organize,

in the sense that local interactions eventually produce global coordination and synergy.

52 Semantic Content Annotator

The resulting structure can often be modeled as a network, with stabilized interactions

functioning as links connecting the agents. Such complex, self-organized networks typ-

ically exhibit the properties of clustering, being scale-free, and forming a small world

[27]. Complex patterns can be found in many traditional research fields, so that com-

plexity is nearly as universally applicable as information science and mathematics and

attracts increasing attention throughout the last decades. However, this universality is

hard to catch and to describe in a formal way like a comprehensive domain model.

6.1.2 CompleXys Taxonomy

The first considerations regarding the domain model led, among others, to the con-

clusion that complexity is a topic with strong relations to many scientific areas. And

these scientific areas are classified to certain disciplines like biology, mathematics and

computer science, each involving many subfields again. So it is obvious to reuse this hi-

erarchical classification in the scaled-down area of complexity related terms. Moreover

this style of hierarchical topic division is well-known, easy to visualize and intuitively

understandable and usable by the users of the system. These thoughts also led to the

decision, that a hierarchical taxonomy would be sufficient for CompleXys and that us-

ing the taxonomy modeling language SKOS (see Section 4.3) is thereby a sensible choice

to be made.

The task of building an exhaustive domain model is one, that is beyond the scope of this

thesis. So, for a proof-of-concept implementation the domain model was restricted to

a semiautomatically created version, that may be revised as part of future work. In the

first place a set of representative and broad scoped textdata was needed to be found.

A good place to look for such a thing seems to be a specialized encyclopedia, so the

first chosen source was the topical table of contents1 of the Encyclopedia of Complexity

and Systems Science [45]. The second identified source were the titles of talks, that

were given at several complexity conferences. Three conferences were chosen for this

purpose:

• Complexity in Science and Society : International Conference & Summer School,

Greece, 14-26 July 20042

• Conference on Nonlinear Science and Complexity, China, 7 - 12 August 20063

111.01.2010: http://www.springer.com/cda/content/document/cda_downloaddocument/Topical+Table+of+Contents.pdf?SGWID=0-0-45-783798-p173779107

211.01.2010: http://www.math.upatras.gr/ crans/complexity/311.01.2010: http://www.siue.edu/ENGINEER/ME/NSC2006/

6.1 CompleXys Domain Ontology 53

• Conference on Nonlinear Science and Complexity, Portugal, 28 - 31 July 20084

In the the next step these text documents had to be analyzed in order to extract impor-

tant terms. A suitable tool to achieve this is the keyphrase extractor Maui[44]. Maui

is based on KEA, that was already introduced in Subsection 3.4, but is designed to na-

tively accomplish a broader range of tasks. It works on a machine learning basis, so it is

necessary to train a model, before the actual keyphrase extraction can be done. In order

to train Maui well for our requirements the fitting training set needs to be specialized

on an extensive amount of general topics, because, as we stated in 6.1.1, complexity is

distributed over many traditional research fields. Such a training set is the CiteULike-

180 data set5 [44], which has been automatically extracted from the huge collection of

tags assigned to the bookmarking platform CiteULike6.

After being trained, Maui is used to extract a big amount of tag candidates out of the

four source text files by calling the MauiTopicExtractor with the n parameter set to

thousand. The resulting .key files contain the extracted tag candidates, but also in-

clude many, that are too general to be sufficient as terms of a domain model. So the

files are superficially, manually scanned and obviously unfitting terms like "group" or

"number" are sorted out. Then the remaining terms were manually clustered into main

categories. These categories are derived from those major scientific disciplines, the

terms are related to most. Additionally some other terms, resulting from a preceding

investigation, were added manually. The final taxonomy includes 297 terms divided in

ten main categories. The terms are shallowly organized within two hierarchical levels,

main categories and appendant terms. They may be further hierarchically classified to

improve their expressiveness in the future. Furthermore some terms are interconnected

by the relation type related to express either topical closeness between two terms or an

ambiguity of belonging, when a term could be assigned to more than one main clas-

sification. Figure 6.1 shows an excerpt of the model as taxonomy circle. The ten main

categories are displayed in the inner circle, while the outer circle contains examples of

appendant terms. The connections between some of the terms are exemplarily for the

use of the related relationships.

411.01.2010: http://www.gecad.isep.ipp.pt/NSC08/511.01.2010: http://maui-indexer.googlecode.com/files/citeulike180.tar.gz6http://www.citeulike.org/


Figure 6.1: An excerpt of the CompleXys taxonomy

6.2 Semantic Content Annotator Pipeline

This section introduces the Semantic Content Annotator pipeline. This pipeline is re-

sponsible for the extraction of semantic data out of the incoming documents and for

the annotation of this data back to the resources. Furthermore, it is meant to decide

whether a resource is relevant for Complexity and to what main classification it should

be assigned to. The succeeding subsections describe how these problems are solved, by

explaining the principle of the CompleXys Tasks and their implementation instances.

Subsection 6.2.1 gives an overview to the structure and purpose of the pipeline and

the Subsections 6.2.2 to 6.2.6 describe the components Crawled Content Reader, Onto

Gazetteer Annotator, Open Calais Annotator and Content Writer.

6.2 Semantic Content Annotator Pipeline 55

6.2.1 Introduction

The Semantic Content Annotator module is meant to take a potentially high number

of documents as input, to analyze them in order to extract semantic data, to decide

whether they are relevant for complexity, to fuzzily classify them into the topics of

the domain model and to finally put them out again. Obviously this involves several

sequential steps, in which each document had to be processed. That makes this module

a perfect candidate for a parallel processing pipeline structure. The main advantage of

such an approach is the effective exploitation of distributed processing and most of

all of multi-core processor systems. Therewith, it is a strong way to raise processing

performance and scalability.

Figure 6.2: The CompleXysTask principle

The Semantic Content Annotator utilizes the java package java.util.concurrent to im-

plement such a pipeline. The basic principle is visualized in Figure 6.2. Every coherent

component is implemented as a runnable task object and submitted to a thread pool.

Concurrent Linked Queues handle the communication between the several tasks. Each

queue has a sender task and a receiver task. Whenever a sender task has finished its

function with a certain document, it sends it to its output queue. At the other side of

the queue the receiver task takes every document in First-In-First-Out order and starts

processing it. Every task possesses a unique name and a set of features, that can be

used to transmit all kinds of special information, that may be needed by a task. For

example the standard feature, that is used up to now, is debug. It enables a centralized

control of debugging output in the Semantic Content Annotator main class. Generally

there are three kinds of tasks in this module, differentiated by the number and usage

type of their queues, by their determination dependency and by their basic duties -

the initiating task, CompleXys Tasks and the finishing task. Furthermore every task

is linked to a Future object, which is basically a flag, that describes the determination


state of the thread it runs in. Every task, except the first, listens to the preceding task of

the pipeline and terminates then and just then, when the preceding task has terminated

and no document is left in its input queue.

The initiating task is the first task in the pipeline. Accordingly it possesses an output,

but no input queue. Instead it is responsible for collecting the necessary resources and

the already included metadata itself, which is its main purpose as well. It terminates,

when no documents are left to collect. The implemented initiating task of CompleXys

is the Crawled Content Reader. In Subsection 6.2.2 it will be described in detail.

Figure 6.3: The Semantic Content Annotator Pipeline

CompleXys Tasks are implementations of the abstract class CompleXysTask. They are

characterized by possessing both input and output queue. Their purpose is the actual

analysis, semantic data extraction and classification of the incoming documents. Three

CompleXysTask instances were implemented throughout this thesis’ work. The Onto

Gazetteer Annotator is further described in Subsection 6.2.3, the KEA Annotator in

Subsection 6.2.4 and the Open Calais Annotator in Subsection 6.2.4.

The finishing task is the last task in the pipeline. Accordingly it possesses an input,

but no output queue. Instead it is responsible for outputting the documents itself. The

whole pipeline and therewith the Semantic Content Annotator terminates, when the

finishing task is done. In CompleXys the finishing task is called Content Writer. It will

be further described in Subsection 6.2.5.

Figure 6.3 provides an overview of the implemented pipeline and its connected data

stores.


6.2.2 Crawled Content Reader

The Crawled Content Reader is the first component of the pipeline and its main purpose

is to gather the documents from the input data store, to decide whether they should be

processed, to wrap them into the GATE data format (see Subsection 4.2.2) and to send

them into the output queue for further processing.

First it builds up connections to both the input data store, where the unprocessed docu-

ment are stored, and the output data store, where the processed documents are stored.

The former will be referred to as Harvester DB, because the Harvester module is re-

sponsible to steadily fill it with documents and the latter will be referred to as SemanticDB, because it is filled by the Semantic Content Annotator module and the stored data

additionally includes the semantic information. The connection to the Semantic DB is

built by using an intermediate persistence layer consisting of a set of DataAccessOb-

jects7 (DAO), Factories and GATE’s Hibernate Persistence Layer, resulting in a strong

layer division and high data store exchangeability.

It fetches all the documents stored in the Harvester DB and checks if the document is

already stored in the Semantic DB and thus was already processed once. This must be

done, because the Harvester DB actually needs the old documents to check for subse-

quent modification and to decide whether a resource is new. This step should become

obsolete in the future, because a modified time stamp or an unprocessed flag can help

to pointedly access only new and modified documents. If the document can be found

there, both versions are compared by a hash value of their content to find out if the

text was modified in the meantime. If it was not modified and the hash value is equal,

the document is ignored, because everything is still up-to-date, but if something was

changed the correctness of all recent annotations and potentially even of the classifica-

tion is uncertain. This dilemma is currently solved by simply deleting the document

out of the Semantic DB and treat it further as it is a new one.

If the documents needs further processing it is wrapped as a GATE document object,

thereby commiting it to the GATE persistency management of the Semantic DB. After

this is accomplished, the document is send to the output queue and the next document

is handled. The Crawled Content Reader terminates, when no unprocessed or modified

document is left in the Harvester DB.

713.01.2010: http://java.sun.com/blueprints/corej2eepatterns/Patterns/DataAccessObject.html


6.2.3 Onto Gazetteer Annotator

The Onto Gazetteer Annotator searches the text for keywords, that are listed in the

gazetteers and annotates found terms with the corresponding concept. The frequency

of occurring concept annotations can be used as a simple indicator for the classifica-

tion. The implemented version does not make use of preprocessing NLP methods like

stemmers or part-of-speech tagging and is therefore likely to provide worse recall and

precision results than the KEA Annotator, that is described in the next Subsection 6.2.3.

The central element of this component is the OntoGazetteer or semantic tagger, that is

included in the information extraction system ANNIE (see Subsection 4.2.3). It is not

directly applicable to the SKOS CompleXys taxonomy, but can make use of a derived,

rule-based version. Therefore, every main category of the domain model gets its own

.lst gazetteer file, wherein all subordinate terms are listed one per line. The terms are

noted in both forms singular and plural to prevent at least the worst effects of the miss-

ing stemmer. A file mappings.def defines the mapping rules from the .lst files to SKOS

concepts in the form: "COMPLEXITY.lst:SkosTaxonomy.rdf:Complexity", where COM-

PLEXITY is the name of the .lst file, SkosTaxonomy the path of the SKOS taxonomy

and Complexity the name of the concept. However, the expressiveness of the gazetteer

data is very limited, so the relationships can not be transformed and an improvement

of the taxonomy to more than two hierarchical layers would complicate the mapping

enormously.

At the beginning the Onto Gazetteer Annotator component initiates the OntoGazetteer

object. Therefore, it disables case sensivity, because it would unnecessarily filter out

words and can not compensate this disadvantage by any perceptible effect. GATE’s de-

fault gazetteer is used as processing basis and the resulting annotations are written into

a special "OntoGazetteerAnnotator" annotation set, so they can be pointedly accessed

later on. When a document is received through the input queue it is simply given as

parameter to the OntologyGazetteer, where it is tagged. After taking the document

back, it is forwarded to the output queue. The Onto Gazetteer Annotator terminates,

when the Crawled Content Reader has terminated and no documents are left in the

input queue.

6.2.4 Kea Annotator

The KEA Annotator classifies a document into the concepts of the CompleXys domain

model. To achieve this it utilizes automatically extracted KEA keyphrases (see Section


4.4) as indicators for term relevancy to a text. To ensure that the keyphrases are match-

able to the domain model anyway, it simply uses the CompleXys taxonomy itself as

controlled vocabulary for the extraction process. Additionally it utilizes the related re-

lationships as weight boosting functions. It is likely to raise the classification precision

by using elaborate relevance indicators, semantic background knowledge and a pre-

filtering of unlikely candidates. The Kea Annotator is expected to outperform the Onto

Gazetteer Annotator.

A pre-implemented GATE plugin for KEA, that converts native KEA keyphrases to

GATE annotations, already existed. However, it was not yet adapted to the use of con-

trolled indexing and general KEA++ functionality, that was required for the fixed-terms

classification task. So the missing intermediary functions were manually implemented

as part of this thesis and the plugin was adapted to KEA++. As classification model

CompleXys uses a pre-trained model, again using the CiteULike-180 data set8, that

was already applied in the ontology extraction process (see Section 6.1).

The Kea Annotator component itself is quite simple. It just fetches every new document

from the input queue, gives it as parameter to the Kea class of the GATE plugin, that

extracts and annotates the keyphrases into a special "KEAAnnotator" annotation set,

and finally writes it back into the output queue. The Kea Annotator terminates, when

the Onto Gazetteer Annotator has terminated and no documents are left in the input

queue.

6.2.5 Open Calais Annotator

The Open Calais Annotator utilizes the OpenCalais web service (see Subsection 4.5.2)

to semantically annotate the text’s entities. The so obtained data is not yet used for

the domain classification, but links the data to the wide external set of Calais’ stored

semantical knowledge base. To exploit these relations has great potential in further

improving the classification, but also for other features like enriching the displayed

resources in the front end with additional data. However, to actually make use of these

data had to be part of future works on CompleXys.

CompleXys uses the existing OpenCalais GATE plugin to access the web service and to

convert its responses to GATE annotations. The plugin had to be initiated with the web

service URL and an OpenCalais API key, that can be requested on the Calais website9.

Then every incoming document can be given to the plugin, which processes it and

811.01.2010: http://maui-indexer.googlecode.com/files/citeulike180.tar.gz9http://www.opencalais.com/


writes the annotations into a special "OpenCalaisAnnotator" annotation set, so that they

can be pointedly accessed later on. The processed document is forwarded to the output

queue. The Open Calais Annotator terminates, when the Kea Annotator has terminated

and no documents are left in the input queue.

6.2.6 Content Writer

The Content Writer ensures, that every document is correctly stored in the Semantic DB,

before the pipeline terminates. Optionally it can calculate comparing or summarizing

values of the former component’s results, because its finishing position guarantees, that

every value is in its final version, when it arrives at the Content Writer.

First it builds up a connection to the Semantic DB and waits for the first documents to

arrive at the end of the pipeline. After receiving a document from the input queue, it

calculates one or more main categories according to the results of the Onto Gazetteer

Annotator and the Kea Annotator in the corresponding annotation sets and writes it as

"mainCategory" feature to feature set of the document. Then it stores the actual version

of the document in the Semantic DB and tries to process another document. The Content

Writer terminates, when the Open Calais Annotator has terminated and no documents

are left in the input queue. When it has terminated, the whole pipeline terminates too.

CHAPTER 7

Semantic Filter

This chapter focuses on the Semantic Filter module, which is the second practical de-

velopment goal of this thesis. It provides a filter mechanism, that extracts resource

subsets by their compliance to semantic conditions. Therefore, Section 7.1 introduces

the principle of the applied filters and their implementation. In addition the matching

resources must be further converted into a standardized, useful output format. Section

7.2 describes the converters for the two formats RSS and Sesame triples, that have been

chosen for this purpose.

7.1 Filter

This section introduces a filter system for CompleXys. Therefore, Subsection 7.1.1 dis-

cusses certain filter approaches, that were considered to be used. Subsection 7.1.2 de-

scribes the central AbstractFilter class and its internal iterating data structure. Subsec-

tion 7.1.2 explains how these filters can be used to implement a framework for propo-

sitional logic. Subsection 7.1.3 introduces the basic filters, that actually evaluate the

semantic data of the documents.

7.1.1 Filter Approaches

The first step in designing a useful filter system for CompleXys was to identify a suit-

able filter approach. One requirement were that it has to be able to handle the large

amount of resources, that are likely to arise in the long-term use of an information fil-

tering system in the world wide web. Furthermore the approach has to be expressive

enough to describe composed filter queries, because the dynamical user interests are

unlikely to be reducible to a single standard condition. And finally the filter has to pro-

62 Semantic Filter

vide good runtime performance, because it tends to be directly involved in the front

end’s output composition, thereby directly influencing the response time for the users

of CompleXys.

A possible approach is an abstract filter layer to the underlying data store itself, that

would try to directly convert user interest queries into data store queries. Advantages

of such an approach are the high expressiveness of potentially usable query languages

like SQL, the native capability to handle big amounts of resources and good perfor-

mance, because data stores are usually optimized for rapid access. Main disadvantages

are the complexity of matching complex user interest queries to data store queries and

above all the high coupling of the filter component to the underlying data store imple-

mentation and structure.

Another approach is the direct conversion of the resources into a triple store format, that

could be used with Sesame1 (see 7.2.2). Filter queries could be made with such suitably

powerful query languages as SPARQL2 and the queries themselves would have been

optimized for good performance too. However, the main disadvantage of this approach

is, that each document had to be converted to the triplestore format in the first place.

This leads to unacceptable high performance flaws and inevitable problems with huge

amount of documents. A possibility to overcome this problems would have been to

exchange the persistence layer of the Semantic Content Annotator and to replace the

GATE solution with a triplestore, so no conversion would have been necessary. How-

ever, this possibility was still discarded, because of the high expense of creating a new

persistence layer between the GATE document model and Sesame and because another

suitable alternative was available.

The third approach and the one, that was finally chosen, is based on the FilterIterator

pattern [57]. Iterators are data structures, that are designed to iteratively access the

elements contained in them. Usually each element is given out, but contrary to this

common habit the FilterIterator controls the elements beforehand. In doing so it checks

if the elements meet a certain filter condition and simply skips those, that do not. This

approach is apparently poorer in performance than the direct data store access, because

in order to evaluate, if they meet the filter criterions, it had to indifferently fetch all doc-

uments in the first place. But on the other hand it is supposed to outperform the triple

store conversion approach, because it discards unnecessary documents before invest-

ing any further processing in them. in addition the performance of logically composed

filter conditions can be further improved by short circuits as discussed in Subsection

1http://www.openrdf.org/2http://www.w3.org/TR/rdf-sparql-query/

7.1 Filter 63

7.1.3. The main advantage of this approach is the flexibility of the system. The filters

can be arbitrarily composed to complex, logical expressions and the system can be ex-

panded by simply writing new condition methods. The actual implementation of the

FilterIterator pattern and the particular filters is described in the succeeding subsec-

tions.

7.1.2 Abstract Filter

The central element of the Semantic Filter module is the AbstractFilter, that is oriented

on the FilterIterator pattern [57]. It is an abstract java class, that contains three elements.

The first element is the private data structure FilterIterator. It implements java’s Iterator

interface, but differs from the standard implementations in two important points. First

its constructor takes another iterator as parameter and wraps it. This results in the

possibility to recursively wrap FilterIterators in another one, thereby creating complex,

composed FilterIterators. This possibility will be leveraged within the logic filters, that

are described in Subsection 7.1.3. The other difference is the toNext() method, that is

responsible to shift the position pointer of the iterator one step forward. However, this

particular implementation does not take one step, but as many steps as are required to

find an element, that meets the filter condition.

The filter condition, that is checked in the toNext() method is described within the

abstract method passes(). This method had to be implemented by any instance of the

AbstractFilter and characterizes the behavior of the particular filter. Therefore, it takes

an element as parameter, checks if it meets the filter condition and simply returns the

truth value.

The third and last element of the AbstractFilter is the filter() method. This is merely

an intermediate class between any external caller and the private FilterIterator data

structure. To achieve this, it takes an iterator as input parameter, uses it to instantiate a

FilterIterator and returns it back to the calling instance.

7.1.3 Logic Filters

Single filter methods can take parameters in order to provide a basic level of flexibility,

but it is unnecessarily time-consuming to provide a new filter class for any possible

condition of a semantic query. A possibility to partially overcome this is the use of

logic filters. These are filters that accept an arbitrary big amount of another filters as

parameters to link them in a logical way.

64 Semantic Filter

Three of those filters have been implemented in the Semantic Filter module, the And-

Filter, the OrFilter and the NotFilter. These filters perform the corresponding operators

and, or and not on those filters, that have been committed to them. Due to the well-

known fact, that the missing operators implication and equivalence can be simulated

by compositions of the three implemented ones, these logical filters form a complete

framework for propositional logic.

The AndFilter takes an arbitrary big number of AbstractFilters and stores them in an

iterable array. Its passes() method iterates over the stored filters and checks each sub-

ordinate passes() condition for fulfillment. Since all propositions of an and operation

had to be true in order to successfully terminate the operation itself, the operator is

cut short and falsified whenever a proposition is found to be incorrect. This behav-

ior can potentially improve the performance of more complex queries. AndFilters are

useful for precise searches, which makes them a prominent choice for queries of the

"Search" use case, that is defined in Subsection 5.2.2. The OrFilters are structured like

the AndFilters, but differs insofar, that just one proposition had to be true to prove the

truthfulness of the whole operator. Analogue to this a short cut to a true termination

is taken, whenever one proposition is found to be true. OrFilters are useful in person-

alized queries, that are trying to express a general "give me all resources, that might

be interesting for the user". The NotFilter does nothing else, then to negate the actual

boolean result of the committed filter’s passes() method. In contrary to the other logic

filters it accepts just one AbstractFilter as input.

Figure 7.1 provides an example for the application of propositional logic in filtering

queries. The visualized case is equivalent to the verbalized condition "Give me all doc-

uments, that are not related to ComplexMathematics, but related to GraphTheory and

DataMining, as well as to at least one of the topics SystemsGenetics and ComplexCom-

puterScience."

7.1.4 Basic Filters

If the logic filters described in Subsection 7.1.3 build up a framework for propositional

logic, the basic filters can be seen as its propositional elements. They filter documents

according to certain concrete criterions. Up to now two basic filters have been im-

plemented in CompleXys, the GazetteerAnnotationFilter and the KeaAnnotationFilter.

Both utilize the document’s semantic annotations, that were extracted within the Se-

mantic Content Annotator module, to decide if a document is sufficiently related to a

certain term. Therefore, they take the particular term string and the float numbered

7.2 Output Variants 65

Figure 7.1: An example for the application of propositional logic in filtering queries

approval threshold between zero and one as input parameters. In the passes() method

they do look up the semantic annotations, that were stored in the respective annotation

set. In case of the GazetteerAnnotationFilter this is the "OntoGazetteerAnnotator" set

and in case of the KeaAnnotationFilter the "KeaAnnotator" set. Then the annotations

terms are iteratively compared to the criterion term. If both match, they further check

if the classification probability of the annotation is equal to or greater than the approval

threshold. If it is the criterion is matched and a true value is returned. Else the passes()

method returns a false value.

7.2 Output Variants

Up to this point the output of the filter process is still a non-standardized java object,

although it forwards the document to the front end module, that converts them into

rendered user output. Two converter classes were developed in order to change this

flaw, to support the requirements of the front end module and for demonstration rea-

sons. The RSS converter is described in Subsection 7.2.1 and the Sesame Triplestore

Converter is described in Subsection 7.2.2.

66 Semantic Filter

7.2.1 RSS Converter

RSS3 is a content syndication format, that is based on XML. In version 2.0 its name is

defined as abbreviation for Really Simple Syndication. It is beside its competitor Atom4

the de facto standard for news feeds in the internet, which makes it an interesting and

highly reusable output choice.

To convert the document into the RSS format CompleXys uses JDOM to dynamically

build up the corresponding XML tree. The XML structure of an RSS feed always con-

tains an rss element, with a subordinate channel element. This channel element contains

any number of item elements, that usually relate to entities like articles of a news site,

entries of a blog or posts of a forum. In the case of CompleXys these items will represent

the filtered documents. The document’s metadata is stored in subordinate attribute el-

ements. The document title is written to the title as well as to the description element,

the document URL to the link element and the categories, that are sorted by probability

and divided by comma, to the category element. A converted item looks the following

way:

<item>

<title>This is the document title</title>

<link>http://www.source.com/original_resource_url.html</link>

<description>This is the document title</description>

<category>cx:example,cx:complexity,cx:thesis</category>

</item>

7.2.2 Sesame Triplestore Converter

Sesame5 is a storage and querying middleware for RDF and is thus based on the triple

store technique. Additionally it also supports SPARQL queries and is in this combina-

tion a powerful candidate to become a common persistence solution for the semantic

web. It was also considered to be used for the filter itself. This possibility was dropped

mainly because of the high implementation expenses for a new persistence layer be-

tween the GATE document model and Sesame in the Semantic Content Annotator

module, as well as the need for an abstracting layer between Semantic Filter queries

3http://www.rssboard.org/rss-specification4http://www.atompub.org/rfc4287.html5http://www.openrdf.org/

7.2 Output Variants 67

and SPARQL, which together would have exceeded the scope of this thesis. However,

Sesame triple stores remain an interesting candidate for semantic web interoperabil-

ity and further processing, so the Sesame Triplestore Converter was implemented to

provide a corresponding output possibility.

Triple stores are named as that, because every stored basic fact is composed by a triple

of elements, a subject, a predicate and an object. Each of these elements is identified

by a URL, so a main task of the mapping was to find unambiguous, readable URLs

for each individual document and annotation we want to store. This was solved by

identifying them by a URL prefix, that clarifies their type, and an hash id as suffix,

that identifies them by hashing a combination of their metadata and content. The re-

lations between objects are stored as fixed URL terms with a common predicate prefix.

The concepts of the CompleXys domain model do not need to be converted, because

they already possess a taxonomy URL. To convert the document into the XML format

CompleXys uses again JDOM to dynamically build up the corresponding DOM tree.

Finally an exemplarily fact, expressing that a specific document is annotated with the

term "ChaosTheory", looks the following way:

<http://complexys.de/datamodel/annotationhash-1/a3cca2b2aa1e3b>

<http://complexys.de/datamodel/pred/annotates>

<http://complexys.de/datamodel/dochash-1/5b3b5aad99a8529074>

<http://complexys.de/datamodel/annotationhash-1/a3cca2b2aa1e3b>

<http://complexys.de/datamodel/pred/hasCategory>

<cx:ChaosTheory>

68 Semantic Filter

CHAPTER 8

Evaluation

This thesis proposes a solution for the two CompleXys modules Semantic Content An-

notator and Semantic Filter. It accomplishes semantic enrichment of complexity related

content, text classification and the filtering of documents in dependency of their se-

mantic data. This chapter evaluates this solution to decide, whether the requirements,

that were defined in Section 2.2, are met and how the system generally performs. Due

to the limited time frame of thesis only the most critical evaluations have been done.

Section 8.1 discusses the achieved quality of the important text classification task by the

common metrics precision and recall. This is only done for the Semantic Content An-

notator module, because the Semantic Filter performs no additional classification, but

just utilizes the already existent ones. Section 8.2 analyses the response time behavior

of the Semantic Filter solution, because this module can directly influence the response

time for the end user and is therefore time-critical. In contrast to that, the response time

of the Semantic Content Annotator is not as critical, because it usually runs concur-

rently in the system background and does not directly influence the end user response

time. Finally Section 8.3 summarizes the evaluation results and concludes strengths

and weaknesses of the system.

8.1 Classification Quality

This section is dedicated to the evaluation of the achieved text classification quality

of CompleXys. The quality is measured in the popular classification quality metrics

precision and recall (see also Subsection 2.2.3). Subsection 8.1.1 introduces the utilized

set of test documents and explains how it was created. Subsection 8.1.2 explains the

applied test strategy and finally Subsection 8.1.3 discusses the test results.

70 Evaluation

8.1.1 Document Set

Text classification systems are usually evaluated by performing a classification on huge

corpora like the popular Reuters-215781 collection or the succeeding RCV12. Then the

classification results of the system are compared to the already existing human-made

classifications, which are taken as the ideal standard. However, the problem of a do-

main specific system like CompleXys is that it must be tested in its own domain to gain

a useful measure of its quality. In fact it is not even allowed to classify the texts to the

most categories, that are used for the general data sets, thus these can not be reasonably

used. More comparable domain specific text classification systems also tend to have a

huge domain corpora with existing classifications present [54], but apparently there is

none for the domain of complexity. The only suitable alternative is to create a new com-

plexity data set, but to do this manually was hardly feasible in regard to the scope of

this thesis. Finally a test document set was automatically designed.

The documents of the set need to fulfill five conditions. They had to be written in en-

glish, be scientific and nested in markup, because CompleXys is supposed to collect its

resources mostly from english, scientific blogs and news sites. They had to be clearly

dividable into those that are related to complexity and those who are not, because it

needs to be automatically decidable whether a certain decision of CompleXys to keep

and store documents is correct. Finally they had to be enough in number to be statisti-

cally relevant, because small samples can too easily be unrepresentative.

In order to comply with these conditions, the final document set was derived from re-

sources, that are tagged on Citeulike3. Ten documents were chosen for each of the ten

main classifications in the CompleXys taxonomy. These hundred documents were com-

plemented by another two hundred, that were not tagged with any term of the Com-

pleXys taxonomy, so that wrong chooses become possible too. All of this documents

are excerpts from english, scientific websites, so the first three conditions are naturally

met. Furthermore, by referring to their tags, they are dividable into a complexity re-

lated part and another that is not. This is not an exact classification, but anyhow an

appropriate rule of thumb, that approximates precise classification to an acceptable de-

gree. Finally the test set contains three hundred documents, which should be enough

to avoid unrepresentative behavior.

1http://www.daviddlewis.com/resources/testcollections/reuters21578/2http://trec.nist.gov/data/reuters/reuters.html3http://www.citeulike.org/

8.1 Classification Quality 71

8.1.2 Test Strategy

The evaluation of the Semantic Content Annotator is basically performed by text clas-

sification quality. Therefore, the test data set, that was introduced in the preceding

subsection, is processed by the Semantic Content Annotator pipeline. This execution

is done for a series of comparable configurations. In the first test the Onto Gazetteer

Annotator is the only component that performs the classification. In the following ten

test configurations the Kea Annotator processes the documents. The latter differ by the

threshold of occurrence, that had to be exceeded by a term, before it counts as relevant

for the text. This variable is supposed to significantly influence the performance of the

classification, because complexity terms like "complexity" or "chaos" also frequently oc-

cur in texts, that are not pertinent to complexity as such. But those irrelevant words are

likely to occur significantly less often, than relevant words, so a sophisticated threshold

can help to sort the wheat from the chaff.

The obtained binary classification data is used to calculate the standard metrics preci-

sion and recall, that can be compared between the configuration cases, but also to the

performance of other text classification systems. Additively to the relevance decision

the test stores URL, main category and the weight of the main category. The main cat-

egory weight is the relative share of terms from a certain main category within the set

of all term annotations of a resource. It is used to compare certain main categories in

order to identify the most important ones for the text. This data is used to take random

samples to empirically evaluate the quality of the categorization into main categories

and analyze the correlation of correctness and weight.

8.1.3 Test Results

The results of the classification quality tests had to be considered within the contrast of

recall and precision. Accordingly Figure 8.1 presents the measured values distributed

across these two dimensions. The particular points represent the several test runs. The

label gaz refers to the Onto Gazetteer Annotator test and the kea labels to the Kea Anno-

tator tests, with the number standing for the different minimum term occurence values.

It can be perceived, that the gaz test achieves a very high recall value, but by doing so

clearly fails to fulfill the precision target value of 0.7. The kea2 test performs even worse,

but the higher the occurrence threshold is adjusted, the better become the precision

values. This tendency can be continued up to kea80, that misses the target value just

by 0.06, while still complying to the recall requirement. However, kea90 significantly

72 Evaluation

Figure 8.1: The distribution of the quality test runs across the dimensions precisionand recall

declines in both values recall and precision, so it can be assumed that kea80 forms a

local optimum and can not simply be further improved by increasing the minimum

occurrence threshold.

The main category values, that were annotated to the resources, are evaluated by taking

random classification samples. These were manually compared to the actual content to

get an empirical clue to how this classification task performs. The samples generally

achieve a success rate of approximately 50%. Considering the negative effect on the

user experience, that a one out of two error rate has, this is obviously not a very good

result. However, this flaw seems partially to be a consequence of the previous false

complexity classifications. This impression is underlined by the fact, that the success

rate of those resources of the random samples, that were correctly classified as com-

plexity, is approximately three out of four, which is significantly higher. Furthermore

the main category classifications are often understandable and vaguely right, but rated

as false, because one or two other categories would definitely fit better. Due to the

interdisciplinary nature of complexity, this is a frequently occurring case. So this key

characteristic is likely to be utilizable in order to improve the quality of main category

classifications. Further improvement proposals are made in the future work considera-

tions in Chapter 9.

Additionally the analysis of the main category classifications and their weights uncov-

ers another interesting relation. Table 8.1 shows, that the resources, that were classified

to their main category with a total weight of 1.0, have a significantly lower complexity

8.1 Classification Quality 73

weight

range

document

number

average

precision

0,0 - 0,4 11 0,72

0,4 - 0,5 15 0,66

0,5 - 0,6 17 0,65

0,6 - 1,0 23 0,78

1,0 21 0,38Table 8.1: Average precision and number of documents within the top test kea80, clus-

tered by main category weight ranges with at least 10 documents

classification precision than others The rapidly falling trend line in Figure 8.2 visualizes

this effect. Based on this knowledge, the result of the top performing test kea80 was re-

evaluated, by simply discarding all resources with the 1.0 value. This virtual test run

is referred to as kea80+ and it is listed among the other tests in Figure 8.1. On one side

the KeaAnnotator finally exceeds the precision requirement value in this configuration,

but on the other it also slightly misses the target recall by 0.02. However, it is still the

best result, that was achieved throughout the tests and if it can be a little bit further im-

proved it will totally achieve the requirements. Suggestions to how this improvements

might look like are discussed in Chapter 9.

Figure 8.2: The correlation of main category weight and precision within the top testkea80

74 Evaluation

8.2 Response Time

This section is dedicated to the performance evaluation of the Semantic Filter. The

performance is measured in response time (see also Subsection 2.2.3). Subsection 8.2.1

explains the applied test strategy and Subsection 8.2.2 discusses the test results.

8.2.1 Test Strategy

The Semantic Filter is evaluated by his response time behavior, because it is time-critical

insofar that it can directly influence the time a user had to wait for the system response.

To evaluate this response time a test should be able to simulate various influencing

variables. Thus the tests vary in four basic dimensions. The number of documents, that

had to be filtered, scales in the steps 10, 100 and 1000. The number of considered terms

scales from 1 to 251 in steps of fifty. The usage of logical filters is varied by either using

an AndFilter or an OrFilter, each with all BasicFilters inverted by NonFilters, without

any invertion or both randomly mixed. Finally a complex nested Filter will be sim-

ulated by randomly chunked BasicFilters, that are nested in randomly chosen logical

filters, that can be nested within another filters again and so on. This mix is supposed

to simulate complexity in the structure of filter systems and measure its effects on the

performance. All filter combinations are visualized in Table 8.2. The tested documents

are only required to possess a certain number of random semantic annotations, so they

can be instantly and automatically created.

Test and or not random

not

and plain x

and not x x

and random not x x

or plain x

or not x x

or random not x x

mixed x x xTable 8.2: The characteristics of the performed test series

The tests were performed on a Macbook with a two gigahertz IntelCore2Duo processor,

two gigabyte DDR2 SDRAM, the operation system MacOSX 10.4.11 and Java 5. This is

not a representative server system, but should be sufficient to reveal the basic runtime

8.2 Response Time 75

behavior and eventual scalability problems.

8.2.2 Test Results

The performance evaluation was dependent from many variables, so the results are

displayed in two perspectives. The first one calculates the average values of the respec-

tive term numbers to be able to examine the relation between document number and

response time. Its results are visualized in Figure 8.3.

Figure 8.3: Average response times over several test series and numbers of handleddocuments

The results reveal that the OrFilter test runs are not significantly influenced by the

number of documents. The response time of the AndFilter test runs on the other side

steadily increases over the three document scales. Furthermore, are those AndFilters,

that contain additional NotFilters, slower than those, that do not and the response time

of the mixed filters does also constantly increase. This behavior can be explained by the

combination of two facts. Firstly it was more likely for a proposition to be false than

true, with a distribution of approximately 120 successes occurring in 1000 documents.

Secondly NotFilters are an additional filter layer, that always costs extra time. The first

fact leads to a better performance of and plain and the inverted OrFilters, because they

can frequently shortcut their decisions. Opposed to that, or plain and the inverted And-

Filters had to check more BasicFilters before they can return their results. However,

76 Evaluation

the increased number of iteration turns alone does apparently not cause any notewor-

thy problems. These do emerge only then, when the iterations are multiplied with the

additional execution time of a NotFilter.

Speed requirements alone can often be satisfied by simply using better hardware for

the servers, but this approach soon stops to be feasible if the response times grow ex-

ponentially. The attribute of software, that measures this behavior is the scalability. To

observe it for the number of documents, the response times are normalized to a rel-

ative response time per document value. The results of this procedure are presented in

Figure 8.4. They reveal that none of the test series grows faster than the linear increas-

ing number of documents, so it can be concluded that the system is scalable within

this dimension. More than that, the relative response time decreases, which is likely to

be caused by a relatively high initial loading time for the code, that is followed by an

efficient processing iteration.

Figure 8.4: Normalized average response times over several test series and numbersof handled documents

The second perspective of the evaluation is the number of used terms. In order to

evaluate this dimension, the average values for the test runs with different numbers

of documents are calculated and presented in Figure 8.5. It can be perceived that the

OrFilters are again indifferently fast in every test. The and plain test run rises to a satu-

ration level and stagnates. Only those filters that include both AndFilter and NotFilter

are steadily dependent on the number of terms. But, as the normalized presentation in

8.2 Response Time 77

Figure 8.5: Average response times over several test series and numbers of terms

Figure 8.6 shows, this growth is not exponentially and hence not critical. The behavior

of the mixed test is too random to provide useful clues in this dimension.

Figure 8.6: Normalized average response times over several test series and numbersof terms

78 Evaluation

8.3 Discussion

The evaluation of the complexity relevance decision reveals that the unadjusted KEA

Annotator as well as the Onto Gazetteer Annotator fails to achieve the precision re-

quirements, that were stated in Subsection 2.2.3. However, further configuration of the

minimal term occurrence variable and the additional discarding of a special class of

documents, whose main category classification was performed with a weight of 1.0,

leads to a test run, that fulfills the precision requirement and just slightly misses the

target recall value. So after adjustment the module already performs this task nearly

satisfying. Approaches for a further improvement of its performance are suggested in

Chapter 9.

According to the random samples the main category classification performs with an

approximate error ratio of fifty percent. This is an unusable state, so it is necessary to

investigate the causes of this flaw and further improve the solution of this task.

The response time of the Semantic Filter module never grows exponentially, so it can be

considered as scalable. The tests revealed a clear performance difference between the

runtimes of OrFilters and those of AndFilters with and without nested NotFilters. Andand not are apparently a slow combination. Several theories for the causes of certain

performance patterns were constructed, but have not yet been verified. Generally the

performance requirements of the Semantic Filter should be achievable, if some further

code optimization is performed and when the system runs on a more powerful server

hardware.

CHAPTER 9

Summary and Future Work

This thesis investigated the applicability of semantic metadata for the task of utilizing

social media resources in topic specific and context aware systems. It is embedded into

the CompleXys project, that develops an adaptive information portal for the field of

complexity. More precisely this thesis implemented the two modules Semantic Content

Annotator and Semantic Filter. The former uses GATE, KEA and OpenCalais to extract

semantic data from incoming documents. Thereupon it decides whether the resource

is considered as relevant for the topic of complexity and into which domain category it

should be classified. A newly created complexity taxonomy was used as a controlled

vocabulary for this process. The Semantic Filter applies the combined concept of filter

iterators and propositional logic to provide a flexible access interface to the semantically

indexed documents.

The evaluation of this work is split into a quality evaluation of the classification process

in the Semantic Content Annotator and a performance evaluation of the time-critical Se-

mantic Filter. The quality requirements, that were stated in Subsection 2.2.3, demand a

precision value of at least 0.7 and a recall value of at least 0.5 for the complexity clas-

sification. While several tested configurations were able to met one of these two goals,

none was able to met both at the same time. However, the kea80+ test run succeeds the

precision threshold and misses the required recall by just 0.02. So, it can be regarded as

a good starting point for further quality improvement. This top configuration is based

on a surprisingly high minimal term occurrence value of eighty. It can be assumed

that the success of this value was caused by the huge average text size of the scientific

documents in the test set. But not all documents of the set are as big as this average

and a broader amount of sources is likely to cause even a bigger variance of text sizes.

Unfortunately high occurrence values are nearly a guarantee for smaller than average

documents to be discarded without distinction. A possibility to handle this fact is to

implement the minimal term occurrence not as a constant, but as a relative value to the

80 Summary and Future Work

size of the document. Another effect, that was harnessed by the kea80+ test run, was

that a main classification with a one-sided weight of 1.0 towards a single category is

likely to be a false success and can be rejected to increase the precision. However, it had

to be clear that doing so probably also decreases the recall value, because it skips all

documents that just contain words from one main category. Additionally, it is possible

that this approach fails for small documents, because short texts are far more likely to

contain just terms of one category.

A further empirical evaluation of the main category classifications revealed an error

rate of approximately fifty percent. It must therefore be regarded as a current weakness

of the system and should be improved. A first approach to do so is to increase the

precision of the complexity classification, because documents that are not relevant for

complexity can hardly be correctly classified into a complexity category. Furthermore,

many classification are not strictly wrong, but just chose a classification, that would

not be the first choice of a human classifier. The interdisciplinary nature of complexity

even boosts this effect, because most of the texts could possibly be classified to more

than one main category. Therefore, it is worth to consider if a multi-value classification

would not be a better choice than the current one.

General improvements to the quality of the Semantic Content Annotator can also be

made by utilizing the document structure by increasing the term candidate weight of

words with emphasizing markup like boldness or headline elements. Yet another pos-

sibility is the use of additional data sources. For example the links in the text could be

loaded and analyzed too, the already annotated OpenCalais data can be applied and

the title can be searched in sites like Google, Citeulike, Technorati or Delicious to extract

additional context and collaborative classification suggestions. User tags in CompleXys

itself can also help to improve the classification. They can not just subsequently refine

the classification quality, but also provide feedback for certain classification decisions,

which can be used for a steady training of the classifier.

The performance evaluation of the Semantic Filter revealed a sufficient scalability of

the filter systems. Minor flaws are likely to be compensated by powerful hardware

and additionally reduced by further code optimization. Performance improvements

can generally be made by decoupling the text annotations from the filter process. Up

to now the Basic Filters iterate over all semantic annotations of a certain text to find

fitting terms. This step can be fastened by separately storing occurring terms and their

occurrence number, which would limit the maximum number of accesses to the number

of terms in the taxonomy. If this data is additionally sorted, the filters can also apply

advanced search techniques to accelerate the procession. Apart from the performance,

81

an improvement of usefulness for information filtering purposes can be achieved by

implementing fuzzy filters, that pick documents not according to discrete criteria, but

just return the top matching candidates sorted by their additive interest probability.

82 Summary and Future Work

References

[1] C. Anderson. The Long Tail. Random House Business, 2006.

[2] A. Baruzzo, A. Dattolo, N. Pudota, and C. Tasso. A general framework for person-

alized text classification and annotation. In Workshop on Adaptation and Personaliza-tion for Web 2.0, UMAP’09, 2009.

[3] N.J. Belkin and W.B. Croft. Information filtering and information retrieval: two

sides of the same coin? Commun. ACM, 1992.

[4] T. Berners-Lee. Information Management: A Proposal, 1989. URL http://www.

w3.org/History/1989/proposal.html.

[5] T. Berners-Lee. Linked Data - Design Issue, 2006. URL http://www.w3.org/

DesignIssues/LinkedData.html.

[6] K. Bittner. Use Case Modeling. Addison-Wesley Longman Publishing Co., Inc.,

2002.

[7] S. Bloehdorn, P. Cimiano, and A. Hotho. Learning ontologies to improve text clus-

tering and classification. In From Data and Information Analysis to Knowledge Engi-neering: Proceedings of the 29th Annual Conference of the German Classification Society(GfKl’05), 2005.

[8] J. Bogg and R. Geyer. Complexity, science and society. Radcliffe Medical Press, Ox-

ford, 2008.

[9] U. Bojars and J.G. Breslin et al. SIOC Core Ontology Specification, 2009. URL http:

//rdfs.org/sioc/spec/. Revision 1.33.

[10] G. Booch, I. Jacobson, and J. Rumbaugh. The Unified Modeling Language User Guide.

Addison-Wesley, 1999.

[11] D. Brickley and L. Miller. FOAF Vocabulary Specification 0.96, 2009. URL http:

//xmlns.com/foaf/spec/.

http://www.w3.org/History/1989/proposal.html

http://www.w3.org/History/1989/proposal.html

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html

http://rdfs.org/sioc/spec/

http://rdfs.org/sioc/spec/

http://xmlns.com/foaf/spec/

http://xmlns.com/foaf/spec/

84 References

[12] Open Calais Documentation. Calais, 2009. URL http://www.opencalais.com/

documentation/opencalais-documentation.

[13] P. Casoto, A. Dattolo, F. Ferrara, N. Pudota, P. Omero, and C. Tasso. Generating

and sharing personal information spaces. In Proc. of the Workshop on Adaptation forthe Social Web, 5th ACM Int. Conf. on Adaptive Hypermedia and Adaptive Web-BasedSystems, 2008.

[14] J. Chen, D. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query

system for internet databases. In In Proc. of SIGMOD, 2000, 2000.

[15] D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON),2006. URL http://tools.ietf.org/html/rfc4627.

[16] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework

and graphical development environment for robust NLP tools and applications.

In Proceedings of the 40th Anniversary Meeting of the Association for ComputationalLinguistics, 2002.

[17] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. The GATE User Guide,

2002. URL http://gate.ac.uk/.

[18] P.J. Denning. Electronic junk. Commun. ACM, 1982.

[19] S.J. Green. Building hypertext links by computing semantic similarity. In IEEETransactions on Knowledge and Data Engineering, 11, 1999.

[20] R. Grishman. TIPSTER Architecture Design Document Version 2.3., 1997. URL http:

//www.itl.nist.gov/div894/894.02/relatedprojects/-tipster/.

[21] T.R. Gruber. A translation approach to portable ontology specifications. Knowl.Acquis., 1993.

[22] S. Handschuh and S. Staab. Cream - creating metadata for the semantic web. Com-puter Networks, 2003.

[23] D. Heckmann, E. Schwarzkopf, J. Mori, D. Dengler, and A. Kröner. The user model

and context ontology gumo revisited for future web 2.0 extensions. In C&O:RR,

2007.

[24] J. Heinz. Implementation of an approximate information filtering approach

(maps). Bachelor thesis, Universität des Saarlandes, 2008.

http://www.opencalais.com/documentation/opencalais-documentation

http://www.opencalais.com/documentation/opencalais-documentation

http://tools.ietf.org/html/rfc4627

http://gate.ac.uk/

http://www.itl.nist.gov/div894/894.02/related projects/-tipster/

http://www.itl.nist.gov/div894/894.02/related projects/-tipster/

References 85

[25] P. Herron. Automatic text classification of consumer health web sites using word-

net. Technical report, The University of North Carolina at Chapel Hill, 2005.

[26] A. Heß, P. Dopichaj, and C. Maaß. Multi-value classification of very short texts.

In KI ’08: Proceedings of the 31st annual German conference on Advances in ArtificialIntelligence, 2008.

[27] F. Heylighen. Encyclopedia of Library and Information Sciences, chapter Complexity

and Self-organization. Marcel Dekker, 2008.

[28] J. Howe. Crowdsourcing: A definition. Crowdsourcing: Tracking the Rise of the Am-ateur (weblog, 2 June), 2006. URL http://crowdsourcing.typepad.com/cs/2006/

06/crowdsourcing_a.html. (accessed on jan 10, 2010).

[29] F. Iacobelli, K. Hammond, and L. Birnbaum. Makemypage: Social media meets

automatic content generation. In Proc. of ICWSM 2009, 2009.

[30] IEEE. IEEE Recommended Practice for Software Requirements Specifications, 1998. URL

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=720574.

[31] A. Isaac and E. Summers. SKOS Simple Knowledge Organization System Primer, 2009.

URL http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/.

[32] A. Jae-wook, P. Brusilovsky, Daqing He, J., and Qi Li. Personalized web explo-

ration with task models. In WWW 2008 / Refereed Track: Browsers and User Interfaces,

2008.

[33] J. Kahan and M.R. Koivunen. Annotea: An Open RDF Infrastructure for Shared

Web Annotations. In Proc. of the International World Wide Web Conference (WWW),2001.

[34] J. Kim, D. Oard, and K. Romanik. Using implicit feedback for user modeling in

internet and intranet searching. Technical report, University of Maryland CLIS,

2000.

[35] A.B. King. Website optimization. O’Reilly, 2008.

[36] C. Lanquillon. Enhancing Text Classification to Improve Information Filtering. PhD

thesis, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg, 2001.

[37] E. D. Liddy. Encyclopedia of Library and Information Science, chapter Natural Lan-

guage Processing. Marcel Dekker, 2003.

http://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html

http://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=720574

http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/

86 References

[38] W.W. Lie and B. Bos. Cascading Style Sheets, level 1, 1996. URL http://www.w3.org/

TR/REC-CSS1/.

[39] J. Liu, L. Birnbaum, and B. Pardo. Categorizing blogger’s interests based on short

snippets of blog posts. In CIKM ’08: Proceeding of the 17th ACM conference on Infor-mation and knowledge management, 2008.

[40] K. Mahesh. Text retrieval quality: A primer.

http://www.oracle.com/technology/products/text/htdocs/imt_quality.htm, 1999.

[41] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, M. Theobald, and G. Weikum.

Word sense disambiguation for exploiting hierarchical thesauri in text classifica-

tion. In Knowledge discovery in databases: PKDD 2005 : 9th European Conference onPrinciples and Practice of Knowledge Discovery in Databases, 2005.

[42] D. Maynard, M. Yankova, N. Aswani, and H. Cunningham. Automatic creation

and monitoring of semantic metadata in a dynamic knowledge portal. In AIMSA,

2004.

[43] O. Medelyan. Automatic Keyphrase Indexing with a Domain-Specific Thesaurus.

Master’s thesis, Philologische, Philosophische Fakultät | Wirtschafts- und Verhal-

tenswissenschaftliche Fakultät, Albert-Ludwigs-Universität Freiburg i. Br., 2005.

[44] O. Medelyan. Human-competitive automatic topic indexing. PhD thesis, Department

of Computer Science, University of Waikato, Hamilton, New Zealand, 2009.

[45] R.A. Meyers, editor. Encyclopedia of Complexity and Systems Science. Springer, 2009.

[46] R. Mitkov. The Oxford handbook of computational linguistics. Oxford University Press,

2003.

[47] N. Nanas, A. Roeck, and M. Vavalis. What happened to content-based information

filtering? In ICTIR ’09: Proceedings of the 2nd International Conference on Theory ofInformation Retrieval, 2009.

[48] D. W. Oard and G. Marchionini. A Conceptual Framework for Text Filtering, 1996.

[49] D.W. Oard. The state of the art in text filtering. User Modeling and User-Adapted Inter-action 7(3). Kluwer Academic Publishers, 1997.

[50] B. Parsia, A. Kalyanpur, and J. Golbeck. Smore - semantic markup, ontology, and

rdf editor, 2005. URL http://www.mindswap.org/papers/SMORE.pdf.

http://www.w3.org/TR/REC-CSS1/

http://www.w3.org/TR/REC-CSS1/

http://www.mindswap.org/papers/SMORE.pdf

References 87

[51] I. Peacock. Showing robots the door, (w)hat is (r)obots (e)xclusion (p)rotocol? Ari-adne, 1998.

[52] T. Pellegrini and A. Blumauer. Semantic Web : Wege zur vernetzten Wissensge-sellschaft. X.media.press, 2006.

[53] C. Da Costa Pereira and A. Tettamanzi. An evolutionary approach to ontology-

based user model acquisition. In WILF, volume 2955 of Lecture Notes in ComputerScience, 2003.

[54] A. Montejo Raez, L.A. Urena-Lopez, and R. Steinberger. Automatic Text Categoriza-tion of Documents in the High Energy Physics Domain. PhD thesis, Granada Univ.,

2006.

[55] L. Razmerita, S. Antipolis, G. Gouardères, E. Conté, and M. Saber. Ontology based

user modeling for personalization of grid learning services. In ELeGI Conference,

2005.

[56] D. Rosem and C. Nelson. Web 2.0: A new generation of learners and education.

Computers in the Schools, 2008.

[57] S. Schmidt. PHP Design Patterns. O’Reilly, 2006.

[58] A.V. Smirnov and A.A. Krizhanovsky. Information filtering based on wiki index

database. CoRR, 2007.

[59] T. Berners-Lee. Semantic Web Road map, 1998. URL http://www.w3.org/

DesignIssues/Semantic.html.

[60] T. Berners-Lee and D. Conolly. Hypertext Markup Language (HTML) - A Represen-tation of Textual Information and MetaInformation for Retrieval and Interchange, 1993.

URL http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt.

[61] K. Thomas. Opencalais whitepaper, 2009. URL http://www.slideshare.net/

KristaThomas/simple-opencalais-whitepaper.

[62] A.M. Turing. Computing machinery and intelligence. Mind, 1950.

[63] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea:

Practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conferenceon Digital Libraries, 1999.

[64] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-niques (Second Edition). Morgan Kaufmann, 2005.

http://www.w3.org/DesignIssues/Semantic.html

http://www.w3.org/DesignIssues/Semantic.html

http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt

http://www.slideshare.net/KristaThomas/simple-opencalais-whitepaper

http://www.slideshare.net/KristaThomas/simple-opencalais-whitepaper

88 References

[65] B. Yang and G. Jeh. Retroactive answering of search queries. In WWW ’06: Pro-ceedings of the 15th international conference on World Wide Web, 2006.

[66] C. Zimmer. Approximate Information Filtering in Structured Peer-to-Peer Networks.

PhD thesis, Universität des Saarlandes, 2008.

List of Figures

1.1 The Semantic Web layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 The relationships between actors and use cases . . . . . . . . . . . . . . . 12

2.2 The CompleXys overview schema . . . . . . . . . . . . . . . . . . . . . . 20

3.1 The basic microformats schema . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 The SIOC main classes in relation . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 The APIs, which form the GATE architecture . . . . . . . . . . . . . . . . 33

4.3 A data model diagram for GATE’s corpus layer . . . . . . . . . . . . . . . 34

4.4 An exemplary SKOS taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 The KEA algorithm diagram together with KEA++ . . . . . . . . . . . . . 37

4.6 Input and output data of the OpenCalais web service . . . . . . . . . . . 41

4.7 An example application of linked data . . . . . . . . . . . . . . . . . . . . 43

6.1 An excerpt of the CompleXys taxonomy . . . . . . . . . . . . . . . . . . . 54

6.2 The CompleXysTask principle . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 The Semantic Content Annotator Pipeline . . . . . . . . . . . . . . . . . . 56

7.1 An example for the application of propositional logic in filtering queries 65

8.1 The distribution of the quality test runs across the dimensions precision

and recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.2 The correlation of main category weight and precision within the top test

kea80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.3 Average response times over several test series and numbers of handled

documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.4 Normalized average response times over several test series and numbers

of handled documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.5 Average response times over several test series and numbers of terms . . 77

90 List of Figures

8.6 Normalized average response times over several test series and numbers

of terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

List of Tables

5.1 Examples of information seeking processes [48] . . . . . . . . . . . . . . . 46

8.1 Average precision and number of documents within the top test kea80,

clustered by main category weight ranges with at least 10 documents . . 73

8.2 The characteristics of the performed test series . . . . . . . . . . . . . . . 74

92 List of Tables

Author’s Statement

I hereby certify that I have prepared this diploma thesis independently, and that only

those sources, aids and advisors that are duly noted herein have been used and / or

consulted.

January 26, 2010

Oliver Schimratzki

An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is...

Documents

Transcript of An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is...