CoKo - A Semantic Web Application for the Semantic Web ... · web content. The way HTML documents...

UNIVERSITY OF MANCHESTER

CoKo - A Semantic Web Application for the Semantic

Web Challenge A dissertation submitted to the University of Manchester for the degree of

Master of Science in the Faculty of Engineering and Physical Sciences

2011

Priyam Maheshwari

School of Computer Science

Page | 1

Table of Contents

ABSTRACT ................................................................................................................................................... 5

DECLARATION ............................................................................................................................................. 6

COPYRIGHT ................................................................................................................................................. 7

ACKNOWLEDGEMENT ................................................................................................................................. 8

CHAPTER 1 INTRODUCTION ........................................................................................................................ 9

1.1 WHAT IS COKO? ........................................................................................................................................ 10

1.2 MOTIVATION ............................................................................................................................................. 11

1.3 ATTEMPT AT SEMANTIC WEB CHALLENGE ....................................................................................................... 14

1.4 PROJECT OBJECTIVES ................................................................................................................................... 15

CHAPTER 2 BACKGROUND AND INITIAL RESEARCH ................................................................................... 16

2.1 SEMANTIC WEB CHALLENGE ......................................................................................................................... 16

2.1.1 Former entries ................................................................................................................................ 17

2.2 LINKED DATA USER-INTERFACES ..................................................................................................................... 21

2.2.1 Faceted browsers ........................................................................................................................... 21

2.2.2 Query builders ................................................................................................................................ 23

2.3 CROWDSOURCING ...................................................................................................................................... 24

2.3.1 Architecture for Collective Knowledge Bases ................................................................................. 24

2.4 PROVENANCE AND TRUST ............................................................................................................................. 27

2.4.1 Trust assessment ........................................................................................................................... 27

2.4.2 Types of Provenance ...................................................................................................................... 28

2.4.3 Provenance Representation ........................................................................................................... 28

2.4.4 Provenance metadata.................................................................................................................... 29

2.5 VISUALIZATION ........................................................................................................................................... 30

CHAPTER 3 SYSTEM ARCHITECTURE .......................................................................................................... 32

3.1 QA SYSTEM ARCHITECTURE ........................................................................................................................... 32

3.1.1 WolframAlpha Architecture ........................................................................................................... 33

3.1.2 TrueKnowledge Architecture ......................................................................................................... 34

3.1.3 CoKo Architecture .......................................................................................................................... 35

3.2 SYSTEM REQUIREMENTS .............................................................................................................................. 36

CHAPTER 4 IMPLEMENTATION .................................................................................................................. 38

4.1 TECHNICAL OVERVIEW ................................................................................................................................. 38

Page | 2

Word Count: 19,651

4.2 SYSTEM ARCHITECTURE................................................................................................................................ 38

4.2.1 Presentation Tier ............................................................................................................................ 39

4.2.2 Application Tier .............................................................................................................................. 41

4.2.3 Data Tier ........................................................................................................................................ 44

CHAPTER 5 DESIGN DECISIONS .................................................................................................................. 47

5.1 Data Set Description Language (DSDL) ............................................................................................. 47

5.2 Types of SPARQL queries ................................................................................................................... 49

5.3 Property Mapping ............................................................................................................................. 53

CHAPTER 6 EVALUATION ........................................................................................................................... 55

6.1 CASE STUDIES ............................................................................................................................................ 55

Case Study 1: Contained dataset ............................................................................................................ 55

Case Study 2: Distributed Datasets ......................................................................................................... 61

6.2 OVERALL EVALUATION ................................................................................................................................. 63

6.3 CONTENDER FOR SEMANTIC WEB CHALLENGE .................................................................................................. 64

CHAPTER 7 CONCLUSION AND FUTURE WORK .......................................................................................... 68

7.1 REFLECTION ............................................................................................................................................... 68

7.2 PROBLEMS WHICH STILL NEED TO BE SOLVED .................................................................................................... 70

7.3 SUGGESTIONS FOR THE FUTURE ..................................................................................................................... 71

7.3.1 Critical ............................................................................................................................................ 71

7.3.2 Other extensions ............................................................................................................................ 74

REFERENCES .............................................................................................................................................. 76

APPENDICES .............................................................................................................................................. 81

APPENDIX A –TECHNOLOGIES USED ..................................................................................................................... 81

APPENDIX B – DATA SET DESCRIPTION LANGUAGE (DSDL) SCHEMA .......................................................................... 82

APPENDIX C - CASE STUDIES ............................................................................................................................... 83

C.1 SPARQL query using GRAPH Clause .................................................................................................. 83

C.2 Full version of DSDL for case study 1 ................................................................................................ 84

C.3 DSDL for additional query upload ..................................................................................................... 85

C.4 DSDL describing CIA Factbook dataset ............................................................................................. 86

C.5 DSDL describing property mappings ................................................................................................. 87

C.6 DSDL describing DBpedia dataset and a query against multiple endpoints ..................................... 88

APPENDIX D – GOOGLE VISUALIZATION DATA FORMAT ........................................................................................... 89

Page | 3

Table of Tables TABLE 1 OVERVIEW OF FEATURES OF FORMER ENTRIES OF SWC AS COMPARED TO COKO ................................................... 20

TABLE 2 COMPARISON OF SEMANTIC WEB BROWSERS. ................................................................................................ 22

TABLE 3 HIGH LEVEL SYSTEM REQUIREMENTS ............................................................................................................. 37

TABLE 4 SUMMARY OF COKO SOURCE CODE STATISTICS ................................................................................................ 38

TABLE 5 DATA SET DESCRIPTION LANGUAGE ELEMENTS ............................................................................................... 48

TABLE 6 PROGRESS TOWARDS FULFILLING MINIMAL REQUIREMENTS OF THE SWC ............................................................. 66

TABLE 7 PROGRESS TOWARDS FULFILLING ADDITIONAL REQUIREMENTS OF THE SWC ......................................................... 67

TABLE 8 PROGRESS OF COKO TOWARDS MEETING ITS OBJECTIVES ................................................................................... 69

Page | 4

Table of Figures FIGURE 1 SECTION OF WOLFRAM|ALPHA RESULT INTERFACE ......................................................................................... 12

FIGURE 2 SECTION OF TRUEKNOWLEDGE RESULT INTERFACE .......................................................................................... 13

FIGURE 3 INPUT-OUTPUT VIEW OF A COLLECTIVE KNOWLEDGE BASE. ............................................................................... 25

FIGURE 4 ABSTRACT VIEW OF QA SYSTEM ARCHITECTURE ............................................................................................. 32

FIGURE 5 ABSTRACT FUNCTIONAL ARCHITECTURE OF WOLFRAMALPHA............................................................................ 33

FIGURE 6 ABSTRACT FUNCTIONAL ARCHITECTURE OF TRUEKNOWLEDGE ........................................................................... 34

FIGURE 7 PROPOSED FUNCTIONAL ARCHITECTURE OF COKO .......................................................................................... 35

FIGURE 8 COKO'S THREE-TIER ARCHITECTURE ............................................................................................................ 39

FIGURE 9 DATA FLOW BETWEEN KEY CLASSES.............................................................................................................. 42

FIGURE 10 BUILDING A FULL-FEATURED SEARCH APPLICATION USING LUCENE ................................................................... 43

FIGURE 11 ILLUSTRATION OF META-VARIABLE REPLACEMENT WITH THE RESULT FROM METAQUERY EXECUTION ....................... 51

FIGURE 12 FLOW OF OPERATIONS INVOLVED IN MAPPING A USER QUERY TO A SPARQL QUERY ........................................... 52

FIGURE 13 FROM PILE OF TRIPLES TO AN INTELLIGIBLE INTERFACE ................................................................................... 68

FIGURE 14 ADDITIONAL DISPATCHER ......................................................................................................................... 72

FIGURE 15 ABSTRACT VIEW OF QUERY WORKFLOW ...................................................................................................... 73

Page | 5

Abstract

With the rise in popularity of the web, more and more people are looking for services

which can help them to find information quickly. One such service is a question answering

(QA) system. It is a technique to provide accurate answers to users’ questions. Given a

question such as “Which is the longest river in the world?” a keyword-based search

engine (e.g. Google) will return a list of URLs to various web pages with probable answer,

whereas a QA systems attempts to directly answer the question with the name of the

river along with some other background details.

CoKo (Collaborative Knowledge) is an attempt at a Semantic Web based open-domain

question answer (QA) system, built upon a community curated structured knowledge

base. It is envisioned to return coherent answers to users’ natural language questions,

along with appropriate visualizations (e.g. charts, maps, tables etc.) in order to make the

answers more intelligible and analyzable. This thesis describes the conception, design and

development of a prototype of this system.

This prototype was developed to determine the utility, feasibility and challenges of

developing a QA system designed to work on a collaboratively curated Linked data

knowledge base. It also aims to analyze the potential of such a system as an entry in the

Semantic Web Challenge.

During the evaluation of this prototype, we identified several bottlenecks that currently

limit the curation process. In the end this thesis contemplates some of the future

challenges and provides suggestions for any subsequent attempt at the system.

Page | 6

Declaration

No portion of the work referred to in this dissertation has been submitted in support of

an application for another degree or qualification of this or any other university or other

institute of learning.

Page | 7

Copyright

I. The author of this dissertation (including any appendices and/or schedules to this

dissertation) owns certain copyright or related rights in it (the “Copyright”) and

s/he has given The University of Manchester certain rights to use such Copyright,

including for administrative purposes.

II. Copies of this dissertation, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright, Designs and

Patents Act 1988 (as amended) and regulations issued under it or, where

appropriate, in accordance with licensing agreements which the University has

entered into. This page must form part of any such copies made.

III. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of

copyright works in the dissertation, for example graphs and tables

(“Reproductions”), which may be described in this dissertation, may not be owned

by the author and may be owned by third parties. Such Intellectual Property and

Reproductions cannot and must not be made available for use without the prior

written permission of the owner(s) of the relevant Intellectual Property and/or

Reproductions.

IV. Further information on the conditions under which disclosure, publication and

commercialisation of this dissertation, the Copyright and any Intellectual Property

and/or Reproductions described in it may take place is available in the University

IP Policy (see http://documents.manchester.ac.uk/display.aspx?DocID=487), in

any relevant Dissertation restriction declarations deposited in the University

Library, The University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations) and in The

University’s Guidance for the Presentation of Dissertations.

Page | 8

Acknowledgement

I would like to thank my supervisor Dr. Bijan Parsia for his continued guidance throughout

this project. His guidance not only helped me to understand the potential benefits of the

project, but also helped me to overcome a number of difficulties throughout the course

of implementing the project.

I would also like to thank my family; my parents and sister who have supported and

encouraged me to pursue further education. None of this would have been possible

without their help.

Page | 9

Chapter 1

Introduction

This thesis proposes a Semantic Web collaborative open-domain question answering

(QA) system. What does it mean to be a Semantic Web collaborative open-domain

question answer system? The traditional keyword-based search engines like Google,

Yahoo, Bing etc. are efficient at providing a list of best possible results from their large

repositories of indexed HTML documents, but they fail to provide direct answers to user

queries like “what is the nutritional value of an apple?” or “distance between Manchester

and London” etc. For users involved with complex information gathering tasks these

search engines are far from providing a complete web search solution, as the user has to

manually identify and aggregate the pieces of relevant information from a selection of

various recommended web sites, where each site presents information in its own format.

Besides the problem of time consuming manual aggregation of appropriate information,

other problems faced by web users are of high recall and keyword sensitive results [1].

Along with the web pages containing relevant information these search engines also

return a huge amount of either mildly or completely irrelevant pages and often due to

keyword sensitivity the search engines are unable to retrieve the desired page if they are

using terms different than the posed query.

The major difficulty faced by these search engines is the lack of machine-interpretable

web content. The way HTML documents on the web are currently deployed, the real

content of a web page is contained in the text, which makes it difficult for machines to

interpret, extract and process the information. In order to overcome this shortcoming,

the Semantic Web initiative provides a suite of technologies to represent Web content in

machine-processable format, allowing machines to determine the meaning of the content

and thus assist in the process information gathering. [2]

Collaborative knowledge management is the new buzzword on the web today. Popular

websites like Wikipedia predominantly operate on user contributed content. Creating a

knowledge base from scratch is expensive in terms of both time and effort. Especially for

open domain system, it would be difficult for an individual or an organization to single

Page | 10

handedly collect and maintain data about all the knowledge in the world. Therefore, it is

often practical to share this burden with the community of web users. This act of

outsourcing the process of knowledge base creation and curation to an undefined group

of individuals is also known as “Crowdsourcing”[3].

Question – Answering (QA) system isn’t a new concept on the web, it has been a topic of

research for several years [4] [5]. Unlike the traditional information retrieval systems (like

search engines), which return a list of best-matching documents, a QA system, consults its

knowledge base to generate direct answers to user’s questions [6]. There are several such

systems available on the web today and can be classified [7] into closed domain QA

systems which deal with questions from a specific domain such as medicine, sports etc. or

open domain QA systems which can deal with questions on just about anything. A

common characteristic of all these systems is that they allow users to ask question in

natural language and then return a concise answer to user’s question by looking up in

their knowledge base. But again some of these QA systems are only able to provide

answers to questions about which they have explicit knowledge. This is mainly due to lack

of semantics in the way the data is represented in their knowledge base and thus these

systems are unable to draw meaningful relations between knowledge in the knowledge

base, perform reasoning and inference in order to answer questions about implicit

knowledge. For example Ask.com1, one of the typical QA systems on the web, could

provide direct answers to questions “How long is the river Nile?” and “Which is the

longest river in the world?” but could not provide direct answer to the question “What is

the length of the longest river in the world?“.

1.1 What is CoKo?

CoKo (Collaborative Knowledge) is an automatic open-domain question answering

system, exploiting Semantic Web technologies to provide coherent answers to user’s

questions. It leverages from the recent developments in social collaborative knowledge-

building, by involving its users in knowledge base building and curation process. Users can

1 http://www.ask.com

Page | 11

contribute their datasets to the knowledge base and ask questions about explicit as well

as implicit information present in the knowledge base.

Besides finding direct and concise answers to their questions, users are also looking for

ways to be able to organize and visualize their search results, in order for an easy

interpretation and analysis of the result. In addition, there is a growing desire among the

web users to be able to share their findings. To this end CoKo will present the answers

along with a range of appropriate visualizations (e.g. charts, maps, tables etc.) so as to

make the answers more intelligible and analyzable for the user. It will also facilitate

sharing of the answers and visualizations via social network websites (like Facebook,

Twitter etc.) blogs and email.

1.2 Motivation

The design of CoKo was motivated by two popular question answering engines on the

web today - Wolfram|Alpha2 and True Knowledge3. Each of these applications has their

own strengths and weaknesses. CoKo aims to capitalize on their strengths by

implementing the most desirable features from both these applications.

Wolfram|Alpha is a web based computational knowledge engine, which generates

answers to users’ questions by doing computations on its internal knowledge base using

inbuilt algorithms. It has a natural language interface (refer Figure 1) which accepts user

questions in plain language and returns detailed answers accompanied with visualizations

(like graphs, timelines etc.), that allows for easy analysis of the answer (refer Figure 1).It

also provides facets for alternate visualization or representation of the results. In order to

compute the result most of the data is derived from multiple sources and a list of these

sources and references are given at the bottom of the result set.

2 http://www.wolframalpha.com/

3 http://www.trueknowledge.com/

Page | 12

Wolfram|Alpha represents knowledge using its own internal techniques and does not

directly apply Semantic Web technology [40] and therefore the data in its knowledge

base cannot be accessed, linked or reused by other semantic applications across the Web.

Perhaps the use of linked data technology would make it easier to pull data from outside

as well.

Although it provides access to its platform through its API, but it doesn’t provide the

flexibility to manipulate or enhance the data. Its knowledge base is built from knowledge

extracted from various sources, combined and hand-curated exclusively by

Wolfram|Alpha team. This restriction on the data is slowing down the growth of its

knowledge base at some level [41], for example for the query “GDP India vs GDP Sri

Lanka” it could only fetch GDP data till the year 2008 for Sri Lanka and till the year 2009

for India.

Natural language input interpretation

Specific result compiled into tabular format

Useful result visualization

Outdated data

Source of data

Facet provided for alternate (reverse) view of the data table

Figure 1 Section of Wolfram|Alpha result interface for the query "GDP India vs GDP Sri Lanka"

Page | 13

True Knowledge4 on the other hand is a Semantic technology based answer engine,

which exploits community curation technique to build its knowledge base. It adds facts to

its knowledge base by either importing from external sources like Wikipedia and Freebase

or through user input by means of a thorough and controlled form input. Validity of these

facts is checked at two levels. Firstly, since the system can understand the facts, it is able

to discard statements which are inconsistent with the existing knowledge in the database.

Secondly users can approve or deny any formerly added facts, thus improving the quality

of knowledge in the database [42].

Although TrueKnowledge allows the users to add facts, but still the knowledge base is

somewhat reserved and exclusive. Users can add data in two ways either by providing

facts in terms of objects and classes or by directly entering an answer as free flowing plain

text in a text area. It doesn’t give the users the freedom to upload complete datasets. The

answer entered as plain text is not stored as facts; rather it would be displayed as such

4 http://www.trueknowledge.com/

Precise answer

Input interpretation

User feedback on result

Facts used to derive the result

User feedback on facts

Figure 2 Section of TrueKnowledge result interface for the query "Is GDP of India more than GDP of Sri Lanka?"

Page | 14

when the question is asked again. The interface to add facts is really constrained and the

user can add only bits of knowledge, which can be a time consuming process in case a

user wants to contribute a large amount of data.

Searching for answers using each of these applications has its own ups and downs. On

one hand where Wolfram’s forte is providing attractive presentation of result, True

Knowledge’s strong point is active contributions by its users in building and improving the

knowledge base.

CoKo is utilizing similar methodologies as these two applications, with emphasis on

following aspects:

Utilize semantic web technologies. Like TrueKnowkedge, it is based on Semantic Web

technologies. It utilizes RDF to store datasets, OWL to link datasets and SPARQL to query

over datasets.

Visualize results. Like Wolfram|Alpha, it presents the results with rich data and graphics.

Users are able to interactively choose alternate visualizations methods, in order to obtain

an optimum view for their result.

Enable a collaborative and social environment.Users can contribute entire datasets and

not just facts. They will also be able to share their queries and results on social sites such

as blogs, Twitter and Facebook.

1.3 Attempt at Semantic Web Challenge

The overall purpose of this project is to implement a Semantic Web application which has

the potential to become a winning entry to the Semantic Web Challenge (SWC)5, Open

Track. SWC is a competition conducted at the International Semantic Web Conference

(ISWC), which invites entries for Semantic Web-based end-user applications.

The approach of this project to the SWC exploits the trend towards making Linked data

useful, by using Semantic Web Technologies. A simple combination of basic Semantic

Web technologies - RDF, SPARQL and web services will enable it, to successfully meet all

of the challenge requirements.

5 http://iswc2011.semanticweb.org/calls/semantic-web-challenge/

Page | 15

Linked Open Data community project has encouraged a lot of data publishers to release

public data according to Linked Data principles6. Through this effort a large number of

datasets have now been published on the web. At the time of this writing, some estimate

that, over 200 million data sets7 exist in the wild. Despite the large amount of open linked

data available on the web today, applications that make use of Linked Data are not

mainstream yet. CoKo is an attempt to demonstrate and utilize the power of Linked RDF

data. It will consume linked data and allow users to discover answers to their questions,

using reasoning and inference over these data sets.

1.4 Project Objectives

Functional Objective

Corresponding to the growth of information on the web, there is a growing need of QA

systems that can help to better utilize, organize and analyze the ever-accumulating

information. Hence the functional objective of CoKo is to present rich, meaningful

visualizations (charts, maps, graphs etc.) corresponding to the answers, by exploiting

semantics of the data and thus making the answers more comprehensible and analyzable.

Technical objective

The underlying technical objective of CoKo is to provide an end to end system for sharing

and curating Linked data in a collaborative environment, which would eventually improve

the quality of the answers for the end users of the QA system.

6 http://www.w3.org/DesignIssues/LinkedData.html

7 Sindice claims to be searching on about 228.19 million documents

Page | 16

Chapter 2

Background and Initial Research

2.1 Semantic Web Challenge

The Semantic Web Challenge (SWC) calls for applications that exploit Semantic Web

technology in a way that demonstrates benefits of the technology. It consists of two

tracks “Open Track” which requires applications to make use of the meaning of

information on the web and “Billion Triples Track” which requires applications to deal

with huge amount of predefined data gathered from the web. It doesn’t specify any

particular domain or technology instead a set of minimum requirements and additional

desired requirements are specified for each track, thus allowing a wide range of

application submissions.

In order to be accepted as a valid entry for the open track of the challenge, the

application is required to realize the following minimum requirements defined by the

organizers:

An end user application with practical worth to general user or at least to the

domain expert.

The data source should be syntactically and structurally diverse, should be under

diverse ownership and should contain considerable amount of real world data.

The data should be processed in order to derive useful information, which would

not be possible or would be difficult to extract with help of conventional web

technologies.

Additional desirable requirements:

The final application exhibits benefits of semantic technologies and has a

functional interface for the end user and accessible on various devices.

Innovative use of semantic technology to a domain or task that hasn’t been

considered before.

The application has a commercial prospective.

Page | 17

The application is scalable and uses dynamic data in combination with static data,

preferably published on the Semantic Web.

Validation of the results with the help on contextual ratings or rankings.

Use of multimedia documents.

The application provides support for several languages.

Functionality should not be mere information retrieval.

2.1.1 Former entries

The first SWC was held in the year 2003 and over the past eight years, the Challenge has

attracted more than 140 entries. These entries have demonstrated a range of different

applications from full web scaled search services to simple recommendation systems,

serving different domains (e.g. Biomedical science, academic research etc.) and have even

covered different platforms (e.g. mobile phones, iPad and iPod). During the initial phase

of the project many of the former entries to the challenge were studied, but due to

limitation of space only few will be discussed in this section.

Many of the former entries to SWC focused on demonstrating how Semantic Web and

presentation technologies can be deployed to provide better search and browsing

support, starting from CS AKTive-Space (CAS) which was the winner of the very first SWC

held in 2003.CAS was a Semantic Web application which allowed funding agencies,

researchers and students to explore the Computer Science research domain in the United

Kingdom. It provided an integrated view of Computer Science research related

information aggregated from multiple heterogeneous sources, such as published RDF

sources, personal web pages, and data bases. It allowed the users to query, explore and

organize information in order to discover rich relations.[8]

MultimediaN E-Culture demonstrator8 (winner of SWC 2006) is a Semantic Web

application to interactively search, navigate and annotate web based media collections. It

provides a keyword-based search over the annotated collection and returns semantically

8 http://e-culture.multimedian.nl/demo/session/search

Page | 18

grouped clusters of query result. It allows the user to view the result using various

available presentation mechanisms.[9]

Arnetminer9 (one of the entries of SWC 2007) provides mining and searching services for

researcher social networks. A semantic based profile is created for each researcher by

automatically extracting and integrating data (e.g., the bibliographic data and the

researcher profiles) from multiple sources on the web and is stored into a researcher

network knowledge base (RNKB). It provides three types of search services: person

search, publication search, and conference search and five types of mining services:

expert finding, people association finding, hot-topic finding, sub-topic finding, and survey

paper finding over the RNKB.[10]

Sig.ma10 (winner of SWC 2009) is a Semantic Web Search engine, built on top of Sindice11.

Sindice parses the information on the web looking for RDFa and microformats, in

particular it parses well known structured information in pages such as Wikipedia,

Wordpress blogs, etc. retrieving information and translating it into triples [11]. Sig.ma

aggregates data about entities from these sources and presents to the user using a

template based result interface. A very innovative aspect of the application is the method

that it provides to its users for dealing with the information quality challenges. It allows

the users to approve, reject or add a new source from which the result has been

aggregated, thus learning from user feedback. It provides a customizable result interface

which can be shared and embedded in blogs, websites and other applications.

Visinav12 (winner of SWC 2009) is a system which can be used to search and navigate the

Web of data. It aggregates RDF data sets by crawling the open web and allows for faceted

browsing of these datasets [3]. Its functionality goes beyond keyword based search by

allowing the users to visually construct and refine their queries via facet selection

operation. Users start by providing a keyword to locate objects and subsequently they

9 http://www.arnetminer.org

10 http://sig.ma/

11 http://sindice.com/

12 http://visinav.deri.org/

Page | 19

can refine their queries to form more complex queries. System is intuitive and calculates

the possible next steps based on the current state, thus displaying only legal choices to

the user for query construction and refinement. It is an exploratory search engine.

NCBO Resource Index13 (winner of the SWC 2010), is a semantic search application for

researchers to browse and analyze the information stored in 22 diverse biomedical

resources. The textual descriptions of the data residing within these resources are

annotated with the help of various ontology terms and indexed. It can then be searched

for concepts through an intuitive interface. Tag clouds are provided in the result to

visualize related concepts to the current search query and color intensity to represent

more relevant resources based on the current search terms.

Table 1 contrasts the features of CoKo with the former entries discussed above.

13 http://bioportal.bioontology.org/resource_index_ui

Page | 20

Application Dataset Service Result End-user Personalisation

Face

t B

ase

d

CS AKTive-Space

multiple heterogeneous sources, such as published RDF sources, personal web pages, and data bases.

investigate about an area of research and a researcher based on their scholarly impact and their research grant income.

variety of visualizations and multi dimensional representations

funding council

researchers

graduate students

None

Ke

y w

ord

Bas

ed

MultimediaN E-Culture demonstrator

annotated index of large virtual collections of cultural-heritage objects

annotation of web resources representing images

search for artwork, artefact, concept, location or person

clustered thumbnails of paintings along with their titles

collection holders

privileged users can add, delete and edit RDF metadata of paintings

Arnetminer knowledge base containing integrated data extracted from researchers’ profiles and crawled publications

profile search

expert finding

conference analysis

course search

sub-graph search

topic browser

academic ranks

different interfaces for different services

academic community

registered users can:

modify extracted profile information;

provide feedback on the search results;

follow researchers

VisiNav index of structured data crawled from the Web

object Search ranked list of objects crawled from the web with the option of alternate views like Table and Timeline

general web user

None

Sig.ma index of structured data crawled from the Web

object search mashup of information retrieved from various sources

general web user

learns from user feedback

search results can be shared on the web

NCBO Resource Index

index of annotated data from 22 diverse Biomedical resources

search Biomedical data based on ontology concepts

list of relevant data from a selected resource

tag cloud of related concepts.

biomedical researchers

None

NLP

CoKo internal knowledge base of structured data contributed by users

question- answer service

precise answers

rich visualization (e.g. tables, graphs, maps, etc.)

general web user

can be customized for domain specific use

users can contribute data

provide feedback on results

share results

Table 1 Overview of features of former entries of SWC as compared to CoKo

Page | 21

2.2 Linked data user-interfaces

There is a huge amount of Linked data available on the web today, but it’s a challenge for

lay users to understand the potential of this data, due to their lack of knowledge about

the intricacies of RDF and other Semantic Web technologies. Even though, there has been

much research devoted to providing efficient and comprehensible user interfaces for

linked data, it is still considered an open research problem.

There are various kinds of linked data user interfaces available, such as triple query

builder, relationship finder, mash-ups, faceted browser etc [13].Most of these linked data

interfaces allow the users to search and explore data in a fashion similar to traditional

search engine. However, some of these require the user to be familiar with RDF triples

and thus pose a challenge for lay users. It is observed in [15] and [13] that only faceted

and triple query builder interfaces allow non-Linked data expert users to efficiently pose

complex and expressive queries to large data repositories. For the purpose of this thesis

only faceted browsers and triple query builder will be discusses in following sections.

2.2.1 Faceted browsers

Faceted browsers are one of the most popular linked data interfaces available on the web

today. These browsers enable users to perform exploratory search by allowing smooth

browsing through the RDF graph [16]. Users start with one resource and are able to

navigate the data space consisting of different data sources by following RDF links. These

browsers have been reported to follow several different approaches to display search

results. On one hand, tools like mSpace [19], Flamenco [20] , Longwell14 and Haystack15

display the facets along with a list of results, using only facets directly connected to the

result. On the other hand, tools like Parallax [21], Humboldt [22], Tabulator [23] and

Nested Faceted Browser16, allow hierarchical filtering of the results. In addition to

providing properties and values of resources some of these browsers (e.g. Tabulator) also

provide visualizations like maps and timelines. However, these browsers only provide a

14 http://simile.mit.edu/wiki/Longwell

15 http://simile.mit.edu/hayloft/

16 http://people.csail.mit.edu/dfhuynh/projects/nfb/

Page | 22

limited number of visualizations and the source code needs to be modified in order to

provide new visualizations [24].In [25] authors examine some of the current

unconventional linked data browsers and draw comparison between them with the help

of table 2.

Browser Runs Data Sources Data

Formats

Data

Presentations

Presentation

selection

BrownSauce Local Web

Server

One at a time RDF (Any) Indented text

list

Hard-coded at

compile-time

Disco Web Server Multiple

Unbounded

RDF (Any) Predicate-

object table

Hard-coded at

compile-time

Exhibit HTML Web

Browser

Single author-

time specified

JSON only List, Timeline,

Map, Graph,

Table, Custom

HTML form

HTML

Authortime

Marbles Web Server Multiple

Unbounded

RDF (Any) Predicate-

object table

Fresnel.

ObjectViewer Desktop Java Multiple

Unbounded

RDF (Any

supported by

Jena)

Graph -

interactive

Hard-coded at

compile-time

Tabulator Firefox Web

browser

extension

Multiple

Unbounded

RDF (Any Table,

Calendar,

Map, Friends.

RDF/N3,

RDF/XML,

HTML

Run-time

manually by

user.

Available

presentations

are decided

by data-type.

Zitgist

DataViewer

Web Server Multiple

Unbounded

RDF (Any) Templates,

Predicate-

object table

Automatically

selected by

data-type.

Table 2 Comparison of Semantic Web Browsers

Note. Reprinted from “A review of user interface adaption in current semantic web browsers” by Turner, E., A. Hinze,

and S. Jones.

A major drawback of faceted browsers is that a broader view of the dataset being viewed

isn’t supported, as they allow only a limited set of queries. Facets are not efficient to

formulate complex queries and only work well for simple queries like searching for

objects belonging to a particular class. In [26] authors give two examples where these

browsers are inefficient.

Page | 23

i) “Persons who went to school in Germany”

Faceted browsers do not work well for this query because “in Germany” is

mentioned in context of the school and not the person.

ii) “Persons who live in x”, where x is a small city

In this case there are other more frequently occurring patterns, which would be

offered to the user as facets, rather than the facet “live in x”.

2.2.2 Query builders

These can be classified into two categories: i) Visual SPARQL query builders e.g. DBPedia

Query Builder and ii) NLP based query builders e.g. PowerAqua

Visual SPARQL query builders provide a triple based interface, which allows users to pose

query to a knowledge base, by building triple patterns. Users are able to define filters,

pattern variables or identifiers for the subjects, predicates and objects. The major

drawback of this approach is that the users need to have a basic understanding of SPARQL

query syntax and its functioning. Users also need to know the terminology and structure

of the underlying schema to be able to formulate efficient queries. These tools provide

suitable options to the users for building triple patterns, by providing a list of predicates

for each typed identifier. The predicates suggested are the ones which are actually

related to the identifier in the repository. Thus, helping the user to explore and create

complex queries by analyzing the relationships between instances. But, graph-based

visualizations where all property values are analyzed in order to provide suggestion are

resource expensive [27]. Also users are not presented with any actual data during the

query construction, therefore they do not know what data the repository holds and the

kind of queries it can answer.

NLP based query builders allow users to pose NL queries which are then processed and

transformed into queries for the repository. These interfaces can be categorized into full

NL interfaces e.g. PowerAqua [28] and Controlled Natural Language (CNL) interfaces e.g.

Ginseng [29]. The main difference between these two approaches is that the latter is not

dependent on a predefined vocabulary and doesn’t generate any syntactical or logical

interpretation of the input queries, instead it controls the user input by only allowing

queries that conform to a grammar generated from the terms and structure of the

Page | 24

internal ontology. This ensures that every query can be interpreted correctly. One of the

major drawback with full NL interfaces is that if the system is unable to provide an answer

to a user query, the user is unaware whether the result couldn’t be retrieved because of

an inadequately posed query which couldn’t be interpreted by the system or because the

underlying schema of the repository doesn’t support the query interpretation. In that

case CNL interfaces are better as they guide the user to only input queries that can be

interpreted.

2.3 Crowdsourcing

Building and maintaining a knowledge base from scratch can be a time consuming and

difficult task, be it for any domain. It would certainly be an enormous challenge, if

building it for an open knowledge question answer engine. Per se it is often practical to

outsource this knowledge base augmentation process to the community. The value

created by the collective contributions of people in the community is often referred to as

"collective intelligence" or "wisdom of crowds" [30]. The term “Crowdsourcing” was

coined by Jeff Howe, who defined it as "the act of a company or institution taking a

function once performed by employees and outsourcing it to an undefined (and generally

large) network of people in the form of an open call." It is a widely used concept on the

web today, with Wikipedia being one of the classic examples, where thousands of users

volunteer to create an encyclopedia that studies show is as accurate as traditional

volumes like Britannica.

2.3.1 Architecture for Collective Knowledge Bases

An overview of Input-output view of a collective knowledge base (KB), with two

continuous loops of interaction is represented in Figure 3. KB receives three streams of

information from the contributors and users (who may or may not be disjoint): 1) Rules

and facts from contributors, 2) Queries and evidence from users and 3) Feedback on the

answers, from users. In turn it produces two streams of information: 1) Answers to

queries and 2) Credit to contributors. As a result, the contributors and users are involved

in a (many-to-many) interaction via the knowledge base: contributions from many

different contributors might be used to derive an answer to a query and the feedback

about the answer will in turn be propagated to different contributors. On the other hand,

Page | 25

many different queries may be answered by using a single contribution which will receive

feedback from all of them.[31]

The key element of this architecture is that the KB is a result of human contributions and

machine learning, drawing value from their respective strengths and weaknesses. Human

beings are good at judging the quality of the end result but fail to efficiently carry out

reasoning on large amount of data. Whereas, machines are capable of handling large

amount of data and do computations with it. Another key element is that the

contributors are continuously receiving feedback on the quality and validity of

contributions and the evolving knowledge base is being scrutinized through queries and

their outcomes. Thus the knowledge is more likely to be accurate.[31]

In [31] the authors suggest that the above proposed model would help to deal with the

following problems associated with community curated knowledge bases:

Quality

In large scale community driven knowledge bases, it is difficult to ensure quality of the

information contributed by individuals when little is known about their areas of expertise.

Therefore, it is important that checks should be put in place to estimate the quality of the

contributions.

Figure 3 Input-output view of a collective knowledge base.

Reprinted from “Building Large Knowledge Bases by Mass Collaboration,” by M. Richardson and P. Domingos, in

Proceedings of the 2nd international conference on Knowledge capture, New York, NY, USA, 2003, p. 129–137.

Page | 26

All the knowledge contributed by individual users should be subjected to community

feedback, along with some sort of machine learning mechanism. This would instigate

users to provide good quality data since the efficacy of the knowledge is being tracked.

Consistency

Consistency of large knowledge base is plagued when contradicting facts are added by

same or different contributors. With the growing size of the knowledge base it is highly

likely that contradicting facts would be encountered due to lack of coordination between

the contributors.

In order to filter the noisy and inconsistent information contributed, quality and accuracy

should be coupled with information, with help of probabilistic parameters.

Relevance

In large scale distributed management of knowledge base, it is often difficult to ensure

that the data contributed by the volunteers is in conformance with the actual goal of the

application. If not effectively warranted it may render volunteer effort futile.

The knowledge requirements of the application should be properly conveyed to the

contributors and they should be given credit for providing any datasets which are used in

producing highly rated answers. It is expected that users will contribute datasets from

their fields of interests and expertise.

Motivation of contributors

Since the success of a collective knowledge base depends on the quality of data

contributed by the volunteers, they should be given due credit for providing high quality

data, in order to motivate them.

System should incorporate a fair method to recognize and award (e.g. listing the top

contributors, virtual badges or titles, etc.) the contributors for sharing their knowledge

and expertise.

Page | 27

2.4 Provenance and trust

Provenance means the origin, or the source of something.17 As discussed in previous

section, quality of data is one of the major concerns in a collaborative data management

environment. Tracking data provenance helps to estimate quality of data based on the

source and data transformations. Provenance is not only important to assess data quality

but also to determine the source of data to ascertain trustworthiness, ownership of data,

timeliness and others as described in [32].

2.4.1 Trust assessment

Trust assessment methods can be classified as follows [33]:

1) Reputation or Ratings-Based

These methods allow users to provide ratings based on their experience or trust in a

particular entity, which can then in turn help other users to decide what or who to

trust or prefer [34]. It includes ratings based systems similar to the one used by eBay,

Amazon and Web-Of-Trust mechanisms. Majority of trust architectures proposed for

Semantic Web based applications fall into this category. The major drawback of this

approach is that explicit and topic-wise trust ratings need to be provided which puts

the extra over head on information consumers to obtain and maintain these ratings.

2) Context-Based

These methods collect metadata about the conditions in which information has been

provided such as four W’s – Who, What, When and Why. Trust decisions are based on

roles or memberships of individuals. Examples policies given in [33] are: "Prefer

product descriptions published by the manufacturer over descriptions published by a

vendor" or "Distrust everything a vendor says about its competitor."

3) Content-Based

These methods are based on rules and axioms along with the data itself and related

data published by others. Example policies for this approach given in [33] are: “Believe

17 "provenance, n.". OED Online. March 2011. Oxford University Press.

http://www.oed.com/view/Entry/153408 (accessed April 30, 2011).

Page | 28

information which has been stated by at least 5 independent sources.” or “Distrust

product prices that are more than 50% below the average price.”

Although ratings based mechanism is a widely deployed trust assessment method on the

web today, being used by sites like Amazon and eBay to rate the sellers. But this method

only captures user’s opinion about a particular element, without any other information

about the source or producer. It limits the ways in which users can express their

justification for trusting a particular entity [34].

Content and context based mechanisms on the other hand are independent of trust

ratings and only require background information which is usually available on the

semantic web in terms of provenance meta data and thus can be utilized in making trust

decisions [33].

2.4.2 Types of Provenance

In [35] authors describe provenance as: workflow provenance and data provenance. A

workflow is a set of steps followed to reach from an initial state to the target state.

Workflow provenance aims to maintain a record about “the entire history of the

derivation of the final output” [35] of a workflow. Whereas, data provenance is

concerned with preserving the details about the origin and derivation of individual pieces

of data. In [36] authors characterize data provenance as why- and where-provenance i.e.

information about the origin of a piece of data in the result of a query and the location

from where that data has been extracted. Additionally how-provenance was introduced

in [37] to describe how the origin was used in deriving the result. Furthermore an

analogue to data provenance, called knowledge provenance is discussed in [38].It is

similar to data provenance except that it includes information about the extensive

reasoning used to derive data either before it is inserted into the knowledge base or after

it is retrieved from the knowledge base.

2.4.3 Provenance Representation

Different schemes can be used to represent provenance information, each having its own

implications on the cost of storing them and the information provided. Two methods to

represent provenance have been described in [39] as:

Page | 29

1) The Inversion method

This method exploits the relationship between the input and the output data. It works

backwards from the output data to determine the input data, used in the derivation

process. This is a compact method of representation and the information provided is

limited to the derivation history of the data.

2) The Annotation method

In this method, metadata about the derivation history of data, description about

source of data and processes are aggregated as annotations. This method pre

computes the provenance and thus can be readily used as metadata. This method

provides more elaborate information than the inversion method and may sometimes

also include the parameters used in derivation process, post-conditions, related

publication references etc.

2.4.4 Provenance metadata

Current, research on recording provenance metadata is either focused on associating RDF

triples with a named graph [40] [41] or to extend an RDF triple to a quadruple, where the

fourth element can be a URI, a blank node or an identifier [42] [43]. This fourth element

can be used to represent provenance information.

RDF Named graph is an explicit provenance mechanism where an RDF graph is assigned

with an URI, which can then be reference by other graphs as a normal resource. Thus it

allows assigning explicit provenance information to a set of RDF triples. However, one

drawback of using RDF named graphs is that they do not support capturing of provenance

information about implicit triples. For this, use of colors to capture implicit and explicit

provenance information about RDF triples has been proposed in [44]. RDF triples are

augmented with a fourth element named color, which represents the different sources

used to derive the triple. A large number of vocabularies have been published to

represent provenance metadata. These vocabularies clearly describe the relationships

and concepts used in provenance annotations.

One of the general purpose vocabularies, which is widely used to represent provenance

metadata is Dublin Core. It has properties like dc:creator, dc:publisher which can be used

Page | 30

to identify the creator and publisher of data, using URIs to identify them. On the other

hand a popular provenance specific vocabulary is the Provenance Vocabulary [45] which

was designed to deal with two dimensions of provenance i.e. data creation and data

access. It contains three sets of terms to store provenance information: general, data

creation, and data access. The general terms contain classes and properties to describe

general provenance elements. The data creation dimension contains classes to describe

how a data has been created and property to identify source data used during data

creation. The data access dimension provides classes and properties to provide

information about the retrieval of source data and creation guidelines.

In case of large datasets storing provenance metadata at triple level can lead to

provenance information being more than the data itself and for this authors in [39]

describe voidp vocabulary, which is a light-weight extension to voiD. It reuses entities

from Provenance vocabulary, the Time ontology in OWL and the Semantic Web Publishing

Vocabulary.

Providing provenance information for answers is a fundamental requirement for any

question answering system. When answers are returned from such applications users

want to know what sources were used, reliability of those sources or how the implicit

answer was derived.

2.5 Visualization

A lot of research is being targeted at enhancing user experience in the information

discovery task on the web, with one of the areas being search result visualization.

Visualization of search result plays an important role for clear understanding and analysis

of the retrieved information; it helps to give context to otherwise plain text results. For

example representation of disease data on a map would assist a public health

professional in her analysis of spatial distribution of disease and the effectiveness of

disease control policies.

Two major approaches to visualizing Semantic Web data as proposed in [46]: i)

visualization of the complete RDF graph and ii) visualization of SPARQL query result i.e.

selective parts of the graph .The first approach is intended for users with thorough

Page | 31

understanding and interest in structural visualization of RDF graphs and thus is not

favourable for general web users who are only interested in visualizing (e.g. charts, table,

pictures etc.) the result and not the underlying model.

The Data-gov Wiki18 provides some useful tutorial and demos on different ways of

visualizing data returned by SPARQL queries. Although the demos and tutorials are mainly

aimed at providing information on how Semantic web technologies can be used in

converting , enhancing and using linked government data, but same techniques can also

be applied to other linked data sources.

The approach described in one of the tutorials “How to render SPARQL results using

Google Visualization API” [47] as well as in [48] is based on following three steps:

1. Query

Generate an appropriate SPARQL query and execute it against appropriate SPARQL

endpoint to fetch the data of interest.

2. Transform

The result of the query needs to be transformed to appropriate format, depending on

input requirements of the visualization library being used e.g. Google Visualization

JSON for Google Visualization API. The transformation is carried out with pre-defined

XSLT which converts SPARQL XML Bindings to the required format.

3. Visualize

The resulting document from the previous step can then be submitted to appropriate

visualization services like Exibit or Google Visualization

18 The Data-gov Wiki is a project being pursued in the Tetherless World Constellation to expose open

government datasets using Semantic Web.

Page | 32

Chapter 3

System Architecture

This chapter outlines the proposed architecture of CoKo. It begins with a discussion about

functional architecture of WolframAlpha and TrueKnowledge followed by a high-level

architecture description of CoKo. It also lists the basic system requirements of the

application.

3.1 QA system architecture

After observing the functioning of TrueKnowledge and WolframAlpha during the initial

phase of the project, it was clear that the basic architecture followed by these QA systems

at the least involves a curator, an end user and a QA engine, which is the core software.

Core components of a QA engine are:

i) Dispatcher, which takes user input and returns appropriate result and visualization,

ii) Query processor (QP) ,which processes the user input to retrieve appropriate data

from the knowledge base in response to user query and passes it to the dispatcher.

iii) Data curation module, which handles the data augmentation and data cleansing

tasks for the data in the knowledge base.

QA Engine

Dispatcher

Query Processor (QP)

Curation module

End User

Curator

Developer

Figure 4 Abstract view of QA system architecture

Page | 33

QA Engine NLP based dispatcher

Mathematica based QP

Internal curation

End User

Curator

Developer

External Environment Internal Environment

Figure 5 Abstract functional architecture of WolframAlpha

The end user interacts with the system via the user interface. User enters a query through

this interface, which then flows between the dispatcher and query processor module of

the QA engine to generate appropriate result for the query. Curator interacts with the

curation module to augment the data and enhance the quality of data which in turn

would help in producing a decent and desired end user experience. Additionally, a

developer develops and modifies the QA engine, to support the curators and the end

users in their respective tasks.

Essentially, QA engine architecture can be categorized into external environment which is

under the control of the end user of the system and internal environment which is

controlled internally by the system owner.

3.1.1 WolframAlpha Architecture

WolframAlpha has a centralized architecture with the curation and development tasks

being under the control of the internal environment. It appears that the development and

curation processes are tightly coupled and the engine is tweaked as part of the curation

process.

The only component which is outside this centralized control is the end user interaction.

The complete curation process is entirely masked from the end user.

Due to the highly centralized architecture of WolframAlpha, it doesn’t have to explicitly

deal with the issues of security, data quality and reliability, as all of these are internally

handled by its team of curators.

Page | 34


SWT based QP

Quasi-distributed curation

Curator

Curation Process

Developer

End User

External Environment Internal Environment

Figure 6 Abstract functional architecture of TrueKnowledge

QA engine of WolframAlpha consists of a NLP based dispatcher, Mathematica based

query processor and an internally managed curation module.

3.1.2 TrueKnowledge Architecture

Architecture of TrueKnowledge is semi-decentralized. It partially shares the task of

curation with its end user community.

It has an internal curation process which co-develops and co-manages the knowledge

base with the external community of curators. The internal curation process adds facts to

its knowledge base by either importing from sources like Wikipedia and Freebase

whereas the users add knowledge by means of a thorough and controlled form based

input. The external curation mechanism is basically feedback based and therefore the

users cannot add a large amount of real-world data at one go.

This semi-distributed architecture makes the system susceptible to spoofing and abuse.

To overcome these threats, TrueKnowledge only allows registered users to curate and

contribute data and has implemented an internal curation mechanism which only accepts

knowledge which confirms with the existing knowledge in its knowledge base.

Its QA engine comprises of a NLP based dispatcher, Semantic Web Technology (SWT)

based query processor and a quasi distributed curation module.

Page | 35


SWT based QP

Distributed curation

Developer

Knowledge Base

Curator End User External Environment

Internal Environment

dispatcher

QP

Data Upload Data Description

Data

Feedb

ack

3.1.3 CoKo Architecture

For CoKo we propose a truly decentralized architecture by moving the curation process

completely outside the internal environment. The task of curation is entirely managed by

the end user community.

This user driven model requires the system to be more generic, flexible and less

complicated for the users. The proposed functional architecture of CoKo is illustrated in

Figure 7 along with the interaction between components of QA engine and end user. The

system basically supports two levels of curation: i) Adding new data ii) Providing feedback

about existing data.

Since it is an open domain QA system users can add new data to the knowledge base

pertaining to any domain. Due to the arbitrary nature of data which can be fed to the

system by the users, we ask them to submit a description about their data while adding

any data. This description will support the query processing module to generate answer

to user queries. With this description of data we are not only asking users to describe the

data itself but also how to use the data and how it aggregates.

Figure 7 Proposed functional architecture of CoKo

Page | 36

End users can ask NL queries via the user interface which would be handled by an NLP

based dispatcher. This dispatcher would pass the query to the Semantic Web

Technologies (SWT) based query processor, which would primarily involve a SPARQL

query. The data retrieved by the query from the knowledge base is then passed back to

the dispatcher which processes this raw data into more analyzable result format basically

some kind of engaging visualization. The end users can provide feedback about the result

and the visualization, thus creating a feedback loop which would be stored in the

knowledge base and help to improve the quality of the results in future.

The open system architecture of CoKo makes it vulnerable to many threats like:

1) Duplicate data upload due to lack of co-ordination between contributors

2) Spoofing attack and abuse of the system

3) Quality of data

4) Trustworthiness of data

In this thesis we aim to address some of these above stated problems to make the system

more robust and secure.

3.2 System Requirements

Based on the system architecture described in the previous section, we define a set of

high level system requirements in Table 3

Page | 37

Searching

Provide a keyword-based interface to search for knowledge.

Return a ranked list of all hits found in the knowledge base.

Visualizing the results

Support different visualizations.

Show relevant visualization according to its rating.

Sharing the results

Support sharing of search results and visualization through social networks (like Facebook,

Twitter etc.) and blogs.

Recommendations

Allow users to suggest keywords and visualizations for an existing data.

Allow users to rate existing visualizations for a query.

Inspecting

Display complete background information about the dataset (e.g. source, creator, validity

period etc.)

Authentication

Implement a authentication system to prevent abuse of the system and to deal with other

problems related to an open system architecture.

Knowledge base Augmentation

Provide support for contribution of new datasets.

Provide a language for users to be able to describe their datasets.

Provide support to associate new queries with existing datasets.

Table 3 High level System Requirements

Page | 38

Chapter 4

Implementation

A prototype of CoKo has been developed to analyze the feasibility of the conceptual

architecture proposed in Chapter 3. This chapter provides the implementation details

along with key technologies used to realize this prototype.

4.1 Technical Overview

CoKo is a JSP and servlet based application, deployed on Apache Tomcat web server. The

entire application was developed on a Windows 7 machine using NetBeans IDE 6.9.1 with

JDK 1.6. The production version of the application is hosted on a machine running Ubuntu

10.10. The development of the application was carried out over a period of three months

and the version in production consists of 14 classes with over 2100 lines of code. Table 4

provides a summary of source code statistics.

Number of Lines of Code ~ 219619

Number of Classes 20

Number of Production Classes 14

Number of Test Classes 6

Number of Methods 44

Table 4 Summary of CoKo source code statistics

The application is portable as all the technologies used in developing it are platform

independent. For a complete list of technologies used, please refer to Appendix A.

4.2 System Architecture

CoKo is based on the standard three-tier client-server architecture, which includes the

presentation tier, application tier and data storage tier. Presentation tier consists of end

user’s workstation running a standard web browser. This tier deals with the way

information is presented to the user i.e. the GUI design. The server running the business

19 including blank lines and comments

Page | 39

logic forms the application tier and the knowledge base is the data storage tier. Figure 8

graphically illustrates various components of CoKo’s three tier architecture.

4.2.1 Presentation Tier

JSP has been used as the primary technology to render content to end user’s browser. For

the purpose of articulation the GUI can be categorized into two views: End user view and

the Curator view. However, these views are non-orthogonal.

End User View

The end user view is the interface for the knowledge seekers who wish to search for data

in CoKo’s knowledge base. The end user view comprises of following two interfaces:

Query interface: It consists of a simple text box, where the Knowledge seekers can input

their query (in English). The user query is sent to CoKo’s Search API and on finding a

match, corresponding SPARQL query is executed. The data retrieved as result of query

execution is then fed to the Transformation API in order to be transformed to Google

JSON format. The resulting transformed data is sent to Google Visualization API before

presenting to the user via the Result interface.

Result interface: In order to provide intelligible search results, CoKo provides rich

visualizations of the results generated from SPARQL query execution. Search results and

End User View

Curator

View

Query processing and

search logic

Data upload and indexing

logic

Triple Store

+ RDBMS

Query

Result

Feedback

Data+ Metadata

Presentation Tier Application Tier Data Tier

Figure 8 CoKo's Three-tier Architecture

Page | 40

the visualizations can be downloaded or shared via social networking websites, blogs or

email.

Curator View

The curator view is the interface for the users engaged in the task of augmenting the

knowledge base. In the current implementation, CoKo supports following levels of

curation:

Contribute new data by uploading datasets which can be local dataset files,

remote dataset files or links to remote endpoints. The datasets can be uploaded

along with data description files through CoKo’s upload tool. It’s a simple interface

which asks the user to specify whether they want to submit a local file or a remote

file and accordingly they can either upload the dataset file along with data

description file or the description file alone. Additionally, they can also upload

new queries for an existing dataset. Once the user publishes a new query, the

application generates a unique URI for the query, which can be used to retrieve

the SPARQL query for remote execution.

Provide feedback about the search results in following ways:

recommend new visualizations

rate existing visualizations

eliminate an existing SPARQL query-visualization mapping

recommend new keywords for the query

eliminate an existing keyword-SPARQL query mapping

This creates a feedback loop which would be stored in the knowledge base and

help to improve the quality of the results in future.

Distinctive technology used: Google visualization API

There are a large number of high quality graphing and charting libraries available on the

web. For the purpose of this project Google Visualization API was used, as it is easy to

use, well-documented and provides a rich set of interactive visualizations ranging from

bar charts to word clouds. SPARQL XML bindings obtained as a result of SPARQL query

execution can be readily transformed to Google Visualization JSON format, using an XSLT.

The visualizations are rendered using Javascript, and thus require that the end user’s web

Page | 41

browser should have Javascript enabled. Even though the API provides a wide range of

visualizations, but for the matter of simplicity the current implementation of prototype

only makes use of Bar Chart, Pie Chart, Line Chart, Table and Map; other visualizations

can be easily added.

4.2.2 Application Tier

This tier consists of Java Bean classes, Java servlet classes and helper classes (non-servlet

classes). Figure 9 provides an overview of data flow between some of the key classes

which are described below:

Search API

This is a servlet class which accepts the user query as input and passes it to the query

handler and in return receives a ranked list of all the SPARQL query hits found in the

Lucene index. This API can then forward a SPARQL query to the query processing module

on demand to retrieve the SPARQL/XML results. The results retrieved from query

processing module are then forwarded to the Transformation API. Additionally, it also

interacts with the knowledgebase to retrieve additional data about the query and the

dataset corresponding to the query e.g provenance information about the dataset,

suggested visualization for query result representation etc. It binds this information with

the query result before sending it to the user interface. Although this API doesn’t provide

an external access in the current implementation, but it can be easily extended to provide

remote search functionality over CoKo’s knowledge base.

Query Handler

This class receives the user query as input from the Search API and normalizes it before

searching the Lucene index for a match. It uses various functions provided by Lucene to

normalize the keywords (discussed later).

Query Processor

This module receives a SPARQL query as input from the Search API. It parses the query to

determine if it’s a generic query (described in Chapter 5) or a general SPARQL query.

Incase it’s a generic query; it interacts with the knowledgebase to retrieve and execute

the corresponding metaquery. The result of the metaquery along with the generic query

Page | 42

is then sent to the user interface to be disambiguated by the user. The user selected value

is then used to transform the generic query to a general query which is then sent to the

SPARQL engine for execution and the retrieved results are sent back to the Search API.

Transformation API

This API receives the SPARQL/XML results retrieved from running the SPARQL query and

transforms it into Google Visualization JSON format. Current implementation of this API is

limited to transforming the results to Google JSON format, but can be extended and

exposed as public API to allow transformation of SPARQL query results using user

supplied XSLT.

Distinctive Technology used: Apache Lucene

Apache Lucene20 is an open source, highly scalable full-text search Java library. It provides

a simple API, focusing mainly on text indexing and searching and allows for easy

integration of these into any application. It is widely used to power websites like LinkedIn,

Twitter Trends - Twitter Analyzing Tool and many more. Wolfram Research also uses

20 http://lucene.apache.org/

Search API

Transformation API

Query Handler

Query Processing

Knowledge Base

User Query

Google Visualization API

Lucene Index

Figure 9 Data flow between key classes

Page | 43

Lucene for its internal tools, the Demonstrations project, the Mathematica

documentation search and for site searching [49].

In order to perform a full-text search on a database, an index can be created for the fields

in the database, on which search is to be performed. Lucene’s index structure is based on

the concept of inverted index21, which allows for fast full-text searches [50]. It supports

ranked searching and also supports many different query types like phrase queries, wild

card queries etc. Figure 10 provides an overview of various steps involved in building a

full-text search application using Lucene. It primarily involves indexing data, searching

data, and retrieving results.

Since Lucene is completely written in java, it allowed for easy integration into our servlet-

based application. Lucene Java library provides a wide array of classes which allow

customizing the way data is indexed, scored and searched. Some of the key Lucene

classes used in our application are:

21 http://xlinux.nist.gov/dads//HTML/invertedIndex.html

Figure 10 Building a full-featured search application using Lucene Retrieved August 16, 2011, from: http://www.ibm.com/developerworks/java/library/os-

apache-lucenesearch/index.html

http://xlinux.nist.gov/dads/HTML/invertedIndex.html

Page | 44

RAMDirectory

An object of class IndexWriter is used to build a Lucene index. Typically, the index is file-

based but Lucene API provides support for creating an in-memory index as well. For file-

based indexes, a directory name is passed to the IndexWriter constructor, whereas for an

in-memory index an object of class RAMDirectory is passed to the constructor.

Although the index generated by Lucene can also be stored inside a relational database,

but this approach is known to have performance issues, especially in cases where the

index is being frequently updated .Therefore, I have used Lucene’s RAMDirectory class to

maintain an in-memory index of keywords associated with the queries.

StandardAnalyzer

Analyzers determine how the text is segregated and stored in an index, as well as at the

time of searching maps query terms to find a match in the index. Lucene provides a

couple of different analyzers and also allows for creating custom analyzers for an

application. StandardAnalyzer is one such analyzer which filters the text by converting it

into lower case, removes stop-words and other characters like spaces from acronyms,

apostrophes (') etc.

AnalyzerUtil

This class offers various methods for full-text analysis such as stemming, retrieving

frequently occurring terms etc. One of the methods provided by this class is

getPorterStemmerAnalyzer(), which returns an English stemming analyzer that uses the

Porter stemming algorithm22 to stem tokens from the underlying child analyzer.

4.2.3 Data Tier

This tier forms CoKo’s knowledge base. It has two components: a triple store which is

used to store the RDF-based knowledge representations and a RDBMS which stores data

descriptions and feedback. A triple store is a special purpose database designed to

provide persistent storage and access to RDF graphs via its APIs and query languages.

22 http://tartarus.org/~martin/PorterStemmer/

Page | 45

A unique data id is generated for each dataset uploaded to the system (including the

endpoint) and is stored in the RDBMS along with the data descriptions provided by the

curator. Each RDF dataset file which is uploaded to the system is saved in a new named

graph. The name of the graph is same as that of the data id.

Distinctive Technology used: OpenLink Open-Source Virtuoso

There is a wide range of commercial and open-source stores available, but for the

purpose of this project OpenLink Virtuoso was used. The open-source OpenLink Virtuoso

is the non commercial edition of Virtuoso Universal Server23. It is an Object-Relational

Database Engine extended into an RDF triple store [51] and provides database

management for RDF, SQL and XML data. It supports N3 / N-Triples and RDF/XML RDF

Data Serializations. It also supports the SPARQL Query Language, Query Protocol, XML

Query Results Serialization and Named Graph functionality and provides a web server to

execute SPARQL queries and upload data over HTTP.

The triples uploaded into the Virtuoso triple store are stored inside a table having four

columns, one for each of GraphID, Subject, Predicate and Object [51]. Each dataset

uploaded into the triple store is assigned a unique GRAPH IRI, which can be used to

execute SPARQL query for data in that file. In order to query all the triples in the triple

store the virt:sponger property needs to be set to “yes” and the rdf:graph property to the

desired Internationalized Resource Identifier(IRI)24, this will give the IRI that can be used

to query all the RDF triples in the triple store.

In addition to storing the RDF datasets in the triple store, Virtuoso’s relational database

engine was used to store the data descriptions provided by the curator. The supplied

description file is parsed to retrieve the data, which is then stored in relational database

tables. The data in these tables is then accessed with the help of Virtuoso JDBC driver.

Since Virtuoso offers SPARQL inside SQL [52], this driver can also be used to execute

SPARQL queries. The SPARQL query is simply appended with SPARQL keyword to

23 http://virtuoso.openlinksw.com/

24 http://www.w3.org/International/

Page | 46

distinguish from SQL. Internally, SPARQL is translated into SQL at the time of parsing the

query.

Even though Virtuoso provides free text indexing capability for text and XML data, it

doesn’t support stemming and therefore Apache Lucene was used to provide full – text

search support with stemming.

Page | 47

Chapter 5

Design Decisions

Several design decisions were made during the implementation of CoKo, in order to

accommodate some of the requirements and to optimize the usability of the system. This

chapter discusses the key decisions and some of the design artifacts produced as a result

of those decisions.

5.1 Data Set Description Language (DSDL)

DSDL is the Data Set Description Language, an XML-based representation language,

designed from the ground up to capture metadata of the datasets. DSDL was conceived to

serve the following purposes:

i) Provide standardized format for curators to describe the data.

ii) Collect metadata to support evaluation of the quality and trustworthiness of data

based on its source.

iii) Identify the righteous contributor of the data to enable proper attribution.

iv) Collect some typical queries and related visualizations.

With the above desired purposes in mind, the major challenge was designing a generic

format to allow the curators to describe the data. For that, the XML format was

fabricated to capture the metadata of the datasets, including informational metadata

about the data set like descriptions of source, owner etc., as well as presentational

metadata like preferred visualization for a particular query result. DSDL schema consists

of two key elements <general> which encloses elements to describe the data set as a

whole and <presentation> which encloses elements to describe the user supplied queries

and relevant visualizations of the retrieved results. Table 5 describes the purpose of each

element of the schema, for full version of the schema refer to Appendix B. DSDL was

designed to concisely capture data pertinent to the data set being published along with

some canonical interesting queries; it is portable and can be reused with any bespoke

application.

Page | 48

Elements Description

<gen

eral

>

<owner> Encloses descriptive information about the owner of the data

set. Includes <name>, <email> and <url> child elements.

< source> Encloses descriptive information about the source of the data

set. Includes <name>, <email> and <url> child elements.

<name> Identifies the name of the dataset.

<type> Identifies if the dataset being uploaded is a file or a reference

to a SPARQL endpoint.

<url> A link to a SPARQL endpoint or to a remote RDF file

<creator> Encloses descriptive information about the creator of the

data set. Includes <name>, <email> and <url> child elements.

<lastEditor> Encloses descriptive information about either the last editor

or the uploader of the data set. Includes <name>, <email>

and <url> child elements.

<date> Encloses <dateTo> and <dateFrom> element to represent

validity of the data set (if applicable).

<licenceInfo> Represents licensing information about the data set.

<description> Textual description of the data set, as to what kind of data is

present in the data set.

<pre

sen

tati

on

>

<import-data> Identifies name of the graph, if the associated SPARQL query

is to be evaluated against more than one data set. In case of

individual query upload, it can be used to provide the name of

an existing graph against which the query is to be executed.

<query> Identifies a SPARQL query. It can be a General SPARQL query

or a Generic query (explained in the next section).

<description> Textual description of the SPARQL query, as to what

information does it generate.

<meta-query> Identifies metaquery which are the special purpose SPARQL

queries, used to retrieve values to be supplied to generic

queries (explained in next section). It is optional and can be

omitted in case of general SPARQL queries.

<keywords> Identifies keyword-property mappings.

<property> Identifies the original term used in the vocabulary of the

corresponding data set. It is optional and can be omitted in

case of general SPARQL queries.

<keyword> Identifies the keywords associated with the query.

<visualisation>

Attribute: Rating

Identifies the preferred visualizations for the corresponding

SPARQL query’s result. The value of the Rating attribute is

used to determine the order in which the visualizations are

presented to the user.

Table 5 Data Set Description Language Elements

Page | 49

Example: 5.2.1 Population of Australia

SELECT ?population WHERE {

?s ns:population ?population ; ns:name "Australia".

}

The schema of DSDL resembles the DSPL25 language, created by Google to process data

for use in the Public Data Explorer26. But it is not as elaborate as DSPL and only captures

metadata sufficient to support provenance, query retrieval and query result rendering.

Even though the users are offered rich visualizations in return for uploading their data

sets using DSPL, but this is overshadowed by the complexity of describing data in DSPL.

Even for relatively small and simple data sets, it demands a lot of explicit descriptions

before it is able to render any visualizations, which can be discouraging for the users.

Therefore, DSDL was designed to have a simplistic format which is easy to comprehend

and easy to use.

5.2 Types of SPARQL queries

Initial design of CoKo allowed curators to only provide targeted queries, which merely

provided one to one mapping of keywords to SPARQL query. For the purpose of

articulation we will use the following example for this and the following section

Example 5.2.1: A curator named Bob has a dataset which contains data about countries,

their population and literacy rates. He decides to upload his dataset to CoKo along with a

basic SPARQL query to retrieve the population of “Australia”. He uses the following

SPARQL query for this purpose.

In order to submit the above query to CoKo, Bob uses the fragment of DSDL provided in

Listing 1 to describe the query.

25 http://code.google.com/apis/publicdata/

26 http://www.google.com/publicdata/home

Page | 50

Now Bob wants to upload similar queries for all the countries in the dataset but the above

description with hard coded queries, would be impractical and cumbersome.

To overcome this situation and to make the SPARQL query and associated keywords

description less verbose two types of queries were introduced:

i) Generic queries

These are provisional SPARQL queries, which cannot be directly evaluated by the

SPARQL engine due to presence of meta-variables. A meta-variable is basically a

special variables used as placeholders. These placeholders are supplied values

from the results retrieved as a result of metaquery execution. Once the

placeholders are replaced with an appropriate value these queries are sent to

SPARQL engine for execution.

ii) Metaqueries

These are special purpose SPARQL queries, used to retrieve values for the

placeholders in the generic queries.

Using the aforementioned types of queries, Bob can now provide data description for a

query assimilating all countries given in his data set. In order to accomplish that, he can

use the fragment of DSDL provided in Listing 2.

Listing 1. Fragment of DSDL describing SPARQL query … <query>


?s ns:population ?population ; ns:name "Australia".

} </query> <keywords>

<keyword>Population of Australia </keyword> </keywords> … ^ Namespace prefix statements omitted for brevity

Page | 51

Listing 2. Fragment of DSDL describing SPARQL query using Generic and meta queries … <query>


?s ns:population ?population ; ns:name ?o. FILTER regex(?o, "{0}", "i")

} </query> <meta-query>

SELECT ?country WHERE {

?s rdfs:label ?country }

</meta-query> <keywords>

<keyword>Population of {0} </keyword> </keywords> … ^ Namespace prefix statements omitted for brevity

Metaquery

SELECT ?country WHERE { ?s rdfs:label ?country}

List of countries

Keyword

Population of {0} Generic Query

SELECT ?population

WHERE

{

?s ns:population ?population ;

ns:name ?o.

FILTER regex(?o, "{0}", "i")

}

Figure 11 is a schematic representation of how the meta-variables are replaced with the

results retrieved from execution of a metaquery.

Figure 11 Illustration of meta-variable replacement with the result from metaquery execution

Page | 52

CoKo relies on clarification dialogues with the end user, to determine the appropriate

value to be plugged into the generic query placeholders. Using the value selected by the

user, a generic query is transformed into an executable query. The flow chart in Figure 12

exhibits the flow of operations involved in mapping a user query to a SPARQL query.

Does user

query match a

keyword in KB?

Does matched

keyword

contain meta-

variables?

Execute metaquery

Prompt user to disambiguate by

selecting a value from results

retrieved in previous step

Replace place-holder with the

value selected by the user

Execute the SPARQL query No

Yes

Yes

User query

Figure 12 Flow of operations involved in mapping a user query to a SPARQL query

Page | 53

5.3 Property Mapping

Metaqueries and generic queries enabled the curators to provide queries covering wide

range of data, but it was still not sufficient to describe the dataset efficiently. There was

some monotony due to hardcoded property values.

Adding to example 5.1, Bob wants to upload a query to retrieve the literacy rates for each

country, but the queries are syntactically similar the only difference being the name of

the property.

To overcome the inconvenience due to hardcoded property values, we incorporated a

new class of placeholders into the DSDL design. The initial strategy was to ask the

curators for a list of vocabulary terms which a query could handle which would then be

normalized into keywords and stored in the database. For example, terms like

based_near27 and populationTotal28 defined in DBPedia vocabulary can be easily

normalized to “based near” and “population Total” using string manipulation functions to

split the term at underscore and at uppercase respectively. However, during the

functional testing of the application, certain vocabularies were found to use terms which

cannot be efficiently normalized. For example CIA factbook vocabulary uses terms like

populationgrowthrate29 and internetusers30, which cannot be normalized into meaningful

keywords. Consequently, we introduced a new element <property> in the DSDL to

represent the original term used in the vocabulary, which would be mapped to a

normalized term in the keyword element supplied by the curator.

Building on example 5.1, Bob can now use the XML fragment provided in Listing 3, to

represent a cluster of queries covering all the properties in his data set. This relieves him

from the burden of explicitly creating queries for each property described in his data set.

27 http://mappings.dbpedia.org/index.php/OntologyProperty:Foaf:based_near

28 http://mappings.dbpedia.org/index.php/OntologyProperty:PopulationTotal

29 http://www4.wiwiss.fu-berlin.de/factbook/ns#populationgrowthrate

30 http://www4.wiwiss.fu-berlin.de/factbook/ns#internetusers

http://www4.wiwiss.fu-berlin.de/factbook/ns#populationgrowthrate

Page | 54

Listing 3. Fragment of DSDL describing SPARQL query using property mappings … <query>

SELECT ?value WHERE {

?s ns:{%property} ?value ; ns:name ?o. FILTER regex(?o, "{0}", "i")

} </query> <meta-query>

SELECT ?country WHERE {

?s rdfs:label ?country }

</meta-query> <keywords>

<property>population<property> <keyword>Population of {0} </keyword>

</keywords> <keywords>

<property>literacy_rate<property> <keyword>Literacy rate of {0} </keyword>

</keywords> … ^ Namespace prefix statements omitted for brevity

Page | 55

Chapter 6

Evaluation

CoKo was developed as a proof-of concept prototype, to identify the potential and

feasibility of a QA system built on top of a collaboratively curated Linked data knowledge

base. In order to effectively scrutinize the proposed system, two case studies were

deliberated for a detailed cognitive walkthrough of the application. Focus of these case

studies was to evaluate the functionality of the application and identify its strengths and

weaknesses. This chapter describes the two case studies, followed by a discussion on the

strengths and weakness of the application in handling the particular tasks elucidated in

each of these case studies.

6.1 Case Studies

All the functional steps described in these case studies have been recorded and are

available online31.

Case Study 1: Contained dataset

For the purpose of this case study, we will consider a dataset representing information

about Internet usage by rural and urban households in various states of US, published by

National Telecommunications and Information Administration (NTIA). The RDF version of

this dataset is available through the Data-gov Wiki32. We can incorporate this dataset into

CoKo’s knowledge base, so as to provide some typical analysis about the data. We will use

the SPARQL query given in Listing 6.1 a), to compare rural and urban broadband usage for

various states of US.

31 http://fishdelish.cs.man.ac.uk:5001/CoKo/evaluation.jsp

32 http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf

Page | 56

The dataset can be integrated into CoKo’s knowledge base, using any one of the following

techniques:

i) Download the dataset file locally from the Data-gov Wiki and then upload it using

CoKo’s upload interface.

ii) Provide the URL of the dataset file in the DSDL file. The corresponding dataset

would be retrieved and loaded into CoKo’s knowledge base automatically, with the

help of the given URL.

iii) Provide URL of the Data-gov SPARQL Endpoint33 in the DSDL file and reformulate

the query in Listing 6.1 a) to specify URI of the dataset in the GRAPH clause

(Appendix C.1).

For the first two techniques, current system works on the principle of single upload

without refresh i.e. the data is stored locally in CoKo’s knowledge base and is not checked

for updates. Whereas, in case of the third technique queries are processed in an ad-hoc

fashion and thus any changes to the dataset are automatically reflected in query results.

Therefore, (iii) is more efficient in cases where we want the changes in a dynamic data

source to reflect in query result. However, in case the publisher of the dataset doesn’t

provide a public SPARQL endpoint for their dataset, then we will have to manually update

CoKo’s knowledge base each time the data changes.

To load the dataset we carry out the following steps

33 http://data-gov.tw.rpi.edu/sparql

Listing 6.1 a) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE {

?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS") } ORDER BY ?state

Page | 57

Listing 6.1 b) <general> <owner> <name>National Telecommunications and Information Administration</name> <url> http://www.ntia.doc.gov/</url> </owner> <source> <name>National Telecommunications and Information Administration survey</name> <type>file</type> <url>http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf</url> </source> <creator> <name>National Telecommunications and Information Administration</name> </creator> <lastEditor> <name>Priyam</name> <email>[email protected]</email> </lastEditor> <date> <from>02/03/2010</from> </date> <licenceInfo>Open Data</licenceInfo> <description>Households using the Internet in and outside the home, by selected characteristics: Total, Urban, Rural, Principal City, 2009</description> </general>

1. Create the DSDL file using the schema provided in Appendix B. As discussed in

Chapter 5 the dataset description contains two key elements <general> and

<presentation> which we can populate for our dataset as given below (for full

version of DSDL refer to Appendix C.2):

<general> element encloses elements which describe the dataset (Listing 6.1 b).

<owner> and <creator> contain information about NTIA, as it’s the publisher of

the dataset.

<source> element’s child element <name> identifies from where the data was

generated, which in this case is NTIA Survey. Since our dataset is a static dataset,

we can simply upload it by providing the URL of the dataset in <url> and the value

“file” in <type>.

<lastEditor> identifies the uploader of the dataset, which in this case is the author

of this thesis.

Page | 58

<date> contains information about the validity period of a dataset, where child

element <from> identifies when the dataset came into existence and element

<to> identifies if after a particular date the data becomes invalid. As given on the

Data-gov Wiki34 page for this dataset, it was created on “2 March 2010” and due

to the nature of data it will not become invalid. Therefore, <to> element has been

omitted and <from> element contains the date of creation of the dataset.

<licenceInfo> contains information about the an licence specifications associated

with the dataset.

<description> contains textual description about the dataset as a whole.

<presentation> element encloses elements which describe the SPARQL query and the

visualization (Listing 6.1 c)

<description> contains textual description of the SPARQL query as a whole

<query> encloses the SPARQL query. Reserved signs like angle brackets must be

escaped.

<keywords> encloses keywords associated with the query

34 http://iw.rpi.edu/wiki/Dataset_10040

Listing 6.1 c) <presentation> <description>Compare rural and urban internet usage for various states in US </description> <query>

PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { ?s d:state ?state .

?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS")

} ORDER BY ?state

</query> <keywords> <keyword> rural vs urban Broadband Internet Use</keyword> </keywords> <visualisation rating="10">ColumnChart</visualisation> </presentation>

Page | 59

<visualisation> provides the name of the visualization and its rating on a scale of

1-10 (10 being the highest). Since we are comparing two sets of values, an

appropriate visualization will be Column Chart. But care needs to be taken about

the sequence of output variables, while formulating the query. Google

Visualization API advocates a particular sequence for the type of values it accepts

(refer to Appendix D) and CoKo is unable to automatically modify the sequence of

variables in the SPARQL query result, so as to satisfy the requirements of a

particular visualization.

2. Upload the DSDL file created as a result of above step, through the upload interface

of CoKo. After uploading the file, a unique URI is generated for our SPARQL query,

which can be used for remote SPARQL query execution, thus enabling us to reusing

the query.

By creating a file spanning ~40 lines of text and following the above two step process we

were able to share the dataset, along with a possible visualization of the result reasonably

easily.

Once the DSDL file is uploaded, the dataset is immediately available for analysis. We can

access CoKo’s search interface and analyze the results of our SPARQL query by entering

the keywords based query “Rural vs Urban broadband internet use”. The result interface

displays the result of the SPARQL query on a bar chart, along with the information about

the query and the dataset supplied by us in DSDL. We can easily provide an additional

visualization for the result, by clicking the “Edit Visualization” button on the result

interface, which allows us to make the following changes related to the visualization:

i) Provide new visualization

ii) Update rating of an existing visualization

iii) Eliminate an existing visualization

We can also share the visualization through a website or a blog, by copying the “Embed

visualization” URL, available on the result interface. Query results are also available in

XML format and can be downloaded. We can transform this XML result back to RDF

format and upload it to CoKo. Thus, enriching the knowledge base with the help of data

derived from data.

Page | 60

Listing 6.2 b) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state WHERE {

?s d:state ?state . FILTER (?state != "TOTAL HOUSEHOLDS"). } ORDER BY ?state

Furthermore, if we reformulate the SPARQL query given in Listing 6.1 a), to use the

Generic SPARQL query provision of CoKo, we will be able to compare the rural and urban

broadband usage for each individual state rather than for all states at once.

CoKo doesn’t allow us to modify an existing query since it might render the query

unusable for the original publisher of the query. For example, in this case if we were

allowed to update the original query; it could no longer be used as a remote query, due to

the presence of meta-variables. Therefore, we can only add new queries to the system.

We can upload the additional query as follows:

1. Identify dataset id for the existing dataset, using CoKo’s data interface, which

provides an overview of all datasets available in CoKo’s knowledge base. Dataset

id is a unique id generated by CoKo, for each dataset or endpoint uploaded to its

knowledge base. This id can also be used in GRAPH clause of SPARQL queries to

refer to the datasets.

2. Formulate a SPARQL query using meta-variables and a supporting metaquery to

retrieve values for meta-variables.

Listing 6.2 a) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE {

?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS"). FILTER regex(?state, "{0}", "i"). } ORDER BY ?state

Page | 61

3. Upload the query using a truncated version of the previous DSDL, containing

descriptions only in <presentation> element (refer to Appendix C.3). The data id

from step 1 is given in <import-data> element, which automatically links the query

to the existing dataset. Thus, we do not have to provide details about the dataset

again.

Now when we enter the query “Rural and Urban Broadband Internet Use in each state”,

in CoKo search interface, we are prompted to select name of the state for which we want

to view the results. We can either select a particular state or select the value “All”, in

which case a result same as that of the previous query (Listing 6.1 a) would be displayed.

CoKo’s search mechanism is keyword-sensitive and it doesn’t support synonym search.

Therefore, user query plays a central role in finding a keyword match in its knowledge

base. Although, it is lenient about misspelled words, but one or more keywords in the

user query should exactly match those present the knowledge base. In case more than

one match is found in the knowledge base for a user query, it displays the results of the

SPARQL query associated with the keyword having the highest score35, and a ranked list

of links to other query results is displayed at the bottom of the result interface.

Case Study 2: Distributed Datasets

For this case study we will start by uploading a query to retrieve “female literacy rates”

(Listing 6.3 a) for countries represented in the CIA Factbook dataset36.

35 based on Lucene Scoring (http://lucene.apache.org/java/3_3_0/scoring.html)

36 http://www4.wiwiss.fu-berlin.de/factbook/

Listing 6.3 a) PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#> SELECT ?female_literacy_rate ?country WHERE {

?s ns:literacy_female ?female_literacy_rate; ns:name ?country }

ORDER BY ?country

http://lucene.apache.org/java/3_3_0/scoring.html

Page | 62

The steps for creating and uploading the DSDL file are same as that described for the

previous case study, the only difference being in the <source> element (Listing 6.3 b). We

will provide the value for <type> element as “endpoint” and <url> element will contain

the URL of CIA Factbook endpoint37. For full version of the DSDL please refer to Appendix

C.4.

SPARQL query given in Listing 6.3 a) can be expanded with the help of property mappings

(elucidated in Chapter 5), to retrieve values for other similar properties described in CIA

Factbook (e.g. Male literacy values, Total literacy rate, Population etc.).In order to expand

the scope of the SPARQL query we will load the query using DSDL fragment given in

Appendix C.5). We can similarly provide property mappings for other properties described

in the dataset.

Next, we will upload the following query (Listing 6.4 a), which retrieves the female literacy

rates for each country given in CIA Factbook, along with the URL of that country’s

Wikipedia page.

37 http://www4.wiwiss.fu-berlin.de/factbook/sparql

Listing 6.4 a) PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT DISTINCT ?country ?female_literacy_rate ?wikiPage WHERE {

?s ns:literacy_female ?female_literacy_rate. ?s ns:name ?country. ?DBcountry a dbpedia:Country . ?DBcountry owl:sameAs ?s . ?DBcountry foaf:page ?wikiPage }

ORDER BY ?country

Listing 6.3 b) <source>

<name>CIA factbook</name> <type>endpoint</type> <url>http://www4.wiwiss.fu-berlin.de/factbook/sparql</url>

</source>

http://www4.wiwiss.fu-berlin.de/factbook/ns

Page | 63

This query executes over the union of CIA Factbook and DBPedia38 graphs. Therefore, we

use the Linked Open Data Cloud Cache Endpoint39 to execute the query.

Current system mandates a query to be associated with at least one dataset, which the

system can identify as a master dataset for that query. The source information associated

with this master dataset is used to determine the data source, for which the query will be

executed (CoKo or a remote endpoint). For the query given in Listing 6.4 a) we can

identify either CIA Factbook or DBPedia as a master dataset and provide the dataset ID of

the other dataset in the <import-data> element. Since, we have already loaded CIA

Factbook, let’s use DBPedia as the master dataset in this case. The DSDL file can be

generated in a similar fashion as done previously for other datasets (for full version of

DSDL refer to Appendix C.6). The endpoint to which query would be sent for execution, is

determined by the value of the source URL, which is linked with the dataset description

rather than the SPARQL query description in the DSDL. Therefore, we will have to give the

source URL as the Linked Open Data Cloud Cache Endpoint and not the DBPedia endpoint

(Listing 6.4 b).

6.2 Overall Evaluation

The above two case studies enabled us to identify the following key strengths and

weaknesses of the current design of the application.

Strengths

Simplicity

Ease of uploading and sharing a dataset along with some useful queries to analyze

the dataset.

38 http://dbpedia.org

39 http://lod.openlinksw.com/sparql

Listing 6.4 b) <source>

<name>DBPedia</name> <type>endpoint</type> <url>http://lod.openlinksw.com/sparql</url>

</source>

Page | 64

Endpoint wrapping

Endpoint wrapping enables to seamlessly pull in and utilize data of huge and

dynamic data sources without having to replicate the data.

SPARQL query aggregation

Enables aggregation of useful SPARQL queries, which can be reused with the help

of the unique query URI.

Scope of queries expanded

With the help of generic queries and property mappings, curators are able to

expand the scope of their queries.

Weakness

Keyword sensitivity

In order to find a match, user queries must contain at least one or more terms

exactly as contained in CoKo’s knowledge base. Additionally, there is no support

for synonym search.

Visualization mapped only to query

Currently visualization is only mapped to query and therefore curator needs to be

careful about the output format while formulating a query.

Endpoint information mapped only with dataset

Potentially the URL of the endpoint should be associated not only with the dataset

but also with the SPARQL query. This would allow the query to be evaluated

against an endpoint other than that of the dataset as well.

No support for remote dataset refresh

Due to lack of support for remote dataset refresh, data publishers with dynamic

datasets having no public endpoint, have to manually push fresh dataset into

CoKo’s knowledge base, each time the dataset is updated.

6.3 Contender for Semantic Web Challenge

In order to be accepted as a valid entry to the Semantic Web Challenge, an application

needs to at least meet the minimal requirements defined by the organizers. It is

evaluated by the judges of the competition, for fulfillment of the requirements before

being accepted as a contender for the Challenge and therefore CoKo was designed

Page | 65

around these requirements. It is still a prototype and therefore has not been specifically

evaluated for fulfillment of the competition criteria. With the help of Table 6 and Table 7,

we can theoretically analyze its progress towards accomplishment of the requirements

for the competition, as given in Section 2.1 of this thesis.

Page | 66

Minimal requirements

Requirement Progress

1. End-user application. It is an end user application which provides a

practical value not only to casual Web users

but also to domain experts

Casual Web users can query the large

user curated knowledge base to obtain

concise answers accompanied by rich,

intuitive and meaningful visualizations.

Domain experts can upload datasets

from their domain of expertise, in order

to utilize application’s reasoning and

inference capabilities to obtain answers

from their domain of interest.

2. Information sources. Application’s knowledge base is built

and maintained under collaborative

environment. Users are the creators

and curators of information. Therefore,

it is under diverse ownership and

control and is highly heterogeneous.

Application will have real-world data,

assuming users will upload substantial

amount of real world data in order to

obtain useful answers to their

questions.

3. Meaning of data.

Meaning of data plays central role in order to

derive answers to user questions about implicit

knowledge.

Data is processed in order to answer

user questions about implicit

knowledge.

Without semantic processing of

information, certain user question

would be impossible to answer (as

explained in the Introduction section).

Semantic processing is also important

to generate appropriate and

meaningful visualizations.

Table 6 Progress towards fulfilling minimal requirements of the SWC

Page | 67

Additional desirable features

The design of the application doesn’t meet all the additional desirable requirements at

present. But due to its extensible nature; it can be extended to meet other requirements

as well.

Requirement Progress

1. User interface The application is designed to have a simple

search box for query input and a highly intuitive

result interface integrated with rich,

meaningful visualization of the result.

2. Innovative use of semantic

technology

The application is designed to provide an end-

to-end solution from Linked data publishing to

making it useful for a casual web user. It

provides a suite of different uses which have

previously been pursued individually, but not

integrated together into a single application.

3. Functionality The functionality of the application goes

beyond information retrieval, by presenting the

results in a useful format with the help of rich

visualisation techniques.

4. Use of dynamic and static data Users are allowed to publish their local static

RDF dataset files as well as URL to an endpoint,

which allows to interact with a dynamic

dataset.

5. Contextual information All the answers are provided with their source.

Therefore, users can decide trustworthiness of

the sources. Users are also encouraged to rate

the visualizations returned in response to their

queries.

Table 7 Progress towards fulfilling additional requirements of the SWC

Page | 68

Chapter 7

Conclusion and Future Work

Although the design and approach described in this thesis are still far from realizing a

completely automated collaborative QA system, but it provides the ground work for any

future extensions to the system. This chapter presents reflection on the current state of

CoKo followed by a discussion about some suggestions for a future attempt at the system.

7.1 Reflection

Recently, a tremendous amount of data is being published as structured linked data on

the web. But merely publishing the data as linked data alone, serves very little purpose in

realizing its real worth. For general web users this data is far from being usable and

analyzable. CoKo is an end-to-end system for sharing and curating linked data in a

collaborative QA system setup. It is a step towards making linked data sources more

accessible to general web users by serving an easy to use interface to query these data

sources and provides rich intelligible visualization of the data. Figure 13 provides a

contrast between a typical linked dataset file and visualization produced by CoKo, to

assist the end user in analyzing the data in the dataset.

Pile of triples!! CoKo result interface

Figure 13 From pile of triples to an intelligible interface

Page | 69

CoKo is available online40 and at the time of writing this thesis it could answer over 20

questions, based on 5 datasets and 3 endpoints which were fed to the system during the

evaluation and functional testing phase. This milestone was achieved with a setup and

development cycle spanning over a period of three months. The major overhead of the

setup and implementation phase was the difficulty in understanding the functioning of

Virtuoso server, due to the lack in clarity of its documentation. After the initial challenges

with the setup and design, the development of the application was relatively

straightforward. Java was chosen as programming language because of author’s grasp

over the language, but any other server side language would have worked equally well for

the application. Semanticweb.com41 is an active forum of Semantic web experts and

proved to be a helpful resource.

Two objectives were defined in the introduction section of this thesis, and table 8

describes to what level these objectives have been achieved

Objective Progress Scope for improvement

Technical Objective

End to end system for

sharing and curating

linked data in a

collaborative

environment.

Evaluation showed that our approach

successfully provides support to

publishers to share a small dataset

along with some useful queries. Also

it is easy for other users to provide

feedback to the system about the

results and add new queries for

existing data.

Even for large datasets, initial

publishing of dataset to CoKo

KB is easy. But how to reduce

the burden of curation and

query generation for the

curator is still an unsolved

problem.

Functional Objective

An automatic open-

domain question

answering system.

QA system is not yet mature, but with

reasonable amount of data

description provided by a curator, it is

successfully able to provide answers

for keyword based user queries.

QA system can be extended to

improve the GUI and NLP

support.

Table 8 Progress of CoKo towards meeting its objectives

40 http://fishdelish.cs.man.ac.uk:5001/CoKo

41 http://answers.semanticweb.com/

Page | 70

7.2 Problems which still need to be solved

How to handle duplicate data upload?

Although, the developed prototype is strict about duplicate query uploads and ensures

that there are no syntactically duplicate SPARQL queries, but it is tolerant of duplicate

data upload. It doesn’t check for duplicity and uploads any dataset into a new named

graph.

Duplicate data will not only lead to an unnecessary increase in the cost of storing data,

but it will also increase the query processing time and would hamper the curation

process. Duplicate data can be disseminated into the system either due to two individuals

uploading a same dataset or two datasets having identical entities, but with different

identifiers. For example, DBpedia and the CIA Factbook use different URIs for the same

country [53]. A standard way to reconcile identical data entries is to use owl:sameAs

property, but it has not been explored in the current implementation.

Is the system scalable?

Scalability of the current system hasn’t been evaluated and should be considered for any

future attempts. The web application needs to be scalable not only in terms of number of

concurrent connections it can support but also in terms of amount of data it can process

and store.

Virtuoso seems to be a good choice as a triple store and has been proven to scale well

with large datasets.Data.gov SPARQL endpoint42 which stores over 6 billion triples on a

single open-source Virtuoso instance [54], is a good example to illustrate its scalability.

But setting up a Virtuoso server requires a lot of effort due to lack in clarity of its

documentation. The W3C Wiki43 provides a list of large scale triple stores and also

discusses potential scalability of these stores.

How to improvise DSDL?

Current version of DSDL has a fairly minimalist schema, focused on collecting metadata

sufficient to collect information about provenance, query retrieval and query result

42 http://services.data.gov/sparql

43 http://www.w3.org/wiki/LargeTripleStores

Page | 71

rendering. Apparently, this schema would evolve in the future implementations, to

further support the task of curation. During evaluation phase two issues were identified

with the current format of DSDL (refer to Section 6.2 of this thesis):

Endpoint information mapped only with dataset.

Potentially the endpoint information should be associated not only with the

dataset, but also with the SPARQL query.

Visualization mapped only with query.

Curators should be able to specify alternate order for SPARQL query result

variables, for different visualizations.

7.3 Suggestions for the future

Design, development and evaluation process of this first prototype of CoKo has

uncovered some interesting points about the application which are worth reflecting in

this thesis. These are basically features which were envisioned as the part of

development and evaluation process but could not be implemented due to the limitation

of time. Some of these are essential features and should be considered for any future

attempts and some are good to have extensions which would help to improve user

experience.

7.3.1 Critical

Minimize curation effort

Meta queries and generic queries were implemented to support the curators in the task

of describing large datasets and it indeed has lowered the burden on the curators. There

were some other designs as well which were considered before implementing generic

queries.

Additional Dispatchers: A dispatcher is basically a function which takes user input and

returns appropriate query result and visualization .Currently, the system uses a single

bespoke dispatcher: keyword search; which has been enhanced with stemming,

metaqueries and generic queries. Another design which was considered was to allow

curators to associate the main SPARQL query with another SPARQL query or we can say a

keyword-SPARQL query, instead of textual keywords. These keyword-SPARQL queries

Page | 72

along with the main SPARQL query form a new dispatcher. The keyword-SPARQL queries

would be syntactically similar to the Generic queries design of the current system and

would accept key terms from the user query as input for the place holders. If execution of

a keyword-SPARQL query returns a result then the associated main SPARQL query would

be retrieved and executed, as illustrated in Figure 14. Once this is implemented a

keyword search in CoKo’s KB would be followed by execution of a list of dispatchers to

check if any of them return any results. These dispatchers can also be built into the

system for key sites like DBPedia, Data.gov etc. Care needs to be taken that a bad

dispatcher might break the application. A time-out should be set for each of the

dispatchers.

Extract properties from SPARQL query. A SPARQL query pattern typically contains a

reference to a property term. If a user query has a keyword which matches these

property terms, then that SPARQL query becomes a potential match for the user query.

This would reduce the burden on the curators as they will no longer have to add too

many keywords.

Automatic query generation. Current implementation only allows for hand-built SPARQL

queries. An additional direction this research could take is ontology based automatic

User Query Keyword SPARQL

Metaquery

Generic Query

SPARQL Result + Visualization

Current Dispatcher

Additional Dispatcher

SPARQL

Result + Visualization

Figure 14 Additional dispatcher

Page | 73

query generation. Systems like Quelo [55] and Ginseng [29] (discussed in background

research section of this thesis) exemplify this approach. These systems support guided

construction of queries based on feedback. Although we are not sure as to how well this

approach would perform over an arbitrary ontology.

Chain of transformations

Abstractly, we can consider the query workflow from end user query to an output view as

given in Figure 15. An end user query is mapped to a SPARQL query (or a set of queries)

which in turn is mapped to an output view. To bridge the gap between SPARQL query and

output view, current system employs only a single transformation, which is an in-built

transformation to Google JSON format.

This bridge between SPARQL query and target output view can be extended to allow

curators to provide specifications about multiple transformations, mashing up of results

from an additional query execution or direct rendering of results in tabular format. These

user insertable chain of transformations and massaging of SPARQL query results, would

allow the curators to create more sophisticated and compelling views for the query

results.

Check for data update. Refresh data set

As discussed in the evaluation section, due to lack of support for remote dataset refresh,

data publishers with dynamic datasets having no public endpoint, have to manually push

fresh dataset into CoKo’s knowledge base, each time the dataset is updated. In order to

avoid this constant manual effort for data publishers, a generic data refresh for remote

User query SPARQL Output View mapped to

Multiple Transformations

Direct query result rendering in tabular

format

Additional SPARQL query

mapped to

Figure 15 Abstract view of query workflow

Page | 74

datasets can be implemented for the system. A generic refresh will pull fresh data into

the system after a definite period of time. This can either be a constant period (like every

24 hours) or can be based on the HTTP header information to estimate a plausible

expiration time for the data after which it should be refreshed. But if the data changes

too much then there can be a case where the queries stop working. Therefore, an

elaborate verification procedure needs to be in place. Data can also be stored as versions.

If user uploads a local file then it is reasonable to assume that they want that particular

version of the file.

Trust and Authentication

An application based on an open- community curated knowledge base is constantly under

the threat of spoofing, poor quality of data and lack of trust. One way to minimize these

threats is to integrate a trust-based authentication system into the application. A trust

metric like Advogato44 can be used for this purpose. Advagato’s trust metric [56] is based

on network flow and automatically calculates trust for an individual in a network based

on the ratings provided by other individuals who have a rating higher than the individual

being rated. This metric has been proven to be attack resistant and is easy to integrate

and implement.

7.3.2 Other extensions

Public API for search and transformation

Search and transformation API implemented for the system can be extended to provide

external access to be used in other applications and to create mashups with the help of

data from CoKo’s knowledge base. This would further increase the usability and

accessibility of the existing data.

Richer user queries

Current system only supports keyword based user queries, with no semantic analysis. It

can be extended to support richer queries like phrase queries and also expand user

44 http://www.advogato.org/

Page | 75

queries to search for synonyms. Ontology based keyword expansion can also be used to

improve the search mechanism.

Internal Curation

As described in the introduction section of this thesis, TrueKnowledge employs an

internal curation mechanism, which assesses any new entity with the help of an

inferencing system and rejects it if it contradicts an existing entity. A similar mechanism

can be pursued in future extensions of CoKo.

Remote endpoint caching

Once a SPARQL query is sent to a remote endpoint for evaluation, the retrieved results

should be cached in memory to maximize the performance of the system. Maintaining a

cache would help in reducing the redundant calls to a remote endpoint, for each

subsequent call of a query and thus would decrease the response time for the end user.

Keyword-Operation mapping

Primary focus of the current implementation of the application is on people giving fixed

uploads, unlike WolframAlpha which allows users to query for “Population of China +

Population of India” or “male population of China divided by the total population of

China”. WolframAlpha understands the operators “+” and “divided” and provides

appropriate results accordingly. Keywords can be associated with different kinds of

operations, than merely just the queries which are built into the dataset.

Keyword-Visualization mapping and automatic generation of visualization

When the system receives a user query such as “Map of earthquake”, it should not only

search for queries tagged with the keyword “earthquake” but should also understand the

term “Map” and present the result of the query on map based visualization.

In [57] author proposes an ontology based approach to automatic generation of charts

from SPARQL queries. This approach could be pursued in future implementations, to

suggest alternate visualizations to the user for a query result.

Page | 76

References

[1] F. Van Harmelen and G. Antoniou, A Semantic Web Primer, 2nd ed. Cambridge, MA,

USA: The MIT Press.

[2] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker, Searching and

Browsing Linked Data with SWSE: the Semantic Web Search Engine. Technical Report

DERI-TR-2010-07-23, 2010.

[3] J. Howe, Crowdsourcing: How the Power of the Crowd is Driving the Future of

Business. Random House Business, 2009.

[4] L. Hirschman and R. Gaizauskas, “Natural language question answering: the view

from here,” Natural Language Engineering, vol. 7, no. 4, Feb. 2002.

[5] S. J. Athenikos and H. Han, “Biomedical question answering: A survey,” Computer

methods and programs in biomedicine, vol. 99, no. 1, pp. 1–24, 2010.

[6] C. Kwok, O. Etzioni, and D. S. Weld, “Scaling question answering to the web,” ACM

Transactions on Information Systems (TOIS), vol. 19, pp. 242–262, Jul. 2001.

[7] M. R. Kangavari, S. Ghandchi, and M. Golpour, “Information Retrieval: Improving

Question Answering Systems by Query Reformulation and Answer Validation.”

[8] H. Xu, “Interview on Wolfram|Alpha, a Computational Knowledge Engine,” InfoQ, 30-

Jul-2009. [Online]. Available: http://www.infoq.com/. [Accessed: 02-Mar-2011].

[9] N. Spivack, “Wolfram Alpha Computes Answers To Factual Questions. This Is Going To

Be Big.,” TechCrunch, 08-Mar-2009. [Online]. Available: http://techcrunch.com/.

[Accessed: 02-Mar-2011].

[10] “True Knowledge,” Wikipedia. [Online]. Available:

http://en.wikipedia.org/wiki/True_Knowledge. [Accessed: 29-Mar-2011].

[11] m c schraefel, N. R. Shadbolt, N. Gibbins, S. Harris, and H. Glaser, “CS AKTive

space: representing computer science in the semantic web,” in Proceedings of the

13th international conference on World Wide Web, New York, NY, USA, 2004, pp.

384–392.

[12] G. Schreiber et al., “Semantic annotation and search of cultural-heritage

collections: The MultimediaN E-Culture demonstrator,” Web Semantics: Science,

Services and Agents on the World Wide Web, vol. 6, no. 4, pp. 243-249, Nov. 2008.

Page | 77

[13] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner: extraction and

mining of academic social networks,” in Proceeding of the 14th ACM SIGKDD

international conference on Knowledge discovery and data mining, New York, NY,

USA, 2008, pp. 990–998.

[14] C. Torniai, “Semantic Web for the masses - Part II,” 25-Jul-2009. [Online].

Available: http://blog.carlotorniai.net/semantic-web-for-the-masses-part-ii/.

[Accessed: 19-Apr-2011].

[15] A. Harth and P. Buitelaar, “Exploring Semantic Web Datasets with VisiNav,”

presented at the The 6th Annual European Semantic Web Conference (ESWC2009),

Heraklion, Greece, 2009.

[16] V. Lopez, A. Nikolov, M. Sabou, V. Uren, E. Motta, and M. d’ Aquin, “Scaling Up

Question-Answering to Linked Data,” in Knowledge Engineering and Management by

the Masses, vol. 6317, P. Cimiano and H. S. Pinto, Eds. Berlin, Heidelberg: Springer

Berlin Heidelberg, 2010, pp. 193-210.

[17] E. Rajabi and M. Kahani, “Designing a Step-by-Step User Interface for Finding

Provenance Information over Linked Data,” Web Engineering, pp. 403–406, 2011.

[18] M. Hearst, A. Elliott, J. English, R. Sinha, K. Swearingen, and K.-P. Yee, “Finding the

flow in web site search,” Commun. ACM, vol. 45, no. 9, pp. 42–49, Sep. 2002.

[19] M. C. Schraefel, M. Wilson, A. Russell, and D. A. Smith, “mSpace: Improving

information access to multimedia domains with multimodal exploratory search,”

COMMUN. ACM, vol. 49, p. 47--49, 2006.

[20] K.-P. Yee, K. Swearingen, K. Li, and M. Hearst, “Faceted metadata for image

search and browsing,” in Proceedings of the conference on Human factors in

computing systems - CHI ’03, Ft. Lauderdale, Florida, USA, 2003, p. 401.

[21] D. Huynh and D. Karger, “Parallax and companion: Set-based browsing for the

data web,” in Proceedings of 18th International World Wide Web Conference, 2009.

[22] G. Kobilarov and I. Dickinson, “Humboldt: exploring linked data,” context, vol. 6,

p. 7, 2008.

[23] T. Berners-Lee et al., “Tabulator: Exploring and analyzing linked data on the

semantic web,” in Proceedings of the 3rd International Semantic Web User Interaction

Workshop, 2006, vol. 2006.

Page | 78

[24] R. García, J. M. Brunetti, A. López-Muzás, J. M. Gimeno, and R. Gil, “Publishing and

interacting with linked data,” in Proceedings of the International Conference on Web

Intelligence, Mining and Semantics, 2011, p. 18.

[25] E. Turner, A. Hinze, and S. Jones, “A review of user interface adaption in current

semantic web browsers,” 2011.

[26] J. Lehmann and L. Bühmann, “AutoSPARQL: Let Users Query Your Knowledge

Base,” in The Semantic Web: Research and Applications, vol. 6643, G. Antoniou et al.,

Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 63-79.

[27] S. Auer and J. Lehmann, “What Have Innsbruck and Leipzig in Common?

Extracting Semantics from Wiki Content,” in The Semantic Web: Research and

Applications, vol. 4519, E. Franconi, M. Kifer, and W. May, Eds. Berlin, Heidelberg:

Springer Berlin Heidelberg, 2007, pp. 503-517.

[28] V. Lopez, E. Motta, and V. Uren, “Poweraqua: Fishing the semantic web,” The

Semantic Web: Research and Applications, pp. 393–410, 2006.

[29] A. Bernstein, E. Kaufmann, and C. Kaiser, “Querying the semantic web with

ginseng: A guided input natural language search engine,” in 15th Workshop on

Information Technologies and Systems, Las Vegas, NV, 2005, pp. 112–126.

[30] T. Gruber, “Collective knowledge systems: Where the Social Web meets the

Semantic Web,” Web Semantics: Science, Services and Agents on the World Wide

Web, vol. 6, no. 1, pp. 4-13, Feb. 2008.

[31] M. Richardson and P. Domingos, “Building large knowledge bases by mass

collaboration,” in Proceedings of the 2nd international conference on Knowledge

capture, New York, NY, USA, 2003, pp. 129–137.

[32] Y. L. Simmhan, B. Plale, and D. Gannon, “A Survey of Data Provenance

Techniques,” p. 47405, 2005.

[33] C. Bizer and R. Oldakowski, “Using context-and content-based trust policies on

the semantic web,” in Proceedings of the 13th international World Wide Web

conference on Alternate track papers & posters, 2004, pp. 228–229.

[34] A. Gil and V. Ratnakar, “Trusting information sources one citizen at a time,”

PROCEEDINGS OF THE FIRST INTERNATIONAL SEMANTIC WEB CONFERENCE (ISWC),

SARDINIA, p. 162--176, 2002.

Page | 79

[35] J. Cheney, L. Chiticariu, and W.-C. Tan, “Provenance in Databases: Why, How, and

Where,” Foundations and Trends in Databases, vol. 1, no. 4, pp. 379-474, 2007.

[36] P. Buneman, S. Khanna, and T. Wang-Chiew, “Why and where: A characterization

of data provenance,” Database Theory—ICDT 2001, pp. 316–330, 2001.

[37] T. J. Green, G. Karvounarakis, and V. Tannen, “Provenance semirings,” in

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on

Principles of database systems, New York, NY, USA, 2007, pp. 31–40.

[38] P. P. Silva, D. L. McGuinness, and R. McCool, “Knowledge provenance

infrastructure,” IEEE Data Eng. Bull., vol. 26, no. 4, pp. 26–32, 2003.

[39] T. Omitola, C. Gutteridge, I. Millard, H. Glaser, N. Gibbins, and N. Shadbolt,

“Tracing the Provenance of Linked Data using voiD,” 2011.

[40] J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler, “Named graphs, provenance and

trust,” in Proceedings of the 14th international conference on World Wide Web, New

York, NY, USA, 2005, pp. 613–622.

[41] E. R. Watkins and D. A. Nicole, “Named Graphs as a Mechanism for Reasoning

about Provenance,” 18-Jan-2006. [Online]. Available:

http://eprints.ecs.soton.ac.uk/11935/. [Accessed: 08-May-2011].

[42] E. Dumbill, Tracking Provenance of RDF Data. 2003.

[43] R. Macgregor and I.-young Ko, “Representing Contextualized Data using Semantic

Web Tools,” IN PRACTICAL AND SCALABLE SEMANTIC SYSTEMS (WORKSHOP AT 2ND

ISWC, 2003.

[44] G. Flouris, I. Fundulaki, P. Pediaditis, Y. Theoharis, and V. Christophides,

“Capturing Provenance of RDF Triples through Colors.”

[45] O. Hartig and J. Zhao, “Publishing and consuming provenance metadata on the

web of linked data,” Provenance and Annotation of Data and Processes, pp. 78–90,

2010.

[46] M. Leida, A. Afzal, and B. Majeed, “Outlines for dynamic visualization of semantic

web data,” in Proceedings of the 2010 international conference on On the move to

meaningful internet systems, Berlin, Heidelberg, 2010, pp. 170–179.

[47] J. G. Zheng and L. Ding, “How to render SPARQL results using Google Visualization

API - Data-gov Wiki,” 07-Oct-2009. [Online]. Available: http://data-

Page | 80

gov.tw.rpi.edu/wiki/How_to_render_SPARQL_results_using_Google_Visualization_AP

I. [Accessed: 15-Mar-2011].

[48] J. Tennison, “Creating Google Visualisations of Linked Data,” Jeni’s Musings, 23-

Jul-2009. [Online]. Available: http://www.jenitennison.com/blog/. [Accessed: 15-Mar-

2011].

[49] “Lucene-java Wiki.” *Online+. Available: http://wiki.apache.org/lucene-

java/PoweredBy. [Accessed: 17-Aug-2011].

[50] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files

for text indexing,” ACM Trans. Database Syst., vol. 23, no. 4, pp. 453–490, Dec. 1998.

[51] O. Erling, “Implementing a SPARQL compliantRDF triplestore using SQL-ORDBMS.”

[Online]. Available:

http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSRDFWP. [Accessed:

17-Aug-2011].

[52] O. Erling and I. Mikhailov, “RDF Support in the Virtuoso DBMS,” Networked

Knowledge-Networked Media, pp. 7–24, 2009.

[53] A. Jaffri, H. Glaser, and I. Millard, “URI Identity Management for Semantic Web

Data Integration and Linkage,” 2007. *Online+. Available:

http://eprints.ecs.soton.ac.uk/14361/. [Accessed: 05-Sep-2011].

[54] L. Ding et al., “TWC LOGD: A portal for linked open government data ecosystems,”

Web Semantics: Science, Services and Agents on the World Wide Web, vol. In Press,

Corrected Proof.

[55] E. Franconi, P. Guagliardo, and M. Trevisan, “An intelligent query interface based

on ontology navigation,” in Proceedings of the Workshop on Visual Interfaces to the

Social and Semantic Web (VISSW 2010), 2010.

[56] R. Levien, “Attack-resistant trust metrics,” Computing with Social Trust, pp. 121–

132, 2009.

[57] M. Leida, “Toward Automatic Generation of SPARQL result set Visualizations,”

presented at the 8th International Joint Conference on e-Business and

Telecommunications, Seville, Spain.

Page | 81

Appendices

Appendix A –Technologies Used

Description Software

Operating System Windows 7

Web Browser Google Chrome

Automated UI testing tool Selenium IDE

AJAX Framework Google Web Toolkit

Development Platform JDK 1.6

Java IDE Netbeans IDE 6.9.1

JavaEE API Servlet ,JavaServerPages, JSP Standard Tag Library

Database Virtuoso Open-Source Edition 6.1.3

Triplestore Virtuoso Open-Source Edition 6.1.3

Visualization API Google Visualization API

Search Engine Apache Lucene 3.3.0

Page | 82

Appendix B – Data Set Description Language (DSDL) Schema

 <!ELEMENT data (general, presentation*)> <!ELEMENT general (owner, source, creator, lastEditor, date,licenceInfo, description)> <!ELEMENT owner (name, email?, url?)> <!ELEMENT source (name,type, url)> <!ELEMENT creator (name, email?, url?)> <!ELEMENT lastEditor (name,email, url?)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT url (#PCDATA)> <!ELEMENT type (#PCDATA)> <!ELEMENT date (from, to?)> <!ELEMENT from (#PCDATA)> <!ELEMENT to (#PCDATA)> <!ELEMENT licenceInfo (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT presentation (description,import-data?,query,keyQuery?,keywords+, visualisation+)> <!ELEMENT import-data (#PCDATA)> <!ELEMENT query (#PCDATA)> <!ELEMENT keyQuery (#PCDATA)> <!ELEMENT keywords (property?,keyword)> <!ELEMENT property (#PCDATA)> <!ELEMENT keyword (#PCDATA)> <!ELEMENT visualisation (#PCDATA)> <!ATTLIST visualisation rating (1|2|3|4|5|6|7|8|9|10) #REQUIRED

Page | 83

Appendix C - Case Studies

C.1 SPARQL query using GRAPH Clause

Listing C.1 PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { GRAPH <http://data-gov.tw.rpi.edu/vocab/Dataset_10040>{

?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS") } } ORDER BY ?state

Page | 84

C.2 Full version of DSDL for case study 1

1. <data>

2. <general>

3. <owner>

4. <name>National Telecommunications and Information Administration</name>

5. <url> http://www.ntia.doc.gov/</url> 6. </owner>

7. <source>

8. <name>National Telecommunications and Information Administration survey</name>

9. <type>file</type>

10. <url>http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf</url>

11. </source> 12. <creator>

13. <name>National Telecommunications and Information Administration</name>

14. </creator>

15. <lastEditor>

16. <name>Priyam</name> 17. <email>[email protected]</email>

18. </lastEditor>

19. <date>

20. <from>02/03/2010</from>

21. </date> 22. <licenceInfo>Open Data</licenceInfo>

23. <description>Households using the Internet in and outside the home, by selected

24. characteristics: Total, Urban, Rural, Principal City, 2009</description>

25. </general>

26. <presentation> 27. <description>Compare rural and urban internet usage for various states in US </description>

28. <query>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/>

29. SELECT ?state ?urban_home_broadband ?rural_home_broadband

30. WHERE { ?s d:state ?state .

31. ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband .

32. ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . 33. FILTER (?state != "TOTAL HOUSEHOLDS")} ORDER BY ?state

34. </query>

35. <keywords>

36. <keyword> rural vs urban Broadband Internet Use</keyword>

37. </keywords> 38. <visualisation rating="10">ColumnChart</visualisation>

39. </presentation>

40 </data>

Page | 85

C.3 DSDL for additional query upload

<data> <presentation> <import-data>data1</import-data> <description>Compare rural and urban internet usage for various states in US </description> <query>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { ?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS"). FILTER regex(?state, "{0}", "i"). } ORDER BY ?state </query> <keyQuery>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state WHERE { ?s d:state ?state . FILTER (?state != "TOTAL HOUSEHOLDS"). } ORDER BY ?state </keyQuery> <keywords> <keyword> Rural and Urban Broadband Internet Use in {0}</keyword> </keywords> <visualisation rating="10">ColumnChart</visualisation> </presentation> </data>

Page | 86

C.4 DSDL describing CIA Factbook dataset

<data> <general> <owner> <name> Central Intelligence Agency </name> <url>https://www.cia.gov/library/publications/the-world-factbook/</url> </owner> <source> <name>CIA World factbook</name> <type>endpoint</type> <url>http://www4.wiwiss.fu-berlin.de/factbook/sparql</url> </source> <creator> <name> Central Intelligence Agency </name> <url> https://www.cia.gov/ </url> </creator> <lastEditor> <name>Priyam</name> <email> [email protected] </email> <url>http://www.linkedin.com/in/priyammaheshwari</url> </lastEditor> <date> <from>27/08/2011</from> </date> <licenceInfo>Public Domain</licenceInfo> <description>The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities</description> </general> <presentation> <description>Literacy rates of females</description> <query>PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>

SELECT ?female_literacy_rate ?country WHERE { ?s ns:literacy_female ?female_literacy_rate; ns:name ?country } ORDER BY ?country

</query> <keywords> <keyword>Female literacy rates around the world </keyword> </keywords> <visualisation rating="6">Table</visualisation> </presentation> </data>

Page | 87

<data> <presentation> <import-data>data2</import-data> <query> PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>

SELECT ?value ?country WHERE { ?s ns:{%property} ?value; ns:name ?country } ORDER BY ?country </query> <keywords>

<property>literacy_male</property> <keyword>Male literacy rates around the world </keyword> </keywords> <keywords>

<property>population_total</property> <keyword>Population of countries around the world </keyword> </keywords> <keywords>

<property>infantmortalityrate_total</property> <keyword>Infant Mortality rates around the world </keyword> </keywords> <visualisation rating="6">Table</visualisation> </presentation> </data>

C.5 DSDL describing property mappings

Page | 88

C.6 DSDL describing DBpedia dataset and a query against multiple endpoints

<data> <general> <owner> <name>DBPedia Project</name> <url>http://dbpedia.org/About</url> </owner> <source> <name>DBPedia</name> <type>endpoint</type> <url>http://lod.openlinksw.com/sparql</url> </source> <creator> <name>Priyam</name> <email> [email protected] </email> </creator> <lastEditor> <name>Priyam</name> <email> [email protected] </email> </lastEditor> <date> <from>27/07/2011</from> </date> <licenceInfo> GNU Free Documentation License </licenceInfo> <description>DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web</description> </general> <presentation>

<import-data>data2</import-data> <description>Land Area covered by each country and their Wikipedia page </description> <query>PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>

PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT DISTINCT ?country ?land_area ?wikiPage WHERE {

?s ns:area_land ?land_area. ?s ns:name ?country. ?DBcountry a dbpedia:Country . ?DBcountry owl:sameAs ?s . ?DBcountry foaf:page ?wikiPage }ORDER BY ?country </query> <keywords> <keyword>Geographic information about countries</keyword> </keywords> <visualisation rating="10">Map</visualisation> <visualisation rating="8">Table</visualisation> </presentation> </data>

Page | 89

Appendix D – Google Visualization Data Format

Following table provides details about the data format for various visualizations as

recommended by Google Visualization API. When the SPARQL query results are

transformed into Google JSON table format, the data type and format of each column

should follow the data format specified for the corresponding visualization.

Visualization Data Format

Bar Chart The first column should be a string, and represent the label of that group of

bars. Any number of columns can follow, all numeric, each representing the

bars with the same color and relative position in each group.

Pie Chart Two columns. The first column should be of type string, and contain the slice

label. The second column should be a number, and contain the slice value.

Line Chart The first column should be a string, and contain the category label. Any

number of columns can follow, all must be numeric. Each column is displayed

as a separate line.

Geo Map The location is entered in first column, plus two optional columns:

1. [String, Required] A map location. The following formats are

accepted:

a. A specific address (for example, "1600 Pennsylvania Ave").

b. A country name as a string (for example, "England"), or an

uppercase ISO-3166 code or its English text equivalent (for

example, "GB" or "United Kingdom").

c. An uppercase ISO-3166-2 region code name or its English text

equivalent (for example, "US-NJ" or "New Jersey").

d. A metropolitan area code. These are three-digit metro codes

used to designate various regions; US codes only supported.

2. [Number, Optional] A numeric value displayed when the user hovers

over this region. If column 3 is used, this column is required.

3. [String, Optional] Additional string text displayed when the user

hovers over this region.

CoKo - A Semantic Web Application for the Semantic Web ... · web content. The way HTML documents...

Documents

Transcript of CoKo - A Semantic Web Application for the Semantic Web ... · web content. The way HTML documents...