October 19, 20051 Semantic Web. October 19, 20052 Semantic Web Part 3: Semantic Web.
CoKo - A Semantic Web Application for the Semantic Web ... · web content. The way HTML documents...
Transcript of CoKo - A Semantic Web Application for the Semantic Web ... · web content. The way HTML documents...
UNIVERSITY OF MANCHESTER
CoKo - A Semantic Web Application for the Semantic
Web Challenge A dissertation submitted to the University of Manchester for the degree of
Master of Science in the Faculty of Engineering and Physical Sciences
2011
Priyam Maheshwari
School of Computer Science
Page | 1
Table of Contents
ABSTRACT ................................................................................................................................................... 5
DECLARATION ............................................................................................................................................. 6
COPYRIGHT ................................................................................................................................................. 7
ACKNOWLEDGEMENT ................................................................................................................................. 8
CHAPTER 1 INTRODUCTION ........................................................................................................................ 9
1.1 WHAT IS COKO? ........................................................................................................................................ 10
1.2 MOTIVATION ............................................................................................................................................. 11
1.3 ATTEMPT AT SEMANTIC WEB CHALLENGE ....................................................................................................... 14
1.4 PROJECT OBJECTIVES ................................................................................................................................... 15
CHAPTER 2 BACKGROUND AND INITIAL RESEARCH ................................................................................... 16
2.1 SEMANTIC WEB CHALLENGE ......................................................................................................................... 16
2.1.1 Former entries ................................................................................................................................ 17
2.2 LINKED DATA USER-INTERFACES ..................................................................................................................... 21
2.2.1 Faceted browsers ........................................................................................................................... 21
2.2.2 Query builders ................................................................................................................................ 23
2.3 CROWDSOURCING ...................................................................................................................................... 24
2.3.1 Architecture for Collective Knowledge Bases ................................................................................. 24
2.4 PROVENANCE AND TRUST ............................................................................................................................. 27
2.4.1 Trust assessment ........................................................................................................................... 27
2.4.2 Types of Provenance ...................................................................................................................... 28
2.4.3 Provenance Representation ........................................................................................................... 28
2.4.4 Provenance metadata.................................................................................................................... 29
2.5 VISUALIZATION ........................................................................................................................................... 30
CHAPTER 3 SYSTEM ARCHITECTURE .......................................................................................................... 32
3.1 QA SYSTEM ARCHITECTURE ........................................................................................................................... 32
3.1.1 WolframAlpha Architecture ........................................................................................................... 33
3.1.2 TrueKnowledge Architecture ......................................................................................................... 34
3.1.3 CoKo Architecture .......................................................................................................................... 35
3.2 SYSTEM REQUIREMENTS .............................................................................................................................. 36
CHAPTER 4 IMPLEMENTATION .................................................................................................................. 38
4.1 TECHNICAL OVERVIEW ................................................................................................................................. 38
Page | 2
Word Count: 19,651
4.2 SYSTEM ARCHITECTURE................................................................................................................................ 38
4.2.1 Presentation Tier ............................................................................................................................ 39
4.2.2 Application Tier .............................................................................................................................. 41
4.2.3 Data Tier ........................................................................................................................................ 44
CHAPTER 5 DESIGN DECISIONS .................................................................................................................. 47
5.1 Data Set Description Language (DSDL) ............................................................................................. 47
5.2 Types of SPARQL queries ................................................................................................................... 49
5.3 Property Mapping ............................................................................................................................. 53
CHAPTER 6 EVALUATION ........................................................................................................................... 55
6.1 CASE STUDIES ............................................................................................................................................ 55
Case Study 1: Contained dataset ............................................................................................................ 55
Case Study 2: Distributed Datasets ......................................................................................................... 61
6.2 OVERALL EVALUATION ................................................................................................................................. 63
6.3 CONTENDER FOR SEMANTIC WEB CHALLENGE .................................................................................................. 64
CHAPTER 7 CONCLUSION AND FUTURE WORK .......................................................................................... 68
7.1 REFLECTION ............................................................................................................................................... 68
7.2 PROBLEMS WHICH STILL NEED TO BE SOLVED .................................................................................................... 70
7.3 SUGGESTIONS FOR THE FUTURE ..................................................................................................................... 71
7.3.1 Critical ............................................................................................................................................ 71
7.3.2 Other extensions ............................................................................................................................ 74
REFERENCES .............................................................................................................................................. 76
APPENDICES .............................................................................................................................................. 81
APPENDIX A –TECHNOLOGIES USED ..................................................................................................................... 81
APPENDIX B – DATA SET DESCRIPTION LANGUAGE (DSDL) SCHEMA .......................................................................... 82
APPENDIX C - CASE STUDIES ............................................................................................................................... 83
C.1 SPARQL query using GRAPH Clause .................................................................................................. 83
C.2 Full version of DSDL for case study 1 ................................................................................................ 84
C.3 DSDL for additional query upload ..................................................................................................... 85
C.4 DSDL describing CIA Factbook dataset ............................................................................................. 86
C.5 DSDL describing property mappings ................................................................................................. 87
C.6 DSDL describing DBpedia dataset and a query against multiple endpoints ..................................... 88
APPENDIX D – GOOGLE VISUALIZATION DATA FORMAT ........................................................................................... 89
Page | 3
Table of Tables TABLE 1 OVERVIEW OF FEATURES OF FORMER ENTRIES OF SWC AS COMPARED TO COKO ................................................... 20
TABLE 2 COMPARISON OF SEMANTIC WEB BROWSERS. ................................................................................................ 22
TABLE 3 HIGH LEVEL SYSTEM REQUIREMENTS ............................................................................................................. 37
TABLE 4 SUMMARY OF COKO SOURCE CODE STATISTICS ................................................................................................ 38
TABLE 5 DATA SET DESCRIPTION LANGUAGE ELEMENTS ............................................................................................... 48
TABLE 6 PROGRESS TOWARDS FULFILLING MINIMAL REQUIREMENTS OF THE SWC ............................................................. 66
TABLE 7 PROGRESS TOWARDS FULFILLING ADDITIONAL REQUIREMENTS OF THE SWC ......................................................... 67
TABLE 8 PROGRESS OF COKO TOWARDS MEETING ITS OBJECTIVES ................................................................................... 69
Page | 4
Table of Figures FIGURE 1 SECTION OF WOLFRAM|ALPHA RESULT INTERFACE ......................................................................................... 12
FIGURE 2 SECTION OF TRUEKNOWLEDGE RESULT INTERFACE .......................................................................................... 13
FIGURE 3 INPUT-OUTPUT VIEW OF A COLLECTIVE KNOWLEDGE BASE. ............................................................................... 25
FIGURE 4 ABSTRACT VIEW OF QA SYSTEM ARCHITECTURE ............................................................................................. 32
FIGURE 5 ABSTRACT FUNCTIONAL ARCHITECTURE OF WOLFRAMALPHA............................................................................ 33
FIGURE 6 ABSTRACT FUNCTIONAL ARCHITECTURE OF TRUEKNOWLEDGE ........................................................................... 34
FIGURE 7 PROPOSED FUNCTIONAL ARCHITECTURE OF COKO .......................................................................................... 35
FIGURE 8 COKO'S THREE-TIER ARCHITECTURE ............................................................................................................ 39
FIGURE 9 DATA FLOW BETWEEN KEY CLASSES.............................................................................................................. 42
FIGURE 10 BUILDING A FULL-FEATURED SEARCH APPLICATION USING LUCENE ................................................................... 43
FIGURE 11 ILLUSTRATION OF META-VARIABLE REPLACEMENT WITH THE RESULT FROM METAQUERY EXECUTION ....................... 51
FIGURE 12 FLOW OF OPERATIONS INVOLVED IN MAPPING A USER QUERY TO A SPARQL QUERY ........................................... 52
FIGURE 13 FROM PILE OF TRIPLES TO AN INTELLIGIBLE INTERFACE ................................................................................... 68
FIGURE 14 ADDITIONAL DISPATCHER ......................................................................................................................... 72
FIGURE 15 ABSTRACT VIEW OF QUERY WORKFLOW ...................................................................................................... 73
Page | 5
Abstract
With the rise in popularity of the web, more and more people are looking for services
which can help them to find information quickly. One such service is a question answering
(QA) system. It is a technique to provide accurate answers to users’ questions. Given a
question such as “Which is the longest river in the world?” a keyword-based search
engine (e.g. Google) will return a list of URLs to various web pages with probable answer,
whereas a QA systems attempts to directly answer the question with the name of the
river along with some other background details.
CoKo (Collaborative Knowledge) is an attempt at a Semantic Web based open-domain
question answer (QA) system, built upon a community curated structured knowledge
base. It is envisioned to return coherent answers to users’ natural language questions,
along with appropriate visualizations (e.g. charts, maps, tables etc.) in order to make the
answers more intelligible and analyzable. This thesis describes the conception, design and
development of a prototype of this system.
This prototype was developed to determine the utility, feasibility and challenges of
developing a QA system designed to work on a collaboratively curated Linked data
knowledge base. It also aims to analyze the potential of such a system as an entry in the
Semantic Web Challenge.
During the evaluation of this prototype, we identified several bottlenecks that currently
limit the curation process. In the end this thesis contemplates some of the future
challenges and provides suggestions for any subsequent attempt at the system.
Page | 6
Declaration
No portion of the work referred to in this dissertation has been submitted in support of
an application for another degree or qualification of this or any other university or other
institute of learning.
Page | 7
Copyright
I. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns certain copyright or related rights in it (the “Copyright”) and
s/he has given The University of Manchester certain rights to use such Copyright,
including for administrative purposes.
II. Copies of this dissertation, either in full or in extracts and whether in hard or
electronic copy, may be made only in accordance with the Copyright, Designs and
Patents Act 1988 (as amended) and regulations issued under it or, where
appropriate, in accordance with licensing agreements which the University has
entered into. This page must form part of any such copies made.
III. The ownership of certain Copyright, patents, designs, trademarks and other
intellectual property (the “Intellectual Property”) and any reproductions of
copyright works in the dissertation, for example graphs and tables
(“Reproductions”), which may be described in this dissertation, may not be owned
by the author and may be owned by third parties. Such Intellectual Property and
Reproductions cannot and must not be made available for use without the prior
written permission of the owner(s) of the relevant Intellectual Property and/or
Reproductions.
IV. Further information on the conditions under which disclosure, publication and
commercialisation of this dissertation, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University
IP Policy (see http://documents.manchester.ac.uk/display.aspx?DocID=487), in
any relevant Dissertation restriction declarations deposited in the University
Library, The University Library’s regulations (see
http://www.manchester.ac.uk/library/aboutus/regulations) and in The
University’s Guidance for the Presentation of Dissertations.
Page | 8
Acknowledgement
I would like to thank my supervisor Dr. Bijan Parsia for his continued guidance throughout
this project. His guidance not only helped me to understand the potential benefits of the
project, but also helped me to overcome a number of difficulties throughout the course
of implementing the project.
I would also like to thank my family; my parents and sister who have supported and
encouraged me to pursue further education. None of this would have been possible
without their help.
Page | 9
Chapter 1
Introduction
This thesis proposes a Semantic Web collaborative open-domain question answering
(QA) system. What does it mean to be a Semantic Web collaborative open-domain
question answer system? The traditional keyword-based search engines like Google,
Yahoo, Bing etc. are efficient at providing a list of best possible results from their large
repositories of indexed HTML documents, but they fail to provide direct answers to user
queries like “what is the nutritional value of an apple?” or “distance between Manchester
and London” etc. For users involved with complex information gathering tasks these
search engines are far from providing a complete web search solution, as the user has to
manually identify and aggregate the pieces of relevant information from a selection of
various recommended web sites, where each site presents information in its own format.
Besides the problem of time consuming manual aggregation of appropriate information,
other problems faced by web users are of high recall and keyword sensitive results [1].
Along with the web pages containing relevant information these search engines also
return a huge amount of either mildly or completely irrelevant pages and often due to
keyword sensitivity the search engines are unable to retrieve the desired page if they are
using terms different than the posed query.
The major difficulty faced by these search engines is the lack of machine-interpretable
web content. The way HTML documents on the web are currently deployed, the real
content of a web page is contained in the text, which makes it difficult for machines to
interpret, extract and process the information. In order to overcome this shortcoming,
the Semantic Web initiative provides a suite of technologies to represent Web content in
machine-processable format, allowing machines to determine the meaning of the content
and thus assist in the process information gathering. [2]
Collaborative knowledge management is the new buzzword on the web today. Popular
websites like Wikipedia predominantly operate on user contributed content. Creating a
knowledge base from scratch is expensive in terms of both time and effort. Especially for
open domain system, it would be difficult for an individual or an organization to single
Page | 10
handedly collect and maintain data about all the knowledge in the world. Therefore, it is
often practical to share this burden with the community of web users. This act of
outsourcing the process of knowledge base creation and curation to an undefined group
of individuals is also known as “Crowdsourcing”[3].
Question – Answering (QA) system isn’t a new concept on the web, it has been a topic of
research for several years [4] [5]. Unlike the traditional information retrieval systems (like
search engines), which return a list of best-matching documents, a QA system, consults its
knowledge base to generate direct answers to user’s questions [6]. There are several such
systems available on the web today and can be classified [7] into closed domain QA
systems which deal with questions from a specific domain such as medicine, sports etc. or
open domain QA systems which can deal with questions on just about anything. A
common characteristic of all these systems is that they allow users to ask question in
natural language and then return a concise answer to user’s question by looking up in
their knowledge base. But again some of these QA systems are only able to provide
answers to questions about which they have explicit knowledge. This is mainly due to lack
of semantics in the way the data is represented in their knowledge base and thus these
systems are unable to draw meaningful relations between knowledge in the knowledge
base, perform reasoning and inference in order to answer questions about implicit
knowledge. For example Ask.com1, one of the typical QA systems on the web, could
provide direct answers to questions “How long is the river Nile?” and “Which is the
longest river in the world?” but could not provide direct answer to the question “What is
the length of the longest river in the world?“.
1.1 What is CoKo?
CoKo (Collaborative Knowledge) is an automatic open-domain question answering
system, exploiting Semantic Web technologies to provide coherent answers to user’s
questions. It leverages from the recent developments in social collaborative knowledge-
building, by involving its users in knowledge base building and curation process. Users can
1 http://www.ask.com
Page | 11
contribute their datasets to the knowledge base and ask questions about explicit as well
as implicit information present in the knowledge base.
Besides finding direct and concise answers to their questions, users are also looking for
ways to be able to organize and visualize their search results, in order for an easy
interpretation and analysis of the result. In addition, there is a growing desire among the
web users to be able to share their findings. To this end CoKo will present the answers
along with a range of appropriate visualizations (e.g. charts, maps, tables etc.) so as to
make the answers more intelligible and analyzable for the user. It will also facilitate
sharing of the answers and visualizations via social network websites (like Facebook,
Twitter etc.) blogs and email.
1.2 Motivation
The design of CoKo was motivated by two popular question answering engines on the
web today - Wolfram|Alpha2 and True Knowledge3. Each of these applications has their
own strengths and weaknesses. CoKo aims to capitalize on their strengths by
implementing the most desirable features from both these applications.
Wolfram|Alpha is a web based computational knowledge engine, which generates
answers to users’ questions by doing computations on its internal knowledge base using
inbuilt algorithms. It has a natural language interface (refer Figure 1) which accepts user
questions in plain language and returns detailed answers accompanied with visualizations
(like graphs, timelines etc.), that allows for easy analysis of the answer (refer Figure 1).It
also provides facets for alternate visualization or representation of the results. In order to
compute the result most of the data is derived from multiple sources and a list of these
sources and references are given at the bottom of the result set.
2 http://www.wolframalpha.com/
3 http://www.trueknowledge.com/
Page | 12
Wolfram|Alpha represents knowledge using its own internal techniques and does not
directly apply Semantic Web technology [40] and therefore the data in its knowledge
base cannot be accessed, linked or reused by other semantic applications across the Web.
Perhaps the use of linked data technology would make it easier to pull data from outside
as well.
Although it provides access to its platform through its API, but it doesn’t provide the
flexibility to manipulate or enhance the data. Its knowledge base is built from knowledge
extracted from various sources, combined and hand-curated exclusively by
Wolfram|Alpha team. This restriction on the data is slowing down the growth of its
knowledge base at some level [41], for example for the query “GDP India vs GDP Sri
Lanka” it could only fetch GDP data till the year 2008 for Sri Lanka and till the year 2009
for India.
Natural language input interpretation
Specific result compiled into tabular format
Useful result visualization
Outdated data
Source of data
Facet provided for alternate (reverse) view of the data table
Figure 1 Section of Wolfram|Alpha result interface for the query "GDP India vs GDP Sri Lanka"
Page | 13
True Knowledge4 on the other hand is a Semantic technology based answer engine,
which exploits community curation technique to build its knowledge base. It adds facts to
its knowledge base by either importing from external sources like Wikipedia and Freebase
or through user input by means of a thorough and controlled form input. Validity of these
facts is checked at two levels. Firstly, since the system can understand the facts, it is able
to discard statements which are inconsistent with the existing knowledge in the database.
Secondly users can approve or deny any formerly added facts, thus improving the quality
of knowledge in the database [42].
Although TrueKnowledge allows the users to add facts, but still the knowledge base is
somewhat reserved and exclusive. Users can add data in two ways either by providing
facts in terms of objects and classes or by directly entering an answer as free flowing plain
text in a text area. It doesn’t give the users the freedom to upload complete datasets. The
answer entered as plain text is not stored as facts; rather it would be displayed as such
4 http://www.trueknowledge.com/
Precise answer
Input interpretation
User feedback on result
Facts used to derive the result
User feedback on facts
Figure 2 Section of TrueKnowledge result interface for the query "Is GDP of India more than GDP of Sri Lanka?"
Page | 14
when the question is asked again. The interface to add facts is really constrained and the
user can add only bits of knowledge, which can be a time consuming process in case a
user wants to contribute a large amount of data.
Searching for answers using each of these applications has its own ups and downs. On
one hand where Wolfram’s forte is providing attractive presentation of result, True
Knowledge’s strong point is active contributions by its users in building and improving the
knowledge base.
CoKo is utilizing similar methodologies as these two applications, with emphasis on
following aspects:
Utilize semantic web technologies. Like TrueKnowkedge, it is based on Semantic Web
technologies. It utilizes RDF to store datasets, OWL to link datasets and SPARQL to query
over datasets.
Visualize results. Like Wolfram|Alpha, it presents the results with rich data and graphics.
Users are able to interactively choose alternate visualizations methods, in order to obtain
an optimum view for their result.
Enable a collaborative and social environment.Users can contribute entire datasets and
not just facts. They will also be able to share their queries and results on social sites such
as blogs, Twitter and Facebook.
1.3 Attempt at Semantic Web Challenge
The overall purpose of this project is to implement a Semantic Web application which has
the potential to become a winning entry to the Semantic Web Challenge (SWC)5, Open
Track. SWC is a competition conducted at the International Semantic Web Conference
(ISWC), which invites entries for Semantic Web-based end-user applications.
The approach of this project to the SWC exploits the trend towards making Linked data
useful, by using Semantic Web Technologies. A simple combination of basic Semantic
Web technologies - RDF, SPARQL and web services will enable it, to successfully meet all
of the challenge requirements.
5 http://iswc2011.semanticweb.org/calls/semantic-web-challenge/
Page | 15
Linked Open Data community project has encouraged a lot of data publishers to release
public data according to Linked Data principles6. Through this effort a large number of
datasets have now been published on the web. At the time of this writing, some estimate
that, over 200 million data sets7 exist in the wild. Despite the large amount of open linked
data available on the web today, applications that make use of Linked Data are not
mainstream yet. CoKo is an attempt to demonstrate and utilize the power of Linked RDF
data. It will consume linked data and allow users to discover answers to their questions,
using reasoning and inference over these data sets.
1.4 Project Objectives
Functional Objective
Corresponding to the growth of information on the web, there is a growing need of QA
systems that can help to better utilize, organize and analyze the ever-accumulating
information. Hence the functional objective of CoKo is to present rich, meaningful
visualizations (charts, maps, graphs etc.) corresponding to the answers, by exploiting
semantics of the data and thus making the answers more comprehensible and analyzable.
Technical objective
The underlying technical objective of CoKo is to provide an end to end system for sharing
and curating Linked data in a collaborative environment, which would eventually improve
the quality of the answers for the end users of the QA system.
6 http://www.w3.org/DesignIssues/LinkedData.html
7 Sindice claims to be searching on about 228.19 million documents
Page | 16
Chapter 2
Background and Initial Research
2.1 Semantic Web Challenge
The Semantic Web Challenge (SWC) calls for applications that exploit Semantic Web
technology in a way that demonstrates benefits of the technology. It consists of two
tracks “Open Track” which requires applications to make use of the meaning of
information on the web and “Billion Triples Track” which requires applications to deal
with huge amount of predefined data gathered from the web. It doesn’t specify any
particular domain or technology instead a set of minimum requirements and additional
desired requirements are specified for each track, thus allowing a wide range of
application submissions.
In order to be accepted as a valid entry for the open track of the challenge, the
application is required to realize the following minimum requirements defined by the
organizers:
An end user application with practical worth to general user or at least to the
domain expert.
The data source should be syntactically and structurally diverse, should be under
diverse ownership and should contain considerable amount of real world data.
The data should be processed in order to derive useful information, which would
not be possible or would be difficult to extract with help of conventional web
technologies.
Additional desirable requirements:
The final application exhibits benefits of semantic technologies and has a
functional interface for the end user and accessible on various devices.
Innovative use of semantic technology to a domain or task that hasn’t been
considered before.
The application has a commercial prospective.
Page | 17
The application is scalable and uses dynamic data in combination with static data,
preferably published on the Semantic Web.
Validation of the results with the help on contextual ratings or rankings.
Use of multimedia documents.
The application provides support for several languages.
Functionality should not be mere information retrieval.
2.1.1 Former entries
The first SWC was held in the year 2003 and over the past eight years, the Challenge has
attracted more than 140 entries. These entries have demonstrated a range of different
applications from full web scaled search services to simple recommendation systems,
serving different domains (e.g. Biomedical science, academic research etc.) and have even
covered different platforms (e.g. mobile phones, iPad and iPod). During the initial phase
of the project many of the former entries to the challenge were studied, but due to
limitation of space only few will be discussed in this section.
Many of the former entries to SWC focused on demonstrating how Semantic Web and
presentation technologies can be deployed to provide better search and browsing
support, starting from CS AKTive-Space (CAS) which was the winner of the very first SWC
held in 2003.CAS was a Semantic Web application which allowed funding agencies,
researchers and students to explore the Computer Science research domain in the United
Kingdom. It provided an integrated view of Computer Science research related
information aggregated from multiple heterogeneous sources, such as published RDF
sources, personal web pages, and data bases. It allowed the users to query, explore and
organize information in order to discover rich relations.[8]
MultimediaN E-Culture demonstrator8 (winner of SWC 2006) is a Semantic Web
application to interactively search, navigate and annotate web based media collections. It
provides a keyword-based search over the annotated collection and returns semantically
8 http://e-culture.multimedian.nl/demo/session/search
Page | 18
grouped clusters of query result. It allows the user to view the result using various
available presentation mechanisms.[9]
Arnetminer9 (one of the entries of SWC 2007) provides mining and searching services for
researcher social networks. A semantic based profile is created for each researcher by
automatically extracting and integrating data (e.g., the bibliographic data and the
researcher profiles) from multiple sources on the web and is stored into a researcher
network knowledge base (RNKB). It provides three types of search services: person
search, publication search, and conference search and five types of mining services:
expert finding, people association finding, hot-topic finding, sub-topic finding, and survey
paper finding over the RNKB.[10]
Sig.ma10 (winner of SWC 2009) is a Semantic Web Search engine, built on top of Sindice11.
Sindice parses the information on the web looking for RDFa and microformats, in
particular it parses well known structured information in pages such as Wikipedia,
Wordpress blogs, etc. retrieving information and translating it into triples [11]. Sig.ma
aggregates data about entities from these sources and presents to the user using a
template based result interface. A very innovative aspect of the application is the method
that it provides to its users for dealing with the information quality challenges. It allows
the users to approve, reject or add a new source from which the result has been
aggregated, thus learning from user feedback. It provides a customizable result interface
which can be shared and embedded in blogs, websites and other applications.
Visinav12 (winner of SWC 2009) is a system which can be used to search and navigate the
Web of data. It aggregates RDF data sets by crawling the open web and allows for faceted
browsing of these datasets [3]. Its functionality goes beyond keyword based search by
allowing the users to visually construct and refine their queries via facet selection
operation. Users start by providing a keyword to locate objects and subsequently they
9 http://www.arnetminer.org
10 http://sig.ma/
11 http://sindice.com/
12 http://visinav.deri.org/
Page | 19
can refine their queries to form more complex queries. System is intuitive and calculates
the possible next steps based on the current state, thus displaying only legal choices to
the user for query construction and refinement. It is an exploratory search engine.
NCBO Resource Index13 (winner of the SWC 2010), is a semantic search application for
researchers to browse and analyze the information stored in 22 diverse biomedical
resources. The textual descriptions of the data residing within these resources are
annotated with the help of various ontology terms and indexed. It can then be searched
for concepts through an intuitive interface. Tag clouds are provided in the result to
visualize related concepts to the current search query and color intensity to represent
more relevant resources based on the current search terms.
Table 1 contrasts the features of CoKo with the former entries discussed above.
13 http://bioportal.bioontology.org/resource_index_ui
Page | 20
Application Dataset Service Result End-user Personalisation
Face
t B
ase
d
CS AKTive-Space
multiple heterogeneous sources, such as published RDF sources, personal web pages, and data bases.
investigate about an area of research and a researcher based on their scholarly impact and their research grant income.
variety of visualizations and multi dimensional representations
funding council
researchers
graduate students
None
Ke
y w
ord
Bas
ed
MultimediaN E-Culture demonstrator
annotated index of large virtual collections of cultural-heritage objects
annotation of web resources representing images
search for artwork, artefact, concept, location or person
clustered thumbnails of paintings along with their titles
collection holders
privileged users can add, delete and edit RDF metadata of paintings
Arnetminer knowledge base containing integrated data extracted from researchers’ profiles and crawled publications
profile search
expert finding
conference analysis
course search
sub-graph search
topic browser
academic ranks
different interfaces for different services
academic community
registered users can:
modify extracted profile information;
provide feedback on the search results;
follow researchers
VisiNav index of structured data crawled from the Web
object Search ranked list of objects crawled from the web with the option of alternate views like Table and Timeline
general web user
None
Sig.ma index of structured data crawled from the Web
object search mashup of information retrieved from various sources
general web user
learns from user feedback
search results can be shared on the web
NCBO Resource Index
index of annotated data from 22 diverse Biomedical resources
search Biomedical data based on ontology concepts
list of relevant data from a selected resource
tag cloud of related concepts.
biomedical researchers
None
NLP
CoKo internal knowledge base of structured data contributed by users
question- answer service
precise answers
rich visualization (e.g. tables, graphs, maps, etc.)
general web user
can be customized for domain specific use
users can contribute data
provide feedback on results
share results
Table 1 Overview of features of former entries of SWC as compared to CoKo
Page | 21
2.2 Linked data user-interfaces
There is a huge amount of Linked data available on the web today, but it’s a challenge for
lay users to understand the potential of this data, due to their lack of knowledge about
the intricacies of RDF and other Semantic Web technologies. Even though, there has been
much research devoted to providing efficient and comprehensible user interfaces for
linked data, it is still considered an open research problem.
There are various kinds of linked data user interfaces available, such as triple query
builder, relationship finder, mash-ups, faceted browser etc [13].Most of these linked data
interfaces allow the users to search and explore data in a fashion similar to traditional
search engine. However, some of these require the user to be familiar with RDF triples
and thus pose a challenge for lay users. It is observed in [15] and [13] that only faceted
and triple query builder interfaces allow non-Linked data expert users to efficiently pose
complex and expressive queries to large data repositories. For the purpose of this thesis
only faceted browsers and triple query builder will be discusses in following sections.
2.2.1 Faceted browsers
Faceted browsers are one of the most popular linked data interfaces available on the web
today. These browsers enable users to perform exploratory search by allowing smooth
browsing through the RDF graph [16]. Users start with one resource and are able to
navigate the data space consisting of different data sources by following RDF links. These
browsers have been reported to follow several different approaches to display search
results. On one hand, tools like mSpace [19], Flamenco [20] , Longwell14 and Haystack15
display the facets along with a list of results, using only facets directly connected to the
result. On the other hand, tools like Parallax [21], Humboldt [22], Tabulator [23] and
Nested Faceted Browser16, allow hierarchical filtering of the results. In addition to
providing properties and values of resources some of these browsers (e.g. Tabulator) also
provide visualizations like maps and timelines. However, these browsers only provide a
14 http://simile.mit.edu/wiki/Longwell
15 http://simile.mit.edu/hayloft/
16 http://people.csail.mit.edu/dfhuynh/projects/nfb/
Page | 22
limited number of visualizations and the source code needs to be modified in order to
provide new visualizations [24].In [25] authors examine some of the current
unconventional linked data browsers and draw comparison between them with the help
of table 2.
Browser Runs Data Sources Data
Formats
Data
Presentations
Presentation
selection
BrownSauce Local Web
Server
One at a time RDF (Any) Indented text
list
Hard-coded at
compile-time
Disco Web Server Multiple
Unbounded
RDF (Any) Predicate-
object table
Hard-coded at
compile-time
Exhibit HTML Web
Browser
Single author-
time specified
JSON only List, Timeline,
Map, Graph,
Table, Custom
HTML form
HTML
Authortime
Marbles Web Server Multiple
Unbounded
RDF (Any) Predicate-
object table
Fresnel.
ObjectViewer Desktop Java Multiple
Unbounded
RDF (Any
supported by
Jena)
Graph -
interactive
Hard-coded at
compile-time
Tabulator Firefox Web
browser
extension
Multiple
Unbounded
RDF (Any Table,
Calendar,
Map, Friends.
RDF/N3,
RDF/XML,
HTML
Run-time
manually by
user.
Available
presentations
are decided
by data-type.
Zitgist
DataViewer
Web Server Multiple
Unbounded
RDF (Any) Templates,
Predicate-
object table
Automatically
selected by
data-type.
Table 2 Comparison of Semantic Web Browsers
Note. Reprinted from “A review of user interface adaption in current semantic web browsers” by Turner, E., A. Hinze,
and S. Jones.
A major drawback of faceted browsers is that a broader view of the dataset being viewed
isn’t supported, as they allow only a limited set of queries. Facets are not efficient to
formulate complex queries and only work well for simple queries like searching for
objects belonging to a particular class. In [26] authors give two examples where these
browsers are inefficient.
Page | 23
i) “Persons who went to school in Germany”
Faceted browsers do not work well for this query because “in Germany” is
mentioned in context of the school and not the person.
ii) “Persons who live in x”, where x is a small city
In this case there are other more frequently occurring patterns, which would be
offered to the user as facets, rather than the facet “live in x”.
2.2.2 Query builders
These can be classified into two categories: i) Visual SPARQL query builders e.g. DBPedia
Query Builder and ii) NLP based query builders e.g. PowerAqua
Visual SPARQL query builders provide a triple based interface, which allows users to pose
query to a knowledge base, by building triple patterns. Users are able to define filters,
pattern variables or identifiers for the subjects, predicates and objects. The major
drawback of this approach is that the users need to have a basic understanding of SPARQL
query syntax and its functioning. Users also need to know the terminology and structure
of the underlying schema to be able to formulate efficient queries. These tools provide
suitable options to the users for building triple patterns, by providing a list of predicates
for each typed identifier. The predicates suggested are the ones which are actually
related to the identifier in the repository. Thus, helping the user to explore and create
complex queries by analyzing the relationships between instances. But, graph-based
visualizations where all property values are analyzed in order to provide suggestion are
resource expensive [27]. Also users are not presented with any actual data during the
query construction, therefore they do not know what data the repository holds and the
kind of queries it can answer.
NLP based query builders allow users to pose NL queries which are then processed and
transformed into queries for the repository. These interfaces can be categorized into full
NL interfaces e.g. PowerAqua [28] and Controlled Natural Language (CNL) interfaces e.g.
Ginseng [29]. The main difference between these two approaches is that the latter is not
dependent on a predefined vocabulary and doesn’t generate any syntactical or logical
interpretation of the input queries, instead it controls the user input by only allowing
queries that conform to a grammar generated from the terms and structure of the
Page | 24
internal ontology. This ensures that every query can be interpreted correctly. One of the
major drawback with full NL interfaces is that if the system is unable to provide an answer
to a user query, the user is unaware whether the result couldn’t be retrieved because of
an inadequately posed query which couldn’t be interpreted by the system or because the
underlying schema of the repository doesn’t support the query interpretation. In that
case CNL interfaces are better as they guide the user to only input queries that can be
interpreted.
2.3 Crowdsourcing
Building and maintaining a knowledge base from scratch can be a time consuming and
difficult task, be it for any domain. It would certainly be an enormous challenge, if
building it for an open knowledge question answer engine. Per se it is often practical to
outsource this knowledge base augmentation process to the community. The value
created by the collective contributions of people in the community is often referred to as
"collective intelligence" or "wisdom of crowds" [30]. The term “Crowdsourcing” was
coined by Jeff Howe, who defined it as "the act of a company or institution taking a
function once performed by employees and outsourcing it to an undefined (and generally
large) network of people in the form of an open call." It is a widely used concept on the
web today, with Wikipedia being one of the classic examples, where thousands of users
volunteer to create an encyclopedia that studies show is as accurate as traditional
volumes like Britannica.
2.3.1 Architecture for Collective Knowledge Bases
An overview of Input-output view of a collective knowledge base (KB), with two
continuous loops of interaction is represented in Figure 3. KB receives three streams of
information from the contributors and users (who may or may not be disjoint): 1) Rules
and facts from contributors, 2) Queries and evidence from users and 3) Feedback on the
answers, from users. In turn it produces two streams of information: 1) Answers to
queries and 2) Credit to contributors. As a result, the contributors and users are involved
in a (many-to-many) interaction via the knowledge base: contributions from many
different contributors might be used to derive an answer to a query and the feedback
about the answer will in turn be propagated to different contributors. On the other hand,
Page | 25
many different queries may be answered by using a single contribution which will receive
feedback from all of them.[31]
The key element of this architecture is that the KB is a result of human contributions and
machine learning, drawing value from their respective strengths and weaknesses. Human
beings are good at judging the quality of the end result but fail to efficiently carry out
reasoning on large amount of data. Whereas, machines are capable of handling large
amount of data and do computations with it. Another key element is that the
contributors are continuously receiving feedback on the quality and validity of
contributions and the evolving knowledge base is being scrutinized through queries and
their outcomes. Thus the knowledge is more likely to be accurate.[31]
In [31] the authors suggest that the above proposed model would help to deal with the
following problems associated with community curated knowledge bases:
Quality
In large scale community driven knowledge bases, it is difficult to ensure quality of the
information contributed by individuals when little is known about their areas of expertise.
Therefore, it is important that checks should be put in place to estimate the quality of the
contributions.
Figure 3 Input-output view of a collective knowledge base.
Reprinted from “Building Large Knowledge Bases by Mass Collaboration,” by M. Richardson and P. Domingos, in
Proceedings of the 2nd international conference on Knowledge capture, New York, NY, USA, 2003, p. 129–137.
Page | 26
All the knowledge contributed by individual users should be subjected to community
feedback, along with some sort of machine learning mechanism. This would instigate
users to provide good quality data since the efficacy of the knowledge is being tracked.
Consistency
Consistency of large knowledge base is plagued when contradicting facts are added by
same or different contributors. With the growing size of the knowledge base it is highly
likely that contradicting facts would be encountered due to lack of coordination between
the contributors.
In order to filter the noisy and inconsistent information contributed, quality and accuracy
should be coupled with information, with help of probabilistic parameters.
Relevance
In large scale distributed management of knowledge base, it is often difficult to ensure
that the data contributed by the volunteers is in conformance with the actual goal of the
application. If not effectively warranted it may render volunteer effort futile.
The knowledge requirements of the application should be properly conveyed to the
contributors and they should be given credit for providing any datasets which are used in
producing highly rated answers. It is expected that users will contribute datasets from
their fields of interests and expertise.
Motivation of contributors
Since the success of a collective knowledge base depends on the quality of data
contributed by the volunteers, they should be given due credit for providing high quality
data, in order to motivate them.
System should incorporate a fair method to recognize and award (e.g. listing the top
contributors, virtual badges or titles, etc.) the contributors for sharing their knowledge
and expertise.
Page | 27
2.4 Provenance and trust
Provenance means the origin, or the source of something.17 As discussed in previous
section, quality of data is one of the major concerns in a collaborative data management
environment. Tracking data provenance helps to estimate quality of data based on the
source and data transformations. Provenance is not only important to assess data quality
but also to determine the source of data to ascertain trustworthiness, ownership of data,
timeliness and others as described in [32].
2.4.1 Trust assessment
Trust assessment methods can be classified as follows [33]:
1) Reputation or Ratings-Based
These methods allow users to provide ratings based on their experience or trust in a
particular entity, which can then in turn help other users to decide what or who to
trust or prefer [34]. It includes ratings based systems similar to the one used by eBay,
Amazon and Web-Of-Trust mechanisms. Majority of trust architectures proposed for
Semantic Web based applications fall into this category. The major drawback of this
approach is that explicit and topic-wise trust ratings need to be provided which puts
the extra over head on information consumers to obtain and maintain these ratings.
2) Context-Based
These methods collect metadata about the conditions in which information has been
provided such as four W’s – Who, What, When and Why. Trust decisions are based on
roles or memberships of individuals. Examples policies given in [33] are: "Prefer
product descriptions published by the manufacturer over descriptions published by a
vendor" or "Distrust everything a vendor says about its competitor."
3) Content-Based
These methods are based on rules and axioms along with the data itself and related
data published by others. Example policies for this approach given in [33] are: “Believe
17 "provenance, n.". OED Online. March 2011. Oxford University Press.
http://www.oed.com/view/Entry/153408 (accessed April 30, 2011).
Page | 28
information which has been stated by at least 5 independent sources.” or “Distrust
product prices that are more than 50% below the average price.”
Although ratings based mechanism is a widely deployed trust assessment method on the
web today, being used by sites like Amazon and eBay to rate the sellers. But this method
only captures user’s opinion about a particular element, without any other information
about the source or producer. It limits the ways in which users can express their
justification for trusting a particular entity [34].
Content and context based mechanisms on the other hand are independent of trust
ratings and only require background information which is usually available on the
semantic web in terms of provenance meta data and thus can be utilized in making trust
decisions [33].
2.4.2 Types of Provenance
In [35] authors describe provenance as: workflow provenance and data provenance. A
workflow is a set of steps followed to reach from an initial state to the target state.
Workflow provenance aims to maintain a record about “the entire history of the
derivation of the final output” [35] of a workflow. Whereas, data provenance is
concerned with preserving the details about the origin and derivation of individual pieces
of data. In [36] authors characterize data provenance as why- and where-provenance i.e.
information about the origin of a piece of data in the result of a query and the location
from where that data has been extracted. Additionally how-provenance was introduced
in [37] to describe how the origin was used in deriving the result. Furthermore an
analogue to data provenance, called knowledge provenance is discussed in [38].It is
similar to data provenance except that it includes information about the extensive
reasoning used to derive data either before it is inserted into the knowledge base or after
it is retrieved from the knowledge base.
2.4.3 Provenance Representation
Different schemes can be used to represent provenance information, each having its own
implications on the cost of storing them and the information provided. Two methods to
represent provenance have been described in [39] as:
Page | 29
1) The Inversion method
This method exploits the relationship between the input and the output data. It works
backwards from the output data to determine the input data, used in the derivation
process. This is a compact method of representation and the information provided is
limited to the derivation history of the data.
2) The Annotation method
In this method, metadata about the derivation history of data, description about
source of data and processes are aggregated as annotations. This method pre
computes the provenance and thus can be readily used as metadata. This method
provides more elaborate information than the inversion method and may sometimes
also include the parameters used in derivation process, post-conditions, related
publication references etc.
2.4.4 Provenance metadata
Current, research on recording provenance metadata is either focused on associating RDF
triples with a named graph [40] [41] or to extend an RDF triple to a quadruple, where the
fourth element can be a URI, a blank node or an identifier [42] [43]. This fourth element
can be used to represent provenance information.
RDF Named graph is an explicit provenance mechanism where an RDF graph is assigned
with an URI, which can then be reference by other graphs as a normal resource. Thus it
allows assigning explicit provenance information to a set of RDF triples. However, one
drawback of using RDF named graphs is that they do not support capturing of provenance
information about implicit triples. For this, use of colors to capture implicit and explicit
provenance information about RDF triples has been proposed in [44]. RDF triples are
augmented with a fourth element named color, which represents the different sources
used to derive the triple. A large number of vocabularies have been published to
represent provenance metadata. These vocabularies clearly describe the relationships
and concepts used in provenance annotations.
One of the general purpose vocabularies, which is widely used to represent provenance
metadata is Dublin Core. It has properties like dc:creator, dc:publisher which can be used
Page | 30
to identify the creator and publisher of data, using URIs to identify them. On the other
hand a popular provenance specific vocabulary is the Provenance Vocabulary [45] which
was designed to deal with two dimensions of provenance i.e. data creation and data
access. It contains three sets of terms to store provenance information: general, data
creation, and data access. The general terms contain classes and properties to describe
general provenance elements. The data creation dimension contains classes to describe
how a data has been created and property to identify source data used during data
creation. The data access dimension provides classes and properties to provide
information about the retrieval of source data and creation guidelines.
In case of large datasets storing provenance metadata at triple level can lead to
provenance information being more than the data itself and for this authors in [39]
describe voidp vocabulary, which is a light-weight extension to voiD. It reuses entities
from Provenance vocabulary, the Time ontology in OWL and the Semantic Web Publishing
Vocabulary.
Providing provenance information for answers is a fundamental requirement for any
question answering system. When answers are returned from such applications users
want to know what sources were used, reliability of those sources or how the implicit
answer was derived.
2.5 Visualization
A lot of research is being targeted at enhancing user experience in the information
discovery task on the web, with one of the areas being search result visualization.
Visualization of search result plays an important role for clear understanding and analysis
of the retrieved information; it helps to give context to otherwise plain text results. For
example representation of disease data on a map would assist a public health
professional in her analysis of spatial distribution of disease and the effectiveness of
disease control policies.
Two major approaches to visualizing Semantic Web data as proposed in [46]: i)
visualization of the complete RDF graph and ii) visualization of SPARQL query result i.e.
selective parts of the graph .The first approach is intended for users with thorough
Page | 31
understanding and interest in structural visualization of RDF graphs and thus is not
favourable for general web users who are only interested in visualizing (e.g. charts, table,
pictures etc.) the result and not the underlying model.
The Data-gov Wiki18 provides some useful tutorial and demos on different ways of
visualizing data returned by SPARQL queries. Although the demos and tutorials are mainly
aimed at providing information on how Semantic web technologies can be used in
converting , enhancing and using linked government data, but same techniques can also
be applied to other linked data sources.
The approach described in one of the tutorials “How to render SPARQL results using
Google Visualization API” [47] as well as in [48] is based on following three steps:
1. Query
Generate an appropriate SPARQL query and execute it against appropriate SPARQL
endpoint to fetch the data of interest.
2. Transform
The result of the query needs to be transformed to appropriate format, depending on
input requirements of the visualization library being used e.g. Google Visualization
JSON for Google Visualization API. The transformation is carried out with pre-defined
XSLT which converts SPARQL XML Bindings to the required format.
3. Visualize
The resulting document from the previous step can then be submitted to appropriate
visualization services like Exibit or Google Visualization
18 The Data-gov Wiki is a project being pursued in the Tetherless World Constellation to expose open
government datasets using Semantic Web.
Page | 32
Chapter 3
System Architecture
This chapter outlines the proposed architecture of CoKo. It begins with a discussion about
functional architecture of WolframAlpha and TrueKnowledge followed by a high-level
architecture description of CoKo. It also lists the basic system requirements of the
application.
3.1 QA system architecture
After observing the functioning of TrueKnowledge and WolframAlpha during the initial
phase of the project, it was clear that the basic architecture followed by these QA systems
at the least involves a curator, an end user and a QA engine, which is the core software.
Core components of a QA engine are:
i) Dispatcher, which takes user input and returns appropriate result and visualization,
ii) Query processor (QP) ,which processes the user input to retrieve appropriate data
from the knowledge base in response to user query and passes it to the dispatcher.
iii) Data curation module, which handles the data augmentation and data cleansing
tasks for the data in the knowledge base.
QA Engine
Dispatcher
Query Processor (QP)
Curation module
End User
Curator
Developer
Figure 4 Abstract view of QA system architecture
Page | 33
QA Engine NLP based dispatcher
Mathematica based QP
Internal curation
End User
Curator
Developer
External Environment Internal Environment
Figure 5 Abstract functional architecture of WolframAlpha
The end user interacts with the system via the user interface. User enters a query through
this interface, which then flows between the dispatcher and query processor module of
the QA engine to generate appropriate result for the query. Curator interacts with the
curation module to augment the data and enhance the quality of data which in turn
would help in producing a decent and desired end user experience. Additionally, a
developer develops and modifies the QA engine, to support the curators and the end
users in their respective tasks.
Essentially, QA engine architecture can be categorized into external environment which is
under the control of the end user of the system and internal environment which is
controlled internally by the system owner.
3.1.1 WolframAlpha Architecture
WolframAlpha has a centralized architecture with the curation and development tasks
being under the control of the internal environment. It appears that the development and
curation processes are tightly coupled and the engine is tweaked as part of the curation
process.
The only component which is outside this centralized control is the end user interaction.
The complete curation process is entirely masked from the end user.
Due to the highly centralized architecture of WolframAlpha, it doesn’t have to explicitly
deal with the issues of security, data quality and reliability, as all of these are internally
handled by its team of curators.
Page | 34
QA Engine NLP based dispatcher
SWT based QP
Quasi-distributed curation
Curator
Curation Process
Developer
End User
External Environment Internal Environment
Figure 6 Abstract functional architecture of TrueKnowledge
QA engine of WolframAlpha consists of a NLP based dispatcher, Mathematica based
query processor and an internally managed curation module.
3.1.2 TrueKnowledge Architecture
Architecture of TrueKnowledge is semi-decentralized. It partially shares the task of
curation with its end user community.
It has an internal curation process which co-develops and co-manages the knowledge
base with the external community of curators. The internal curation process adds facts to
its knowledge base by either importing from sources like Wikipedia and Freebase
whereas the users add knowledge by means of a thorough and controlled form based
input. The external curation mechanism is basically feedback based and therefore the
users cannot add a large amount of real-world data at one go.
This semi-distributed architecture makes the system susceptible to spoofing and abuse.
To overcome these threats, TrueKnowledge only allows registered users to curate and
contribute data and has implemented an internal curation mechanism which only accepts
knowledge which confirms with the existing knowledge in its knowledge base.
Its QA engine comprises of a NLP based dispatcher, Semantic Web Technology (SWT)
based query processor and a quasi distributed curation module.
Page | 35
QA Engine NLP based dispatcher
SWT based QP
Distributed curation
Developer
Knowledge Base
Curator End User External Environment
Internal Environment
dispatcher
QP
Data Upload Data Description
Data
Feedb
ack
3.1.3 CoKo Architecture
For CoKo we propose a truly decentralized architecture by moving the curation process
completely outside the internal environment. The task of curation is entirely managed by
the end user community.
This user driven model requires the system to be more generic, flexible and less
complicated for the users. The proposed functional architecture of CoKo is illustrated in
Figure 7 along with the interaction between components of QA engine and end user. The
system basically supports two levels of curation: i) Adding new data ii) Providing feedback
about existing data.
Since it is an open domain QA system users can add new data to the knowledge base
pertaining to any domain. Due to the arbitrary nature of data which can be fed to the
system by the users, we ask them to submit a description about their data while adding
any data. This description will support the query processing module to generate answer
to user queries. With this description of data we are not only asking users to describe the
data itself but also how to use the data and how it aggregates.
Figure 7 Proposed functional architecture of CoKo
Page | 36
End users can ask NL queries via the user interface which would be handled by an NLP
based dispatcher. This dispatcher would pass the query to the Semantic Web
Technologies (SWT) based query processor, which would primarily involve a SPARQL
query. The data retrieved by the query from the knowledge base is then passed back to
the dispatcher which processes this raw data into more analyzable result format basically
some kind of engaging visualization. The end users can provide feedback about the result
and the visualization, thus creating a feedback loop which would be stored in the
knowledge base and help to improve the quality of the results in future.
The open system architecture of CoKo makes it vulnerable to many threats like:
1) Duplicate data upload due to lack of co-ordination between contributors
2) Spoofing attack and abuse of the system
3) Quality of data
4) Trustworthiness of data
In this thesis we aim to address some of these above stated problems to make the system
more robust and secure.
3.2 System Requirements
Based on the system architecture described in the previous section, we define a set of
high level system requirements in Table 3
Page | 37
Searching
Provide a keyword-based interface to search for knowledge.
Return a ranked list of all hits found in the knowledge base.
Visualizing the results
Support different visualizations.
Show relevant visualization according to its rating.
Sharing the results
Support sharing of search results and visualization through social networks (like Facebook,
Twitter etc.) and blogs.
Recommendations
Allow users to suggest keywords and visualizations for an existing data.
Allow users to rate existing visualizations for a query.
Inspecting
Display complete background information about the dataset (e.g. source, creator, validity
period etc.)
Authentication
Implement a authentication system to prevent abuse of the system and to deal with other
problems related to an open system architecture.
Knowledge base Augmentation
Provide support for contribution of new datasets.
Provide a language for users to be able to describe their datasets.
Provide support to associate new queries with existing datasets.
Table 3 High level System Requirements
Page | 38
Chapter 4
Implementation
A prototype of CoKo has been developed to analyze the feasibility of the conceptual
architecture proposed in Chapter 3. This chapter provides the implementation details
along with key technologies used to realize this prototype.
4.1 Technical Overview
CoKo is a JSP and servlet based application, deployed on Apache Tomcat web server. The
entire application was developed on a Windows 7 machine using NetBeans IDE 6.9.1 with
JDK 1.6. The production version of the application is hosted on a machine running Ubuntu
10.10. The development of the application was carried out over a period of three months
and the version in production consists of 14 classes with over 2100 lines of code. Table 4
provides a summary of source code statistics.
Number of Lines of Code ~ 219619
Number of Classes 20
Number of Production Classes 14
Number of Test Classes 6
Number of Methods 44
Table 4 Summary of CoKo source code statistics
The application is portable as all the technologies used in developing it are platform
independent. For a complete list of technologies used, please refer to Appendix A.
4.2 System Architecture
CoKo is based on the standard three-tier client-server architecture, which includes the
presentation tier, application tier and data storage tier. Presentation tier consists of end
user’s workstation running a standard web browser. This tier deals with the way
information is presented to the user i.e. the GUI design. The server running the business
19 including blank lines and comments
Page | 39
logic forms the application tier and the knowledge base is the data storage tier. Figure 8
graphically illustrates various components of CoKo’s three tier architecture.
4.2.1 Presentation Tier
JSP has been used as the primary technology to render content to end user’s browser. For
the purpose of articulation the GUI can be categorized into two views: End user view and
the Curator view. However, these views are non-orthogonal.
End User View
The end user view is the interface for the knowledge seekers who wish to search for data
in CoKo’s knowledge base. The end user view comprises of following two interfaces:
Query interface: It consists of a simple text box, where the Knowledge seekers can input
their query (in English). The user query is sent to CoKo’s Search API and on finding a
match, corresponding SPARQL query is executed. The data retrieved as result of query
execution is then fed to the Transformation API in order to be transformed to Google
JSON format. The resulting transformed data is sent to Google Visualization API before
presenting to the user via the Result interface.
Result interface: In order to provide intelligible search results, CoKo provides rich
visualizations of the results generated from SPARQL query execution. Search results and
End User View
Curator
View
Query processing and
search logic
Data upload and indexing
logic
Triple Store
+ RDBMS
Query
Result
Feedback
Data+ Metadata
Presentation Tier Application Tier Data Tier
Figure 8 CoKo's Three-tier Architecture
Page | 40
the visualizations can be downloaded or shared via social networking websites, blogs or
email.
Curator View
The curator view is the interface for the users engaged in the task of augmenting the
knowledge base. In the current implementation, CoKo supports following levels of
curation:
Contribute new data by uploading datasets which can be local dataset files,
remote dataset files or links to remote endpoints. The datasets can be uploaded
along with data description files through CoKo’s upload tool. It’s a simple interface
which asks the user to specify whether they want to submit a local file or a remote
file and accordingly they can either upload the dataset file along with data
description file or the description file alone. Additionally, they can also upload
new queries for an existing dataset. Once the user publishes a new query, the
application generates a unique URI for the query, which can be used to retrieve
the SPARQL query for remote execution.
Provide feedback about the search results in following ways:
recommend new visualizations
rate existing visualizations
eliminate an existing SPARQL query-visualization mapping
recommend new keywords for the query
eliminate an existing keyword-SPARQL query mapping
This creates a feedback loop which would be stored in the knowledge base and
help to improve the quality of the results in future.
Distinctive technology used: Google visualization API
There are a large number of high quality graphing and charting libraries available on the
web. For the purpose of this project Google Visualization API was used, as it is easy to
use, well-documented and provides a rich set of interactive visualizations ranging from
bar charts to word clouds. SPARQL XML bindings obtained as a result of SPARQL query
execution can be readily transformed to Google Visualization JSON format, using an XSLT.
The visualizations are rendered using Javascript, and thus require that the end user’s web
Page | 41
browser should have Javascript enabled. Even though the API provides a wide range of
visualizations, but for the matter of simplicity the current implementation of prototype
only makes use of Bar Chart, Pie Chart, Line Chart, Table and Map; other visualizations
can be easily added.
4.2.2 Application Tier
This tier consists of Java Bean classes, Java servlet classes and helper classes (non-servlet
classes). Figure 9 provides an overview of data flow between some of the key classes
which are described below:
Search API
This is a servlet class which accepts the user query as input and passes it to the query
handler and in return receives a ranked list of all the SPARQL query hits found in the
Lucene index. This API can then forward a SPARQL query to the query processing module
on demand to retrieve the SPARQL/XML results. The results retrieved from query
processing module are then forwarded to the Transformation API. Additionally, it also
interacts with the knowledgebase to retrieve additional data about the query and the
dataset corresponding to the query e.g provenance information about the dataset,
suggested visualization for query result representation etc. It binds this information with
the query result before sending it to the user interface. Although this API doesn’t provide
an external access in the current implementation, but it can be easily extended to provide
remote search functionality over CoKo’s knowledge base.
Query Handler
This class receives the user query as input from the Search API and normalizes it before
searching the Lucene index for a match. It uses various functions provided by Lucene to
normalize the keywords (discussed later).
Query Processor
This module receives a SPARQL query as input from the Search API. It parses the query to
determine if it’s a generic query (described in Chapter 5) or a general SPARQL query.
Incase it’s a generic query; it interacts with the knowledgebase to retrieve and execute
the corresponding metaquery. The result of the metaquery along with the generic query
Page | 42
is then sent to the user interface to be disambiguated by the user. The user selected value
is then used to transform the generic query to a general query which is then sent to the
SPARQL engine for execution and the retrieved results are sent back to the Search API.
Transformation API
This API receives the SPARQL/XML results retrieved from running the SPARQL query and
transforms it into Google Visualization JSON format. Current implementation of this API is
limited to transforming the results to Google JSON format, but can be extended and
exposed as public API to allow transformation of SPARQL query results using user
supplied XSLT.
Distinctive Technology used: Apache Lucene
Apache Lucene20 is an open source, highly scalable full-text search Java library. It provides
a simple API, focusing mainly on text indexing and searching and allows for easy
integration of these into any application. It is widely used to power websites like LinkedIn,
Twitter Trends - Twitter Analyzing Tool and many more. Wolfram Research also uses
20 http://lucene.apache.org/
Search API
Transformation API
Query Handler
Query Processing
Knowledge Base
User Query
Google Visualization API
Lucene Index
Figure 9 Data flow between key classes
Page | 43
Lucene for its internal tools, the Demonstrations project, the Mathematica
documentation search and for site searching [49].
In order to perform a full-text search on a database, an index can be created for the fields
in the database, on which search is to be performed. Lucene’s index structure is based on
the concept of inverted index21, which allows for fast full-text searches [50]. It supports
ranked searching and also supports many different query types like phrase queries, wild
card queries etc. Figure 10 provides an overview of various steps involved in building a
full-text search application using Lucene. It primarily involves indexing data, searching
data, and retrieving results.
Since Lucene is completely written in java, it allowed for easy integration into our servlet-
based application. Lucene Java library provides a wide array of classes which allow
customizing the way data is indexed, scored and searched. Some of the key Lucene
classes used in our application are:
21 http://xlinux.nist.gov/dads//HTML/invertedIndex.html
Figure 10 Building a full-featured search application using Lucene Retrieved August 16, 2011, from: http://www.ibm.com/developerworks/java/library/os-
apache-lucenesearch/index.html
Page | 44
RAMDirectory
An object of class IndexWriter is used to build a Lucene index. Typically, the index is file-
based but Lucene API provides support for creating an in-memory index as well. For file-
based indexes, a directory name is passed to the IndexWriter constructor, whereas for an
in-memory index an object of class RAMDirectory is passed to the constructor.
Although the index generated by Lucene can also be stored inside a relational database,
but this approach is known to have performance issues, especially in cases where the
index is being frequently updated .Therefore, I have used Lucene’s RAMDirectory class to
maintain an in-memory index of keywords associated with the queries.
StandardAnalyzer
Analyzers determine how the text is segregated and stored in an index, as well as at the
time of searching maps query terms to find a match in the index. Lucene provides a
couple of different analyzers and also allows for creating custom analyzers for an
application. StandardAnalyzer is one such analyzer which filters the text by converting it
into lower case, removes stop-words and other characters like spaces from acronyms,
apostrophes (') etc.
AnalyzerUtil
This class offers various methods for full-text analysis such as stemming, retrieving
frequently occurring terms etc. One of the methods provided by this class is
getPorterStemmerAnalyzer(), which returns an English stemming analyzer that uses the
Porter stemming algorithm22 to stem tokens from the underlying child analyzer.
4.2.3 Data Tier
This tier forms CoKo’s knowledge base. It has two components: a triple store which is
used to store the RDF-based knowledge representations and a RDBMS which stores data
descriptions and feedback. A triple store is a special purpose database designed to
provide persistent storage and access to RDF graphs via its APIs and query languages.
22 http://tartarus.org/~martin/PorterStemmer/
Page | 45
A unique data id is generated for each dataset uploaded to the system (including the
endpoint) and is stored in the RDBMS along with the data descriptions provided by the
curator. Each RDF dataset file which is uploaded to the system is saved in a new named
graph. The name of the graph is same as that of the data id.
Distinctive Technology used: OpenLink Open-Source Virtuoso
There is a wide range of commercial and open-source stores available, but for the
purpose of this project OpenLink Virtuoso was used. The open-source OpenLink Virtuoso
is the non commercial edition of Virtuoso Universal Server23. It is an Object-Relational
Database Engine extended into an RDF triple store [51] and provides database
management for RDF, SQL and XML data. It supports N3 / N-Triples and RDF/XML RDF
Data Serializations. It also supports the SPARQL Query Language, Query Protocol, XML
Query Results Serialization and Named Graph functionality and provides a web server to
execute SPARQL queries and upload data over HTTP.
The triples uploaded into the Virtuoso triple store are stored inside a table having four
columns, one for each of GraphID, Subject, Predicate and Object [51]. Each dataset
uploaded into the triple store is assigned a unique GRAPH IRI, which can be used to
execute SPARQL query for data in that file. In order to query all the triples in the triple
store the virt:sponger property needs to be set to “yes” and the rdf:graph property to the
desired Internationalized Resource Identifier(IRI)24, this will give the IRI that can be used
to query all the RDF triples in the triple store.
In addition to storing the RDF datasets in the triple store, Virtuoso’s relational database
engine was used to store the data descriptions provided by the curator. The supplied
description file is parsed to retrieve the data, which is then stored in relational database
tables. The data in these tables is then accessed with the help of Virtuoso JDBC driver.
Since Virtuoso offers SPARQL inside SQL [52], this driver can also be used to execute
SPARQL queries. The SPARQL query is simply appended with SPARQL keyword to
23 http://virtuoso.openlinksw.com/
24 http://www.w3.org/International/
Page | 46
distinguish from SQL. Internally, SPARQL is translated into SQL at the time of parsing the
query.
Even though Virtuoso provides free text indexing capability for text and XML data, it
doesn’t support stemming and therefore Apache Lucene was used to provide full – text
search support with stemming.
Page | 47
Chapter 5
Design Decisions
Several design decisions were made during the implementation of CoKo, in order to
accommodate some of the requirements and to optimize the usability of the system. This
chapter discusses the key decisions and some of the design artifacts produced as a result
of those decisions.
5.1 Data Set Description Language (DSDL)
DSDL is the Data Set Description Language, an XML-based representation language,
designed from the ground up to capture metadata of the datasets. DSDL was conceived to
serve the following purposes:
i) Provide standardized format for curators to describe the data.
ii) Collect metadata to support evaluation of the quality and trustworthiness of data
based on its source.
iii) Identify the righteous contributor of the data to enable proper attribution.
iv) Collect some typical queries and related visualizations.
With the above desired purposes in mind, the major challenge was designing a generic
format to allow the curators to describe the data. For that, the XML format was
fabricated to capture the metadata of the datasets, including informational metadata
about the data set like descriptions of source, owner etc., as well as presentational
metadata like preferred visualization for a particular query result. DSDL schema consists
of two key elements <general> which encloses elements to describe the data set as a
whole and <presentation> which encloses elements to describe the user supplied queries
and relevant visualizations of the retrieved results. Table 5 describes the purpose of each
element of the schema, for full version of the schema refer to Appendix B. DSDL was
designed to concisely capture data pertinent to the data set being published along with
some canonical interesting queries; it is portable and can be reused with any bespoke
application.
Page | 48
Elements Description
<gen
eral
>
<owner> Encloses descriptive information about the owner of the data
set. Includes <name>, <email> and <url> child elements.
< source> Encloses descriptive information about the source of the data
set. Includes <name>, <email> and <url> child elements.
<name> Identifies the name of the dataset.
<type> Identifies if the dataset being uploaded is a file or a reference
to a SPARQL endpoint.
<url> A link to a SPARQL endpoint or to a remote RDF file
<creator> Encloses descriptive information about the creator of the
data set. Includes <name>, <email> and <url> child elements.
<lastEditor> Encloses descriptive information about either the last editor
or the uploader of the data set. Includes <name>, <email>
and <url> child elements.
<date> Encloses <dateTo> and <dateFrom> element to represent
validity of the data set (if applicable).
<licenceInfo> Represents licensing information about the data set.
<description> Textual description of the data set, as to what kind of data is
present in the data set.
<pre
sen
tati
on
>
<import-data> Identifies name of the graph, if the associated SPARQL query
is to be evaluated against more than one data set. In case of
individual query upload, it can be used to provide the name of
an existing graph against which the query is to be executed.
<query> Identifies a SPARQL query. It can be a General SPARQL query
or a Generic query (explained in the next section).
<description> Textual description of the SPARQL query, as to what
information does it generate.
<meta-query> Identifies metaquery which are the special purpose SPARQL
queries, used to retrieve values to be supplied to generic
queries (explained in next section). It is optional and can be
omitted in case of general SPARQL queries.
<keywords> Identifies keyword-property mappings.
<property> Identifies the original term used in the vocabulary of the
corresponding data set. It is optional and can be omitted in
case of general SPARQL queries.
<keyword> Identifies the keywords associated with the query.
<visualisation>
Attribute: Rating
Identifies the preferred visualizations for the corresponding
SPARQL query’s result. The value of the Rating attribute is
used to determine the order in which the visualizations are
presented to the user.
Table 5 Data Set Description Language Elements
Page | 49
Example: 5.2.1 Population of Australia
SELECT ?population WHERE {
?s ns:population ?population ; ns:name "Australia".
}
The schema of DSDL resembles the DSPL25 language, created by Google to process data
for use in the Public Data Explorer26. But it is not as elaborate as DSPL and only captures
metadata sufficient to support provenance, query retrieval and query result rendering.
Even though the users are offered rich visualizations in return for uploading their data
sets using DSPL, but this is overshadowed by the complexity of describing data in DSPL.
Even for relatively small and simple data sets, it demands a lot of explicit descriptions
before it is able to render any visualizations, which can be discouraging for the users.
Therefore, DSDL was designed to have a simplistic format which is easy to comprehend
and easy to use.
5.2 Types of SPARQL queries
Initial design of CoKo allowed curators to only provide targeted queries, which merely
provided one to one mapping of keywords to SPARQL query. For the purpose of
articulation we will use the following example for this and the following section
Example 5.2.1: A curator named Bob has a dataset which contains data about countries,
their population and literacy rates. He decides to upload his dataset to CoKo along with a
basic SPARQL query to retrieve the population of “Australia”. He uses the following
SPARQL query for this purpose.
In order to submit the above query to CoKo, Bob uses the fragment of DSDL provided in
Listing 1 to describe the query.
25 http://code.google.com/apis/publicdata/
26 http://www.google.com/publicdata/home
Page | 50
Now Bob wants to upload similar queries for all the countries in the dataset but the above
description with hard coded queries, would be impractical and cumbersome.
To overcome this situation and to make the SPARQL query and associated keywords
description less verbose two types of queries were introduced:
i) Generic queries
These are provisional SPARQL queries, which cannot be directly evaluated by the
SPARQL engine due to presence of meta-variables. A meta-variable is basically a
special variables used as placeholders. These placeholders are supplied values
from the results retrieved as a result of metaquery execution. Once the
placeholders are replaced with an appropriate value these queries are sent to
SPARQL engine for execution.
ii) Metaqueries
These are special purpose SPARQL queries, used to retrieve values for the
placeholders in the generic queries.
Using the aforementioned types of queries, Bob can now provide data description for a
query assimilating all countries given in his data set. In order to accomplish that, he can
use the fragment of DSDL provided in Listing 2.
Listing 1. Fragment of DSDL describing SPARQL query … <query>
SELECT ?population WHERE {
?s ns:population ?population ; ns:name "Australia".
} </query> <keywords>
<keyword>Population of Australia </keyword> </keywords> … ^ Namespace prefix statements omitted for brevity
Page | 51
Listing 2. Fragment of DSDL describing SPARQL query using Generic and meta queries … <query>
SELECT ?population WHERE {
?s ns:population ?population ; ns:name ?o. FILTER regex(?o, "{0}", "i")
} </query> <meta-query>
SELECT ?country WHERE {
?s rdfs:label ?country }
</meta-query> <keywords>
<keyword>Population of {0} </keyword> </keywords> … ^ Namespace prefix statements omitted for brevity
Metaquery
SELECT ?country WHERE { ?s rdfs:label ?country}
List of countries
Keyword
Population of {0} Generic Query
SELECT ?population
WHERE
{
?s ns:population ?population ;
ns:name ?o.
FILTER regex(?o, "{0}", "i")
}
Figure 11 is a schematic representation of how the meta-variables are replaced with the
results retrieved from execution of a metaquery.
Figure 11 Illustration of meta-variable replacement with the result from metaquery execution
Page | 52
CoKo relies on clarification dialogues with the end user, to determine the appropriate
value to be plugged into the generic query placeholders. Using the value selected by the
user, a generic query is transformed into an executable query. The flow chart in Figure 12
exhibits the flow of operations involved in mapping a user query to a SPARQL query.
Does user
query match a
keyword in KB?
Does matched
keyword
contain meta-
variables?
Execute metaquery
Prompt user to disambiguate by
selecting a value from results
retrieved in previous step
Replace place-holder with the
value selected by the user
Execute the SPARQL query No
Yes
Yes
User query
Figure 12 Flow of operations involved in mapping a user query to a SPARQL query
Page | 53
5.3 Property Mapping
Metaqueries and generic queries enabled the curators to provide queries covering wide
range of data, but it was still not sufficient to describe the dataset efficiently. There was
some monotony due to hardcoded property values.
Adding to example 5.1, Bob wants to upload a query to retrieve the literacy rates for each
country, but the queries are syntactically similar the only difference being the name of
the property.
To overcome the inconvenience due to hardcoded property values, we incorporated a
new class of placeholders into the DSDL design. The initial strategy was to ask the
curators for a list of vocabulary terms which a query could handle which would then be
normalized into keywords and stored in the database. For example, terms like
based_near27 and populationTotal28 defined in DBPedia vocabulary can be easily
normalized to “based near” and “population Total” using string manipulation functions to
split the term at underscore and at uppercase respectively. However, during the
functional testing of the application, certain vocabularies were found to use terms which
cannot be efficiently normalized. For example CIA factbook vocabulary uses terms like
populationgrowthrate29 and internetusers30, which cannot be normalized into meaningful
keywords. Consequently, we introduced a new element <property> in the DSDL to
represent the original term used in the vocabulary, which would be mapped to a
normalized term in the keyword element supplied by the curator.
Building on example 5.1, Bob can now use the XML fragment provided in Listing 3, to
represent a cluster of queries covering all the properties in his data set. This relieves him
from the burden of explicitly creating queries for each property described in his data set.
27 http://mappings.dbpedia.org/index.php/OntologyProperty:Foaf:based_near
28 http://mappings.dbpedia.org/index.php/OntologyProperty:PopulationTotal
29 http://www4.wiwiss.fu-berlin.de/factbook/ns#populationgrowthrate
30 http://www4.wiwiss.fu-berlin.de/factbook/ns#internetusers
Page | 54
Listing 3. Fragment of DSDL describing SPARQL query using property mappings … <query>
SELECT ?value WHERE {
?s ns:{%property} ?value ; ns:name ?o. FILTER regex(?o, "{0}", "i")
} </query> <meta-query>
SELECT ?country WHERE {
?s rdfs:label ?country }
</meta-query> <keywords>
<property>population<property> <keyword>Population of {0} </keyword>
</keywords> <keywords>
<property>literacy_rate<property> <keyword>Literacy rate of {0} </keyword>
</keywords> … ^ Namespace prefix statements omitted for brevity
Page | 55
Chapter 6
Evaluation
CoKo was developed as a proof-of concept prototype, to identify the potential and
feasibility of a QA system built on top of a collaboratively curated Linked data knowledge
base. In order to effectively scrutinize the proposed system, two case studies were
deliberated for a detailed cognitive walkthrough of the application. Focus of these case
studies was to evaluate the functionality of the application and identify its strengths and
weaknesses. This chapter describes the two case studies, followed by a discussion on the
strengths and weakness of the application in handling the particular tasks elucidated in
each of these case studies.
6.1 Case Studies
All the functional steps described in these case studies have been recorded and are
available online31.
Case Study 1: Contained dataset
For the purpose of this case study, we will consider a dataset representing information
about Internet usage by rural and urban households in various states of US, published by
National Telecommunications and Information Administration (NTIA). The RDF version of
this dataset is available through the Data-gov Wiki32. We can incorporate this dataset into
CoKo’s knowledge base, so as to provide some typical analysis about the data. We will use
the SPARQL query given in Listing 6.1 a), to compare rural and urban broadband usage for
various states of US.
31 http://fishdelish.cs.man.ac.uk:5001/CoKo/evaluation.jsp
32 http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf
Page | 56
The dataset can be integrated into CoKo’s knowledge base, using any one of the following
techniques:
i) Download the dataset file locally from the Data-gov Wiki and then upload it using
CoKo’s upload interface.
ii) Provide the URL of the dataset file in the DSDL file. The corresponding dataset
would be retrieved and loaded into CoKo’s knowledge base automatically, with the
help of the given URL.
iii) Provide URL of the Data-gov SPARQL Endpoint33 in the DSDL file and reformulate
the query in Listing 6.1 a) to specify URI of the dataset in the GRAPH clause
(Appendix C.1).
For the first two techniques, current system works on the principle of single upload
without refresh i.e. the data is stored locally in CoKo’s knowledge base and is not checked
for updates. Whereas, in case of the third technique queries are processed in an ad-hoc
fashion and thus any changes to the dataset are automatically reflected in query results.
Therefore, (iii) is more efficient in cases where we want the changes in a dynamic data
source to reflect in query result. However, in case the publisher of the dataset doesn’t
provide a public SPARQL endpoint for their dataset, then we will have to manually update
CoKo’s knowledge base each time the data changes.
To load the dataset we carry out the following steps
33 http://data-gov.tw.rpi.edu/sparql
Listing 6.1 a) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE {
?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS") } ORDER BY ?state
Page | 57
Listing 6.1 b) <general> <owner> <name>National Telecommunications and Information Administration</name> <url> http://www.ntia.doc.gov/</url> </owner> <source> <name>National Telecommunications and Information Administration survey</name> <type>file</type> <url>http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf</url> </source> <creator> <name>National Telecommunications and Information Administration</name> </creator> <lastEditor> <name>Priyam</name> <email>[email protected]</email> </lastEditor> <date> <from>02/03/2010</from> </date> <licenceInfo>Open Data</licenceInfo> <description>Households using the Internet in and outside the home, by selected characteristics: Total, Urban, Rural, Principal City, 2009</description> </general>
1. Create the DSDL file using the schema provided in Appendix B. As discussed in
Chapter 5 the dataset description contains two key elements <general> and
<presentation> which we can populate for our dataset as given below (for full
version of DSDL refer to Appendix C.2):
<general> element encloses elements which describe the dataset (Listing 6.1 b).
<owner> and <creator> contain information about NTIA, as it’s the publisher of
the dataset.
<source> element’s child element <name> identifies from where the data was
generated, which in this case is NTIA Survey. Since our dataset is a static dataset,
we can simply upload it by providing the URL of the dataset in <url> and the value
“file” in <type>.
<lastEditor> identifies the uploader of the dataset, which in this case is the author
of this thesis.
Page | 58
<date> contains information about the validity period of a dataset, where child
element <from> identifies when the dataset came into existence and element
<to> identifies if after a particular date the data becomes invalid. As given on the
Data-gov Wiki34 page for this dataset, it was created on “2 March 2010” and due
to the nature of data it will not become invalid. Therefore, <to> element has been
omitted and <from> element contains the date of creation of the dataset.
<licenceInfo> contains information about the an licence specifications associated
with the dataset.
<description> contains textual description about the dataset as a whole.
<presentation> element encloses elements which describe the SPARQL query and the
visualization (Listing 6.1 c)
<description> contains textual description of the SPARQL query as a whole
<query> encloses the SPARQL query. Reserved signs like angle brackets must be
escaped.
<keywords> encloses keywords associated with the query
34 http://iw.rpi.edu/wiki/Dataset_10040
Listing 6.1 c) <presentation> <description>Compare rural and urban internet usage for various states in US </description> <query>
PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { ?s d:state ?state .
?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS")
} ORDER BY ?state
</query> <keywords> <keyword> rural vs urban Broadband Internet Use</keyword> </keywords> <visualisation rating="10">ColumnChart</visualisation> </presentation>
Page | 59
<visualisation> provides the name of the visualization and its rating on a scale of
1-10 (10 being the highest). Since we are comparing two sets of values, an
appropriate visualization will be Column Chart. But care needs to be taken about
the sequence of output variables, while formulating the query. Google
Visualization API advocates a particular sequence for the type of values it accepts
(refer to Appendix D) and CoKo is unable to automatically modify the sequence of
variables in the SPARQL query result, so as to satisfy the requirements of a
particular visualization.
2. Upload the DSDL file created as a result of above step, through the upload interface
of CoKo. After uploading the file, a unique URI is generated for our SPARQL query,
which can be used for remote SPARQL query execution, thus enabling us to reusing
the query.
By creating a file spanning ~40 lines of text and following the above two step process we
were able to share the dataset, along with a possible visualization of the result reasonably
easily.
Once the DSDL file is uploaded, the dataset is immediately available for analysis. We can
access CoKo’s search interface and analyze the results of our SPARQL query by entering
the keywords based query “Rural vs Urban broadband internet use”. The result interface
displays the result of the SPARQL query on a bar chart, along with the information about
the query and the dataset supplied by us in DSDL. We can easily provide an additional
visualization for the result, by clicking the “Edit Visualization” button on the result
interface, which allows us to make the following changes related to the visualization:
i) Provide new visualization
ii) Update rating of an existing visualization
iii) Eliminate an existing visualization
We can also share the visualization through a website or a blog, by copying the “Embed
visualization” URL, available on the result interface. Query results are also available in
XML format and can be downloaded. We can transform this XML result back to RDF
format and upload it to CoKo. Thus, enriching the knowledge base with the help of data
derived from data.
Page | 60
Listing 6.2 b) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state WHERE {
?s d:state ?state . FILTER (?state != "TOTAL HOUSEHOLDS"). } ORDER BY ?state
Furthermore, if we reformulate the SPARQL query given in Listing 6.1 a), to use the
Generic SPARQL query provision of CoKo, we will be able to compare the rural and urban
broadband usage for each individual state rather than for all states at once.
CoKo doesn’t allow us to modify an existing query since it might render the query
unusable for the original publisher of the query. For example, in this case if we were
allowed to update the original query; it could no longer be used as a remote query, due to
the presence of meta-variables. Therefore, we can only add new queries to the system.
We can upload the additional query as follows:
1. Identify dataset id for the existing dataset, using CoKo’s data interface, which
provides an overview of all datasets available in CoKo’s knowledge base. Dataset
id is a unique id generated by CoKo, for each dataset or endpoint uploaded to its
knowledge base. This id can also be used in GRAPH clause of SPARQL queries to
refer to the datasets.
2. Formulate a SPARQL query using meta-variables and a supporting metaquery to
retrieve values for meta-variables.
Listing 6.2 a) PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE {
?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS"). FILTER regex(?state, "{0}", "i"). } ORDER BY ?state
Page | 61
3. Upload the query using a truncated version of the previous DSDL, containing
descriptions only in <presentation> element (refer to Appendix C.3). The data id
from step 1 is given in <import-data> element, which automatically links the query
to the existing dataset. Thus, we do not have to provide details about the dataset
again.
Now when we enter the query “Rural and Urban Broadband Internet Use in each state”,
in CoKo search interface, we are prompted to select name of the state for which we want
to view the results. We can either select a particular state or select the value “All”, in
which case a result same as that of the previous query (Listing 6.1 a) would be displayed.
CoKo’s search mechanism is keyword-sensitive and it doesn’t support synonym search.
Therefore, user query plays a central role in finding a keyword match in its knowledge
base. Although, it is lenient about misspelled words, but one or more keywords in the
user query should exactly match those present the knowledge base. In case more than
one match is found in the knowledge base for a user query, it displays the results of the
SPARQL query associated with the keyword having the highest score35, and a ranked list
of links to other query results is displayed at the bottom of the result interface.
Case Study 2: Distributed Datasets
For this case study we will start by uploading a query to retrieve “female literacy rates”
(Listing 6.3 a) for countries represented in the CIA Factbook dataset36.
35 based on Lucene Scoring (http://lucene.apache.org/java/3_3_0/scoring.html)
36 http://www4.wiwiss.fu-berlin.de/factbook/
Listing 6.3 a) PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#> SELECT ?female_literacy_rate ?country WHERE {
?s ns:literacy_female ?female_literacy_rate; ns:name ?country }
ORDER BY ?country
Page | 62
The steps for creating and uploading the DSDL file are same as that described for the
previous case study, the only difference being in the <source> element (Listing 6.3 b). We
will provide the value for <type> element as “endpoint” and <url> element will contain
the URL of CIA Factbook endpoint37. For full version of the DSDL please refer to Appendix
C.4.
SPARQL query given in Listing 6.3 a) can be expanded with the help of property mappings
(elucidated in Chapter 5), to retrieve values for other similar properties described in CIA
Factbook (e.g. Male literacy values, Total literacy rate, Population etc.).In order to expand
the scope of the SPARQL query we will load the query using DSDL fragment given in
Appendix C.5). We can similarly provide property mappings for other properties described
in the dataset.
Next, we will upload the following query (Listing 6.4 a), which retrieves the female literacy
rates for each country given in CIA Factbook, along with the URL of that country’s
Wikipedia page.
37 http://www4.wiwiss.fu-berlin.de/factbook/sparql
Listing 6.4 a) PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT DISTINCT ?country ?female_literacy_rate ?wikiPage WHERE {
?s ns:literacy_female ?female_literacy_rate. ?s ns:name ?country. ?DBcountry a dbpedia:Country . ?DBcountry owl:sameAs ?s . ?DBcountry foaf:page ?wikiPage }
ORDER BY ?country
Listing 6.3 b) <source>
<name>CIA factbook</name> <type>endpoint</type> <url>http://www4.wiwiss.fu-berlin.de/factbook/sparql</url>
</source>
Page | 63
This query executes over the union of CIA Factbook and DBPedia38 graphs. Therefore, we
use the Linked Open Data Cloud Cache Endpoint39 to execute the query.
Current system mandates a query to be associated with at least one dataset, which the
system can identify as a master dataset for that query. The source information associated
with this master dataset is used to determine the data source, for which the query will be
executed (CoKo or a remote endpoint). For the query given in Listing 6.4 a) we can
identify either CIA Factbook or DBPedia as a master dataset and provide the dataset ID of
the other dataset in the <import-data> element. Since, we have already loaded CIA
Factbook, let’s use DBPedia as the master dataset in this case. The DSDL file can be
generated in a similar fashion as done previously for other datasets (for full version of
DSDL refer to Appendix C.6). The endpoint to which query would be sent for execution, is
determined by the value of the source URL, which is linked with the dataset description
rather than the SPARQL query description in the DSDL. Therefore, we will have to give the
source URL as the Linked Open Data Cloud Cache Endpoint and not the DBPedia endpoint
(Listing 6.4 b).
6.2 Overall Evaluation
The above two case studies enabled us to identify the following key strengths and
weaknesses of the current design of the application.
Strengths
Simplicity
Ease of uploading and sharing a dataset along with some useful queries to analyze
the dataset.
38 http://dbpedia.org
39 http://lod.openlinksw.com/sparql
Listing 6.4 b) <source>
<name>DBPedia</name> <type>endpoint</type> <url>http://lod.openlinksw.com/sparql</url>
</source>
Page | 64
Endpoint wrapping
Endpoint wrapping enables to seamlessly pull in and utilize data of huge and
dynamic data sources without having to replicate the data.
SPARQL query aggregation
Enables aggregation of useful SPARQL queries, which can be reused with the help
of the unique query URI.
Scope of queries expanded
With the help of generic queries and property mappings, curators are able to
expand the scope of their queries.
Weakness
Keyword sensitivity
In order to find a match, user queries must contain at least one or more terms
exactly as contained in CoKo’s knowledge base. Additionally, there is no support
for synonym search.
Visualization mapped only to query
Currently visualization is only mapped to query and therefore curator needs to be
careful about the output format while formulating a query.
Endpoint information mapped only with dataset
Potentially the URL of the endpoint should be associated not only with the dataset
but also with the SPARQL query. This would allow the query to be evaluated
against an endpoint other than that of the dataset as well.
No support for remote dataset refresh
Due to lack of support for remote dataset refresh, data publishers with dynamic
datasets having no public endpoint, have to manually push fresh dataset into
CoKo’s knowledge base, each time the dataset is updated.
6.3 Contender for Semantic Web Challenge
In order to be accepted as a valid entry to the Semantic Web Challenge, an application
needs to at least meet the minimal requirements defined by the organizers. It is
evaluated by the judges of the competition, for fulfillment of the requirements before
being accepted as a contender for the Challenge and therefore CoKo was designed
Page | 65
around these requirements. It is still a prototype and therefore has not been specifically
evaluated for fulfillment of the competition criteria. With the help of Table 6 and Table 7,
we can theoretically analyze its progress towards accomplishment of the requirements
for the competition, as given in Section 2.1 of this thesis.
Page | 66
Minimal requirements
Requirement Progress
1. End-user application. It is an end user application which provides a
practical value not only to casual Web users
but also to domain experts
Casual Web users can query the large
user curated knowledge base to obtain
concise answers accompanied by rich,
intuitive and meaningful visualizations.
Domain experts can upload datasets
from their domain of expertise, in order
to utilize application’s reasoning and
inference capabilities to obtain answers
from their domain of interest.
2. Information sources. Application’s knowledge base is built
and maintained under collaborative
environment. Users are the creators
and curators of information. Therefore,
it is under diverse ownership and
control and is highly heterogeneous.
Application will have real-world data,
assuming users will upload substantial
amount of real world data in order to
obtain useful answers to their
questions.
3. Meaning of data.
Meaning of data plays central role in order to
derive answers to user questions about implicit
knowledge.
Data is processed in order to answer
user questions about implicit
knowledge.
Without semantic processing of
information, certain user question
would be impossible to answer (as
explained in the Introduction section).
Semantic processing is also important
to generate appropriate and
meaningful visualizations.
Table 6 Progress towards fulfilling minimal requirements of the SWC
Page | 67
Additional desirable features
The design of the application doesn’t meet all the additional desirable requirements at
present. But due to its extensible nature; it can be extended to meet other requirements
as well.
Requirement Progress
1. User interface The application is designed to have a simple
search box for query input and a highly intuitive
result interface integrated with rich,
meaningful visualization of the result.
2. Innovative use of semantic
technology
The application is designed to provide an end-
to-end solution from Linked data publishing to
making it useful for a casual web user. It
provides a suite of different uses which have
previously been pursued individually, but not
integrated together into a single application.
3. Functionality The functionality of the application goes
beyond information retrieval, by presenting the
results in a useful format with the help of rich
visualisation techniques.
4. Use of dynamic and static data Users are allowed to publish their local static
RDF dataset files as well as URL to an endpoint,
which allows to interact with a dynamic
dataset.
5. Contextual information All the answers are provided with their source.
Therefore, users can decide trustworthiness of
the sources. Users are also encouraged to rate
the visualizations returned in response to their
queries.
Table 7 Progress towards fulfilling additional requirements of the SWC
Page | 68
Chapter 7
Conclusion and Future Work
Although the design and approach described in this thesis are still far from realizing a
completely automated collaborative QA system, but it provides the ground work for any
future extensions to the system. This chapter presents reflection on the current state of
CoKo followed by a discussion about some suggestions for a future attempt at the system.
7.1 Reflection
Recently, a tremendous amount of data is being published as structured linked data on
the web. But merely publishing the data as linked data alone, serves very little purpose in
realizing its real worth. For general web users this data is far from being usable and
analyzable. CoKo is an end-to-end system for sharing and curating linked data in a
collaborative QA system setup. It is a step towards making linked data sources more
accessible to general web users by serving an easy to use interface to query these data
sources and provides rich intelligible visualization of the data. Figure 13 provides a
contrast between a typical linked dataset file and visualization produced by CoKo, to
assist the end user in analyzing the data in the dataset.
Pile of triples!! CoKo result interface
Figure 13 From pile of triples to an intelligible interface
Page | 69
CoKo is available online40 and at the time of writing this thesis it could answer over 20
questions, based on 5 datasets and 3 endpoints which were fed to the system during the
evaluation and functional testing phase. This milestone was achieved with a setup and
development cycle spanning over a period of three months. The major overhead of the
setup and implementation phase was the difficulty in understanding the functioning of
Virtuoso server, due to the lack in clarity of its documentation. After the initial challenges
with the setup and design, the development of the application was relatively
straightforward. Java was chosen as programming language because of author’s grasp
over the language, but any other server side language would have worked equally well for
the application. Semanticweb.com41 is an active forum of Semantic web experts and
proved to be a helpful resource.
Two objectives were defined in the introduction section of this thesis, and table 8
describes to what level these objectives have been achieved
Objective Progress Scope for improvement
Technical Objective
End to end system for
sharing and curating
linked data in a
collaborative
environment.
Evaluation showed that our approach
successfully provides support to
publishers to share a small dataset
along with some useful queries. Also
it is easy for other users to provide
feedback to the system about the
results and add new queries for
existing data.
Even for large datasets, initial
publishing of dataset to CoKo
KB is easy. But how to reduce
the burden of curation and
query generation for the
curator is still an unsolved
problem.
Functional Objective
An automatic open-
domain question
answering system.
QA system is not yet mature, but with
reasonable amount of data
description provided by a curator, it is
successfully able to provide answers
for keyword based user queries.
QA system can be extended to
improve the GUI and NLP
support.
Table 8 Progress of CoKo towards meeting its objectives
40 http://fishdelish.cs.man.ac.uk:5001/CoKo
41 http://answers.semanticweb.com/
Page | 70
7.2 Problems which still need to be solved
How to handle duplicate data upload?
Although, the developed prototype is strict about duplicate query uploads and ensures
that there are no syntactically duplicate SPARQL queries, but it is tolerant of duplicate
data upload. It doesn’t check for duplicity and uploads any dataset into a new named
graph.
Duplicate data will not only lead to an unnecessary increase in the cost of storing data,
but it will also increase the query processing time and would hamper the curation
process. Duplicate data can be disseminated into the system either due to two individuals
uploading a same dataset or two datasets having identical entities, but with different
identifiers. For example, DBpedia and the CIA Factbook use different URIs for the same
country [53]. A standard way to reconcile identical data entries is to use owl:sameAs
property, but it has not been explored in the current implementation.
Is the system scalable?
Scalability of the current system hasn’t been evaluated and should be considered for any
future attempts. The web application needs to be scalable not only in terms of number of
concurrent connections it can support but also in terms of amount of data it can process
and store.
Virtuoso seems to be a good choice as a triple store and has been proven to scale well
with large datasets.Data.gov SPARQL endpoint42 which stores over 6 billion triples on a
single open-source Virtuoso instance [54], is a good example to illustrate its scalability.
But setting up a Virtuoso server requires a lot of effort due to lack in clarity of its
documentation. The W3C Wiki43 provides a list of large scale triple stores and also
discusses potential scalability of these stores.
How to improvise DSDL?
Current version of DSDL has a fairly minimalist schema, focused on collecting metadata
sufficient to collect information about provenance, query retrieval and query result
42 http://services.data.gov/sparql
43 http://www.w3.org/wiki/LargeTripleStores
Page | 71
rendering. Apparently, this schema would evolve in the future implementations, to
further support the task of curation. During evaluation phase two issues were identified
with the current format of DSDL (refer to Section 6.2 of this thesis):
Endpoint information mapped only with dataset.
Potentially the endpoint information should be associated not only with the
dataset, but also with the SPARQL query.
Visualization mapped only with query.
Curators should be able to specify alternate order for SPARQL query result
variables, for different visualizations.
7.3 Suggestions for the future
Design, development and evaluation process of this first prototype of CoKo has
uncovered some interesting points about the application which are worth reflecting in
this thesis. These are basically features which were envisioned as the part of
development and evaluation process but could not be implemented due to the limitation
of time. Some of these are essential features and should be considered for any future
attempts and some are good to have extensions which would help to improve user
experience.
7.3.1 Critical
Minimize curation effort
Meta queries and generic queries were implemented to support the curators in the task
of describing large datasets and it indeed has lowered the burden on the curators. There
were some other designs as well which were considered before implementing generic
queries.
Additional Dispatchers: A dispatcher is basically a function which takes user input and
returns appropriate query result and visualization .Currently, the system uses a single
bespoke dispatcher: keyword search; which has been enhanced with stemming,
metaqueries and generic queries. Another design which was considered was to allow
curators to associate the main SPARQL query with another SPARQL query or we can say a
keyword-SPARQL query, instead of textual keywords. These keyword-SPARQL queries
Page | 72
along with the main SPARQL query form a new dispatcher. The keyword-SPARQL queries
would be syntactically similar to the Generic queries design of the current system and
would accept key terms from the user query as input for the place holders. If execution of
a keyword-SPARQL query returns a result then the associated main SPARQL query would
be retrieved and executed, as illustrated in Figure 14. Once this is implemented a
keyword search in CoKo’s KB would be followed by execution of a list of dispatchers to
check if any of them return any results. These dispatchers can also be built into the
system for key sites like DBPedia, Data.gov etc. Care needs to be taken that a bad
dispatcher might break the application. A time-out should be set for each of the
dispatchers.
Extract properties from SPARQL query. A SPARQL query pattern typically contains a
reference to a property term. If a user query has a keyword which matches these
property terms, then that SPARQL query becomes a potential match for the user query.
This would reduce the burden on the curators as they will no longer have to add too
many keywords.
Automatic query generation. Current implementation only allows for hand-built SPARQL
queries. An additional direction this research could take is ontology based automatic
User Query Keyword SPARQL
Metaquery
Generic Query
SPARQL Result + Visualization
Current Dispatcher
Additional Dispatcher
SPARQL
Result + Visualization
Figure 14 Additional dispatcher
Page | 73
query generation. Systems like Quelo [55] and Ginseng [29] (discussed in background
research section of this thesis) exemplify this approach. These systems support guided
construction of queries based on feedback. Although we are not sure as to how well this
approach would perform over an arbitrary ontology.
Chain of transformations
Abstractly, we can consider the query workflow from end user query to an output view as
given in Figure 15. An end user query is mapped to a SPARQL query (or a set of queries)
which in turn is mapped to an output view. To bridge the gap between SPARQL query and
output view, current system employs only a single transformation, which is an in-built
transformation to Google JSON format.
This bridge between SPARQL query and target output view can be extended to allow
curators to provide specifications about multiple transformations, mashing up of results
from an additional query execution or direct rendering of results in tabular format. These
user insertable chain of transformations and massaging of SPARQL query results, would
allow the curators to create more sophisticated and compelling views for the query
results.
Check for data update. Refresh data set
As discussed in the evaluation section, due to lack of support for remote dataset refresh,
data publishers with dynamic datasets having no public endpoint, have to manually push
fresh dataset into CoKo’s knowledge base, each time the dataset is updated. In order to
avoid this constant manual effort for data publishers, a generic data refresh for remote
User query SPARQL Output View mapped to
Multiple Transformations
Direct query result rendering in tabular
format
Additional SPARQL query
mapped to
Figure 15 Abstract view of query workflow
Page | 74
datasets can be implemented for the system. A generic refresh will pull fresh data into
the system after a definite period of time. This can either be a constant period (like every
24 hours) or can be based on the HTTP header information to estimate a plausible
expiration time for the data after which it should be refreshed. But if the data changes
too much then there can be a case where the queries stop working. Therefore, an
elaborate verification procedure needs to be in place. Data can also be stored as versions.
If user uploads a local file then it is reasonable to assume that they want that particular
version of the file.
Trust and Authentication
An application based on an open- community curated knowledge base is constantly under
the threat of spoofing, poor quality of data and lack of trust. One way to minimize these
threats is to integrate a trust-based authentication system into the application. A trust
metric like Advogato44 can be used for this purpose. Advagato’s trust metric [56] is based
on network flow and automatically calculates trust for an individual in a network based
on the ratings provided by other individuals who have a rating higher than the individual
being rated. This metric has been proven to be attack resistant and is easy to integrate
and implement.
7.3.2 Other extensions
Public API for search and transformation
Search and transformation API implemented for the system can be extended to provide
external access to be used in other applications and to create mashups with the help of
data from CoKo’s knowledge base. This would further increase the usability and
accessibility of the existing data.
Richer user queries
Current system only supports keyword based user queries, with no semantic analysis. It
can be extended to support richer queries like phrase queries and also expand user
44 http://www.advogato.org/
Page | 75
queries to search for synonyms. Ontology based keyword expansion can also be used to
improve the search mechanism.
Internal Curation
As described in the introduction section of this thesis, TrueKnowledge employs an
internal curation mechanism, which assesses any new entity with the help of an
inferencing system and rejects it if it contradicts an existing entity. A similar mechanism
can be pursued in future extensions of CoKo.
Remote endpoint caching
Once a SPARQL query is sent to a remote endpoint for evaluation, the retrieved results
should be cached in memory to maximize the performance of the system. Maintaining a
cache would help in reducing the redundant calls to a remote endpoint, for each
subsequent call of a query and thus would decrease the response time for the end user.
Keyword-Operation mapping
Primary focus of the current implementation of the application is on people giving fixed
uploads, unlike WolframAlpha which allows users to query for “Population of China +
Population of India” or “male population of China divided by the total population of
China”. WolframAlpha understands the operators “+” and “divided” and provides
appropriate results accordingly. Keywords can be associated with different kinds of
operations, than merely just the queries which are built into the dataset.
Keyword-Visualization mapping and automatic generation of visualization
When the system receives a user query such as “Map of earthquake”, it should not only
search for queries tagged with the keyword “earthquake” but should also understand the
term “Map” and present the result of the query on map based visualization.
In [57] author proposes an ontology based approach to automatic generation of charts
from SPARQL queries. This approach could be pursued in future implementations, to
suggest alternate visualizations to the user for a query result.
Page | 76
References
[1] F. Van Harmelen and G. Antoniou, A Semantic Web Primer, 2nd ed. Cambridge, MA,
USA: The MIT Press.
[2] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker, Searching and
Browsing Linked Data with SWSE: the Semantic Web Search Engine. Technical Report
DERI-TR-2010-07-23, 2010.
[3] J. Howe, Crowdsourcing: How the Power of the Crowd is Driving the Future of
Business. Random House Business, 2009.
[4] L. Hirschman and R. Gaizauskas, “Natural language question answering: the view
from here,” Natural Language Engineering, vol. 7, no. 4, Feb. 2002.
[5] S. J. Athenikos and H. Han, “Biomedical question answering: A survey,” Computer
methods and programs in biomedicine, vol. 99, no. 1, pp. 1–24, 2010.
[6] C. Kwok, O. Etzioni, and D. S. Weld, “Scaling question answering to the web,” ACM
Transactions on Information Systems (TOIS), vol. 19, pp. 242–262, Jul. 2001.
[7] M. R. Kangavari, S. Ghandchi, and M. Golpour, “Information Retrieval: Improving
Question Answering Systems by Query Reformulation and Answer Validation.”
[8] H. Xu, “Interview on Wolfram|Alpha, a Computational Knowledge Engine,” InfoQ, 30-
Jul-2009. [Online]. Available: http://www.infoq.com/. [Accessed: 02-Mar-2011].
[9] N. Spivack, “Wolfram Alpha Computes Answers To Factual Questions. This Is Going To
Be Big.,” TechCrunch, 08-Mar-2009. [Online]. Available: http://techcrunch.com/.
[Accessed: 02-Mar-2011].
[10] “True Knowledge,” Wikipedia. [Online]. Available:
http://en.wikipedia.org/wiki/True_Knowledge. [Accessed: 29-Mar-2011].
[11] m c schraefel, N. R. Shadbolt, N. Gibbins, S. Harris, and H. Glaser, “CS AKTive
space: representing computer science in the semantic web,” in Proceedings of the
13th international conference on World Wide Web, New York, NY, USA, 2004, pp.
384–392.
[12] G. Schreiber et al., “Semantic annotation and search of cultural-heritage
collections: The MultimediaN E-Culture demonstrator,” Web Semantics: Science,
Services and Agents on the World Wide Web, vol. 6, no. 4, pp. 243-249, Nov. 2008.
Page | 77
[13] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner: extraction and
mining of academic social networks,” in Proceeding of the 14th ACM SIGKDD
international conference on Knowledge discovery and data mining, New York, NY,
USA, 2008, pp. 990–998.
[14] C. Torniai, “Semantic Web for the masses - Part II,” 25-Jul-2009. [Online].
Available: http://blog.carlotorniai.net/semantic-web-for-the-masses-part-ii/.
[Accessed: 19-Apr-2011].
[15] A. Harth and P. Buitelaar, “Exploring Semantic Web Datasets with VisiNav,”
presented at the The 6th Annual European Semantic Web Conference (ESWC2009),
Heraklion, Greece, 2009.
[16] V. Lopez, A. Nikolov, M. Sabou, V. Uren, E. Motta, and M. d’ Aquin, “Scaling Up
Question-Answering to Linked Data,” in Knowledge Engineering and Management by
the Masses, vol. 6317, P. Cimiano and H. S. Pinto, Eds. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2010, pp. 193-210.
[17] E. Rajabi and M. Kahani, “Designing a Step-by-Step User Interface for Finding
Provenance Information over Linked Data,” Web Engineering, pp. 403–406, 2011.
[18] M. Hearst, A. Elliott, J. English, R. Sinha, K. Swearingen, and K.-P. Yee, “Finding the
flow in web site search,” Commun. ACM, vol. 45, no. 9, pp. 42–49, Sep. 2002.
[19] M. C. Schraefel, M. Wilson, A. Russell, and D. A. Smith, “mSpace: Improving
information access to multimedia domains with multimodal exploratory search,”
COMMUN. ACM, vol. 49, p. 47--49, 2006.
[20] K.-P. Yee, K. Swearingen, K. Li, and M. Hearst, “Faceted metadata for image
search and browsing,” in Proceedings of the conference on Human factors in
computing systems - CHI ’03, Ft. Lauderdale, Florida, USA, 2003, p. 401.
[21] D. Huynh and D. Karger, “Parallax and companion: Set-based browsing for the
data web,” in Proceedings of 18th International World Wide Web Conference, 2009.
[22] G. Kobilarov and I. Dickinson, “Humboldt: exploring linked data,” context, vol. 6,
p. 7, 2008.
[23] T. Berners-Lee et al., “Tabulator: Exploring and analyzing linked data on the
semantic web,” in Proceedings of the 3rd International Semantic Web User Interaction
Workshop, 2006, vol. 2006.
Page | 78
[24] R. García, J. M. Brunetti, A. López-Muzás, J. M. Gimeno, and R. Gil, “Publishing and
interacting with linked data,” in Proceedings of the International Conference on Web
Intelligence, Mining and Semantics, 2011, p. 18.
[25] E. Turner, A. Hinze, and S. Jones, “A review of user interface adaption in current
semantic web browsers,” 2011.
[26] J. Lehmann and L. Bühmann, “AutoSPARQL: Let Users Query Your Knowledge
Base,” in The Semantic Web: Research and Applications, vol. 6643, G. Antoniou et al.,
Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 63-79.
[27] S. Auer and J. Lehmann, “What Have Innsbruck and Leipzig in Common?
Extracting Semantics from Wiki Content,” in The Semantic Web: Research and
Applications, vol. 4519, E. Franconi, M. Kifer, and W. May, Eds. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2007, pp. 503-517.
[28] V. Lopez, E. Motta, and V. Uren, “Poweraqua: Fishing the semantic web,” The
Semantic Web: Research and Applications, pp. 393–410, 2006.
[29] A. Bernstein, E. Kaufmann, and C. Kaiser, “Querying the semantic web with
ginseng: A guided input natural language search engine,” in 15th Workshop on
Information Technologies and Systems, Las Vegas, NV, 2005, pp. 112–126.
[30] T. Gruber, “Collective knowledge systems: Where the Social Web meets the
Semantic Web,” Web Semantics: Science, Services and Agents on the World Wide
Web, vol. 6, no. 1, pp. 4-13, Feb. 2008.
[31] M. Richardson and P. Domingos, “Building large knowledge bases by mass
collaboration,” in Proceedings of the 2nd international conference on Knowledge
capture, New York, NY, USA, 2003, pp. 129–137.
[32] Y. L. Simmhan, B. Plale, and D. Gannon, “A Survey of Data Provenance
Techniques,” p. 47405, 2005.
[33] C. Bizer and R. Oldakowski, “Using context-and content-based trust policies on
the semantic web,” in Proceedings of the 13th international World Wide Web
conference on Alternate track papers & posters, 2004, pp. 228–229.
[34] A. Gil and V. Ratnakar, “Trusting information sources one citizen at a time,”
PROCEEDINGS OF THE FIRST INTERNATIONAL SEMANTIC WEB CONFERENCE (ISWC),
SARDINIA, p. 162--176, 2002.
Page | 79
[35] J. Cheney, L. Chiticariu, and W.-C. Tan, “Provenance in Databases: Why, How, and
Where,” Foundations and Trends in Databases, vol. 1, no. 4, pp. 379-474, 2007.
[36] P. Buneman, S. Khanna, and T. Wang-Chiew, “Why and where: A characterization
of data provenance,” Database Theory—ICDT 2001, pp. 316–330, 2001.
[37] T. J. Green, G. Karvounarakis, and V. Tannen, “Provenance semirings,” in
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on
Principles of database systems, New York, NY, USA, 2007, pp. 31–40.
[38] P. P. Silva, D. L. McGuinness, and R. McCool, “Knowledge provenance
infrastructure,” IEEE Data Eng. Bull., vol. 26, no. 4, pp. 26–32, 2003.
[39] T. Omitola, C. Gutteridge, I. Millard, H. Glaser, N. Gibbins, and N. Shadbolt,
“Tracing the Provenance of Linked Data using voiD,” 2011.
[40] J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler, “Named graphs, provenance and
trust,” in Proceedings of the 14th international conference on World Wide Web, New
York, NY, USA, 2005, pp. 613–622.
[41] E. R. Watkins and D. A. Nicole, “Named Graphs as a Mechanism for Reasoning
about Provenance,” 18-Jan-2006. [Online]. Available:
http://eprints.ecs.soton.ac.uk/11935/. [Accessed: 08-May-2011].
[42] E. Dumbill, Tracking Provenance of RDF Data. 2003.
[43] R. Macgregor and I.-young Ko, “Representing Contextualized Data using Semantic
Web Tools,” IN PRACTICAL AND SCALABLE SEMANTIC SYSTEMS (WORKSHOP AT 2ND
ISWC, 2003.
[44] G. Flouris, I. Fundulaki, P. Pediaditis, Y. Theoharis, and V. Christophides,
“Capturing Provenance of RDF Triples through Colors.”
[45] O. Hartig and J. Zhao, “Publishing and consuming provenance metadata on the
web of linked data,” Provenance and Annotation of Data and Processes, pp. 78–90,
2010.
[46] M. Leida, A. Afzal, and B. Majeed, “Outlines for dynamic visualization of semantic
web data,” in Proceedings of the 2010 international conference on On the move to
meaningful internet systems, Berlin, Heidelberg, 2010, pp. 170–179.
[47] J. G. Zheng and L. Ding, “How to render SPARQL results using Google Visualization
API - Data-gov Wiki,” 07-Oct-2009. [Online]. Available: http://data-
Page | 80
gov.tw.rpi.edu/wiki/How_to_render_SPARQL_results_using_Google_Visualization_AP
I. [Accessed: 15-Mar-2011].
[48] J. Tennison, “Creating Google Visualisations of Linked Data,” Jeni’s Musings, 23-
Jul-2009. [Online]. Available: http://www.jenitennison.com/blog/. [Accessed: 15-Mar-
2011].
[49] “Lucene-java Wiki.” *Online+. Available: http://wiki.apache.org/lucene-
java/PoweredBy. [Accessed: 17-Aug-2011].
[50] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files
for text indexing,” ACM Trans. Database Syst., vol. 23, no. 4, pp. 453–490, Dec. 1998.
[51] O. Erling, “Implementing a SPARQL compliantRDF triplestore using SQL-ORDBMS.”
[Online]. Available:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSRDFWP. [Accessed:
17-Aug-2011].
[52] O. Erling and I. Mikhailov, “RDF Support in the Virtuoso DBMS,” Networked
Knowledge-Networked Media, pp. 7–24, 2009.
[53] A. Jaffri, H. Glaser, and I. Millard, “URI Identity Management for Semantic Web
Data Integration and Linkage,” 2007. *Online+. Available:
http://eprints.ecs.soton.ac.uk/14361/. [Accessed: 05-Sep-2011].
[54] L. Ding et al., “TWC LOGD: A portal for linked open government data ecosystems,”
Web Semantics: Science, Services and Agents on the World Wide Web, vol. In Press,
Corrected Proof.
[55] E. Franconi, P. Guagliardo, and M. Trevisan, “An intelligent query interface based
on ontology navigation,” in Proceedings of the Workshop on Visual Interfaces to the
Social and Semantic Web (VISSW 2010), 2010.
[56] R. Levien, “Attack-resistant trust metrics,” Computing with Social Trust, pp. 121–
132, 2009.
[57] M. Leida, “Toward Automatic Generation of SPARQL result set Visualizations,”
presented at the 8th International Joint Conference on e-Business and
Telecommunications, Seville, Spain.
Page | 81
Appendices
Appendix A –Technologies Used
Description Software
Operating System Windows 7
Web Browser Google Chrome
Automated UI testing tool Selenium IDE
AJAX Framework Google Web Toolkit
Development Platform JDK 1.6
Java IDE Netbeans IDE 6.9.1
JavaEE API Servlet ,JavaServerPages, JSP Standard Tag Library
Database Virtuoso Open-Source Edition 6.1.3
Triplestore Virtuoso Open-Source Edition 6.1.3
Visualization API Google Visualization API
Search Engine Apache Lucene 3.3.0
Page | 82
Appendix B – Data Set Description Language (DSDL) Schema
<!--DTD definition for DSDL--> <!ELEMENT data (general, presentation*)> <!ELEMENT general (owner, source, creator, lastEditor, date,licenceInfo, description)> <!ELEMENT owner (name, email?, url?)> <!ELEMENT source (name,type, url)> <!ELEMENT creator (name, email?, url?)> <!ELEMENT lastEditor (name,email, url?)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT url (#PCDATA)> <!ELEMENT type (#PCDATA)> <!ELEMENT date (from, to?)> <!ELEMENT from (#PCDATA)> <!ELEMENT to (#PCDATA)> <!ELEMENT licenceInfo (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT presentation (description,import-data?,query,keyQuery?,keywords+, visualisation+)> <!ELEMENT import-data (#PCDATA)> <!ELEMENT query (#PCDATA)> <!ELEMENT keyQuery (#PCDATA)> <!ELEMENT keywords (property?,keyword)> <!ELEMENT property (#PCDATA)> <!ELEMENT keyword (#PCDATA)> <!ELEMENT visualisation (#PCDATA)> <!ATTLIST visualisation rating (1|2|3|4|5|6|7|8|9|10) #REQUIRED
Page | 83
Appendix C - Case Studies
C.1 SPARQL query using GRAPH Clause
Listing C.1 PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { GRAPH <http://data-gov.tw.rpi.edu/vocab/Dataset_10040>{
?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS") } } ORDER BY ?state
Page | 84
C.2 Full version of DSDL for case study 1
1. <data>
2. <general>
3. <owner>
4. <name>National Telecommunications and Information Administration</name>
5. <url> http://www.ntia.doc.gov/</url> 6. </owner>
7. <source>
8. <name>National Telecommunications and Information Administration survey</name>
9. <type>file</type>
10. <url>http://data-gov.tw.rpi.edu/raw2/10040/data-10040.rdf</url>
11. </source> 12. <creator>
13. <name>National Telecommunications and Information Administration</name>
14. </creator>
15. <lastEditor>
16. <name>Priyam</name> 17. <email>[email protected]</email>
18. </lastEditor>
19. <date>
20. <from>02/03/2010</from>
21. </date> 22. <licenceInfo>Open Data</licenceInfo>
23. <description>Households using the Internet in and outside the home, by selected
24. characteristics: Total, Urban, Rural, Principal City, 2009</description>
25. </general>
26. <presentation> 27. <description>Compare rural and urban internet usage for various states in US </description>
28. <query>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/>
29. SELECT ?state ?urban_home_broadband ?rural_home_broadband
30. WHERE { ?s d:state ?state .
31. ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband .
32. ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . 33. FILTER (?state != "TOTAL HOUSEHOLDS")} ORDER BY ?state
34. </query>
35. <keywords>
36. <keyword> rural vs urban Broadband Internet Use</keyword>
37. </keywords> 38. <visualisation rating="10">ColumnChart</visualisation>
39. </presentation>
40 </data>
Page | 85
C.3 DSDL for additional query upload
<data> <presentation> <import-data>data1</import-data> <description>Compare rural and urban internet usage for various states in US </description> <query>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state ?urban_home_broadband ?rural_home_broadband WHERE { ?s d:state ?state . ?s d:urban_internet_use_in_the_home_broadband_no ?urban_home_broadband . ?s d:rural_internet_use_in_the_home_broadband_no ?rural_home_broadband . FILTER (?state != "TOTAL HOUSEHOLDS"). FILTER regex(?state, "{0}", "i"). } ORDER BY ?state </query> <keyQuery>PREFIX d: <http://data-gov.tw.rpi.edu/vocab/p/10040/> SELECT ?state WHERE { ?s d:state ?state . FILTER (?state != "TOTAL HOUSEHOLDS"). } ORDER BY ?state </keyQuery> <keywords> <keyword> Rural and Urban Broadband Internet Use in {0}</keyword> </keywords> <visualisation rating="10">ColumnChart</visualisation> </presentation> </data>
Page | 86
C.4 DSDL describing CIA Factbook dataset
<data> <general> <owner> <name> Central Intelligence Agency </name> <url>https://www.cia.gov/library/publications/the-world-factbook/</url> </owner> <source> <name>CIA World factbook</name> <type>endpoint</type> <url>http://www4.wiwiss.fu-berlin.de/factbook/sparql</url> </source> <creator> <name> Central Intelligence Agency </name> <url> https://www.cia.gov/ </url> </creator> <lastEditor> <name>Priyam</name> <email> [email protected] </email> <url>http://www.linkedin.com/in/priyammaheshwari</url> </lastEditor> <date> <from>27/08/2011</from> </date> <licenceInfo>Public Domain</licenceInfo> <description>The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities</description> </general> <presentation> <description>Literacy rates of females</description> <query>PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>
SELECT ?female_literacy_rate ?country WHERE { ?s ns:literacy_female ?female_literacy_rate; ns:name ?country } ORDER BY ?country
</query> <keywords> <keyword>Female literacy rates around the world </keyword> </keywords> <visualisation rating="6">Table</visualisation> </presentation> </data>
Page | 87
<data> <presentation> <import-data>data2</import-data> <query> PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>
SELECT ?value ?country WHERE { ?s ns:{%property} ?value; ns:name ?country } ORDER BY ?country </query> <keywords>
<property>literacy_male</property> <keyword>Male literacy rates around the world </keyword> </keywords> <keywords>
<property>population_total</property> <keyword>Population of countries around the world </keyword> </keywords> <keywords>
<property>infantmortalityrate_total</property> <keyword>Infant Mortality rates around the world </keyword> </keywords> <visualisation rating="6">Table</visualisation> </presentation> </data>
C.5 DSDL describing property mappings
Page | 88
C.6 DSDL describing DBpedia dataset and a query against multiple endpoints
<data> <general> <owner> <name>DBPedia Project</name> <url>http://dbpedia.org/About</url> </owner> <source> <name>DBPedia</name> <type>endpoint</type> <url>http://lod.openlinksw.com/sparql</url> </source> <creator> <name>Priyam</name> <email> [email protected] </email> </creator> <lastEditor> <name>Priyam</name> <email> [email protected] </email> </lastEditor> <date> <from>27/07/2011</from> </date> <licenceInfo> GNU Free Documentation License </licenceInfo> <description>DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web</description> </general> <presentation>
<import-data>data2</import-data> <description>Land Area covered by each country and their Wikipedia page </description> <query>PREFIX ns: <http://www4.wiwiss.fu-berlin.de/factbook/ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT DISTINCT ?country ?land_area ?wikiPage WHERE {
?s ns:area_land ?land_area. ?s ns:name ?country. ?DBcountry a dbpedia:Country . ?DBcountry owl:sameAs ?s . ?DBcountry foaf:page ?wikiPage }ORDER BY ?country </query> <keywords> <keyword>Geographic information about countries</keyword> </keywords> <visualisation rating="10">Map</visualisation> <visualisation rating="8">Table</visualisation> </presentation> </data>
Page | 89
Appendix D – Google Visualization Data Format
Following table provides details about the data format for various visualizations as
recommended by Google Visualization API. When the SPARQL query results are
transformed into Google JSON table format, the data type and format of each column
should follow the data format specified for the corresponding visualization.
Visualization Data Format
Bar Chart The first column should be a string, and represent the label of that group of
bars. Any number of columns can follow, all numeric, each representing the
bars with the same color and relative position in each group.
Pie Chart Two columns. The first column should be of type string, and contain the slice
label. The second column should be a number, and contain the slice value.
Line Chart The first column should be a string, and contain the category label. Any
number of columns can follow, all must be numeric. Each column is displayed
as a separate line.
Geo Map The location is entered in first column, plus two optional columns:
1. [String, Required] A map location. The following formats are
accepted:
a. A specific address (for example, "1600 Pennsylvania Ave").
b. A country name as a string (for example, "England"), or an
uppercase ISO-3166 code or its English text equivalent (for
example, "GB" or "United Kingdom").
c. An uppercase ISO-3166-2 region code name or its English text
equivalent (for example, "US-NJ" or "New Jersey").
d. A metropolitan area code. These are three-digit metro codes
used to designate various regions; US codes only supported.
2. [Number, Optional] A numeric value displayed when the user hovers
over this region. If column 3 is used, this column is required.
3. [String, Optional] Additional string text displayed when the user
hovers over this region.