Semantic web design for - Presentation

38

Transcript of Semantic web design for - Presentation

• Basics of Linked Data• Purpose of this project• Migrational Framework

• Eight Steps• Conclusion

What is Linked Data?

• Linked Data is an alternative data representation format.

• Actually, its just a repackaging of Semantic Web elements

• It is different from relational database concepts such as tables, rows, columns…

RDF

Subject-Predicate -Object

Jurong belongs to the West Zone

Linked Data Representation Format

http://data.gov.sg/resource/area/Jurong_West

http://data.gov.sg/ontology/property/has_zone

http://data.gov.sg/resource/zone/West

Subject

Predicate

Object

http://w3.org/2003/01/geo/wgs84_pos#/lat http://w3.org/2003/01/geo/wgs84_pos#/long

1°20'040.2"N103°42'24.54"E

Traditional representation - Tables

Linked Data Components• Data talks about itself. Humans and Machines

both understand data - How?

• URIs - lots of them (http://data.gov.sg/PlanningArea/Kallang)

• RDF - Data model (Jurong Point is a location)

• Ontologies - Enforces a structure to data (Land Hierarchy) – represented as RDFs

• SPARQL - Does the same job as SQL and a bit more...

See the Difference?

Linked Data Cloud (Web of Data)

Linked Data becomes Linked Open Data(LOD) by publishing it with “appropriate” license

Provides opportunity to link with other useful data sets

Provides variety of information about the same resource

Linked Data and Government Data - a natural compatibility

• Why?

• Govt data is used by all

• Govt data needs to be transparent and easily understandable

• Govt data is mainly factual – a direct fit!

• Standardized representation of Govt data across the globe can facilitate comparison without hassles.

• Best way to propagate a useful agenda to the private arena...

Who have implemented Linked Data?

• UK, US, Brazil Governments

• Private Corporations? Yes– BBC

– Nature

– World Bank

– New York Times

– FAO

– CIA Factbook

?Provide Links?

http://wheredoesmymoneygo.org/bubbletree-map.html#/~/grand-total--2010-

Sample Linked Data Usecase in UK

ABC Water Proj (R)

Agency Websites

Singstat publicationsMINISTRIES

XLS

HTML

PDF

Accountant-General's DepartmentAccounting and Corporate Regulatory Authority

Agency For Science, Technology & ResearchAttorney-General’s Chambers

Building & Construction AuthorityCentral Narcotics Bureau

Central Provident Fund Board Civil Aviation Authority of Singapore

Department of StatisticsEconomic Development Board

Energy Market AuthorityHealth Sciences Authority

Housing & Development BoardImmigration & Checkpoints Authority

Infocomm Development Authority of SingaporeInland Revenue Authority of Singapore

Institute of Technical EducationIntellectual Property Office of Singapore

JTC CorporationJudiciary, Subordinate CourtsJudiciary, Supreme CourtLand Transport AuthorityMajlis Ugama Islam Singapura

Maritime & Port Authority of Singapore

Monetary Authority of SingaporeNanyang Polytechnic

National Environment AgencyNational Heritage Board

National Library Board National Parks Board

Ngee Ann Polytechnic People's Association

Public Service DivisionPublic Transport Council

Public Utilities Board Republic Polytechnic

Sentosa Development Corporation Singapore Civil Defence Force

Singapore Customs Singapore Land Authority

Singapore Police ForceSingapore Polytechnic

Singapore Sports CouncilSingapore Workforce Development Agency

Spring Singapore Temasek Polytechnic

Urban Redevelopment Authority

Ministry of Community Development, Youth & Sports

Ministry of Education

Ministry of Foreign Affairs

Ministry of Health

Ministry of Law –Community Mediation Unit

Ministry of Manpower

Ministry of Transport

Media Development Authority

BFABuildings(C)GreenBuilding(E)

C- CommunityCul - Culture

E- EnvironmentEmp- Employment

Edu - EducationH- HealthF- Family

R- RecreationS- Sports

Breast Screen (H)Cervical Screen (H)Healthier Dining (H)

Quit Centers (H)

Infocomm Access (C)Silver infocomm (C)

Wireless Hotspots (R)Child care (F)Disability (F)Elder care (F)

Family (F)Family Friendly Estab (F)

Student Care (F)Comm Mediation Center (C)

After Death Facilities (E)Funeral Palours (E)Dengue Cluster (H)Hawker Center (E)

NEA Offices (E)Recycling Bins (E)

Waste Disposal Site (E)

Waste Treatment (E)

Heritage sites(Cul)Monuments(Cul)

Museums(Cul)

Libraries (Cul)Streets and Places(Cul)

CD Councils (C)Community Clubs (C)

Constituency offices (C)Other facilities (C)

Other Pan networks (C)PA head quarters (C)

Residents Committee(C)Water Venture (C)

National Parks (R)Skyrise greenery (E)

Sports clubs (S)

CET Centers(Emp)WDA Service points(Emp)

Kindergartens (Edu)Get TokenAddress SearchAgency Data SearchStatic Map

Get Layer InfoMashupGet Related Data

Get DirectionsPublic Transportation

Reverse Geocode

Map-related APIs from various agencies

Traffic-related APIs from Land Transport Authority

Tourism-related APIs from the Singapore Tourism Board

Environment-related APIs from the National Environment Agency

Library-related data feeds & web services from National Library Board

DGS Eco System

SG DATA

TEXTUAL

SPATIAL

API

THEMES OPERATIONSCATEGORIES

UNSTRUCTURED DATA

STRUCTURED DATA

STRUCTURED DATA

STATUTORY BOARDS

SG Government Data Eco System

Different levels of granularity

Multiple End points

Meta data only at data set levels

Data already cooked !!

Hierarchies not captured

Vocabulary Conflict in spatial and textual data

Few design issues spotted through the Linked Data lens

Benefits of using Linked Data for iDASingapore

• An opportunity to standardize common terms across agencies

• Re-use of resources (through URIs) ex: http://data.gov.sg/zone/central

• Centralized control?

• Single endpoint for all govt data - Linked Data API

• Very convenient for developers to join data from different agencies. eg: combining data from SLA and URA

URA Sites for Sales dataset(Urban Planning)DOS Population and Household Characteristics dataset (Population Demographics)

Age Pyramid of Resident Population

Old Age Support Ratio

Datasets Used for Framework Evaluation

Framework Formulation Process• Work was split into three phases – Analysis, Design

and Evaluation

• Based on study of Linked Data Migration Research Papers and cookbooks published by the World Wide Web Consortium(W3C)

• Analysis of Linked Data implementations in UK ,US and Brazil

• Evaluation of Linked Data tools with Singapore data sets for recommendation in each step of the framework

• Contemplating on probable issues that could be faced during implementation

Proposed Linked Data MigrationalFramework for DGS

Specification Identfication Analysis

Object Modeling

Ontology Modeling

URI Naming

RDF Creation

External Linking

Datasets Publication

Discovery & Exploitation

Re-use Create

S2R D2R A2R

\

Govt Agencies and IDA

Govt Agencies Domain Matter Experts

Ontology Modelers

IDA and Web Architects

Developers

Developers and Domain Experts

Developers

Web Architects

ObjectivesSpecifications

Project Duration

Dataset PrioritizationDataset License SettingImpln Mode Selection

RoadmapArchitecture

Overview

Relational ModelDataset Overview

Drawing Objects in Whiteboard

Conceptual View

Conceptual ViewPublic Vocabularies

Re-use of Existing Vocabularies

Creation of New Vocabularies

OWL, RDFS, RDF Vocabulary files

Resources Class and Properties

Visualization of URI mining process

URI AdministrationURI Lifecycle

ER ModelSpreadsheets,

DBMS, API

Conversion to RDF triples using Mapping files

RDF Triples

Government and external data sets

Linking based on Similarity Algorithms

Outbound Links

RDF TriplesOntologies

SPARQL, API

Data InsertionVOID ModelingData Retrieval

API to SPARQL conversion

VOID TriplesJSON data

Actual DataExisting Apps

GamificationCrowdsourcing

Catalog RegistrationExternal Reference

New Apps

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

SOUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

INPU

T

PR

OC

ES

S

OUTPU

T

Resource

Allocation

10

Resource

Allocation

15

Resource

Allocation

15

Resource

Allocation

5

Resource

Allocation

20

Resource

Allocation

5

Resource

Allocation

15

Resource

Allocation

15

1

2

3

4

5

6

78

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Specification Home

1) Design the High Level Architecture

2) Set the “Migration Potential” for data sets

3) Decide the “Perspective” – Vertical vs Horizontal -> Agency vs Application (We recommend Agency perspective)

Data setData set

URL Data Type AgencyUtility Level

Interlinking Possibility

Potential Level

Annual Vehicle Population by Type of Fuel Use URL

Textual (PDF) LTA H L M

Administrative Data - Employment Statistic URL

Textual (HTML) MOM H M H

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Specification Home

4) Setting up of License for data sets

5) Implementation Method – “Linked Data + RDF”Other options - 1) Just URIs 2) URI for data sets only

Analysis of Data sets Study of System specifications, design & integration documents (including database) of

the selected data sets

• Understand Metadata, Schema design and Entity Relationship (ER) models

Data SetData Set

URL Data Type Agency LicenseAccess Rights Data Access Modes

Annual Vehicle Population by Type of Fuel Use URL Textual (PDF) LTA PDDL R

API, SPARQL, RDF Dump

Administrative Data - Employment Statistic URL

Textual (HTML) MOM PDDL R

API, SPARQL, RDF Dump

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Object Modeling

This is modeling without usage context.*Requires normalization of database model in 3NF form

IssuesPossibility of applying high abstraction and high granularity to objects

Key Learning Ease in identifying the use of common objects across data setsFacilitates brainstorming of relationships between objects

Home

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Ontology ModelingTakes the conceptual diagram from Object Modeling as input.

Design Ontologies1. Identify classes and subclasses2. Identify hierarchy structure3. Connect classes through relationship4. Create rules for inference (optional)5. Output OWL vocabulary files

Ontology modelling is carried out in two ways:- 1) Using and extending public ontologies 2) Designing a local ontology from scratch

Home

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Ontology Modeling

Date fields, location fields and fields related to measurements in DGS have scope for vocabulary re-use

Vocabulary for the identified data sets (developed using Protege) with screenshots

List of vocabularies required for LOGD implementation

List of tools used for ontology modeling

OUTPUT?ALLOCATION PERCENTAGE?PERSONNEL INVOLVED

Home

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

URI Naming

ABOX TBOX

http://data.gov.sg/ontology/Ministry/ http://data.gov.sg/ministry/MOH

http://data.gov.sg/ontology/Agency/ http://data.gov.sg/agency/SLA

http://data.gov.sg/ontology/SiteLocation http://data.gov.sg/location/pioneer_road_north

http://data.gov.sg/ontology/Race http://data.gov.sg/race/chinese

Dataset ID URAstaticfile001

Dataset http://data.gov.sg/dataset/ URAstaticfile001/

Class http://data.gov.sg/terms/class/URAstaticfile001/sitesforsale

Property http://data.gov.sg/terms/property/URAstaticfile001/time

Row 1 http://data.gov.sg/dataset/URAstaticfile001/1

Row 1 - A generic column http://data.gov.sg/dataset/URAstaticfile001/1/columnName

Dataset URIs

Home

1) “URI Administration” ModeMaintained centrally in the DGS platform (resultant URIs will start with http://data.gov.sg/) -> RECOMMENDED

vsMaintained by individual agencies (resultant URIs will start with http://ura.gov.sg or http://sla.gov.sg).

vsMaintained externally by third party platforms such as Kasabi (resultant URIs will start with http://data.kasabi.org) – No longer valid as Kasabi service has been shut down

2) Setup of URI Taxonomy

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

RDF Creation Home

RDF triples are generated by converting data from source format with the necessary transformation

Type Nature Example of Singapore data sets Source format

S2R (Static Files) StaticURA Site for Sales, Singstat’s Population

Household Characteristics XLS, CSV, TXT files and other static files

D2R (RDBMS) Dynamic DGS tables RDBMS

A2R (APIs/Web Services) Dynamc

OneMap API, myTransport API, NLB web services

Application Programming Interface (API) and Web Services(SOAP, REST)

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

RDF Creation Home

Evaluated 3 tools for each mode of conversion Google Refine - S2RRDF Views - D2RRDF Sponger - A2R

Google Refine Demo for S2R!

ER models from RDBMS are to be converted into corresponding vocabularies/Ontologies for D2R process using STDTrip methodology

For A2R, External Cartridges (mapping files) are to be created for mapping API parameters to vocabularies. This can be done in RDF Sponger

“We feel that Linked Data is best suited for data from Static files and not for data that is real-time and dynamic in nature unless conformity to structure can be trusted”

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

External Linking

External Linking is connecting with other data sets in the web of data

Data.gov.sg

WorldBankCIA World Factbook

DBpedia FAO GeonamesSupreme

CourtFlickr

<http://data.gov.sg/location/bugis> <owl:sameAs> <http://www.dbpedia.org/resource/Bugis><http://data.gov.sg/race/malay> <owl:sameAs> <http://www.dbpedia.org/resource/Malay_race>

Issues•The outbound links made to data sets outside of IDA’s purview can be risky

•Dead links are a vivid possibility during the change of resource URIs or system downtime

Home

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Datasets Publication Home

Triple Store or RDF Store is the data structure used to store Linked Data.

• We used Virtuoso Universal Server’s built-in triple store for evaluation• It is visualized that the triple store will be centralized at iDA

SPARQL (pronounced as SPARKLE) will be the main output terminal for Linked Data• SPARQL can be used to SELECT, INSERT , DELETE, UPDATE data• SPARQL is gateway to any operation on Linked Data. APIs and Applications are

built on top of it

Triple Store and SPARQL Demo!

We had some information about External Linked Data Hosting but we had to remove itas the major provider Talis has closed its own hosting service Kasabi!

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Datasets Publication Home

Linked Data API is the common API endpoint that will be used by developers and public users to access government data.- This solves the problem of maintaining multiple end points!

ex: http://gov.tso.co.uk/transport/api/transport/doc/bus-stop-point.xml

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8

Discovery & Exploitation

Key Theme1) Internal discovery within Singapore for local citizens – idea4apps (link)

2) External discovery for attracting usage of Singapore government data in international economic & political research and global issues(water scarcity, Carbon

Footprint etc.)• Entry in CKAN registry ->http://thedatahub.org/tag/registry

Home

Gamification? Promoted by LinkedGov.org

Original data

provided by URA

Possible because of the re-use of the

common resource URI Pasir Ris across

data sets

Similarly, location based data from OneMap API is

retrieved for Pasir Ris

Interlinked Datasets Post-Migration

Other Interesting Use Cases

Definitely not Science Fiction!

Q & A Engine that works on top of government linked data. Inspired by www.trueknowledge.com

Sense-MakingQuestion: Which recent year had a growth rate close to 50% for majority of Singapore based SME?

Step1: Spot the resources in this query

Dbpedia Spotlight does just that! – Semantic Information Extraction

Which recent year had a growth rate close to 50% for majority of Singapore based SME

Step2: Identify the relationship between the resources

SME is instance of the Organization class Organization class comes under Singapore country

Growth rate is a property of Sales class Year is a class by itself

Majority is subset of Group class

Step3: Use NLP technique – Syntactic Analysis (Stanford Parser) followed by Focus Extraction for understanding the question

2010 is retuned as the result!

Step 4: Look for RDF triples that meet the criteria

Syntactic Parse tree is generated followed by Access Pattern

Key Challenges• Dense data - lot of additional RDF triples will get created along with the

required RDF triples as a resource belongs to multiple ontologiesDemographics dataset stats:-Rows:~300 Columns:16 in excel file Resultant triples count in RDF/Triple store:13711Reason: Majority of the generated triples are for machine understanding.

• URI administration could be an intense activity as dead URIs can cause damage to applications eg: what will happen if http://data.gov.sg/area/jurong doesn't work?

• Changes to structure of static files and RDMBS tables require changes in RDF mapping files - might be a long process if not properly regulated

• Not readily suitable for real-time data

Summary

Four in-person discussion sessions with IDA, NIIT and SLA

Analysis of Five data.gov.sg system specifications

Evaluation of Four existing Migration Frameworks

Prototyping with Six core Linked Data Tools

Dataset Publication

Virtuoso Universal Server Linked Data API

External Linking

SILK LIMES

RDF Creation

Google Refine RDF Views RDF Sponger

URI Naming

Pubby

Ontology Modeling

Protégé

Object Modeling

Concept Map

Summary

• Applicability of the framework to Singapore Government Data

• Issues identified in existing Data Eco System• Recommended tools and best practices for each step• Launchpad for SG Linked Data implementation

Final Thoughts…• ROI is not a key metric for Linked Data implementation• Benefits of moving to Linked Data is intangible and may

not be immediately realizable• Volume of work is huge compared to traditional

systems

We are thankful to Prof Chris Khoofor his supervision and iDA staff Soy Boon Lim for providing overview of data.gov.sg and also for furnishing DGS design documents...

Why are we doing this project?

To prescribe a Linked Data migrational framework for data.gov.sg (DGS) data sets

First hand view of the required migration activities

Issues anticipated at each step

Evaluation & Recommendation on Linked Data tools

To help IDA in realizing - What more can be done with existing data ? A closer look at Government counterparts – UK and US !

In totality, iDA can use this report as a guide for the various aspects related to Linked Data implementation

Basic Thought Process of Linked Data Publishing

• Select data sets that appear apt for Linked Data

• Identify the data sources for the data sets

• Find out what type of transformations are needed

• Publish it!

iDA Singapore launched Data.gov.sg portal and mGov@SG public services during June 2011

Data.gov.sg provides 5000+ public data sets from 50 government agencies

Purpose: Building applications, research and for creating applications using the data

Data.Gov.Sg