DataGraft: Data-as-a-Service for Open Data

DataGraftData-as-a-Service for Open Data

Dumitru [email protected]

https://datagraft.net

About me

• Education

– Eng (2003), Technical University of Cluj-Napoca, Romania

– PhD (2008), University of Innsbruck, Austria

• Current positions

– Senior Research Scientist, SINTEF, Norway

– Associate Professor, University of Oslo, Norway

• Expertise and responsibilities

– Initiating, leading, and carrying out (research-intensive) projects on data management and service-oriented topics

– Involved with over 20 large-scale R&D projects at the European level during the past 12 years

2

“Technology for a better society”• Public and private

companies

• Data owners

• Data publishers

• Data integrators and

aggregators

• Developers

• Improved data access

• Data-driven decision making

• Cost reduction when

working with data

• Reduction on the

dependency on generic

infrastructures providers

(e.g. generic cloud)

• Increase in the speed of

making data available

• Increase in the reuse of data

• Data cleaning

• Data transformation

• Data publication

• Data-as-a-Service

• Open data

• Linked data (RDF, SPARQL)

DataGraft 3

Outline

Session #1: Open Data

• Open Data

• (Open) Data Quality Issues

• Linked (Open) Data– RDF, RDFS, SPARQL

Session #2: DataGraft

• Data-as-a-Service: DataGraft

• Examples and Demo

• Big Data and DataGraft

• Open Data in Malaysian context (by Dennis Gan)

• (Optional: Hands on)

5

What is Open Data?What is Linked Data?

Challenges in (Linked Open) Data?

How to publish Linked Open Data?Linked Open Data Use Cases?

(Linked) Open Data and Big Data?

Open Data

What can open data do for you? (Source: The ODI, https://vimeo.com/110800848)

7

Open Data

…is changing the nature of business

...reflects a cultural shift to a more open society

8

Example: Personalized and Localized Urban Quality Index (PLUQI)

The index includes data from various domains:

Daily life satisfactionweather, transportation, community, …

Healthcare levelnumber of doctors, hospitals, suicide statistics, …

Safety and securitynumber of police stations, fire stations, crimes per capita, …

Financial satisfaction prices, incomes, housing, savings, debt, insurance, pension, …

Level of opportunityjobs, unemployment, education, re-education, …

Environmental needs and efficiencygreen space, air quality,…

9

PLUQI – potential usage

• Place recommendation for travel agencies or travelers

• Policy analysis and optimization for (local) government

• Understanding the citizen’s voice and demands regarding environmental conservation

• Commercial impact analysis for retailer and franchises

• Location recommendation and understanding local issues for real estate

• Risk analysis and management for insurance and financial companies

• Local marketing and sales force optimization for marketers

10

Open Data

• Businesses can develop new ideas, services and applications; improve decision making, cost savings

• Can increase government transparency and accountability, quality of public services

• Citizens get better and timely access to public services

11Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Gartner:

By 2016, the use of "open data" will continue to

increase — but slowly, and predominantly limited to

Type A enterprises.

By 2017, over 60% of government open data

programs that do not effectively use open data

internally, will be scaled back or discontinued.

By 2020, enterprises and governments will fail to

protect 75% of sensitive data and will declassify and

grant broad/public access to it.

Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data_JUN+2014_v2.pdf

Lots of open datasets on the Web…

• A large number of datasets have been published as open data in the recent years

• Many kinds of data: cultural, science, finance, statistics, transport, environment, …

• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …

12

…but few actually used

• Few applications utilizing open

and distributed datasets at present

• Challenges for data consumers

– Data quality issues

– Difficult or unreliable data access

– Licensing issues

• Challenges for data publishers

– Lack of expertise & resources: not easily to publish & maintain high quality data

– Unclear monetization & sustainability

13

Open Data Portal Datasets Applications

data.gov ~ 200 000 ~ 80

publicdata.eu ~ 48 000 ~ 85

data.gov.uk ~ 31 000 ~ 390

data.norge.no ~ 620 ~ 60

data.gov.my ~ 1065 ~ 10

Lots of datasets are in tabular format

– Records organized in silos of collections

– Very few links within and/or across collections

– Difficult to understand the nature of the data

– Difficult to integrate / query

14

europeandataportal.eu

Openlyavailable on the web as a document

Available under structured format (XLS)

Available under non-proprietary formats (CSV)

Uses URIs to denote things

Linked to other data to provide context

Tim Berners-Lee's 5 stars open data

rating system

15

1-Star Benefits

Consumers:

Ability to look at, print, store, modify and share data

Ability to use data as input to a system

Publishers:

Easily publish data

Ensure transparency

5-Star Benefits

Consumers:

Discover more (related) data while consuming the data

Directly learn about the data schema

? Have to deal with broken data links

? Trust issues

Publishers:

Make data discoverable

Increase the value of data

Gain the same benefits from the links as the consumers

? Need to invest resources to link data

? May need to clean data

16

…

Tabular Data Graph Data

• Lots of open datasets are in tabular format

• CSV, Excel, TSV, etc.

• Records organized in silos of collections

• Very few links within and/or across

collections

• Difficult to understand the nature of the data

• Difficult to integrate / query

Based on Linked Data• Method for publishing data on the Web

• Self-describing data and relations

• Interlinking

• Accessed using semantic queries

• Open standards by W3C− Data format: RDF

− Knowledge representation: RDFS/OWL

− Query language: SPARQL

http://www.w3.org/standards/semanticweb/data

europeandataportal.eu

17

Tabular Data

GraphData

18

(Open) Data Quality Issues

Tabular data

Tabular data is data that is structured into rows and columns

Correspondence with reality:

1) Each row represents an entity

2) Each column header represents an attribute of entity

3) Each column value represents a value of attribute

4) Each table represents a collection of entities

20

Tabular data files

Tabular data can be stored in different formats:

Tabular Text Formats (pure tabular data)Delimiter-separated values:

- CSV – comma-separated values- Less common, including TSV – tab-separated values, colon-separated values etc.

Spreadsheet Formats (meta-data information about the document, tabular data, formulas)

- XLS (Excel spreadsheet)- XLSX (Excel 2007 format)

21

Tabular data quality issues

When a dataset does not satisfy specified data quality criteria, it means that it contains data quality issues.

In order to provide higher data quality, these quality issues should be detected and removed.

22

Types of quality issues

23


24


25

What types of data quality issues can occur?

26


27


Actual information model:

order

street

house

28


Actual information model:

orderhas address

address29


30


Data model:

observationhas make

make31


Data model:

observation

make

year

number32

Summary of data quality issues

33

How to resolve data quality issues?

Workflow:

1) Identify data quality issues

2) Define transformation functions to resolve them

3) Execute transformation and verify the result

34

Transformation function types

By scope:

Functions on rows

Functions on columns

Functions transforming entire

dataset

By caused effect:

Data reordering functions

Data extraction functions

Data manipulation functions

Data enrichment functions

35

Transformation functionsScope Name Description Effect

Rows

Add Row Create a new record in a dataset Data enrichment

Take/Drop Rows Extract only relevant rows by indexData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”

Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection

Filter Rows Extract only relevant rows by conditionData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”

Entiredataset

RemoveDuplicates

Remove similar rows Data extraction. Resolves issues: “Duplicate rows”

Sort DatasetSorts dataset by given column names in given order

Data reordering, simplifies quality issues detection

Reshape Dataset (Melt)

Move columns to rowsData manipulation. Resolves issues: “Column headers, containing attribute values”

Reshape Dataset(Cast)

Move rows to columns by categorizing and aggregating

Data enrichment, simplifies quality issues detection

Group and Aggregate

Group values by column or multiple columns and perform aggregation

Data enrichment, simplifies quality issues detection

Columns

Add ColumnAdd a column with a manually specified value

Data enrichment

Derive ColumnAdd a column with values, computed from other columns

Data enrichment

Take/Drop Columns

Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”

Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection

Merge Columns Merge columns using custom separatorData manipulation. Resolves issues: “Single value is splitted across multiple columns”

Split Column Split column using custom separatorData manipulation. Resolves issues: “Multiple values stored in one column”

Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”

Map columns Apply function to all values in a columnData manipulation. Resolves issues: “Illegal values”, “Missing values”, “Inconsistent values” 36

Tabular data cleaning tools

CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface

Programming languages and libraries for data analysis (R, agate for Python) – users need knowledge in programming

Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google Spreadsheets) - were not initially created for data cleaning, hard to debug, code is mixed up with data

Frameworks/tools designed to be used for interactive data cleaning and transformation in ETL process

37

Example: vehicle registration data

https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&CMSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true

38

https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&CMSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true

Example: vehicle registration data (continued)

* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39

Map columns – applying a function to all values in a column

Effect: data manipulation

Resolves anomalies: Illegal values, Missing values, Inconsistent values

Required parameters:

For all columns that should be mapped

1) Name of column to manipulate

2) Name of function to apply

40

Before:

Map columns – apply function to all values in a column

41

After:

Map columns – apply function to all values in a column

42

Derive column – add a column with values computed from others

Effect: data enrichment

Adds new information to data


1) Name of derived column

2) Column(s) to derive from

3) Function to derive with

43

Before:


44

After:


45

Cast dataset – move rows to columns by categorizing and aggregating

Effect: data enrichment

Adds new information to data, simplifies anomaly detection


1) Column name for variable (what to categorize and put to headers)

2) Column name for value (on what to perform aggregations)

46

Before:


47

After:


48

RDF mapping

Reusing of existing vocabularies is encouraged. Helps to interlink data.

49

RDF mapping

http://vocabs.datagraft.net/vehicles

51

http://vocabs.datagraft.net/vehicles

Linked (Open) DataRDF, RDFS, SPARQL

Linked Data

• Method for publishing data on the Web

• Self-describing data and relations

• Interlinking

• Accessed using semantic queries


53


Linked open data cloud

By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792

54

Linked Data principles

• Every thing is represented by a URI

• URIs of things can be dereferenced

• Things are linked to other things by relating their URIs

55

Linked Data technology

• Data format:

• Knowledge representation: RDFS/OWL

• Query language:

• Linking medium: HTTP

56

Graph data structure

Alice

Jim

Peter

57

RDF in reality: using URLs to identify things

58

Resource Description Framework (RDF) Basics

• RDF making statements on resources (entities)

o Triple data model: subject -> predicate -> object (Alice's age is 34)

• Subjects and objects:

o Resources (URIs of entities) – can have properties related to them (http://my-domain.com/Alice)

o Literals – constant values ("female", "3.14159"); can not be subjects

o Blank nodes – used to specify composite properties (e.g., address which is composed of a country, city, street name, house number, zip code etc.)

• Realtionships (a.k.a. predicates) – relate one subject to one object

59

RDF serialisation formats

• Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads)

60

<http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> .

<http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> .

<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" .

<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Leonardo_da_Vinci> .

<http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject> <http://www.wikidata.org/entity/Q12418> .

• JSON-LD (JSON-based RDF syntax)

"@context": "example-context.json",

"@id": "http://example.org/bob#me",

"@type": "Person",

"birthdate": "1990-07-04",

"knows": "http://example.org/alice#me",

"interest": {

"@id": "http://www.wikidata.org/entity/Q12418",

"title": "Mona Lisa",

"subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619",

"creator": "http://dbpedia.org/resource/Leonardo_da_Vinci"

}

RDF serialisation formats (continued)

• RDFa (for HTML and XML embedding)

61

<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">

<div resource="http://example.org/bob#me" typeof="foaf:Person">

<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>

and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>

<p>Bob is interested in <span property="foaf:topic_interest"

resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>

</div>

<div resource="http://www.wikidata.org/entity/Q12418">

<p>The <span property="dcterms:title">Mona Lisa</span> was painted by

<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>

and is the subject of the video

<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à Washington'</a>. </p>

</div>

<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">

<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>

</div>

</body>

RDF serialisation formats (continued)

• RDF/XML (XML syntax for RDF)

62

<?xml version="1.0" encoding="utf-8"?>

<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"

xmlns:foaf="http://xmlns.com/foaf/0.1/"

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:schema="http://schema.org/">

<rdf:Description rdf:about="http://example.org/bob#me">

<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>

<schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate>

<foaf:knows rdf:resource="http://example.org/alice#me"/>

<foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>

</rdf:Description>

<rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">

<dcterms:title>Mona Lisa</dcterms:title>

<dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>

</rdf:Description>

<rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">

<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/>

</rdf:Description>

</rdf:RDF>

RDF Schema (RDFS)

• basic capabilities for describing RDF vocabularies

• includes concepts to describe:o classes, class hierarchies (sub-classes) and instances (typing)

o non-standard literal data types

o property hierarchies (sub-properties)

o predicate domain and range

o utility properties (labels, comments, additional information about things, definitions of reources)

o …

63

Linked data vocabulary sources

64

Querying RDF: SPARQL

• RDF Query language– Based on graph matching

• Uses SQL-like syntax

• Query types:– SELECT – table of raw values

– CONSTRUCT, DESCRIBE – RDF graph

– ASK – boolean

65

SPARQL querying – example graph

a:Alice c:Jimb:Peterfoaf:knows foaf:knows

foaf:Person

rdf:type

"Lissy" "Pety" "Jimbo"

foaf:nickfoaf:nick foaf:nick

foaf:knows

66

SPARQL querying – query

Question: What are the nicknames of people that Alice knows?

Query: @prefix a: <http://alice.org/> .

@prefix foaf: <http://xmlns.com/foaf/0.1/>

.

select where {

a:Alice foaf:knows .

foaf:nick

}

a:Alicefoaf:knows

?someonefoaf:nick

?nickname

67

SPARQL querying – matching to the graph


foaf:Person

rdf:type



foaf:knows

68

SPARQL querying – result

Query: @prefix a: <http://alice.org/> .

@prefix foaf: <http://xmlns.com/foaf/0.1/>

.

select where {

a:Alice foaf:knows .

foaf:nick

}

nickname

"Pety"

"Jimbo"

69

Data integration using Linked Data: using URIs

Example: Relational DB or spreadsheet – dataset about scientific publications:

ID Name Home page

1 Alice http://alice.org/

2 Tim https://www.w3.org/People/Berners-Lee/

ID author ISBN Publication topic

1 978-3-16-14410-0 "On the frictional coefficient of bananas"

1534-1-22-66975-1

"Do woodpeckers get headaches?"

2 1-933019-33-6 "The Semantic Web"

70

Data integration using Linked Data: using URIs (continued)

a:Alice

http://.../978-3-16-148410-0

http://.../534-1-22-663975-1

foaf:topic

foaf:topic

"On the frictional coefficient of bananas"


t:Tim http://.../1-933019-33-6foaf:publications

foaf:topic

"The Semantic Web"

Graph representation of new dataset:

71

Data integration using Linked Data: Using URIs (continued)

Same URI!

72

Data integration using Linked Data: Using URIs (continued)


foaf:Person

rdf:type



foaf:knows

…978-3-16-148410-0

…534-1-22-663975-1

foaf:topic

foaf:topic

"On the frictional coefficient of bananas"


Resulting graph:

73

Query federation using SPARQL

74

Linked Data is great for Open Data

• Linked Data is a great means to represent data– Semantics are part of the data

– Naturally linked to other data

– Querying language

• How Linked Data can improve Open Data:– Easier integration, free data from silos

– Seamless interlinking of data

– Understand the data

– New ways to query and interact with data

75

… but has been ignored by the mainstream

• Difficult to make it accessible to people

– Publishers

– Developers

– Data workers

• Challenges with using Linked Data

– Lack of tooling and expertise to publish high quality Linked Data

– Lack of resources to host LOD endpoints / unreliable data access

• DataGraft: packaging Linked Data to make it more approachable to the open data community

76

Data-as-a-Service: DataGraft

78

“Data is the new oil”…but many of us just need gasoline

Data-as-a-Service …is the new filling station

Data-as-a-Service

• Outsourcing of various data operations to the cloud

• Eliminates

– upfront costs on data infrastructure

– ongoing investment of time and resources in managing the data infrastructure

• Complete package for

– transformation of raw data into meaningful data assets

– reliable delivery of data assets

79

was developed to allow

data workers to manage their data in a

simple, effective, and efficient way

Powerful

data transformation and

reliable data access capabilities

80

DataGraft

Data Transformation and RDF Publication Process

• Interactive design of transformations?

• Repeatable transformations?

• Reuse/share transformations (user-based access)?

• Cloud-based deployment of transformations?

• Self-serviced process?

• Data and Transformation as-a-Service? 81

TransformGenerate

RDF

Ontology XOntology X

Ontology X

Ontology mapping

RDF GraphRaw Data Prepared Data

Map

Map

RDF Triple Store

Tabular Data

GraphData

DataGraft: Data-as-a-ServiceFor the Data Transformation and RDF Publication Process

82

83

https://www.ssb.no/statistikkbanken

Example: Using statistical data

102

Data records (rows)

Add rowTake row(s)Drop row(s)

Shift rowFilter rows (grep)

Remove duplicate rows

Entire datasetSort

Reshape datasetGroup (categorize) and aggregate

Columns

Add column(s)Take column(s)Drop column(s)Move column

Merge columnsSplit column

Rename column(s)Apply function to all values in a column

Data pages and federated querying

108

What is the population of locations and total number of persons employed in Human health and social work activities?

Configuring data visualizations

109

113

APIs

DataGraft key feature: Flexible management and sharing of data

and transformations

Fork, reuse and extend transformations built by other professionals from DataGraft’s

transformations catalog

Interactively build, modify and share data

transformations

Share transformations privately or publicly

Reuse transformations to repeatably clean and

transform spreadsheet data

Programmatically access transformations and the transformation catalogue

114

Reuse of transformations in environmental data publishing

TRAGSA Pilot

• Number of transformations: 42

– Created via reuse: 25

• Number of triples:

– ~ 7.7M

ARPA Pilot

• Number of transformations: 5

– Created via reuse: 2

• Number of triples:

– ~ 14K

115

Forking/reusing transformations helped us spend less time on creating new transformations

DataGraft key feature: Reliable data hosting and querying services

Host data on DataGraft’sreliable, cloud-based

semantic graph database

Share data privately or publicly

Query data through your own SPARQL

endpoint

Programmatically access the data

catalogue

116

Operations & maintenance performed on behalf of users

Grafter Grafterizer

Semantic Graph DBaaSData Portal

DataGraft

117

DataGraft Enablers

DataGraft – 1 package 2 audiences

DataGraft

Data Publisher Application Developer

Helping integrating and publishing data

Giving better, easier tools

118

Examples and Demo

The context: Statsbygg

120

• A public sector administration company

• Norwegian government's key advisor in construction and property affairs

• Building commissioner

• Property manager

• Property developer

• Interest: Exploit/Share property data in novel ways

• For efficiency and sustainability of the property included in the government's civil estate

Example: Reporting state-owned real estate properties in Norway

Example: Reporting state-owned real estate properties in Norway (cont’)

• A hard copy of 314 pages and as a PDF file

• 6 Person-Months• Data collection with spreadsheets• Quality assurance through e-mails

and phone correspondence

Pains• Time consuming• Poor data quality• Static report without live updating

• Live service• Efficient sharing of data• Simplified integration with external

datasets• Live updating• Reliable access• …

• Risk and vulnerability analysis, e.g. buildings affected by flooding

• Analysis of leasing prices

Report Reporting Service 3rd party services

121

Sample data

122

Cleaning, Transformation, Publishing, Integration, Querying, Visualization,

Service Access

Demo Scenario

• Interactively create tabular data transformations

• Reuse/extend data transformations (incl. data annotations)

• RDF data publication and querying

• Integrating and visualising data from different sources

• (Using 3rd party tools with DataGraft)

123

Demo sample data

124


Service Access

Demo sample data

125


Service Access

Benefits of DataGraft in use cases

• Simplified data publishing process

• Integration with external data sources using established web standards

• Data that was not publicly available – now published (e.g. air quality data in Oslo)

• Time-efficient publishing

• Repeatable data transformation process

126

DataGraft and Big Data

• Desired features:

– real-time interactivity

– large datasets batch transformation capability

We are developing a hybrid solution to work with both batch and real-time processing.

127

DataGraft and Big Data: High-level architecture

128

DataGraft – targeted impacts

Reduction in costsfor organisations which lack sufficient expertise and resources to make their data available

Reduction on the dependencyof data owners on generic Cloud platforms to build, deploy and maintain their linked data from scratch

Increase in the speed of publishing new datasets and updating existing datasets

Reduction in the cost and complexity of developing applications that use data

Increase in the reuse of data by providing reliable access to numerous datasets hosted on DataGraft.net

129

• Gathering enough of good datasets

• Designing/implementing

2. Able to focus onservice quality

Example: The benefit of DataGraft in PLUQI

130

• Reducing cost for implementing transformations

• Integrating the process is simpler

1. 23% of developmentcost reduction

Datasetsgathering

Datatransformation

Data provisioning/access

ImplementingApp

Before

Datasetsgathering

Datatransformation

Data provisioning/

access

ImplementingApp

After (with DataGraft)

DataGraft in numbers (as of end of Jan 2016)

131

238Registered users

607 (208 public)

Registered Data transformations

1828Uploaded files

192Public Data

pages

DataGraft in the wild

• Investigating crime data in small geographies

• Used DataGraft to transform data and publish RDF

132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/

Data Science and DataGraft

Greater Data Science:

1. Data Exploration and Preparation

2. Data Representation and Transformation

3. Computing with Data

4. Data Visualization and Presentation

5. Data Modeling

6. Science about Data Science133

“50 years of Data Science” by David Donohohttp://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

DataGraft

134https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/

Summary

• DataGraft – emerging Data-as-a-Service solution for making (linked) data more accessible

– Platform, portal, methodology, APIs

– Online service, functional and documented

– Validated through several use cases

• Key features:

– Support for Sharable/Repeatable/Reusable Data Transformations

– Reliable RDF Database-as-a-Service

136

https://datagraft.net

Thank you!Contact: [email protected] 137

DataGraft: Data-as-a-Service for Open Data

Data & Analytics

Transcript of DataGraft: Data-as-a-Service for Open Data