DataGraft: Data-as-a-Service for Open Data
-
Upload
dapaasproject -
Category
Data & Analytics
-
view
225 -
download
1
Transcript of DataGraft: Data-as-a-Service for Open Data
About me
• Education
– Eng (2003), Technical University of Cluj-Napoca, Romania
– PhD (2008), University of Innsbruck, Austria
• Current positions
– Senior Research Scientist, SINTEF, Norway
– Associate Professor, University of Oslo, Norway
• Expertise and responsibilities
– Initiating, leading, and carrying out (research-intensive) projects on data management and service-oriented topics
– Involved with over 20 large-scale R&D projects at the European level during the past 12 years
2
“Technology for a better society”• Public and private
companies
• Data owners
• Data publishers
• Data integrators and
aggregators
• Developers
• Improved data access
• Data-driven decision making
• Cost reduction when
working with data
• Reduction on the
dependency on generic
infrastructures providers
(e.g. generic cloud)
• Increase in the speed of
making data available
• Increase in the reuse of data
• Data cleaning
• Data transformation
• Data publication
• Data-as-a-Service
• Open data
• Linked data (RDF, SPARQL)
DataGraft 3
4
Outline
Session #1: Open Data
• Open Data
• (Open) Data Quality Issues
• Linked (Open) Data– RDF, RDFS, SPARQL
Session #2: DataGraft
• Data-as-a-Service: DataGraft
• Examples and Demo
• Big Data and DataGraft
• Open Data in Malaysian context (by Dennis Gan)
• (Optional: Hands on)
5
What is Open Data?What is Linked Data?
Challenges in (Linked Open) Data?
How to publish Linked Open Data?Linked Open Data Use Cases?
(Linked) Open Data and Big Data?
Open Data
What can open data do for you? (Source: The ODI, https://vimeo.com/110800848)
7
Open Data
…is changing the nature of business
...reflects a cultural shift to a more open society
8
Example: Personalized and Localized Urban Quality Index (PLUQI)
The index includes data from various domains:
Daily life satisfactionweather, transportation, community, …
Healthcare levelnumber of doctors, hospitals, suicide statistics, …
Safety and securitynumber of police stations, fire stations, crimes per capita, …
Financial satisfaction prices, incomes, housing, savings, debt, insurance, pension, …
Level of opportunityjobs, unemployment, education, re-education, …
Environmental needs and efficiencygreen space, air quality,…
9
PLUQI – potential usage
• Place recommendation for travel agencies or travelers
• Policy analysis and optimization for (local) government
• Understanding the citizen’s voice and demands regarding environmental conservation
• Commercial impact analysis for retailer and franchises
• Location recommendation and understanding local issues for real estate
• Risk analysis and management for insurance and financial companies
• Local marketing and sales force optimization for marketers
10
Open Data
• Businesses can develop new ideas, services and applications; improve decision making, cost savings
• Can increase government transparency and accountability, quality of public services
• Citizens get better and timely access to public services
11Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad/public access to it.
Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data_JUN+2014_v2.pdf
Lots of open datasets on the Web…
• A large number of datasets have been published as open data in the recent years
• Many kinds of data: cultural, science, finance, statistics, transport, environment, …
• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …
12
…but few actually used
• Few applications utilizing open
and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high quality data
– Unclear monetization & sustainability
13
Open Data Portal Datasets Applications
data.gov ~ 200 000 ~ 80
publicdata.eu ~ 48 000 ~ 85
data.gov.uk ~ 31 000 ~ 390
data.norge.no ~ 620 ~ 60
data.gov.my ~ 1065 ~ 10
Lots of datasets are in tabular format
– Records organized in silos of collections
– Very few links within and/or across collections
– Difficult to understand the nature of the data
– Difficult to integrate / query
14
europeandataportal.eu
Openlyavailable on the web as a document
Available under structured format (XLS)
Available under non-proprietary formats (CSV)
Uses URIs to denote things
Linked to other data to provide context
Tim Berners-Lee's 5 stars open data
rating system
15
1-Star Benefits
Consumers:
Ability to look at, print, store, modify and share data
Ability to use data as input to a system
Publishers:
Easily publish data
Ensure transparency
5-Star Benefits
Consumers:
Discover more (related) data while consuming the data
Directly learn about the data schema
? Have to deal with broken data links
? Trust issues
Publishers:
Make data discoverable
Increase the value of data
Gain the same benefits from the links as the consumers
? Need to invest resources to link data
? May need to clean data
16
…
Tabular Data Graph Data
• Lots of open datasets are in tabular format
• CSV, Excel, TSV, etc.
• Records organized in silos of collections
• Very few links within and/or across
collections
• Difficult to understand the nature of the data
• Difficult to integrate / query
Based on Linked Data• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
• Open standards by W3C− Data format: RDF
− Knowledge representation: RDFS/OWL
− Query language: SPARQL
http://www.w3.org/standards/semanticweb/data
europeandataportal.eu
17
Tabular Data
GraphData
18
(Open) Data Quality Issues
Tabular data
Tabular data is data that is structured into rows and columns
Correspondence with reality:
1) Each row represents an entity
2) Each column header represents an attribute of entity
3) Each column value represents a value of attribute
4) Each table represents a collection of entities
20
Tabular data files
Tabular data can be stored in different formats:
Tabular Text Formats (pure tabular data)Delimiter-separated values:
- CSV – comma-separated values- Less common, including TSV – tab-separated values, colon-separated values etc.
Spreadsheet Formats (meta-data information about the document, tabular data, formulas)
- XLS (Excel spreadsheet)- XLSX (Excel 2007 format)
21
Tabular data quality issues
When a dataset does not satisfy specified data quality criteria, it means that it contains data quality issues.
In order to provide higher data quality, these quality issues should be detected and removed.
22
Types of quality issues
23
Types of quality issues
24
Types of quality issues
25
What types of data quality issues can occur?
26
Types of quality issues
27
Types of quality issues
Actual information model:
order
street
house
28
Types of quality issues
Actual information model:
orderhas address
address29
Types of quality issues
30
Types of quality issues
Data model:
observationhas make
make31
Types of quality issues
Data model:
observation
make
year
number32
Summary of data quality issues
33
How to resolve data quality issues?
Workflow:
1) Identify data quality issues
2) Define transformation functions to resolve them
3) Execute transformation and verify the result
34
Transformation function types
By scope:
Functions on rows
Functions on columns
Functions transforming entire
dataset
By caused effect:
Data reordering functions
Data extraction functions
Data manipulation functions
Data enrichment functions
35
Transformation functionsScope Name Description Effect
Rows
Add Row Create a new record in a dataset Data enrichment
Take/Drop Rows Extract only relevant rows by indexData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”
Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection
Filter Rows Extract only relevant rows by conditionData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”
Entiredataset
RemoveDuplicates
Remove similar rows Data extraction. Resolves issues: “Duplicate rows”
Sort DatasetSorts dataset by given column names in given order
Data reordering, simplifies quality issues detection
Reshape Dataset (Melt)
Move columns to rowsData manipulation. Resolves issues: “Column headers, containing attribute values”
Reshape Dataset(Cast)
Move rows to columns by categorizing and aggregating
Data enrichment, simplifies quality issues detection
Group and Aggregate
Group values by column or multiple columns and perform aggregation
Data enrichment, simplifies quality issues detection
Columns
Add ColumnAdd a column with a manually specified value
Data enrichment
Derive ColumnAdd a column with values, computed from other columns
Data enrichment
Take/Drop Columns
Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”
Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection
Merge Columns Merge columns using custom separatorData manipulation. Resolves issues: “Single value is splitted across multiple columns”
Split Column Split column using custom separatorData manipulation. Resolves issues: “Multiple values stored in one column”
Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”
Map columns Apply function to all values in a columnData manipulation. Resolves issues: “Illegal values”, “Missing values”, “Inconsistent values” 36
Tabular data cleaning tools
CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface
Programming languages and libraries for data analysis (R, agate for Python) – users need knowledge in programming
Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google Spreadsheets) - were not initially created for data cleaning, hard to debug, code is mixed up with data
Frameworks/tools designed to be used for interactive data cleaning and transformation in ETL process
37
Example: vehicle registration data
https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&CMSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true
38
Example: vehicle registration data (continued)
* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39
Map columns – applying a function to all values in a column
Effect: data manipulation
Resolves anomalies: Illegal values, Missing values, Inconsistent values
Required parameters:
For all columns that should be mapped
1) Name of column to manipulate
2) Name of function to apply
40
Before:
Map columns – apply function to all values in a column
41
After:
Map columns – apply function to all values in a column
42
Derive column – add a column with values computed from others
Effect: data enrichment
Adds new information to data
Required parameters:
1) Name of derived column
2) Column(s) to derive from
3) Function to derive with
43
Before:
Derive column – add a column with values computed from others
44
After:
Derive column – add a column with values computed from others
45
Cast dataset – move rows to columns by categorizing and aggregating
Effect: data enrichment
Adds new information to data, simplifies anomaly detection
Required parameters:
1) Column name for variable (what to categorize and put to headers)
2) Column name for value (on what to perform aggregations)
46
Before:
Cast dataset – move rows to columns by categorizing and aggregating
47
After:
Cast dataset – move rows to columns by categorizing and aggregating
48
RDF mapping
Reusing of existing vocabularies is encouraged. Helps to interlink data.
49
50
Linked (Open) DataRDF, RDFS, SPARQL
Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
http://www.w3.org/standards/semanticweb/data
53
Linked open data cloud
By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792
54
Linked Data principles
• Every thing is represented by a URI
• URIs of things can be dereferenced
• Things are linked to other things by relating their URIs
55
Linked Data technology
• Data format:
• Knowledge representation: RDFS/OWL
• Query language:
• Linking medium: HTTP
56
Graph data structure
Alice
Jim
Peter
57
RDF in reality: using URLs to identify things
58
Resource Description Framework (RDF) Basics
• RDF making statements on resources (entities)
o Triple data model: subject -> predicate -> object (Alice's age is 34)
• Subjects and objects:
o Resources (URIs of entities) – can have properties related to them (http://my-domain.com/Alice)
o Literals – constant values ("female", "3.14159"); can not be subjects
o Blank nodes – used to specify composite properties (e.g., address which is composed of a country, city, street name, house number, zip code etc.)
• Realtionships (a.k.a. predicates) – relate one subject to one object
59
RDF serialisation formats
• Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads)
60
<http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> .
<http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Leonardo_da_Vinci> .
<http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject> <http://www.wikidata.org/entity/Q12418> .
• JSON-LD (JSON-based RDF syntax)
"@context": "example-context.json",
"@id": "http://example.org/bob#me",
"@type": "Person",
"birthdate": "1990-07-04",
"knows": "http://example.org/alice#me",
"interest": {
"@id": "http://www.wikidata.org/entity/Q12418",
"title": "Mona Lisa",
"subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619",
"creator": "http://dbpedia.org/resource/Leonardo_da_Vinci"
}
RDF serialisation formats (continued)
• RDFa (for HTML and XML embedding)
61
<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">
<div resource="http://example.org/bob#me" typeof="foaf:Person">
<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>
and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>
<p>Bob is interested in <span property="foaf:topic_interest"
resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>
</div>
<div resource="http://www.wikidata.org/entity/Q12418">
<p>The <span property="dcterms:title">Mona Lisa</span> was painted by
<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>
and is the subject of the video
<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à Washington'</a>. </p>
</div>
<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>
</div>
</body>
RDF serialisation formats (continued)
• RDF/XML (XML syntax for RDF)
62
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="http://example.org/bob#me">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate>
<foaf:knows rdf:resource="http://example.org/alice#me"/>
<foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
<rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">
<dcterms:title>Mona Lisa</dcterms:title>
<dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>
</rdf:Description>
<rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
</rdf:RDF>
RDF Schema (RDFS)
• basic capabilities for describing RDF vocabularies
• includes concepts to describe:o classes, class hierarchies (sub-classes) and instances (typing)
o non-standard literal data types
o property hierarchies (sub-properties)
o predicate domain and range
o utility properties (labels, comments, additional information about things, definitions of reources)
o …
63
Linked data vocabulary sources
64
Querying RDF: SPARQL
• RDF Query language– Based on graph matching
• Uses SQL-like syntax
• Query types:– SELECT – table of raw values
– CONSTRUCT, DESCRIBE – RDF graph
– ASK – boolean
65
SPARQL querying – example graph
a:Alice c:Jimb:Peterfoaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
66
SPARQL querying – query
Question: What are the nicknames of people that Alice knows?
Query: @prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
a:Alicefoaf:knows
?someonefoaf:nick
?nickname
67
SPARQL querying – matching to the graph
a:Alice c:Jimb:Peterfoaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
68
SPARQL querying – result
Query: @prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
nickname
"Pety"
"Jimbo"
69
Data integration using Linked Data: using URIs
Example: Relational DB or spreadsheet – dataset about scientific publications:
ID Name Home page
1 Alice http://alice.org/
2 Tim https://www.w3.org/People/Berners-Lee/
ID author ISBN Publication topic
1 978-3-16-14410-0 "On the frictional coefficient of bananas"
1534-1-22-66975-1
"Do woodpeckers get headaches?"
2 1-933019-33-6 "The Semantic Web"
70
Data integration using Linked Data: using URIs (continued)
a:Alice
http://.../978-3-16-148410-0
http://.../534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of bananas"
"Do woodpeckers get headaches?"
t:Tim http://.../1-933019-33-6foaf:publications
foaf:topic
"The Semantic Web"
Graph representation of new dataset:
71
Data integration using Linked Data: Using URIs (continued)
Same URI!
72
Data integration using Linked Data: Using URIs (continued)
a:Alice c:Jimb:Peterfoaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
…978-3-16-148410-0
…534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of bananas"
"Do woodpeckers get headaches?"
Resulting graph:
73
Query federation using SPARQL
74
Linked Data is great for Open Data
• Linked Data is a great means to represent data– Semantics are part of the data
– Naturally linked to other data
– Querying language
• How Linked Data can improve Open Data:– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
75
… but has been ignored by the mainstream
• Difficult to make it accessible to people
– Publishers
– Developers
– Data workers
• Challenges with using Linked Data
– Lack of tooling and expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
• DataGraft: packaging Linked Data to make it more approachable to the open data community
76
Data-as-a-Service: DataGraft
78
“Data is the new oil”…but many of us just need gasoline
Data-as-a-Service …is the new filling station
Data-as-a-Service
• Outsourcing of various data operations to the cloud
• Eliminates
– upfront costs on data infrastructure
– ongoing investment of time and resources in managing the data infrastructure
• Complete package for
– transformation of raw data into meaningful data assets
– reliable delivery of data assets
79
was developed to allow
data workers to manage their data in a
simple, effective, and efficient way
Powerful
data transformation and
reliable data access capabilities
80
DataGraft
Data Transformation and RDF Publication Process
• Interactive design of transformations?
• Repeatable transformations?
• Reuse/share transformations (user-based access)?
• Cloud-based deployment of transformations?
• Self-serviced process?
• Data and Transformation as-a-Service? 81
TransformGenerate
RDF
Ontology XOntology X
Ontology X
Ontology mapping
RDF GraphRaw Data Prepared Data
Map
Map
RDF Triple Store
Tabular Data
GraphData
DataGraft: Data-as-a-ServiceFor the Data Transformation and RDF Publication Process
82
83
https://www.ssb.no/statistikkbanken
Example: Using statistical data
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Data records (rows)
Add rowTake row(s)Drop row(s)
Shift rowFilter rows (grep)
Remove duplicate rows
Entire datasetSort
Reshape datasetGroup (categorize) and aggregate
Columns
Add column(s)Take column(s)Drop column(s)Move column
Merge columnsSplit column
Rename column(s)Apply function to all values in a column
103
104
105
106
107
Data pages and federated querying
108
What is the population of locations and total number of persons employed in Human health and social work activities?
Configuring data visualizations
109
110
111
112
113
APIs
DataGraft key feature: Flexible management and sharing of data
and transformations
Fork, reuse and extend transformations built by other professionals from DataGraft’s
transformations catalog
Interactively build, modify and share data
transformations
Share transformations privately or publicly
Reuse transformations to repeatably clean and
transform spreadsheet data
Programmatically access transformations and the transformation catalogue
114
Reuse of transformations in environmental data publishing
TRAGSA Pilot
• Number of transformations: 42
– Created via reuse: 25
• Number of triples:
– ~ 7.7M
ARPA Pilot
• Number of transformations: 5
– Created via reuse: 2
• Number of triples:
– ~ 14K
115
Forking/reusing transformations helped us spend less time on creating new transformations
DataGraft key feature: Reliable data hosting and querying services
Host data on DataGraft’sreliable, cloud-based
semantic graph database
Share data privately or publicly
Query data through your own SPARQL
endpoint
Programmatically access the data
catalogue
116
Operations & maintenance performed on behalf of users
Grafter Grafterizer
Semantic Graph DBaaSData Portal
DataGraft
117
DataGraft Enablers
DataGraft – 1 package 2 audiences
DataGraft
Data Publisher Application Developer
Helping integrating and publishing data
Giving better, easier tools
118
Examples and Demo
The context: Statsbygg
120
• A public sector administration company
• Norwegian government's key advisor in construction and property affairs
• Building commissioner
• Property manager
• Property developer
• Interest: Exploit/Share property data in novel ways
• For efficiency and sustainability of the property included in the government's civil estate
Example: Reporting state-owned real estate properties in Norway
Example: Reporting state-owned real estate properties in Norway (cont’)
• A hard copy of 314 pages and as a PDF file
• 6 Person-Months• Data collection with spreadsheets• Quality assurance through e-mails
and phone correspondence
Pains• Time consuming• Poor data quality• Static report without live updating
• Live service• Efficient sharing of data• Simplified integration with external
datasets• Live updating• Reliable access• …
• Risk and vulnerability analysis, e.g. buildings affected by flooding
• Analysis of leasing prices
Report Reporting Service 3rd party services
121
Sample data
122
Cleaning, Transformation, Publishing, Integration, Querying, Visualization,
Service Access
Demo Scenario
• Interactively create tabular data transformations
• Reuse/extend data transformations (incl. data annotations)
• RDF data publication and querying
• Integrating and visualising data from different sources
• (Using 3rd party tools with DataGraft)
123
Demo sample data
124
Cleaning, Transformation, Publishing, Integration, Querying, Visualization,
Service Access
Demo sample data
125
Cleaning, Transformation, Publishing, Integration, Querying, Visualization,
Service Access
Benefits of DataGraft in use cases
• Simplified data publishing process
• Integration with external data sources using established web standards
• Data that was not publicly available – now published (e.g. air quality data in Oslo)
• Time-efficient publishing
• Repeatable data transformation process
126
DataGraft and Big Data
• Desired features:
– real-time interactivity
– large datasets batch transformation capability
We are developing a hybrid solution to work with both batch and real-time processing.
127
DataGraft and Big Data: High-level architecture
128
DataGraft – targeted impacts
Reduction in costsfor organisations which lack sufficient expertise and resources to make their data available
Reduction on the dependencyof data owners on generic Cloud platforms to build, deploy and maintain their linked data from scratch
Increase in the speed of publishing new datasets and updating existing datasets
Reduction in the cost and complexity of developing applications that use data
Increase in the reuse of data by providing reliable access to numerous datasets hosted on DataGraft.net
129
• Gathering enough of good datasets
• Designing/implementing
2. Able to focus onservice quality
Example: The benefit of DataGraft in PLUQI
130
• Reducing cost for implementing transformations
• Integrating the process is simpler
1. 23% of developmentcost reduction
Datasetsgathering
Datatransformation
Data provisioning/access
ImplementingApp
Before
Datasetsgathering
Datatransformation
Data provisioning/
access
ImplementingApp
After (with DataGraft)
DataGraft in numbers (as of end of Jan 2016)
131
238Registered users
607 (208 public)
Registered Data transformations
1828Uploaded files
192Public Data
pages
DataGraft in the wild
• Investigating crime data in small geographies
• Used DataGraft to transform data and publish RDF
132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/
Data Science and DataGraft
Greater Data Science:
1. Data Exploration and Preparation
2. Data Representation and Transformation
3. Computing with Data
4. Data Visualization and Presentation
5. Data Modeling
6. Science about Data Science133
“50 years of Data Science” by David Donohohttp://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
DataGraft
134https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/
135
Summary
• DataGraft – emerging Data-as-a-Service solution for making (linked) data more accessible
– Platform, portal, methodology, APIs
– Online service, functional and documented
– Validated through several use cases
• Key features:
– Support for Sharable/Repeatable/Reusable Data Transformations
– Reliable RDF Database-as-a-Service
136
https://datagraft.net
Thank you!Contact: [email protected] 137
138