The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for...

23
The BigIaS Platform Simplifying Big Data Integration A SoftwareasaService Approach – ~ Preliminary Analysis and Design ~ September 5 th , 2013 Sogndal, Norway Dumitru Roman Claudia Daniela Pop Roxana Ioana Roman Bjørn Magnus Mathisen Contact: [email protected] 1

Transcript of The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for...

Page 1: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

The BigIaS PlatformSimplifying Big Data Integration

‐A Software‐as‐a‐Service Approach –~ Preliminary Analysis and Design ~  

September 5th, 2013Sogndal, Norway

Dumitru RomanClaudia Daniela PopRoxana Ioana Roman

Bjørn Magnus Mathisen

Contact: [email protected]

1

Page 2: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Context: Big Data and Our Primary Focus

• Addresses things that can be done at a large scale but cannot be done at a smaller one

o Extract new insights or create new forms of value in ways that change stakeholders and relationships between them

• Causality vs. correlations: not knowing why but only whato Challenges the basic traditional understanding of how to make decisions

2

Heterogeneity Change

Page 3: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Overview• The problem

o Data integration - a complex, unsolved problemo Tools addressing various aspects of data integration process can hardly be used

together for more complex, interesting integration tasks=> High cost of data integration at large scale, rather complicated and time consuming process

• The goal: Simplify data integration at large scale!o Enable users with limited technical data integration skills to get from raw data to

insightful data with minimal effort

• The approacho Semantic-based data integrationo Flexible and customizable workflows of data integration tools (application

integration)=> A Software-as-a-Service for data integration at large scale

3

Page 4: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

The BigIaS Platform

4

Talend ESB

Sesame OWLIM

Streaming

Talend Data Integration Karma Open 

Refine Tabula UIMA/Gate Silk

Data Layer• Storage services• Streaming services

Platform Layer• Data operations• Entity extraction• Data crawling• Data linking• Visualization• Streaming

Map4Rdf LodLive

C‐SPARQL LdSpider

Data Sets3rd party Open Data LOD3rd party3rd party

UX ‐ Portal and widgets

Grok/NUPIC

Trident‐ML

Other

Page 5: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Evaluated tools/approaches1. Application Integration

o Talend ESB

2. Data Processingo Talend Data Integrationo Tabulao Karmao Open Refineo UIMAo GATEo Silko LdSpider

5

3. Storageo Sesame o OWLIM

4. Visualizationo Map4Rdfo LodLive

5. Real-time Machine Learningo Trident‐MLo Grok/NUPIC

5. Streamingo C-SPARQL

Page 6: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Proposed Integration Workflows

1. Data Contextualization2. Entity Discovery3. Data Linking4. Data Visualization5. Real-time Machine Learning6. RDF Streaming

6

Page 7: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Data Contextualization

7

Integration with external 

data (Karma,

Open Refine)

Export RDF 

(Karma,Open Refine)

Store RDF

(Sesame + OWLIM)

Data Cleaning

(Open refine)

X2CSV(Tabula –PDF,Open Refine, Karma)

PDF

Excel

Relational DB

Excel, CSV, XML, etc.

*OWL 

* OWL file can be imported or configured

SPARQLendpoint

Open Refine: files: TSV, CSV, *SV, Excel (.xls and .xlsx), 

JSON, XML, RDF as XML and Google Docs

Web Services Karma: 

databases: MySQL, SQL Server, ORACLE, PostGISfiles: Excel, CSV,  XML, JSON, KMLWeb Services 

Tabula:  PDF

Page 8: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Entity Discovery

8

Entity Extraction

(UIMA, Gate)

Export RDF

(Karma, Open Refine)

Entity Contextualization

Open Refine(Reconcile Plug‐In)

Store RDF

(Sesame + OWLIM)

*OWL 

Reconcile against:

• Sparql Endpoint• RDF file• Sindice service(Twitter, Freebase, DBpedia, and other datasets, relevant for the data)

* OWL file can be imported or configured

CSV

Excel…

SPARQLendpoint

Gate:files: text files

UIMA: files: text,  audio, video

Page 9: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Data Linking

9

X2RDF

(Open Refine,Karma)

Store RDF

(Sesame + OWLIM)

Link Discovery

(SILK)

Store RDF

(Sesame + OWLIM)

SPARQL endpoint

SPARQLendpoint

CSV

Excel…

Relational DB

Open Refine: files: TSV, CSV, *SV, Excel (.xls and .xlsx), 

JSON, XML, RDF as XML and Google DocsWeb Services 

Karma: databases: MySQL, SQL Server, ORACLE, PostGISfiles: Excel, CSV,  XML, JSON, KMLWeb Services 

Page 10: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Data Visualization

10

DataContextualization

EntityDiscovery

DataLinking

Data Visualization(Map4RDF,LodLive)

GUICSV

Excel…

Relational DB

Sparql Endpoint

Page 11: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Real‐time Machine Learning

11

DataContextualization

EntityDiscovery

DataLinking

Real‐time Machine Learning

(GROK/NUPIC,Trident‐ML)

PatternsCSV

Excel…

Relational DB

Sparql Endpoint

Real Time RDFizationService

Streaming Service(DB, e.g. JSON)

C‐SparqlQuery 

Generator

C‐Sparqlqueries

Page 12: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

RDF Streaming

12

Real Time RDFizationService

StreamingEngine(C‐Sparql)

Streaming Service(DB, e.g. JSON)

RDF

Real‐time Machine Learning(GROK/NUPIC, Trident‐ML)

C‐Sparql queries

Page 13: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Core Technological Prerequisites

Tool PrerequisitesOpen Refine GREL Functions

GREL ExamplesKarma Create ontologies

Ontologies from Karma Web InterfaceUIMA Regular ExpressionsSilk Silk Link Specification Language (Silk‐

LSL)C‐Sparql C‐Sparql language

13

Page 14: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant upcoming research projects (currently under negotiations)

• Data Publishing through the Cloud: A Data- and Platform-as-a-Service Approach for Efficient Data Publication and Consumption (DaPaaS)

o The DaPaaS project aims to deliver an integrated DaaS and PaaS environment for open data–the DaPaaS platform–together with supporting activities for effective and efficient publication and consumption of data and creation of applications using the data.

o Expected to start Nov 2013 o Budget ~2.1M € (~1.5M €) for 2 years (EC funded)

14

Page 15: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

15

Page 16: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant upcoming research projects (currently under negotiations – cont’)

• ProaSense – The Proactive Sensing Enterpriseo The goal is to provide a very scalable, distributed architecture for the management

and processing of big-data that will enable continuous monitoring of the need for the service adaptation and propose corresponding changes in an (semi-) automatic way.

o Expected to start Nov 2013 o Budget ~4.2M € (~3.2M €) for 3 years (EC funded)

16

Page 17: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

17

Page 18: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant upcoming research projects (currently under negotiations – cont’)

• SmartOpenData – Open Linked Data for environment protection in Smart Regions

o SmartOpenData aims to define mechanisms for acquiring, adapting and using Open Data provided by existing sources for environment protection in European protected areas

o Expected to start Nov 2013 o Budget ~3.4M € (~2.5M €) for 2 years (EC funded)

18

Page 19: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant upcoming research projects (currently under negotiations – cont’)

• INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards

o Develop reliable stress tests on European critical infrastructure using integrated modelling tools for decision-support. It will lead to higher infrastructure networks resilience to rare and low probability extreme events, known as “black swans”.

o Expected to start Oct 2013 o Budget ~3.6M € (~2.8M €) for 2 years (EC funded)

19

Page 20: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant ongoing research projects 

• BigFut – Analyzing Big Data: Preparing for the Future of Intelligent Information Management

o SINTEF Internal projecto Goals:

• Analyze and integrate a suit of technological approaches and techniques • Advance demo/prototype implementation to penetrate the market in the short term

o Jan-Dec 2013

20

Page 21: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Relevant ongoing research projects (cont’)

• CITI-SENSE – develop ”Citizen’s Observatories” to empowercitizens to:

o Contribute to and participate in environmental governanceo Support and influence community and policy priorities and associated decision

makingo Contribute to Global Earth Observation System of Systems (GEOSS)

• 27 partners (EC fuded)

21

http://www.citi‐sense.eu/

Page 22: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Summary and Outlook• What’s new here

o The platform itself, implementing flexible data integration workflowso A set of components (e.g. Real-time RDFization of streams, C-Sparql Query

Generator)

• What’s challengingo Application integrationo Consistent scalability throughout workflowso Platform deployment on cloud environmentso Use of new, unproven technologieso …and many others

• Short-term plan (end of September)o Get the first prototype implementing the proposed workflowso Experiment with some data / simple use cases (e.g. CITI-SENSE data)

22

Page 23: The BigIaS Platform Simplifying Big Data Integration€¦ · • INFRARISK— Novel Indicators for identifying critical INFRAstructure at RISK from natural Hazards o Develop reliable

Thank you!Q&A

23