TrustwOrthy model-awaRE Analytics Data · PDF fileThe Toreador user is an expert programmer,...
Transcript of TrustwOrthy model-awaRE Analytics Data · PDF fileThe Toreador user is an expert programmer,...
TrustwOrthy model-awaRE Analytics Data platfORm
Beniamino Di Martino
CINI Universit della Campania Luigi Vanvitelli
BigData: programming analytics for multiple BD Platforms deployment -
The Toreador Code-Based Approach
Toreador EC H2020 project
2
TrustwOrthy model-awaREAnalytics Data platfORm
Call: ICT-BIGDATA-2015
Project Coordinator:
Prof. Ernesto Damiani
WP2 leader:
Prof. Beniamino Di Martino
Objectives
3
Specification of a fully declarative framework and a model set supporting Big Data analytics.
MBDAaaS model-based BDA-as-a-service providing models of the entire Big Data analysis process and of its artefacts.Automation and commoditization of Big Data analytics.Enabling it to be easily tailored to domain-specific customer
requirements.
SLA and assurance approaches to guarantee contractual quality, performance, and security of BDA.
Design and development of automatic deployment of TOREADOR analytic solutions on multiple BD platforms.
Pilots
4
SAP: The Application Log Analysis Pilot (security log).
LIGHTSOURCE: The Energy Production Data Analysis Pilot (sensor-based management of photovoltaic).
JoT: The Clickstream Analysis Pilot (e.g., fraud control).
DTA: The Aerospace Products Manufacturing Analysis Pilot (data related to manufacturing process).
Model-driven approach
Abstract the typical procedural models (e.g., data pipeline) implemented in big data frameworks
Develop model transformations to translate modelling decisions into actual provisioning on multiple Big Data Platforms
TOREADOR Overview (Recall)
Declarative
Models
Procedural
Models
Deploymen
t Models
(Non-)Fuctional
Goals: Service
goals of Big Data
Pipeline
What the BDA
should achieve and
how to achieve
objectives
How the BDA
process should
work
PlatformPlatformPlatform
Specify goals as service level objectives (SLO).
Five areas: representation, preparation, analytics, processing, display and reporting.A single model because aspects of different
areas may impact on the same procedural model template.
Based on a controlled vocabulary. Controlled names (e.g., anonymization). Values in an ordinal scale as strings or
numbers (e.g., obfuscation, hashing, k-anonymity).
Insufficient to run a Big Data Analytics. Incomplete.
TOREADOR Overview: Declarative Model
Declarative
Models
Procedural
Models
Deploymen
t Models
Service Level
Objectives (SLOs):
Service goals of Big
Data Pipeline
What the BDA
should achieve and
how to achieve
objectives
How the BDA
process should
work
Complete the declarative model with all information needed for running analytics. Simple to map SLOs on procedures.Platform independent.
Specified using a function that takes as input SLOs and returns as output procedural templates. Procedural templates are pre-calculated
based on defined SLOs.
Templates express competences of data scientist and data technologist.Declarative models used to select the (set
of) proper templates.
TOREADOR Overview: Procedural Model
Declarative
Models
Procedural
Models
Service Level
Objectives (SLOs):
Service goals of Big
Data Pipeline
What the BDA
should achieve and
how to achieve
objectives
Deploymen
t Models
How the BDA
process should
work
Specify how procedural models are incarnated in a ready-to-be-deployed architecture. Drive analytics execution in real scenarios.To be defined for each TOREADOR application scenarios.Platform Dependent.
TOREADOR Overview: Deployment Model
Declarative
Models
Procedural
Models
Deploymen
t Models
Service Level
Objectives (SLOs):
Service goals of Big
Data Pipeline
What the BDA
should achieve and
how to achieve
objectives
How the BDA
process should
work
PlatformPlatformPlatform
The Toreador Dev & Deploy Process
Service Catalog
OWL-S Service
Ontology
Service
Definition /
Implementation
Deployment/Compi
ler
Platform/Executi
on
JSON
OWL-S Composition Model
Service Model
Code of the services (Jobs)
APIs
OWL-S Service Ont.
OWL-S Editor
OWL-S Selector Code Compiler
JSON
Code base Model
Declarative
Model
Service
Composition
Service Selection
Two development approaches
Service-basedNo coding, for basic usersAnalytics services are provided by the target TOREADOR platformBig Data campaign built by composing existing servicesBased on model transformations
Code-basedAdvanced usersAnalytics algorithms are developed by the usersParallel computations are configured by the users
Both driven by the same declarative modelCode once, deploy everywhere
Support for batch and stream (batch first)
The Code-Based Approach: Motivation
Code Once/Deploy Everywhere
The Toreador user is an expert programmer, aware of the potentialities (flexibility andcontrollability) and purposes (analytics developed from scratch or migration of legacycode) of a code-based approach.She expresses the parallel computation of a coded algorithm, in terms of parallelprimitives.Toreador distributes it among computational nodes hosted by differentplatforms/technologies in multi-platform Big Data and Cloud environments using State ofthe Art orchestrators. 11
I. Code III. DeployII. Transform
Deployment Model
Procedural Model
Skeleton-BasedCode Compiler
First phase: Code
Code
iteration = 0while iteration == 0 or previousCentroids != centroids:
previousCentroids = centroidsAssigns =
data_parallel_region(data, kmeans_assign, centroids)print("ASSIGN:", assigns)clusters = [[] for x in centroids]for x in assigns:
clusters[x[1]].append(x[0])
Definition of a set of Parallel Primitives expressing Parallel Computational Patterns ina given programming language
12
Second Phase: Transform
Definitions of Code Skeletons that realize the computationallogic, workflow and interactions characterizing a ParallelComputational Pattern.
Realization of a Skeleton-Based Code Compiler (Source to sourceTransformer) which encarnating the Skeletons - transforms thesequential code augmented with the parallel primitives intoparallel versions, specific to different platforms/technologies,according to a 4-steps workflow:
13
Second Phase: Transform
Skeleton based Code Compiler first step
1. Parallel Pattern Selection
1.Selection of the ParallelParadigm (based onDeclarative Model andutilized primitives)
14
Second Phase: Transform
Skeleton based Code Compiler second step
2.Transformation of the algorithmcoded utilizing theaforementioned primitives in aparallel agnostic versionincarnating predefined codeSkeletons Technology independent(Agnostic w.r. to the targetplatforms/ technologies)
2. Incarnation of agnostic Skeletons
import math
def data_parallel_region(distr, func,
*repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance
between two vectors"""
return math.sqrt(sum([(x[1]-
x[0])**2 for x in zip(a, b)]))
def kmeans_init(data, k):
"""Returns initial centroids
configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids
import math
def data_parallel_region(distr, func,
*repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance
between two vectors"""
return math.sqrt(sum([(x[1]-
x[0])**2 for x in zip(a, b)]))
def kmeans_init(data, k):
"""Returns initial centroids
configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids
import math
def data_parallel_region(distr, func,
*repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance
between two vectors"""
return math.sqrt(sum([(x[1]-
x[0])**2 for x in zip(a, b)]))
def kmeans_init(data, k):
"""Returns initial centroids
configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids
import math
def data_parallel_region(distr, func,
*repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance
between two vectors"""
return math.sqrt(sum([(x[1]-
x[0])**2 for x in zip(a, b)]))
def kmeans_init(data, k):
"""Returns initial centroids
configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids
15
Second Phase: Transform
Skeleton based Code Compiler third step
3.Specialization of the agnosticSkeletons in multiple versions ofthe code to be executed, butspecific to the different targetplatforms/technologies
3. Production of technology dependent Skeletons
Amazon: ./elastic-mapreduce --create --stream input s3n://ngramstest/input_gz --output s3n://ngramstest/output-streaming-fullrun --mapper s3n://ngramstest/code/mapper.py reducer s3n://ngramstest/code/reducer.py --nu