Lambdoop, a framework for easy development of big data applications

66
A framework for easy development of Big Data applications Rubén Casado [email protected] om @ruben_casado

description

Most of the existing Big Data technologies are focused on managing large amount of static data (e.g. Hadoop, Hive, Pig). On the other hand, trending approaches try to deal with real time processing of dynamic data (e.g Storm, S4). Batch processing of massive static data provides strong results since they can take into account more information and, for example, perform better training of predictive models. But batch processing takes time and is not feasible for domains where the response time is a critical issue. Real time processing solves this issue, but it uses a weak approach where the analyzed information is limited in order to achieve low latency. Many domains require the benefit of both batch and real time processing approaches. It is not an easy issue to develop software architecture by tailoring suitable technologies, software layers, data sources, data storage solutions, smart algorithms and so on to achieve the good scalable solution. That is where Lambdoop comes in. Lambdoop is a software framework for easing developing Big Data applications by combining real time and batch processing approaches. It implements a Lambda based architecture that provide an abstraction layer to the developers. Developers do not have to deal with different technologies, configurations, data formats … They just use Lambdoop framework as the only needed API. Lambdoop also includes extra tools such as input/output drivers, visualization tools, cluster management tools and widely accepted AI algorithms. To evaluate the effectiveness of Lambdoop we have applied our framework to different real scenarios: 1) Analysis and prediction of data air quality information; 2) Social networks based identification of emergent situations and 3) Quantum Chemistry molecular dynamics simulations. Conclusions of the evaluations provide good feedback to improve the development of the framework.

Transcript of Lambdoop, a framework for easy development of big data applications

Page 1: Lambdoop, a framework for easy development of big data applications

A framework for easy development of

Big Data applications

Rubén Casado

[email protected]

@ruben_casado

Page 2: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 3: Lambdoop, a framework for easy development of big data applications

About me :-)

Page 4: Lambdoop, a framework for easy development of big data applications

PhD in Software Engineering MSc in Computer Science BSc in Computer Science

Academics

Work

Experience

Page 5: Lambdoop, a framework for easy development of big data applications

About Treelogic

Page 6: Lambdoop, a framework for easy development of big data applications

Treelogic is an R&D

intensive company with

the mission of creating,

boosting, developing and

adapting scientific and

technological

knowledge to improve

quality standards in our

daily life

Page 7: Lambdoop, a framework for easy development of big data applications

TREELOGIC – Distributor and Sales

Page 8: Lambdoop, a framework for easy development of big data applications

International Projects

National Projects

Regional Projects

R&D Manag. System

Internal Projects

Research Lines

Computer Vision

Big Data

Teraherzt technology

Data science

Social Media Analysis

Semantics

Security & Safety

Justice

Health

Transport

Financial services

ICT tailored solutions

Solutions

R&D

Page 9: Lambdoop, a framework for easy development of big data applications

7 ongoing FP7 projects

ICT, SEC, OCEAN

Coordinating 5 of them

3 ongoing Eurostars projects

Coordinating all of them

Page 10: Lambdoop, a framework for easy development of big data applications

Research

INNOVATION

&

7 years’ experience in R&D projects

More than 40

projects with

budget over 120 MEUR

More than

300 partners

in last 3

years

Project coordinator in 7 European projects

Overall participation

in 11 European

projects

Page 11: Lambdoop, a framework for easy development of big data applications

www.datadopter.com

Page 12: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 13: Lambdoop, a framework for easy development of big data applications

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Page 14: Lambdoop, a framework for easy development of big data applications

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

Page 15: Lambdoop, a framework for easy development of big data applications

3 problems

Volume

Variety Velocity

Page 16: Lambdoop, a framework for easy development of big data applications

3 solutions

Batch processing

NoSQLReal-time

processing

Page 17: Lambdoop, a framework for easy development of big data applications

3 solutions

Batch processing

NoSQLReal-time

processing

Page 18: Lambdoop, a framework for easy development of big data applications

• Scalable

• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

Page 19: Lambdoop, a framework for easy development of big data applications

• Low latency

• Continuous unbounded

streams of data

• Distributed

• Parallel

• Fault-tolerant

Real-time processing

Velocity

Page 20: Lambdoop, a framework for easy development of big data applications

• Low latency

• Massive data + Streaming data

• Scalable

• Combine batch and real-time results

Hybrid computation model

Volume Velocity

Page 21: Lambdoop, a framework for easy development of big data applications

All data

New data

Batch processing

Real-time processing

Batchresults

Streamresults

CombinationFinal results

Hybrid computation model

Page 22: Lambdoop, a framework for easy development of big data applications

Batch processing Large amount of statics data Scalable solution Volume

Real-time processing Computing streaming data Low latency Velocity

Hybrid computation Lambda Architecture Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003Processing Paradigms

Page 23: Lambdoop, a framework for easy development of big data applications

Processing Pipeline

DATA

ACQUISITION

DATA

STORAGEDATA

ANALYSIS RESULTS

Page 24: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 25: Lambdoop, a framework for easy development of big data applications

Open source framework

Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

What is Lambdoop?

Page 26: Lambdoop, a framework for easy development of big data applications

• Building a batch processing application requires

o MapReduce developing

o Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …)

o Storage systems (Hbase, MongoDB, HDFS, Cassandra…)

Why Lambdoop?

• Real-time processing requires

o Streaming computing (S4, Storm, Samza)

o Unboundend input (Flume, Scribe)

o Temporal data stores (In-memory, Kafka, Kestrel)

Page 27: Lambdoop, a framework for easy development of big data applications

Why Lambdoop?• Building a hybrid computation system (Lambda Architecture) requires

o Application logic has to be defined in two different systems using different frameworks

o Data must be serialized consistently and kept in sync between each system

o Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

Page 28: Lambdoop, a framework for easy development of big data applications
Page 29: Lambdoop, a framework for easy development of big data applications
Page 30: Lambdoop, a framework for easy development of big data applications

“One of the most interesting areas of future work is

high level abstractions that map to a batch

processing component and a real-time processing

component. There's no reason why you shouldn't have

the conciseness of a declarative language with the

robustness of the batch/real-time architecture”.

Why Lambdoop?

Nathan Marz

“Lambda Architecture is a implementation challenge.

In many real-world situations a stumbling block for

switching to a Lambda Architecture lies with a scalable

batch processing layer. Technologies like Hadoop (…)

are there but there is a shortage of people with the

expertise to leverage them.Rajat Jain

Page 31: Lambdoop, a framework for easy development of big data applications

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

Page 32: Lambdoop, a framework for easy development of big data applications

Lambdoop

Batch

Real-Time

Hybrid

Page 33: Lambdoop, a framework for easy development of big data applications

Information represented as Data objects

o Types:

o StaticData

o StreamingData

o Every Data object has a Schema to describe the Data fields (types, nulleables, keys…)

o A Data object is composed by Datasets.

Data Input

Page 34: Lambdoop, a framework for easy development of big data applications

Dataset

o A Data object is formed by one or more Datasets.

o All Datasets of a Data object share the same Schema

o Datasets are formed by Register objects,

o A Register is composed by RegisterFields.

Data Input

Page 35: Lambdoop, a framework for easy development of big data applications

Schema

o Very similar to Avro definition schemas.

o Allow to define input data’s structure, fields, types, nulleables…

o Json format

Data Input

Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB

23 street 43.529

5.673

2011-01-04

7 8 0.35 13 67 158 3.87 18.8 34 982

32 road 44.5 5.72 2011-01-04

7 8.6 0.4 12 68 158 3.87 19 33 975

{ "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ]}

Page 36: Lambdoop, a framework for easy development of big data applications

Importing data into Lambdoop

o Loaders: Import information from multiple sources and store it into the HDFS as Data objects

o Producers: Get streaming data and represent it as Data objects

o Heterogeneous sources.

o Serialize information into Avro format

Data Input

Page 37: Lambdoop, a framework for easy development of big data applications

• Static Data example: Importing a Air Quality dataset from local logs to HDFS

o Loader

o Schema’s path is files/csv/Air_quality_schema

Data Input

//Read schema from a file

String schema = readSchemaFile(schema_file);

Loader loader = new CSVLoader("AQ.avro", uri, schema)

Data input = new StaticData(loader);

Page 38: Lambdoop, a framework for easy development of big data applications

• Streaming Data example: Reading streaming sensor data from TCP port

o Producer

o Weather stations emit messages to port 8080

o Schema’s path is files/csv/Air_quality_schema

Data Input

int port = 8080;

//Read schema

String schema = readSchemaFile (schema_file);

Producer producer = new TCPProducer ("AirQualityListener",

refresh, port, schema);

// Create Data object

Data data = new StreamingData(producer)

Page 39: Lambdoop, a framework for easy development of big data applications

Extensibility

o Users can implement their own data loaders/producers

1) Extend Loader/Producer interface

2) Read data from original source

3) Get and serialize information (Avro format) considering Schemas

Data Input

Page 40: Lambdoop, a framework for easy development of big data applications

Unitary actions to process data

An Operation takes Data as input, processes the Data and produces another Data as output

Types of operations:

Aggregation: Produces a single value per DataSet

Filter: Output data has the same schema as input data

Group: Produces several DataSet, grouping registers together

Projection: Changes the Data schema, but preserves the records and their values

Join: Combines different Data objects

Operations

Page 41: Lambdoop, a framework for easy development of big data applications

Operations

Aggregation(1)

Count

Average

Sum

MinValue

MaxValue

Mode

Aggregation(2)

Skewness

Z-Test

Stderror

Variance

Covariance

Filter

Filter

Limit

TopN

BottomN

Max

Min

Group

Group

RollUp

Cube

N-Til

Projection

Select

Frecuency

Variation

Join

Inner Join

Left Join

Right Join

Outer Join

Operations

Page 42: Lambdoop, a framework for easy development of big data applications

Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces:

OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed

BatchOperation: Provides MapReduce logic to process the input Data

StreamingOperation: Provides Storm/Trident based functions to process streaming registers

HybridOperation: Provides merging logic between streaming and batch results

Operations

Page 43: Lambdoop, a framework for easy development of big data applications

User Defined Operation interfaces

Operations

Page 44: Lambdoop, a framework for easy development of big data applications

Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations

o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output

o StreamingWorkflow: Operates on a StreamingData to produce another StreamingData

o HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData)

Workflow connections

Workflows

Data Workflow Data

Data Workflow

Workflow

WorkflowWorkflow

Workflow

Data

Data

Data

Page 45: Lambdoop, a framework for easy development of big data applications

// Batch processing example

String schema = readSchemaFile(schema_file);Loader loader = new CSVLoader("AQ.avro",uri, schema)

Data input = new StaticData(loader);

Workflow wf = new BatchWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),

ConditionType.EQUAL, new StaticValue(«street 45"));

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));

wf.addOperation(filter);wf.addOperation(avg);

//Run the workflowwf.run();

//Get the resultsData output = wf.getResults();

Workflows

Page 46: Lambdoop, a framework for easy development of big data applications

//Real-time processing exampleProducer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer);Workflow wf = new StreamingWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),

ConditionType.EQUAL, new StaticValue("Estación Av. Castilla"));

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));

wf.addOperation(filter);wf.addOperation(avg);

//Runs the workflowwf.run();

//Gets the resultsWhile (!stop){ Data output = wf.getResults(); … }

Workflows

Page 47: Lambdoop, a framework for easy development of big data applications

// Hybrid computation exampleProducer producer = new PortProducer("catest", schema1, config);StreamingData streamInput = new StreamingData(producer);Loader loader = new CSVLoader("AQ.avro",uri, schema2)

StaticData batchInput = new StaticData(loader);

Data input = new HybridData(streamInput, batchInput);Workflow wf = new HybridWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34"));wf.addOperation(filter);

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));wf.addOperation(avg);

//Run the workflowwf.run();

//Get the resultsWhile (!stop) { Data output = wf.getResults();}

Workflows

Page 48: Lambdoop, a framework for easy development of big data applications

Data

CSV, JSON, …

VISUALIZATION

EXPORT

ALARM SYSTEM

Filter

RollUp

StdErrorAvg

Select

Cube

Variance

join

Results exploitation

Page 49: Lambdoop, a framework for easy development of big data applications

/* Produce from Twitter */

TwitterProducer producer = new TwitterProducer(…);

Data data = new StreamingData(producer);

StreamingWorkflow wf = new StreamingWorkflow(data);

/* Add operations to workflow*/

wf.addOperation(new Count());

/* Get results from workflow*/

Data results = wf.getResults();

/* Show results. Set dashboard refresh*/

Dashboard d = new Dashboard(config);

d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");

Results exploitation Visualization

Page 50: Lambdoop, a framework for easy development of big data applications

Results exploitation Visualization

Page 51: Lambdoop, a framework for easy development of big data applications

Results exploitation Visualization

Page 52: Lambdoop, a framework for easy development of big data applications

Data data = new StaticData(loader);

Workflow wf = new BatchWorkflow(data);

/* Add operations to workflow*/

wf.addOperation(new Count());

/* Get results from workflow*/

Data results = wf.getResults();

/* Export results */

Exporter.asCSV(results, File);

MongoExport(results, Map<String, String> conf);

PostgresExport(results, Map<String, String> conf);

CSV, JSON, …

Results exploitation Export

Page 53: Lambdoop, a framework for easy development of big data applications

Data data = new StreamingData(producer);

StreamingWorkflow wf = new StreamingWorkflow(data);

/* Add operations to workflow*/

wf.addOperation(new Count());

/* Get results from workflow*/

Data results = wf.getResults();

/* Set alarm

condition: T/F (e.g time or certain value)

action: execution (e.g. show results, send an email)*/

AlarmFactory.setAlert(results, condition, action);

Results exploitation Alarms

Page 54: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 55: Lambdoop, a framework for easy development of big data applications

55

Change configurations and easily manage the cluster

Friendly tools for monitoring the health of the cluster

Wizard-driven Lambdoop installation of new nodes

Page 56: Lambdoop, a framework for easy development of big data applications

Visual editor for defining workflows and scheduling tasks

o Plugin for Eclipseo Visual elements for:

– Input Sources

– Loader

– Operations

– Operation parameterso RegisterFields

o Static values

– Visualization elementso Generates workflow codeo XML Import/Exporto Scheduling of workflows

Page 57: Lambdoop, a framework for easy development of big data applications

• Tool for working with messy big data, cleaning it and transforming it.

• Import data in different formats• Explore datasets • Apply advanced cell transformations• Refine inconsistencies• Filter and partition your big data

Page 58: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 59: Lambdoop, a framework for easy development of big data applications

Objective: To create event assessment and decision-making

supporting tools which improve quickness and efficiency when facing

emergency situations making.

Exploit the information available in Social Networks to complement

data about emergency situations

Real-time processing

Social Awareness Based Emergency Situation Solver

Page 60: Lambdoop, a framework for easy development of big data applications

Alert detection

Locations

Information

“Attached” resources (photo,

video, links,…)

Page 61: Lambdoop, a framework for easy development of big data applications

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Batch processing + Real processing+ Hybrid computation

Page 62: Lambdoop, a framework for easy development of big data applications

Quantum Mechanics Molecular Dynamics

Computer simulation of physical movements of microscopic elements

Large amount of data as streaming in each time-step

Real-time interaction (query, visual exploration) during the simulation

Data analytics on the whole dataset

Real time processing + Batch processing + Hybrid computation

Page 63: Lambdoop, a framework for easy development of big data applications

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

Page 64: Lambdoop, a framework for easy development of big data applications

Conclusions

• Big Data is not only batch processing

• To implement a Lambda Architecture is not trivial

• Lambdoop: Big Data made easy

• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• Extensible

Page 65: Lambdoop, a framework for easy development of big data applications

Conclusions• Roadmap

– Now

• Release a early version of Lambdoop Framework as Open Source

• Get feedback from the community

• Increase the set of built-in functions

– Next

o Move all components to YARN

o Stable versions of Lambdoop ecosystem

o Models (Mahout, Jubatus, Samoa, R)

– Beyond

• Configurable processing engines (Spark, S4, Samza …)

• Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB …)

Page 66: Lambdoop, a framework for easy development of big data applications

If you want stay tuned about Lambdoop register in

@ruben_casado @datadopter @treelogic

www.lambdoop.com

[email protected] [email protected]

www.lambdoop.com www.datadopter.com www.treelogic.com