Lambdoop, a framework for easy development of big data applications

A framework for easy development of

Big Data applications

Rubén Casado

[email protected]

@ruben_casado

1. Big Data processing

2. Lambdoop framework

3. Lambdoop ecosystem

4. Case studies

5. Conclusions

Agenda

About me :-)

PhD in Software Engineering MSc in Computer Science BSc in Computer Science

Academics

Work

Experience

About Treelogic

Treelogic is an R&D

intensive company with

the mission of creating,

boosting, developing and

adapting scientific and

technological

knowledge to improve

quality standards in our

daily life

TREELOGIC – Distributor and Sales

International Projects

National Projects

Regional Projects

R&D Manag. System

Internal Projects

Research Lines

Computer Vision

Big Data

Teraherzt technology

Data science

Social Media Analysis

Semantics

Security & Safety

Justice

Health

Transport

Financial services

ICT tailored solutions

Solutions

R&D

7 ongoing FP7 projects

ICT, SEC, OCEAN

Coordinating 5 of them

3 ongoing Eurostars projects

Coordinating all of them

Research

INNOVATION

&

7 years’ experience in R&D projects

More than 40

projects with

budget over 120 MEUR

More than

300 partners

in last 3

years

Project coordinator in 7 European projects

Overall participation

in 11 European

projects

www.datadopter.com




4. Case studies

5. Conclusions

Agenda

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

3 problems

Volume

Variety Velocity

3 solutions

Batch processing

NoSQLReal-time

processing

• Scalable

• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

• Low latency

• Continuous unbounded

streams of data

• Distributed

• Parallel

• Fault-tolerant

Real-time processing

Velocity

• Low latency

• Massive data + Streaming data

• Scalable

• Combine batch and real-time results

Hybrid computation model

Volume Velocity

All data

New data

Batch processing


Batchresults

Streamresults

CombinationFinal results

Hybrid computation model

Batch processing Large amount of statics data Scalable solution Volume

Real-time processing Computing streaming data Low latency Velocity

Hybrid computation Lambda Architecture Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003Processing Paradigms

Processing Pipeline

DATA

ACQUISITION

DATA

STORAGEDATA

ANALYSIS RESULTS




4. Case studies

5. Conclusions

Agenda

Open source framework

Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

What is Lambdoop?

• Building a batch processing application requires

o MapReduce developing

o Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …)

o Storage systems (Hbase, MongoDB, HDFS, Cassandra…)

Why Lambdoop?

• Real-time processing requires

o Streaming computing (S4, Storm, Samza)

o Unboundend input (Flume, Scribe)

o Temporal data stores (In-memory, Kafka, Kestrel)

Why Lambdoop?• Building a hybrid computation system (Lambda Architecture) requires

o Application logic has to be defined in two different systems using different frameworks

o Data must be serialized consistently and kept in sync between each system

o Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

“One of the most interesting areas of future work is

high level abstractions that map to a batch

processing component and a real-time processing

component. There's no reason why you shouldn't have

the conciseness of a declarative language with the

robustness of the batch/real-time architecture”.

Why Lambdoop?

Nathan Marz

“Lambda Architecture is a implementation challenge.

In many real-world situations a stumbling block for

switching to a Lambda Architecture lies with a scalable

batch processing layer. Technologies like Hadoop (…)

are there but there is a shortage of people with the

expertise to leverage them.Rajat Jain

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

Lambdoop

Batch

Real-Time

Hybrid

Information represented as Data objects

o Types:

o StaticData

o StreamingData

o Every Data object has a Schema to describe the Data fields (types, nulleables, keys…)

o A Data object is composed by Datasets.

Data Input

Dataset

o A Data object is formed by one or more Datasets.

o All Datasets of a Data object share the same Schema

o Datasets are formed by Register objects,

o A Register is composed by RegisterFields.

Data Input

Schema

o Very similar to Avro definition schemas.

o Allow to define input data’s structure, fields, types, nulleables…

o Json format

Data Input

Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB

23 street 43.529

5.673

2011-01-04

7 8 0.35 13 67 158 3.87 18.8 34 982

32 road 44.5 5.72 2011-01-04

7 8.6 0.4 12 68 158 3.87 19 33 975

{ "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ]}

Importing data into Lambdoop

o Loaders: Import information from multiple sources and store it into the HDFS as Data objects

o Producers: Get streaming data and represent it as Data objects

o Heterogeneous sources.

o Serialize information into Avro format

Data Input

• Static Data example: Importing a Air Quality dataset from local logs to HDFS

o Loader

o Schema’s path is files/csv/Air_quality_schema

Data Input

//Read schema from a file

String schema = readSchemaFile(schema_file);

Loader loader = new CSVLoader("AQ.avro", uri, schema)

Data input = new StaticData(loader);

• Streaming Data example: Reading streaming sensor data from TCP port

o Producer

o Weather stations emit messages to port 8080

o Schema’s path is files/csv/Air_quality_schema

Data Input

int port = 8080;

//Read schema

String schema = readSchemaFile (schema_file);

Producer producer = new TCPProducer ("AirQualityListener",

refresh, port, schema);

// Create Data object

Data data = new StreamingData(producer)

Extensibility

o Users can implement their own data loaders/producers

1) Extend Loader/Producer interface

2) Read data from original source

3) Get and serialize information (Avro format) considering Schemas

Data Input

Unitary actions to process data

An Operation takes Data as input, processes the Data and produces another Data as output

Types of operations:

Aggregation: Produces a single value per DataSet

Filter: Output data has the same schema as input data

Group: Produces several DataSet, grouping registers together

Projection: Changes the Data schema, but preserves the records and their values

Join: Combines different Data objects

Operations

Operations

Aggregation(1)

Count

Average

Sum

MinValue

MaxValue

Mode

Aggregation(2)

Skewness

Z-Test

Stderror

Variance

Covariance

Filter

Filter

Limit

TopN

BottomN

Max

Min

Group

Group

RollUp

Cube

N-Til

Projection

Select

Frecuency

Variation

Join

Inner Join

Left Join

Right Join

Outer Join

Operations

Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces:

OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed

BatchOperation: Provides MapReduce logic to process the input Data

StreamingOperation: Provides Storm/Trident based functions to process streaming registers

HybridOperation: Provides merging logic between streaming and batch results

Operations

User Defined Operation interfaces

Operations

Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations

o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output

o StreamingWorkflow: Operates on a StreamingData to produce another StreamingData

o HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData)

Workflow connections

Workflows

Data Workflow Data

Data Workflow

Workflow

WorkflowWorkflow

Workflow

Data

Data

Data

// Batch processing example

String schema = readSchemaFile(schema_file);Loader loader = new CSVLoader("AQ.avro",uri, schema)

Data input = new StaticData(loader);

Workflow wf = new BatchWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),

ConditionType.EQUAL, new StaticValue(«street 45"));

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));

wf.addOperation(filter);wf.addOperation(avg);

//Run the workflowwf.run();

//Get the resultsData output = wf.getResults();

Workflows

//Real-time processing exampleProducer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer);Workflow wf = new StreamingWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),

ConditionType.EQUAL, new StaticValue("Estación Av. Castilla"));

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));

wf.addOperation(filter);wf.addOperation(avg);

//Runs the workflowwf.run();

//Gets the resultsWhile (!stop){ Data output = wf.getResults(); … }

Workflows

// Hybrid computation exampleProducer producer = new PortProducer("catest", schema1, config);StreamingData streamInput = new StreamingData(producer);Loader loader = new CSVLoader("AQ.avro",uri, schema2)

StaticData batchInput = new StaticData(loader);

Data input = new HybridData(streamInput, batchInput);Workflow wf = new HybridWorkflow(input);

//Add a filter operationFilter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34"));wf.addOperation(filter);

//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));wf.addOperation(avg);

//Run the workflowwf.run();

//Get the resultsWhile (!stop) { Data output = wf.getResults();}

Workflows

Data

CSV, JSON, …

VISUALIZATION

EXPORT

ALARM SYSTEM

Filter

RollUp

StdErrorAvg

Select

Cube

Variance

join

…

Results exploitation

/* Produce from Twitter */

TwitterProducer producer = new TwitterProducer(…);

Data data = new StreamingData(producer);

StreamingWorkflow wf = new StreamingWorkflow(data);

/* Add operations to workflow*/

wf.addOperation(new Count());

…

/* Get results from workflow*/

Data results = wf.getResults();

/* Show results. Set dashboard refresh*/

Dashboard d = new Dashboard(config);

d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");

Results exploitation Visualization

Results exploitation Visualization

Data data = new StaticData(loader);

Workflow wf = new BatchWorkflow(data);



…



/* Export results */

Exporter.asCSV(results, File);

MongoExport(results, Map<String, String> conf);

PostgresExport(results, Map<String, String> conf);

CSV, JSON, …

Results exploitation Export

Data data = new StreamingData(producer);

StreamingWorkflow wf = new StreamingWorkflow(data);



…



/* Set alarm

condition: T/F (e.g time or certain value)

action: execution (e.g. show results, send an email)*/

AlarmFactory.setAlert(results, condition, action);

Results exploitation Alarms




4. Case studies

5. Conclusions

Agenda

55

Change configurations and easily manage the cluster

Friendly tools for monitoring the health of the cluster

Wizard-driven Lambdoop installation of new nodes

Visual editor for defining workflows and scheduling tasks

o Plugin for Eclipseo Visual elements for:

– Input Sources

– Loader

– Operations

– Operation parameterso RegisterFields

o Static values

– Visualization elementso Generates workflow codeo XML Import/Exporto Scheduling of workflows

• Tool for working with messy big data, cleaning it and transforming it.

• Import data in different formats• Explore datasets • Apply advanced cell transformations• Refine inconsistencies• Filter and partition your big data




4. Case studies

5. Conclusions

Agenda

Objective: To create event assessment and decision-making

supporting tools which improve quickness and efficiency when facing

emergency situations making.

Exploit the information available in Social Networks to complement

data about emergency situations


Social Awareness Based Emergency Situation Solver

Alert detection

Locations

Information

“Attached” resources (photo,

video, links,…)

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Batch processing + Real processing+ Hybrid computation

Quantum Mechanics Molecular Dynamics

Computer simulation of physical movements of microscopic elements

Large amount of data as streaming in each time-step

Real-time interaction (query, visual exploration) during the simulation

Data analytics on the whole dataset

Real time processing + Batch processing + Hybrid computation




4. Case studies

5. Conclusions

Agenda

Conclusions

• Big Data is not only batch processing

• To implement a Lambda Architecture is not trivial

• Lambdoop: Big Data made easy

• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• Extensible

Conclusions• Roadmap

– Now

• Release a early version of Lambdoop Framework as Open Source

• Get feedback from the community

• Increase the set of built-in functions

– Next

o Move all components to YARN

o Stable versions of Lambdoop ecosystem

o Models (Mahout, Jubatus, Samoa, R)

– Beyond

• Configurable processing engines (Spark, S4, Samza …)

• Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB …)

If you want stay tuned about Lambdoop register in

@ruben_casado @datadopter @treelogic

www.lambdoop.com

[email protected] [email protected]

www.lambdoop.com www.datadopter.com www.treelogic.com

Lambdoop, a framework for easy development of big data applications

Technology

Transcript of Lambdoop, a framework for easy development of big data applications