Lambdoop, a framework for easy development of big data applications
-
Upload
datadopter -
Category
Technology
-
view
4.497 -
download
0
description
Transcript of Lambdoop, a framework for easy development of big data applications
A framework for easy development of
Big Data applications
Rubén Casado
@ruben_casado
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
About me :-)
PhD in Software Engineering MSc in Computer Science BSc in Computer Science
Academics
Work
Experience
About Treelogic
Treelogic is an R&D
intensive company with
the mission of creating,
boosting, developing and
adapting scientific and
technological
knowledge to improve
quality standards in our
daily life
TREELOGIC – Distributor and Sales
International Projects
National Projects
Regional Projects
R&D Manag. System
Internal Projects
Research Lines
Computer Vision
Big Data
Teraherzt technology
Data science
Social Media Analysis
Semantics
Security & Safety
Justice
Health
Transport
Financial services
ICT tailored solutions
Solutions
R&D
7 ongoing FP7 projects
ICT, SEC, OCEAN
Coordinating 5 of them
3 ongoing Eurostars projects
Coordinating all of them
Research
INNOVATION
&
7 years’ experience in R&D projects
More than 40
projects with
budget over 120 MEUR
More than
300 partners
in last 3
years
Project coordinator in 7 European projects
Overall participation
in 11 European
projects
www.datadopter.com
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
A massive volume of both
structured and unstructured data
that is so large to process with
traditional database and software
techniques
What is Big Data?
Big Data are high-volume, high-velocity,
and/or high-variety information assets that
require new forms of processing to enable
enhanced decision making, insight
discovery and process optimization
How is Big Data?
- Gartner IT Glossary -
3 problems
Volume
Variety Velocity
3 solutions
Batch processing
NoSQLReal-time
processing
3 solutions
Batch processing
NoSQLReal-time
processing
• Scalable
• Large amount of static data
• Distributed
• Parallel
• Fault tolerant
• High latency
Batch processing
Volume
• Low latency
• Continuous unbounded
streams of data
• Distributed
• Parallel
• Fault-tolerant
Real-time processing
Velocity
• Low latency
• Massive data + Streaming data
• Scalable
• Combine batch and real-time results
Hybrid computation model
Volume Velocity
All data
New data
Batch processing
Real-time processing
Batchresults
Streamresults
CombinationFinal results
Hybrid computation model
Batch processing Large amount of statics data Scalable solution Volume
Real-time processing Computing streaming data Low latency Velocity
Hybrid computation Lambda Architecture Volume + Velocity
2006
2010
2014
1ª Generation
2ª Generation
3ª Generation
Inception
2003Processing Paradigms
Processing Pipeline
DATA
ACQUISITION
DATA
STORAGEDATA
ANALYSIS RESULTS
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
Open source framework
Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis
Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process
Same single API for the three processing paradigms
o Batch processing similar to Pig / Cascading
o Real time processing using built-in functions easier than Trident
o Hybrid computation model transparent for the developer
What is Lambdoop?
• Building a batch processing application requires
o MapReduce developing
o Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …)
o Storage systems (Hbase, MongoDB, HDFS, Cassandra…)
Why Lambdoop?
• Real-time processing requires
o Streaming computing (S4, Storm, Samza)
o Unboundend input (Flume, Scribe)
o Temporal data stores (In-memory, Kafka, Kestrel)
Why Lambdoop?• Building a hybrid computation system (Lambda Architecture) requires
o Application logic has to be defined in two different systems using different frameworks
o Data must be serialized consistently and kept in sync between each system
o Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results
“One of the most interesting areas of future work is
high level abstractions that map to a batch
processing component and a real-time processing
component. There's no reason why you shouldn't have
the conciseness of a declarative language with the
robustness of the batch/real-time architecture”.
Why Lambdoop?
Nathan Marz
“Lambda Architecture is a implementation challenge.
In many real-world situations a stumbling block for
switching to a Lambda Architecture lies with a scalable
batch processing layer. Technologies like Hadoop (…)
are there but there is a shortage of people with the
expertise to leverage them.Rajat Jain
Lambdoop
Data Operation Data
Workflow
Streaming data
Static data
Lambdoop
Batch
Real-Time
Hybrid
Information represented as Data objects
o Types:
o StaticData
o StreamingData
o Every Data object has a Schema to describe the Data fields (types, nulleables, keys…)
o A Data object is composed by Datasets.
Data Input
Dataset
o A Data object is formed by one or more Datasets.
o All Datasets of a Data object share the same Schema
o Datasets are formed by Register objects,
o A Register is composed by RegisterFields.
Data Input
Schema
o Very similar to Avro definition schemas.
o Allow to define input data’s structure, fields, types, nulleables…
o Json format
Data Input
Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB
23 street 43.529
5.673
2011-01-04
7 8 0.35 13 67 158 3.87 18.8 34 982
32 road 44.5 5.72 2011-01-04
7 8.6 0.4 12 68 158 3.87 19 33 975
{ "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ]}
Importing data into Lambdoop
o Loaders: Import information from multiple sources and store it into the HDFS as Data objects
o Producers: Get streaming data and represent it as Data objects
o Heterogeneous sources.
o Serialize information into Avro format
Data Input
• Static Data example: Importing a Air Quality dataset from local logs to HDFS
o Loader
o Schema’s path is files/csv/Air_quality_schema
Data Input
//Read schema from a file
String schema = readSchemaFile(schema_file);
Loader loader = new CSVLoader("AQ.avro", uri, schema)
Data input = new StaticData(loader);
• Streaming Data example: Reading streaming sensor data from TCP port
o Producer
o Weather stations emit messages to port 8080
o Schema’s path is files/csv/Air_quality_schema
Data Input
int port = 8080;
//Read schema
String schema = readSchemaFile (schema_file);
Producer producer = new TCPProducer ("AirQualityListener",
refresh, port, schema);
// Create Data object
Data data = new StreamingData(producer)
Extensibility
o Users can implement their own data loaders/producers
1) Extend Loader/Producer interface
2) Read data from original source
3) Get and serialize information (Avro format) considering Schemas
Data Input
Unitary actions to process data
An Operation takes Data as input, processes the Data and produces another Data as output
Types of operations:
Aggregation: Produces a single value per DataSet
Filter: Output data has the same schema as input data
Group: Produces several DataSet, grouping registers together
Projection: Changes the Data schema, but preserves the records and their values
Join: Combines different Data objects
Operations
Operations
Aggregation(1)
Count
Average
Sum
MinValue
MaxValue
Mode
Aggregation(2)
Skewness
Z-Test
Stderror
Variance
Covariance
Filter
Filter
Limit
TopN
BottomN
Max
Min
Group
Group
RollUp
Cube
N-Til
Projection
Select
Frecuency
Variation
Join
Inner Join
Left Join
Right Join
Outer Join
Operations
Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces:
OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed
BatchOperation: Provides MapReduce logic to process the input Data
StreamingOperation: Provides Storm/Trident based functions to process streaming registers
HybridOperation: Provides merging logic between streaming and batch results
Operations
User Defined Operation interfaces
Operations
Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations
o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output
o StreamingWorkflow: Operates on a StreamingData to produce another StreamingData
o HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData)
Workflow connections
Workflows
Data Workflow Data
Data Workflow
Workflow
WorkflowWorkflow
Workflow
Data
Data
Data
// Batch processing example
String schema = readSchemaFile(schema_file);Loader loader = new CSVLoader("AQ.avro",uri, schema)
Data input = new StaticData(loader);
Workflow wf = new BatchWorkflow(input);
//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),
ConditionType.EQUAL, new StaticValue(«street 45"));
//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));
wf.addOperation(filter);wf.addOperation(avg);
//Run the workflowwf.run();
//Get the resultsData output = wf.getResults();
Workflows
//Real-time processing exampleProducer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer);Workflow wf = new StreamingWorkflow(input);
//Add a filter operationFilter filter = new Filter(new RegisterField("Title"),
ConditionType.EQUAL, new StaticValue("Estación Av. Castilla"));
//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));
wf.addOperation(filter);wf.addOperation(avg);
//Runs the workflowwf.run();
//Gets the resultsWhile (!stop){ Data output = wf.getResults(); … }
Workflows
// Hybrid computation exampleProducer producer = new PortProducer("catest", schema1, config);StreamingData streamInput = new StreamingData(producer);Loader loader = new CSVLoader("AQ.avro",uri, schema2)
StaticData batchInput = new StaticData(loader);
Data input = new HybridData(streamInput, batchInput);Workflow wf = new HybridWorkflow(input);
//Add a filter operationFilter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34"));wf.addOperation(filter);
//Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2"));wf.addOperation(avg);
//Run the workflowwf.run();
//Get the resultsWhile (!stop) { Data output = wf.getResults();}
Workflows
Data
CSV, JSON, …
VISUALIZATION
EXPORT
ALARM SYSTEM
Filter
RollUp
StdErrorAvg
Select
Cube
Variance
join
…
Results exploitation
/* Produce from Twitter */
TwitterProducer producer = new TwitterProducer(…);
Data data = new StreamingData(producer);
StreamingWorkflow wf = new StreamingWorkflow(data);
/* Add operations to workflow*/
wf.addOperation(new Count());
…
/* Get results from workflow*/
Data results = wf.getResults();
/* Show results. Set dashboard refresh*/
Dashboard d = new Dashboard(config);
d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");
Results exploitation Visualization
Results exploitation Visualization
Results exploitation Visualization
Data data = new StaticData(loader);
Workflow wf = new BatchWorkflow(data);
/* Add operations to workflow*/
wf.addOperation(new Count());
…
/* Get results from workflow*/
Data results = wf.getResults();
/* Export results */
Exporter.asCSV(results, File);
MongoExport(results, Map<String, String> conf);
PostgresExport(results, Map<String, String> conf);
CSV, JSON, …
Results exploitation Export
Data data = new StreamingData(producer);
StreamingWorkflow wf = new StreamingWorkflow(data);
/* Add operations to workflow*/
wf.addOperation(new Count());
…
/* Get results from workflow*/
Data results = wf.getResults();
/* Set alarm
condition: T/F (e.g time or certain value)
action: execution (e.g. show results, send an email)*/
AlarmFactory.setAlert(results, condition, action);
Results exploitation Alarms
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
55
Change configurations and easily manage the cluster
Friendly tools for monitoring the health of the cluster
Wizard-driven Lambdoop installation of new nodes
Visual editor for defining workflows and scheduling tasks
o Plugin for Eclipseo Visual elements for:
– Input Sources
– Loader
– Operations
– Operation parameterso RegisterFields
o Static values
– Visualization elementso Generates workflow codeo XML Import/Exporto Scheduling of workflows
• Tool for working with messy big data, cleaning it and transforming it.
• Import data in different formats• Explore datasets • Apply advanced cell transformations• Refine inconsistencies• Filter and partition your big data
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
Objective: To create event assessment and decision-making
supporting tools which improve quickness and efficiency when facing
emergency situations making.
Exploit the information available in Social Networks to complement
data about emergency situations
Real-time processing
Social Awareness Based Emergency Situation Solver
Alert detection
Locations
Information
“Attached” resources (photo,
video, links,…)
Static stations and mobile sensors in Asturias sending streaming data
Historical data of > 10 years
Monitoring, trends identification, predictions
Batch processing + Real processing+ Hybrid computation
Quantum Mechanics Molecular Dynamics
Computer simulation of physical movements of microscopic elements
Large amount of data as streaming in each time-step
Real-time interaction (query, visual exploration) during the simulation
Data analytics on the whole dataset
Real time processing + Batch processing + Hybrid computation
1. Big Data processing
2. Lambdoop framework
3. Lambdoop ecosystem
4. Case studies
5. Conclusions
Agenda
Conclusions
• Big Data is not only batch processing
• To implement a Lambda Architecture is not trivial
• Lambdoop: Big Data made easy
• High abstraction layer for all processing model
• All steps in the data processing pipeline
• Same Java API for all programing paradigms
• Extensible
Conclusions• Roadmap
– Now
• Release a early version of Lambdoop Framework as Open Source
• Get feedback from the community
• Increase the set of built-in functions
– Next
o Move all components to YARN
o Stable versions of Lambdoop ecosystem
o Models (Mahout, Jubatus, Samoa, R)
– Beyond
• Configurable processing engines (Spark, S4, Samza …)
• Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB …)
If you want stay tuned about Lambdoop register in
@ruben_casado @datadopter @treelogic
www.lambdoop.com
[email protected] [email protected]
www.lambdoop.com www.datadopter.com www.treelogic.com