Profiterole

LESSONS LEARNT FROM DEVELOPING MAP-REDUCE FOR ANDROID

[email protected]

Boris Farber

mailto:[email protected]

Plan

1. Introduction and Motivation2. Problem Domain – Big Data for Android

Devices3. Solution Domain

1. Solution Architecture2. Programming Paradigms3. Map Reduce4. Design Patterns Used5. Implementation

4. Summary and Discussion

Introduction

Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail. The SQL systems fail not only to scale up but

also to provide desired functionality. For example back bone of

Facebook/Twitter/LinkedIn/Google The common data pattern for the companies

above huge amount of un-structured data.

Unstructured Data vs. Structured Data

However most data is unstructured or semi structured, think of twits, likes, profile updates …

SQL is structured data i.e. the types (mostly primitive) and the fields are known in advance and there is little deviation from the flat table norm.

Android World

Android smart phone are getting smarter They handle more and more data Big data patterns are dropping to smart

phone market Current big data solutions such as

Hadoop are not appropriate, because they solve the wrong problem File system Multi machine load balancing

Constraints for Android based applications

Application most of the time sleeps and doesn’t run

Impossible to have fault tolerant file systems

Save battery and CPU power Reuse of containers Sharing resources – ashmem, strings pool

Problem

Single – thread approaches for data handling (sort/search/analyze) are naïve: Getting slower Awkward to develop and maintain No multi core/threading utilization

Problem Domain Definition

Problem Domain is a world where the software product requirements exist. Technically speaking Conceptual model which describes the: Various entities Attributes and relationships Scope and Constraints

Problem Domain Case Study Semi structural text based data

functionalities needed Word Counting Inverted Index is a mapping between

words the the documents where the words appear.

Distributed Grep

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

MapMap

MapMap

MapMap

Reduce

Reduce

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of

greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

Solution Domain

Solution Domain

Conceptual model to cover the use cases of the Problem Domain, which describes:

All entities and relationships related to the “implementation“ of the problem

Analysis and Architectural Patterns Design Patterns

Solution domain is greater than the Problem domain, because Solution Domain adds entries that are taken from granted in Problem Domain (such as factory).

Welcome “Profiterole”

Open Source Java/Android Big Data solution that implements Map Reduce algorithm

Operates on large text files by breaking them to chunks

Template Based – not only for strings and integers but for any comparable objects

Fully concurrent Optional Lua based post processing

Example of profiterole output

High Level View

Map Reduce

Is a framework for processing highly distributable problems across huge datasets using a large number of computers/threads/cpus. The framework contains both Map and Reduce functions.

The motivation is large size of input data combined with a lot of machines available (thus need to be used effectively)

Map-Reduce

1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Map-Reduce

Map: Accepts input

key/value pair Emits intermediate

key/value pair

Reduce : Accepts intermediate key/value pair

Emits output key/value pair

Very big

dataResult

MAP

REDUCE

Design Paradigms

Working convention, way to think about program.

Goal of the paradigm to think and get a program.

Programming Paradigms

Programming Paradigm is a conceptual model for creating programs, supported by programming language.

Paradigms differ in the concepts and abstractions used to: Represent the elements of a program such as

objects, functions, variables, constraints, etc. Represent the steps that compose a

computation such as assignment, evaluation, continuations, data flows, etc...

Programming Paradigms in Profiterole

Map-Reduce problems are functional in their nature map reduce are first class citizens.

All the development – done in Java that is is object oriented language

Few parts are generic Results are table based Need to find a tradeoff for example

what can be solved by generics and what can be solved by inheritance.

Design Patterns

Architectural solutions needed to solve problems in context

All in all patterns are no more than structures how to connect classes.

But this is mechanical definition, the real value definition pattern is a structure or sub-part known immediately not only to someone who writes the code but also to someone who reads or uses the code.

Design Patterns

Command Pattern Mediator Strategy Template Method Observer

Problem

Results of map reduce are very difficult to process

Must be simple and generic to use Solution add another level of indirection

Waffle Dataset

Batched data handling is major component

Modeled as hash table with keys values Took ideas from Lua

Coding – SDK Structure

API – user level apis and call backs MapReduce – implementation Samples – code samples with callbacks

sample implementation Tests – all the development tests Waffle – result set implementation

SDK Logical View(by packages)

User Level•API•Samples

Core•Map Reduce Implementation•Tests

Result •Waffle Database

Implementation

Android APIs UI Files from SDCard

Java APIs Concurrent libraries Use Java generics

Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();

List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String, Integer>>();

int numThreads = 25;

ExecutorService pool = Executors.newFixedThreadPool(numThreads);

CompletionService<OutputUnit> futurePool = new ExecutorCompletionService<MapCallback.OutputUnit>(

pool);

Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();

// linear addition of jobs, parallel execution

for (TMapInput m : input) {

futureSet.add(futurePool.submit(mapper.makeWorker(m)));

}

// tasks running

pool.shutdown();

Testing

Corner stone component Testing types

Functionality Nullity and parameters

Testing utilities such as sorting

API Practices

API Decisions Use referential transparency in APIs API patterns from Java collection Use Java generics

Java

Lingua-franca of Android development General, Concurrent, Class Based Object

Oriented language Android has major Java language

libraries (io, net,util, lang) Compiles to class format then

transformed to dex format and runs on Dalvik virtual machine

Lua Backend

Provide REPL for on-line working with results

Summary

Effective Design Decisions Very simple API Never return nulls Checks for validity

Runs fast on mega size databases

Interested

http://code.google.com/p/profiterole/ http://code.google.com/p/profiterole/dow

nloads/list

http://code.google.com/p/profiterole/

http://code.google.com/p/profiterole/downloads/list

http://code.google.com/p/profiterole/downloads/list

THANK YOU ! [email protected]

mailto:[email protected]

Profiterole

Technology

Transcript of Profiterole