Profiterole

37
LESSONS LEARNT FROM DEVELOPING MAP-REDUCE FOR ANDROID [email protected] Boris Farber

description

 

Transcript of Profiterole

Page 1: Profiterole

LESSONS LEARNT FROM DEVELOPING MAP-REDUCE FOR ANDROID

[email protected]

Boris Farber

Page 2: Profiterole

Plan

1. Introduction and Motivation2. Problem Domain – Big Data for Android

Devices3. Solution Domain

1. Solution Architecture2. Programming Paradigms3. Map Reduce4. Design Patterns Used5. Implementation

4. Summary and Discussion

Page 3: Profiterole

Introduction

Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail. The SQL systems fail not only to scale up but

also to provide desired functionality. For example back bone of

Facebook/Twitter/LinkedIn/Google The common data pattern for the companies

above huge amount of un-structured data.

Page 4: Profiterole

Unstructured Data vs. Structured Data

However most data is unstructured or semi structured, think of twits, likes, profile updates …

SQL is structured data i.e. the types (mostly primitive) and the fields are known in advance and there is little deviation from the flat table norm.

Page 5: Profiterole

Android World

Android smart phone are getting smarter They handle more and more data Big data patterns are dropping to smart

phone market Current big data solutions such as

Hadoop are not appropriate, because they solve the wrong problem File system Multi machine load balancing

Page 6: Profiterole

Constraints for Android based applications

Application most of the time sleeps and doesn’t run

Impossible to have fault tolerant file systems

Save battery and CPU power Reuse of containers Sharing resources – ashmem, strings pool

Page 7: Profiterole

Problem

Single – thread approaches for data handling (sort/search/analyze) are naïve: Getting slower Awkward to develop and maintain No multi core/threading utilization

Page 8: Profiterole

Problem Domain Definition

Problem Domain is a world where the software product requirements exist. Technically speaking Conceptual model which describes the: Various entities Attributes and relationships Scope and Constraints

Page 9: Profiterole

Problem Domain Case Study Semi structural text based data

functionalities needed Word Counting Inverted Index is a mapping between

words the the documents where the words appear.

Distributed Grep

Page 10: Profiterole

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

MapMap

MapMap

MapMap

Reduce

Reduce

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Page 11: Profiterole

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of

greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

Page 12: Profiterole

Solution Domain

Page 13: Profiterole

Solution Domain

Conceptual model to cover the use cases of the Problem Domain, which describes:

All entities and relationships related to the “implementation“ of the problem

Analysis and Architectural Patterns Design Patterns

Solution domain is greater than the Problem domain, because Solution Domain adds entries that are taken from granted in Problem Domain (such as factory).

Page 14: Profiterole

Welcome “Profiterole”

Open Source Java/Android Big Data solution that implements Map Reduce algorithm

Operates on large text files by breaking them to chunks

Template Based – not only for strings and integers but for any comparable objects

Fully concurrent Optional Lua based post processing

Page 15: Profiterole

Example of profiterole output

Page 16: Profiterole

High Level View

Page 17: Profiterole

Map Reduce

Is a framework for processing highly distributable problems across huge datasets using a large number of computers/threads/cpus. The framework contains both Map and Reduce functions.

The motivation is large size of input data combined with a lot of machines available (thus need to be used effectively)

Page 18: Profiterole

Map-Reduce

1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Page 19: Profiterole

Map-Reduce

Map: Accepts input

key/value pair Emits intermediate

key/value pair

Reduce : Accepts intermediate key/value pair

Emits output key/value pair

Very big

dataResult

MAP

REDUCE

Page 20: Profiterole

Design Paradigms

Working convention, way to think about program.

Goal of the paradigm to think and get a program.

Page 21: Profiterole

Programming Paradigms

Programming Paradigm is a conceptual model for creating programs, supported by programming language. 

Paradigms differ in the concepts and abstractions used to: Represent the elements of a program such as

objects, functions, variables, constraints, etc. Represent the steps that compose a

computation such as assignment, evaluation, continuations, data flows, etc...

Page 22: Profiterole

Programming Paradigms in Profiterole

Map-Reduce problems are functional in their nature map reduce are first class citizens.

All the development – done in Java that is is object oriented language

Few parts are generic Results are table based Need to find a tradeoff for example

what can be solved by generics and what can be solved by inheritance.

Page 23: Profiterole

Design Patterns

Architectural solutions needed to solve problems in context

All in all patterns are no more than structures how to connect classes.

But this is mechanical definition, the real value definition pattern is a structure or sub-part known immediately not only to someone who writes the code but also to someone who reads or uses the code.

Page 24: Profiterole

Design Patterns

Command Pattern Mediator Strategy Template Method Observer

Page 25: Profiterole

Problem

Results of map reduce are very difficult to process

Must be simple and generic to use Solution add another level of indirection

Page 26: Profiterole

Waffle Dataset

Batched data handling is major component

Modeled as hash table with keys values Took ideas from Lua

Page 27: Profiterole

Coding – SDK Structure

API – user level apis and call backs MapReduce – implementation Samples – code samples with callbacks

sample implementation Tests – all the development tests Waffle – result set implementation

Page 28: Profiterole

SDK Logical View(by packages)

User Level•API•Samples

Core•Map Reduce Implementation•Tests

Result •Waffle Database

Page 29: Profiterole

Implementation

Android APIs UI Files from SDCard

Java APIs Concurrent libraries Use Java generics

Page 30: Profiterole

Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();

List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String, Integer>>();

int numThreads = 25;

ExecutorService pool = Executors.newFixedThreadPool(numThreads);

CompletionService<OutputUnit> futurePool = new ExecutorCompletionService<MapCallback.OutputUnit>(

pool);

Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();

// linear addition of jobs, parallel execution

for (TMapInput m : input) {

futureSet.add(futurePool.submit(mapper.makeWorker(m)));

}

// tasks running

pool.shutdown();

Page 31: Profiterole

Testing

Corner stone component Testing types

Functionality Nullity and parameters

Testing utilities such as sorting

Page 32: Profiterole

API Practices

API Decisions Use referential transparency in APIs API patterns from Java collection Use Java generics

Page 33: Profiterole

Java

Lingua-franca of Android development General, Concurrent, Class Based Object

Oriented language Android has major Java language

libraries (io, net,util, lang) Compiles to class format then

transformed to dex format and runs on Dalvik virtual machine

Page 34: Profiterole

Lua Backend

Provide REPL for on-line working with results

Page 35: Profiterole

Summary

Effective Design Decisions Very simple API Never return nulls Checks for validity

Runs fast on mega size databases

Page 36: Profiterole

Interested

http://code.google.com/p/profiterole/ http://code.google.com/p/profiterole/dow

nloads/list