Profiterole
-
Upload
boris-farber -
Category
Technology
-
view
232 -
download
0
description
Transcript of Profiterole
LESSONS LEARNT FROM DEVELOPING MAP-REDUCE FOR ANDROID
Boris Farber
Plan
1. Introduction and Motivation2. Problem Domain – Big Data for Android
Devices3. Solution Domain
1. Solution Architecture2. Programming Paradigms3. Map Reduce4. Design Patterns Used5. Implementation
4. Summary and Discussion
Introduction
Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail. The SQL systems fail not only to scale up but
also to provide desired functionality. For example back bone of
Facebook/Twitter/LinkedIn/Google The common data pattern for the companies
above huge amount of un-structured data.
Unstructured Data vs. Structured Data
However most data is unstructured or semi structured, think of twits, likes, profile updates …
SQL is structured data i.e. the types (mostly primitive) and the fields are known in advance and there is little deviation from the flat table norm.
Android World
Android smart phone are getting smarter They handle more and more data Big data patterns are dropping to smart
phone market Current big data solutions such as
Hadoop are not appropriate, because they solve the wrong problem File system Multi machine load balancing
Constraints for Android based applications
Application most of the time sleeps and doesn’t run
Impossible to have fault tolerant file systems
Save battery and CPU power Reuse of containers Sharing resources – ashmem, strings pool
Problem
Single – thread approaches for data handling (sort/search/analyze) are naïve: Getting slower Awkward to develop and maintain No multi core/threading utilization
Problem Domain Definition
Problem Domain is a world where the software product requirements exist. Technically speaking Conceptual model which describes the: Various entities Attributes and relationships Scope and Constraints
Problem Domain Case Study Semi structural text based data
functionalities needed Word Counting Inverted Index is a mapping between
words the the documents where the words appear.
Distributed Grep
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
Inverted Index Example
to be or not to be afraid, (12th.txt)
be, (12th.txt, hamlet.txt)greatness, (12th.txt)
not, (12th.txt, hamlet.txt)of, (12th.txt)
or, (hamlet.txt)to, (hamlet.txt)
hamlet.txt
be not afraid of
greatness
12th.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt
be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt
Solution Domain
Solution Domain
Conceptual model to cover the use cases of the Problem Domain, which describes:
All entities and relationships related to the “implementation“ of the problem
Analysis and Architectural Patterns Design Patterns
Solution domain is greater than the Problem domain, because Solution Domain adds entries that are taken from granted in Problem Domain (such as factory).
Welcome “Profiterole”
Open Source Java/Android Big Data solution that implements Map Reduce algorithm
Operates on large text files by breaking them to chunks
Template Based – not only for strings and integers but for any comparable objects
Fully concurrent Optional Lua based post processing
Example of profiterole output
High Level View
Map Reduce
Is a framework for processing highly distributable problems across huge datasets using a large number of computers/threads/cpus. The framework contains both Map and Reduce functions.
The motivation is large size of input data combined with a lot of machines available (thus need to be used effectively)
Map-Reduce
1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
Map-Reduce
Map: Accepts input
key/value pair Emits intermediate
key/value pair
Reduce : Accepts intermediate key/value pair
Emits output key/value pair
Very big
dataResult
MAP
REDUCE
Design Paradigms
Working convention, way to think about program.
Goal of the paradigm to think and get a program.
Programming Paradigms
Programming Paradigm is a conceptual model for creating programs, supported by programming language.
Paradigms differ in the concepts and abstractions used to: Represent the elements of a program such as
objects, functions, variables, constraints, etc. Represent the steps that compose a
computation such as assignment, evaluation, continuations, data flows, etc...
Programming Paradigms in Profiterole
Map-Reduce problems are functional in their nature map reduce are first class citizens.
All the development – done in Java that is is object oriented language
Few parts are generic Results are table based Need to find a tradeoff for example
what can be solved by generics and what can be solved by inheritance.
Design Patterns
Architectural solutions needed to solve problems in context
All in all patterns are no more than structures how to connect classes.
But this is mechanical definition, the real value definition pattern is a structure or sub-part known immediately not only to someone who writes the code but also to someone who reads or uses the code.
Design Patterns
Command Pattern Mediator Strategy Template Method Observer
Problem
Results of map reduce are very difficult to process
Must be simple and generic to use Solution add another level of indirection
Waffle Dataset
Batched data handling is major component
Modeled as hash table with keys values Took ideas from Lua
Coding – SDK Structure
API – user level apis and call backs MapReduce – implementation Samples – code samples with callbacks
sample implementation Tests – all the development tests Waffle – result set implementation
SDK Logical View(by packages)
User Level•API•Samples
Core•Map Reduce Implementation•Tests
Result •Waffle Database
Implementation
Android APIs UI Files from SDCard
Java APIs Concurrent libraries Use Java generics
Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();
List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String, Integer>>();
int numThreads = 25;
ExecutorService pool = Executors.newFixedThreadPool(numThreads);
CompletionService<OutputUnit> futurePool = new ExecutorCompletionService<MapCallback.OutputUnit>(
pool);
Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();
// linear addition of jobs, parallel execution
for (TMapInput m : input) {
futureSet.add(futurePool.submit(mapper.makeWorker(m)));
}
// tasks running
pool.shutdown();
Testing
Corner stone component Testing types
Functionality Nullity and parameters
Testing utilities such as sorting
API Practices
API Decisions Use referential transparency in APIs API patterns from Java collection Use Java generics
Java
Lingua-franca of Android development General, Concurrent, Class Based Object
Oriented language Android has major Java language
libraries (io, net,util, lang) Compiles to class format then
transformed to dex format and runs on Dalvik virtual machine
Lua Backend
Provide REPL for on-line working with results
Summary
Effective Design Decisions Very simple API Never return nulls Checks for validity
Runs fast on mega size databases
Interested
http://code.google.com/p/profiterole/ http://code.google.com/p/profiterole/dow
nloads/list
THANK YOU ! [email protected]