Mr bi

An Insight into Map Reduce and related technology

Renjith Peediackal09BM8040

• ET brand equity of 9th march explains the future of analytics

• Some of us will be champions of analytics within the respective organizations

• Some us will be selling analytics products• Some has to talk to analytics professionals and

understand the latest jargon• And the analytics moves to churn web data to give us

more insights. So we move to mR and data in flight• We are IITians!

Importance of understanding MR

The case for Map Reduce

Recommendation System

• Customer Y buys product X5 from an e-commerce site after going through a number of products X1, X2, X3, X4

• Student Y goes through site A1,A2,A3 and finally settles down and read the content from A5

• 1000 of people behaves in the same way.• Can we make more traffic in our site or design a

new site based on the insight derived from above pattern?

A lot more questions

• Based on ET interview of Avinash Kashik, Analytics expert:

What pages are my customer’s reading

A lot more questions contd..• What kind of content I need to develop in my site so as

to attract the right set of people?• Your URL should be present in what kind of sites so

that you get maximum number of referral? • How many of them quit after seeing the homepage? • What different kind of design can be possible to make

them go forward?• Are the users clicking on the right links in the right

fashion in your websites?(Site overlay)• What is the bounce rate?• How to save money on PPC schemes?

And the typical problems with recommendation systems

Problems with popularity• Customer need not be satisfied perpetually by same

products• Popularity based system ruins this possibilities of

exploration!• Companies have to create niche products and up sell

and cross sell it to customers – to satisfy them– retain them – and thus to be successful in the market. Opportunity of

selling a product is lost! • Lack of personalization leads to broken relations• Think Beyond POS data!!

Mixing expert opinion

• To avoid popularity and to have more meaningful recommendation mix expert opinion

• Mix of art with science nobody knows the right blend

• Think beyond POS data and experts wisdom

Pearls of wisdom in the net

But internet data is unfriendly• To statistical techniques and DBMS technology

– Dynamic– Sparse– Unstructured

• Growth of data– Published content: 3-4 Gb/day– Professional web content: 2 Gb/day– User generated content: 5-10 Gb/day– Private text content: ~2 Tb/day (200x more)(Ref: Raghu Ramakrishnan

http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf)• Questions to this data

– Can we do Analytics over Web Data / User Generated Content?– TB of text data / GB of new data each day?– Structured Queries, Search Queries?– At “Google-Speed”?

The case for a new technique

• That gives us a strong case for adopting the new technology of data in flight.

• ‘Map Reduce’ is a technology developed by Google for the similar purposes.

What is Data in flight?

• Earlier data was at ‘rest’!– The normal concepts of DBMS where data is at

rest and the queries hit those static data and fetch results

• Now data is just flying in!– the new concepts of ‘data in flight’ envisages the

already prepared query as static, collecting dynamic data as and when it is produced and consumed.

– Systems to handle

Map and reduce

• A map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an analytical technique.

• Once the information is in simpler and structured form, it can be reduced to the required results.

Terminology explained..

• A standard example:– Word count!

• Given a document, how many of each word are there?

• But in real world it can be:– Given our search logs, how many people click on

result 1– Given our flicker photos, how many cat photos are

there by users in each geographic region– Give our web crawl, what are the 10 most popular

words?

How does a map reduce programme work

Programmer has to specify two methods: Map and Reduce

map (k, v) -> <k', v'>*• Specify a map function that takes a key(k)/value(v) pair.

– key = document URL, value = document contents“document1”, “to be or not to be”

• Output of map is (potentially many) key/value pairs. <k', v'>*

• In our case, output (word, “1”) once per word in the document– “to”, “1”– “be”, “1”– “or”, “1” – “to”, “1”– “not”, “1”– “be”, “1”

Shuffle or sort

• (shuffle/sort)– “to”, “1”– “to”, “1”– “be”, “1”– “be”, “1”– “not”, “1”– “or”, “1”

– reduce (k', <v'>*) -> <k', v'>*

• The reduce function combines the values for a key– “be”, “2”– “not”, “1”– “or”, “1”– “to”, “2”

• For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same

How this new way helpful for our recommendation system?

• Brute power– Uses the brute power of many machines to map

the huge chunk of sparse data into small table of dense data

– The complex and time consuming part of the “task” is done on the new, small and dense data in reduce part

– Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized.

Maps into a denser smaller table

Fault tolerance two different types- Database school of thought

Fault tolerance two different types- MR school of thought

Hierarchy of Parallelism: Cycle of brute force fault tolerance

Criticisms

• A giant step backward in the programming paradigm for large-scale data intensive applications

• A sub-optimal implementation• in that it uses brute force instead of indexing• Not novel at all• it represents a specific implementation of well known

techniques developed 25 years ago• Missing most features in current DBMS• Incompatible with all of the tools DBMS users have

come to depend on

Why it is valuable still?

• Permanent writing magically enables two different wonderful features– It raises the fault tolerance level to such a level,

that we can employ millions of cheap computers to get our work done.

– It brings dynamism and load balancing. Needed since we don’t know about the nature of the data. It helps the programmers to logically manage the complexity of the data

Why can’t parallel DB deliver the same?

• At large scales, super-fancy reliable hardware still fails, albeit less often. The brute force fault tolerance is more practical.

• software still needs to be fault-tolerant

• commodity machines without fancy hardware gives better perf/$

• Usage of more memory to speed up querying has its own implication on tolerance and cost

• Following an execution plan based system does not work with dynamic, sparse and unstructured data

An example: Invite you to the complexity-sequential web access-

based recommendation system

sequential web access-based recommendation system

• It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al]

Recommendation

• And when a particular user has to be catered with a suggestion – his access pattern tree is compared with the

entire tree of patterns. – And the most suitable portions of the tree in

comparison with the user’s pattern are selected and

– its branches are suggested.

Some details

• Let E be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories

• A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events

• Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) a sample database will be like

Session ID Web access sequence

1 abdac

2 eaebcac

3 babfae

4 abbacfc

Details

Length of sequence Sequential web access pattern with support

1 a:4. b:4, c:3

2 aa:4. ab:4. oc3. ba:4. bc:3

3 aac:3, aba;4, obc:3, bac:3

4 Abac:3

Access events can be classified into frequent and infrequent based on frequency crossing a threshold level

And a tree consisting of frequent access events can be created.

The Map and reduce

• So a map job can be designed to process the logs and create pattern tree.

• The task is divided among thousands of cheap machines using map Reduce platform.

• dynamic data and the static query model of data in flight will be very helpful to modify the main tree

• The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning.

• Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data.

DBMS for the same case?• Map

– A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage.

– Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously!

– An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective

– It should be divided among many parallel engines– And this requires expertise in parallel programming.

• Reduce– During reduce phase the entire tree has to be searched for the existence of

resembling patterns. – This also will be ineffective in an execution plan driven model as explained

above. • And with the explosion of data, and the increased need of increased

personalization in recommendations, map reduce becomes the most suitable pattern.

Parallel DB vs MapReduce

• RDBMS is good when– if the application is query-intensive,– whether semi structured or rigidly structured

• MR is effective– ETL and “read once” data sets.– Complex analytics.– Semi-structured data, Non structured– Quick-and-dirty analyses.– Limited-budget operations.

Summary of advantages of MR

• Storage system independence• automatic parallelization• load balancing• network and disk transfer optimization• handling of machine failures• Robustness• Improvements to core library benefit all users of

library! • Ease to programmers!

Is mapReduce the final word?

What is hadoop

• Based on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform.

• This program and the resultant technology is termed as hadoop

Pig

• Can we use MR for repetitive jobs effectively?• How can one control the execution of the

hadoop program just like creating an execution plan in normal DB operation?

• The answer leads to pig. Pig allows one to control the flow of data by creating execution plans easily.

• Suitable when the task are repetitive and the plans can be envisaged early on.

What does hive do?

• Users of databases are not often technology masters.

• They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries.

• We need a program to convert this traditional sql queries into mapReduce jobs.

• And the one created by hadoop movement is Hive.

Hive architecture

Many more tools But

Mr bi

Education

Transcript of Mr bi