Introduction to Hadoop and MapReduce

25
Introduction to Hadoop and MapReduce Csaba Toth GDG Fresno Meeting Date: February 6 th , 2014 Location: The Hashtag, Fresno

description

This talk was for GDG Fresno meeting. The demo used Google Compute Engine and Google Cloud Storage. The actual talk was different than the slides. There were a lot of good questions from the audience, and diverted to side topics many times.

Transcript of Introduction to Hadoop and MapReduce

Page 1: Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce

Csaba TothGDG Fresno Meeting

Date: February 6th, 2014Location: The Hashtag, Fresno

Page 2: Introduction to Hadoop and MapReduce

Agenda

• Big Data• A little history• Hadoop• Map Reduce• Demo: Hadoop with Google Compute Engine

and Google Cloud Storage

Page 3: Introduction to Hadoop and MapReduce

Big Data• Wikipedia: “collection of data sets so large and complex that it

becomes difficult to process using on-hand database management tools or traditional data processing applications”

• Examples: (Wikibon - A Comprehensive List of Big Data Statistics)– 100 Terabytes of data is uploaded to Facebook every day– Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user

generated data– Twitter generates 12 Terabytes of data every day– LinkedIn processes and mines Petabytes of user data to power the "People

You May Know" feature– YouTube users upload 48 hours of new video content every minute of the

day– Decoding of the human genome used to take 10 years. Now it can be done

in 7 days

Page 4: Introduction to Hadoop and MapReduce

Big Data characteristics

• Three Vs: Volume, Velocity, Variety• Sources:– Science, Sensors, Social networks, Log files– Public Data Stores, Data warehouse appliances– Network and in-stream monitoring technologies– Legacy documents

• Main problems:– Storage Problem– Money Problem– Consuming and processing the data

Page 5: Introduction to Hadoop and MapReduce

A Little History

Two Seminar papers:• “The Google File System” - October 2003 http://

labs.google.com/papers/gfs.html– describes a scalable, distributed, fault-tolerant file system tailored for

data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance

• “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 http://queue.acm.org/detail.cfm?id=988408 – Describes a programming model and an implementation for processing

large data sets.1. map function that processes a key/value pair to generate a set of

intermediate key/value pairs2. reduce function that merges all intermediate values associated with the

same intermediate key

Page 6: Introduction to Hadoop and MapReduce

Hadoop

• Hadoop is an open-source software framework that supports data-intensive distributed applications.

• It is written in Java, utilizes JVMs• Named after it’s creator’s (Doug Cutting, Yahoo)

son’s toy elephant• Hadoop is managing a cluster of commodity

hardware computers. The cluster is composed of a single master node and multiple worker nodes

Page 7: Introduction to Hadoop and MapReduce

Hadoop vs RDBMSHadoop / MapReduce RDBMS

Size of data Petabytes Gigabytes

Integrity of data Low High (referential, typed)

Data schema Dynamic Static

Access method Interactive and Batch Batch

Scaling Linear Nonlinear (worse than linear)

Data structure Unstructured Structured

Normalization of data Not Required Required

Query Response Time Has latency (due to batch processing)

Can be near immediate

Page 8: Introduction to Hadoop and MapReduce

MapReduce

• Hadoop leverages the programming model of map/reduce. It is optimized for processing large data sets.

• MapReduce is an essential technique to do distributed computing on clusters of computers/nodes.

• The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel.

• Hadoop leverages a distributed file system to store the data on various nodes.

Page 9: Introduction to Hadoop and MapReduce

MapReduce

• It is about two functions: map and reduce1. Map Step:– it is about dividing the problem into smaller sub-

problems. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node.

2. Reduce Step:– Once the master gets the work from the worker nodes,

the reduce step takes over and combines all the work. By combining the work you can form some answer and ultimately output.

Page 10: Introduction to Hadoop and MapReduce

MapReduce – Map step

• There is a master node and many slave nodes.• The master node takes the input, divides it into

smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure.

• The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node.

• Each mapping operation is independent of the others, all maps can be performed in parallel.

Page 11: Introduction to Hadoop and MapReduce

MapReduce – Reduce step

• The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve.

• Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.

Page 12: Introduction to Hadoop and MapReduce

Map, Shuffle, and Reduce

https://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Page 13: Introduction to Hadoop and MapReduce

Word count

http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

Page 14: Introduction to Hadoop and MapReduce

Hadoop architecture

• Job Tracker• Task Tracker• Name Node• Data Node

Page 15: Introduction to Hadoop and MapReduce

Figures

• Following: some figures from the book Hadoop: The Definitive Guide, 3rd Edition

Page 16: Introduction to Hadoop and MapReduce

A client reading data from HDFS

Page 17: Introduction to Hadoop and MapReduce

A client writing data to HDFS

Page 18: Introduction to Hadoop and MapReduce

Network distance in Hadoop

Page 19: Introduction to Hadoop and MapReduce
Page 20: Introduction to Hadoop and MapReduce

MapReduce data flow with a single reduce task

Page 21: Introduction to Hadoop and MapReduce

MapReduce data flow with multiple reduce tasks

Page 22: Introduction to Hadoop and MapReduce

Hadoop architecture

Log Data RDBMS

Data Integration Layer

Flume Sqoop

Storage Layer (HDFS)

Computing Layer (MapReduce)

Advanced Query Engine (Hive, Pig)

Data Mining(Pegasus,Mahout)

Index, Searches(Lucene)

DB drivers(Hive driver)

Web Browser (JS)PresentationLayer

Page 23: Introduction to Hadoop and MapReduce

Demo

• Google Compute Engine + Google Cloud Storage• Using Ubuntu as a remote control host• Following the tutorial of:

– https://github.com/GoogleCloudPlatform/solutions-google-compute-engine-cluster-for-hadoop

– Hadoop on Google Compute Engine for Processing Big Data: https://www.youtube.com/watch?v=se9vV8eIZME

• The example hadoop job is an advanced version of word count in perl or python: the words are sorted by length and abc

• Showing also Google Developer Tool web interface

Page 24: Introduction to Hadoop and MapReduce

References

• Google’s tutorial (see github and YouTube link of the Demo)

• Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press

• Lynn Langit’s various presentations and YouTube videos• Dattatrey Sindol:

Big Data Basics - Part 1 - Introduction to Big Data• Bruno Terkaly’s presentations (for example

Hadoop on Azure: Introduction)• Daniel Jebaraj: Ignore HDInsight at Your Own Peril:

Everything You Need to Know

Page 25: Introduction to Hadoop and MapReduce

Thanks for your attention!