Download - Hadoop, HDFS and MapReduce

Hadoop and MapReduce

The workings of the elephant

Friso van [email protected]

mailto:[email protected]


Data everywhere

‣ Global data volume grows exponentially‣ Information retrieval is BIG business these days

‣ Need means of economically storing and processing large data sets

Opportunity

‣ Commodity hardware is ultra cheap‣ CPU and storage even cheaper

Traditional solution

‣ Store data in a (relational) database‣ Run batch jobs for processing

Problems with existing solutions

‣ Databases are seek heavy; B-tree gives log(n) random accesses per update‣ Seeks are wasted time, nothing of value happens during seeks‣ Databases do not play well with commoditized hardware (SANs and 16 CPU

machines are not in the price sweet spot of performance / $)‣ Databases were not built with horizontal scaling in mind

Solution: sort/merge vs. updating the B-tree

‣ Eliminate the seeks, only sequential reading / writ ing‣ Work with batches for efficiency‣ Parallelize work load‣ Distribute processing and storage

History

‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge

optimization applies‣ 2004: Google publishes GFS and MapReduce papers‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve

Nutch’ problem; later becomes standalone project‣ 2011: We’re here learning about it!

Hadoop foundations

‣ Commodity hardware (3K - 7K $ machines)‣ Only sequential reads / writes‣ Distribution of data and processing across cluster‣ Built in reliabil i ty / fault tolerance / redundancy‣ Disk based, does not require data or indexes to fit in RAM‣ Apache licensed, Open Source Software

The US government builds their finger print search index using Hadoop.

The contents for the People You May Know feature is created by a chain of many MapReduce jobs that run daily. The jobs are reportedly a combination of graph traversal, clustering and assisted machine learning.

Amazon’s Frequently Bought Together and Customers Who Bought This Item Also Bought features are brought to you by MapReduce jobs. Recommendation based on large sales transaction datasets is a much seen use case.

Top Charts generated daily based on millions of users’ listening behavior.

Top searches used for auto-completion are re-generated daily by a MapReduce job using all searches for the past couple of days. Popularity for search terms can be based on counts, but also trending and correlation with other datasets (e.g. trending on social media, news, charts in case of music and movies, best seller lists, etc.)

What is Hadoop

Hadoop Filesystem

HDFS




HDFS overview

‣ Distributed fi lesystem‣ Consists of a single master node and multiple (many) data nodes‣ Files are split up blocks (typically 64MB)‣ Blocks are spread across data nodes in the cluster‣ Each block is replicated multiple times to different data nodes in the cluster

(typically 3 times)‣ Master node keeps track of which blocks belong to a fi le

HDFS interaction

‣ Accessible through Java API‣ FUSE (fi lesystem in user space) driver available to mount as regular FS‣ C API available‣ Basic command line tools in Hadoop distribution‣ Web interface

HDFS interaction

‣ File creation, directory l isting and other meta data actions go through the master node (e.g. ls, du, fsck, create fi le)

‣ Data goes directly to and from data nodes (read, write, append)‣ Local read path optimization: clients located on same machine as data node wil l

always access local replica when possible

Hadoop FileSystem (HDFS)

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Name Node

/some/file /foo/bar

HDFS client create file

write data

read data

replicate

Node localHDFS client

read data

HDFS daemons: NameNode

‣ Filesystem master node‣ Keeps track of directories, f i les and block locations‣ Assigns blocks to data nodes‣ Keeps track of l ive nodes (through heartbeats)‣ Init iates re-replication in case of data node loss

‣ Block meta data is held in memory• Will run out of memory when too many fi les exist

‣ Is a SINGLE POINT OF FAILURE in the system• Some solutions exist

HDFS daemons: DataNode

‣ Filesystem worker node / “Block server”‣ Uses underlying regular FS for storage (e.g. ext3)• Takes care of distribution of blocks across disks• Don’t use RAID• More disks means more IO throughput

‣ Sends heartbeats to NameNode‣ Reports blocks to NameNode (on startup)‣ Does not know about the rest of the cluster (shared nothing)

Things to know about HDFS

‣ HDFS is write once, read many• But has append support in newer versions

‣ Has built in compression at the block level‣ Does end-to-end checksumming on all data‣ Has tools for parallelized copying of large amounts of data to other HDFS

clusters (distcp)‣ Provides a convenient f i le format to gather lots of small f i les into a single large

one• Remember the NameNode running out of memory with too many fi les?

‣ HDFS is best used for large, unstructured volumes of raw data in BIG fi les used for batch operations• Optimized for sequential reads, not random access

Hadoop Sequence Files

‣ Special type of f i le to store Key-Value pairs‣ Stores keys and values as byte arrays‣ Uses length encoded bytes as format‣ Often used as input or output format for MapReduce jobs‣ Has built in compression on values

Example: command directory l isting

friso@fvv:~/java$ hadoop fs -ls /Found 3 itemsdrwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Usersdrwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbasedrwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /userfriso@fvv:~/java$

Example: NameNode web interface

Example: copy local f i le to HDFS

friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json

MapReduce

Massively parallelizable computing




MapReduce, the algorithm

Input data: Required output:

Map: extract something useful from each record

map

map

map

map

map

map

map

map

KEYS VALUES

void map(recordNumber, record) {key = record.findColorfulShape();value = record.findGrayShapes();emit(key, value);

}

Framework sorts all KeyValue pairs by Key

KEYS VALUES KEYS VALUES

Reduce: process values for each key

void reduce(key, values) {allGrayShapes = [];foreach (value in values) {allGrayShapes.push(value);

}emit(key, allGrayShapes);

}

KEYS VALUES

reduce

reduce

reduce

KEYS VALUES

MapReduce, the algorithm

map

map

map

map

map

map

map

map

KEYS VALUES KEYS VALUES

reduce

reduce

reduce

KEYS VALUES

Hadoop MapReduce: parallelized on top of HDFS

‣ Job input comes from fi les on HDFS• Typically sequence fi les• Other formats are possible; requires specialized InputFormat implementation• Built in support for text f i les (convenient for logs, csv, etc.)• Files must be splittable for parallelization to work- Not all compression formats have this property (e.g. gzip)

MapReduce daemons: JobTracker

‣ MapReduce master node‣ Takes care of scheduling and job submission‣ Splits jobs into tasks (Mappers and Reducers)‣ Assigns tasks to worker nodes‣ Reassigns tasks in case of failure‣ Keeps track of job progress‣ Keeps track of worker nodes through heartbeats

MapReduce daemons: TaskTracker

‣ MapReduce worker process‣ Starts Mappers en Reducers assigned by JobTracker‣ Sends heart beats to the JobTracker‣ Sends task progress to the JobTracker‣ Does not know about the rest of the cluster (shared nothing)


Combiner Functions

30 | Chapter 2: MapReduce

Hadoop MapReduce: Mapper side

‣ Each mapper processes a piece of the total input• Typically blocks that reside on the same machine as the mapper (local

datanode)‣ Mappers sort output by key and store it on the local disk• If the mapper output does not f it in RAM, on disk merge sort happens

Hadoop MapReduce: Reducer side

‣ Reducers collect sorted input KeyValue pairs over the network from Mappers• Reducer performs (on disk) merge on inputs from different mappers

‣ Reducer calls the reduce method for each unique key• List of values for each key is read from local disk (the result of the merge)• Values do not need to fit in RAM- Reduce methods that need a global view, need enough RAM to fit all values

for a key

‣ Reducer writes output KeyValue pairs to HDFS• Typically blocks go to local data node


Combiner Functions

30 | Chapter 2: MapReduce

<PLUG>

</PLUG>

Summer ClassesBig data crunching using Hadoop and other NoSQL tools• Write Hadoop MapReduce jobs in Java• Run on a actual cluster pre-loaded with several datasets• Create a simple application or visualization with the result• Learn about Hadoop without the hassle of building a production cluster first• Have lots of fun!

Dates: July 12, August 10Only € 295,= for a full day course

http://www.xebia.com/summerclasses/bigdata