Download - Hadoop, HDFS and MapReduce

Transcript
Page 1: Hadoop, HDFS and MapReduce

Hadoop and MapReduce

The workings of the elephant

Friso van [email protected]

Page 2: Hadoop, HDFS and MapReduce

Data everywhere

‣ Global data volume grows exponentially‣ Information retrieval is BIG business these days

‣ Need means of economically storing and processing large data sets

Page 3: Hadoop, HDFS and MapReduce

Opportunity

‣ Commodity hardware is ultra cheap‣ CPU and storage even cheaper

Page 4: Hadoop, HDFS and MapReduce

Traditional solution

‣ Store data in a (relational) database‣ Run batch jobs for processing

Page 5: Hadoop, HDFS and MapReduce

Problems with existing solutions

‣ Databases are seek heavy; B-tree gives log(n) random accesses per update‣ Seeks are wasted time, nothing of value happens during seeks‣ Databases do not play well with commoditized hardware (SANs and 16 CPU

machines are not in the price sweet spot of performance / $)‣ Databases were not built with horizontal scaling in mind

Page 6: Hadoop, HDFS and MapReduce

Solution: sort/merge vs. updating the B-tree

‣ Eliminate the seeks, only sequential reading / writ ing‣ Work with batches for efficiency‣ Parallelize work load‣ Distribute processing and storage

Page 7: Hadoop, HDFS and MapReduce

History

‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge

optimization applies‣ 2004: Google publishes GFS and MapReduce papers‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve

Nutch’ problem; later becomes standalone project‣ 2011: We’re here learning about it!

Page 8: Hadoop, HDFS and MapReduce

Hadoop foundations

‣ Commodity hardware (3K - 7K $ machines)‣ Only sequential reads / writes‣ Distribution of data and processing across cluster‣ Built in reliabil i ty / fault tolerance / redundancy‣ Disk based, does not require data or indexes to fit in RAM‣ Apache licensed, Open Source Software

Page 9: Hadoop, HDFS and MapReduce
Page 10: Hadoop, HDFS and MapReduce

The US government builds their finger print search index using Hadoop.

Page 11: Hadoop, HDFS and MapReduce
Page 12: Hadoop, HDFS and MapReduce

The contents for the People You May Know feature is created by a chain of many MapReduce jobs that run daily. The jobs are reportedly a combination of graph traversal, clustering and assisted machine learning.

Page 13: Hadoop, HDFS and MapReduce
Page 14: Hadoop, HDFS and MapReduce

Amazon’s Frequently Bought Together and Customers Who Bought This Item Also Bought features are brought to you by MapReduce jobs. Recommendation based on large sales transaction datasets is a much seen use case.

Page 15: Hadoop, HDFS and MapReduce
Page 16: Hadoop, HDFS and MapReduce

Top Charts generated daily based on millions of users’ listening behavior.

Page 17: Hadoop, HDFS and MapReduce
Page 18: Hadoop, HDFS and MapReduce

Top searches used for auto-completion are re-generated daily by a MapReduce job using all searches for the past couple of days. Popularity for search terms can be based on counts, but also trending and correlation with other datasets (e.g. trending on social media, news, charts in case of music and movies, best seller lists, etc.)

Page 19: Hadoop, HDFS and MapReduce

What is Hadoop

Page 21: Hadoop, HDFS and MapReduce

HDFS overview

‣ Distributed fi lesystem‣ Consists of a single master node and multiple (many) data nodes‣ Files are split up blocks (typically 64MB)‣ Blocks are spread across data nodes in the cluster‣ Each block is replicated multiple times to different data nodes in the cluster

(typically 3 times)‣ Master node keeps track of which blocks belong to a fi le

Page 22: Hadoop, HDFS and MapReduce

HDFS interaction

‣ Accessible through Java API‣ FUSE (fi lesystem in user space) driver available to mount as regular FS‣ C API available‣ Basic command line tools in Hadoop distribution‣ Web interface

Page 23: Hadoop, HDFS and MapReduce

HDFS interaction

‣ File creation, directory l isting and other meta data actions go through the master node (e.g. ls, du, fsck, create fi le)

‣ Data goes directly to and from data nodes (read, write, append)‣ Local read path optimization: clients located on same machine as data node wil l

always access local replica when possible

Page 24: Hadoop, HDFS and MapReduce

Hadoop FileSystem (HDFS)

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Name Node

/some/file /foo/bar

HDFS client create file

write data

read data

replicate

Node localHDFS client

read data

Page 25: Hadoop, HDFS and MapReduce

HDFS daemons: NameNode

‣ Filesystem master node‣ Keeps track of directories, f i les and block locations‣ Assigns blocks to data nodes‣ Keeps track of l ive nodes (through heartbeats)‣ Init iates re-replication in case of data node loss

‣ Block meta data is held in memory• Will run out of memory when too many fi les exist

‣ Is a SINGLE POINT OF FAILURE in the system• Some solutions exist

Page 26: Hadoop, HDFS and MapReduce

HDFS daemons: DataNode

‣ Filesystem worker node / “Block server”‣ Uses underlying regular FS for storage (e.g. ext3)• Takes care of distribution of blocks across disks• Don’t use RAID• More disks means more IO throughput

‣ Sends heartbeats to NameNode‣ Reports blocks to NameNode (on startup)‣ Does not know about the rest of the cluster (shared nothing)

Page 27: Hadoop, HDFS and MapReduce

Things to know about HDFS

‣ HDFS is write once, read many• But has append support in newer versions

‣ Has built in compression at the block level‣ Does end-to-end checksumming on all data‣ Has tools for parallelized copying of large amounts of data to other HDFS

clusters (distcp)‣ Provides a convenient f i le format to gather lots of small f i les into a single large

one• Remember the NameNode running out of memory with too many fi les?

‣ HDFS is best used for large, unstructured volumes of raw data in BIG fi les used for batch operations• Optimized for sequential reads, not random access

Page 28: Hadoop, HDFS and MapReduce

Hadoop Sequence Files

‣ Special type of f i le to store Key-Value pairs‣ Stores keys and values as byte arrays‣ Uses length encoded bytes as format‣ Often used as input or output format for MapReduce jobs‣ Has built in compression on values

Page 29: Hadoop, HDFS and MapReduce

Example: command directory l isting

friso@fvv:~/java$ hadoop fs -ls /Found 3 itemsdrwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Usersdrwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbasedrwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /userfriso@fvv:~/java$

Page 30: Hadoop, HDFS and MapReduce

Example: NameNode web interface

Page 31: Hadoop, HDFS and MapReduce

Example: copy local f i le to HDFS

friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json

Page 32: Hadoop, HDFS and MapReduce

MapReduce

Massively parallelizable computing

Friso van [email protected]

Page 33: Hadoop, HDFS and MapReduce

MapReduce, the algorithm

Input data: Required output:

Page 34: Hadoop, HDFS and MapReduce

Map: extract something useful from each record

map

map

map

map

map

map

map

map

KEYS VALUES

void map(recordNumber, record) {key = record.findColorfulShape();value = record.findGrayShapes();emit(key, value);

}

Page 35: Hadoop, HDFS and MapReduce

Framework sorts all KeyValue pairs by Key

KEYS VALUES KEYS VALUES

Page 36: Hadoop, HDFS and MapReduce

Reduce: process values for each key

void reduce(key, values) {allGrayShapes = [];foreach (value in values) {allGrayShapes.push(value);

}emit(key, allGrayShapes);

}

KEYS VALUES

reduce

reduce

reduce

KEYS VALUES

Page 37: Hadoop, HDFS and MapReduce

MapReduce, the algorithm

map

map

map

map

map

map

map

map

KEYS VALUES KEYS VALUES

reduce

reduce

reduce

KEYS VALUES

Page 38: Hadoop, HDFS and MapReduce

Hadoop MapReduce: parallelized on top of HDFS

‣ Job input comes from fi les on HDFS• Typically sequence fi les• Other formats are possible; requires specialized InputFormat implementation• Built in support for text f i les (convenient for logs, csv, etc.)• Files must be splittable for parallelization to work- Not all compression formats have this property (e.g. gzip)

Page 39: Hadoop, HDFS and MapReduce

MapReduce daemons: JobTracker

‣ MapReduce master node‣ Takes care of scheduling and job submission‣ Splits jobs into tasks (Mappers and Reducers)‣ Assigns tasks to worker nodes‣ Reassigns tasks in case of failure‣ Keeps track of job progress‣ Keeps track of worker nodes through heartbeats

Page 40: Hadoop, HDFS and MapReduce

MapReduce daemons: TaskTracker

‣ MapReduce worker process‣ Starts Mappers en Reducers assigned by JobTracker‣ Sends heart beats to the JobTracker‣ Sends task progress to the JobTracker‣ Does not know about the rest of the cluster (shared nothing)

Page 41: Hadoop, HDFS and MapReduce

Hadoop MapReduce: parallelized on top of HDFS

Combiner Functions

30 | Chapter 2: MapReduce

Page 42: Hadoop, HDFS and MapReduce

Hadoop MapReduce: Mapper side

‣ Each mapper processes a piece of the total input• Typically blocks that reside on the same machine as the mapper (local

datanode)‣ Mappers sort output by key and store it on the local disk• If the mapper output does not f it in RAM, on disk merge sort happens

Page 43: Hadoop, HDFS and MapReduce

Hadoop MapReduce: Reducer side

‣ Reducers collect sorted input KeyValue pairs over the network from Mappers• Reducer performs (on disk) merge on inputs from different mappers

‣ Reducer calls the reduce method for each unique key• List of values for each key is read from local disk (the result of the merge)• Values do not need to fit in RAM- Reduce methods that need a global view, need enough RAM to fit all values

for a key

‣ Reducer writes output KeyValue pairs to HDFS• Typically blocks go to local data node

Page 44: Hadoop, HDFS and MapReduce

Hadoop MapReduce: parallelized on top of HDFS

Combiner Functions

30 | Chapter 2: MapReduce

Page 45: Hadoop, HDFS and MapReduce

<PLUG>

</PLUG>

Summer ClassesBig data crunching using Hadoop and other NoSQL tools• Write Hadoop MapReduce jobs in Java• Run on a actual cluster pre-loaded with several datasets• Create a simple application or visualization with the result• Learn about Hadoop without the hassle of building a production cluster first• Have lots of fun!

Dates: July 12, August 10Only € 295,= for a full day course

http://www.xebia.com/summerclasses/bigdata