Introduc)onto Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Introduc)on to Hadoop

Copyright 2011 Cloudera Inc. All rights reserved

Today’s speaker – Tom Hanlon

• [email protected]• Senior Instructor at Cloudera

2


Agenda

• What is Hadoop?• Some Internals• The ecosystem


Hadoop Distributed File System (HDFS)

MapReduce

Apache Hadoop

• Consolidates Mixed Data• Move complex and relaHonal data into a single repository

• Stores Inexpensively• Keep raw data always available• Use industry standard hardware

• Processes at the Source• Eliminate ETL boNlenecks• Mine data first, govern later

Open Source Distributed Storage and Processing Engine


What is Hadoop

• Hadoop is system of Distributed Fault tolerant Scalable storage and processing• Modeled aSer papers published by Google describing their architecture

• Open Source Apache Licensed• Java


What is Hadoop

• More info:• hNp://hadoop.apache.org

http://hadoop.apache.org

http://hadoop.apache.org


Who uses Hadoop ?

• TwiNer, Facebook, Yahoo, StumbleUpon and others


Should you use Hadoop ?

• Is your data too big for a database ? • Is the main boNleneck in processing your data the Hme that it takes to read the data from disk ?• We can store more data than we can process• The raHo of IOPS/TB is going backwards• Disks get a liNle bit faster, but they get a lot bigger


Limita)ons of hadoop

• All Systems have limitaHons, we just get used to them and ignore them over Hme. Think about how hard it was to squeeze your data into a relaHonal model back in the day.


Limita)ons of hadoop

• Batch Processing• Hadoop is currently more or less about Batch processing of huge amounts of data, and it excels at that.

• Hadoop is NOT about random access *• Hadoop is NOT about low latency• Hadoop is NOT about real Hme lookup• Hadoop is NOT about caching• Hadoop is NOT about SQL

*hbase is for quick random access with caching


Now look at some internals



Hadoop Internals

• Hadoop at the core consists of HDFS and Map Reduce• HDFS, The Hadoop Distributed File System

• Responsible for storing files and handling node failure

• Map Reduce• A Distributed system for processing data stored in HDFS


HDFS

• Presents the aggregate storage of the disks in your datanodes as more or less one big file system


HDFS Fundamentals

• Data is replicated• Default is 3 Hmes

• Files are immutable• No updates, no appends

• Disk access is opHmized for SequenHal Reads• Store data in large “blocks” 64MB default


HDFS Fundamentals con)nued

• Avoid CorrupHon• “blocks” are verified with checksum when stored and read

• High throughput• Avoid contenHon, have system share as liNle informaHon and resources as possible

• Fault Tolerant• Loss of a disk, or machine, or rack of machines should not lead to data loss

NameNode

DataNodes


HDFS Architecture


Focus on the NameNode

• Only one per cluster “master node”• Stores Meta informaHon of “filesystem”• Filename, permissions, directories, blocks

• Kept in RAM for fast access• Persisted to disk


Focus on the DataNodes

• Many per cluster, “slave nodes”• Stores individual file “blocks” but knows nothing about them, accept the block name. • The NameNode is the brains of the ougit

• Reports regularly to NameNode• “Hey I am alive, and I have these blocks”

NameNode

DataNodes

Client


HDFS: Anatomy of a write

NameNode

DataNodes

Client


First Client connects to NameNode for permission check and to record metadata

NameNode

DataNodes

Client


NameNode creates metadata and confirms permission to write. Returns datanode that Client can stream to.

NameNode

DataNodes

Client


Client begins streaming the file to a datanode


Datanode receiving data, streams to second and third datanode


HDFS blocks revisited

• If a file is larger that 64MB then it will consist of mulHple blocks.• 70MB file = 1 64MB block, 1 6MB block• Blocks will be replicated 3 Hmes• We can read the file by accessing any of the 3 copies of the blocks.• Immutable blocks mean any copy is as good as any other


MapReduce

• HDFS handles the Distributed FileSystem layer• MapReduce is how we process the data• MapReduce Daemons• JobTracker• TaskTracker

• Goals• Distribute the reading and processing of data• Localize the processing when possible• Share as liNle data as possible while processing


MapReduce


Focus on the JobTracker

• One per cluster “master node”• Takes jobs from clients• Splits work into “tasks”• Distributes “tasks” to TaskTrackers• Monitors progress,deals with failures


Focus on the TaskTrackers

• Many per cluster “slave nodes”• Does the actual work, executes the code for the job• Talks regularly with JobTracker• Launches child process when given a task• Reports progress of running “task” back to JobTracker


Anatomy of a MapReduce Job

• Client submits job• I want to process all the apache log files and compare hits to our website from top 50 CiHes in the US and compare with census data on income. *

* we will assume census data is in Data Warehouse, already so the hadoop piece is just weblog report from 50 ci<es.


Anatomy of a MapReduce Job con)nued...

• JobTracker receives Job• Queries NameNode for number of blocks in file• The Job is split into tasks• One map task for each block• As many Reduce tasks as specified in the Job


Anatomy of a MapReduce Job con)nued...

• TaskTracker checks in Regularly with JobTracker• Is there any work for me ?

• If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”• In this case the task tracker will search the log for lines with acHvity from any of the 50 ciHes we care about


MapReduce workflow Big Picture


Client Submits job to JobTracker


JobTracker queries NameNode for file/block info


JobTracker defines job as collec)on of Tasks

1 file, 100 blocks = 100 Map Tasks10 reduce Tasks = 10 Reduce TaskswaiHng to assign 110 tasks for jobid 12422424_01


TaskTrackers checking in are assigned Tasks


When all Tasks for a par)cular Job completethe job is complete.

Text

110 tasks completed successfully, Update web interface..

Job 12422424_01 is done


Focus on the Workflow

• Five Steps• Input• Map• Shuffle and Sort• Reduce• Output


Input

• MapReduce process “records” • In our case lines of text in an apache log

• Records are presented as Key Value pairs • InputFormat determines Input Keys/Values

• TextInputFormat is the default• Key is numeric equivalent of line number• Value is content of the line

• Given a 64MB block, the mapper will get however many lines are contained in the block


Map

• For Each K/V processed the Mapper can emit 0 or more intermediate Keys and Values• The code you write determines these• In our case, if the log entry came from the 50 ciHes we are analyzing, then we emit, If not we ignore


Intermediate Data.. what key ?......

Key points...IdenHcal keys will go to the same Reducer, so a key is chosen based on what type of aggregaHon you intend to do.


Intermediate Data.. what key ?

• In our case we want to aggregate by City, and probably do some calculaHon based on Hme. • Key of City might be good• Key of City:Hmestamp of record might be beNer• Value will be the line from the log

• So input ~linenumber* as Key value as line, emit City:Hmestamp as Key, value as line

*not really linenumber, byteoffset of line.. close enough


Shuffle and Sort

• For Each K/V processed the Mapper can emit 0 or more intermediate Keys and Values• The code you write determines the Intermediate Keys and Values. In our case, if the log entry came from the 50 ciHes we are analyzing, then we emit, If not we ignore.

• Keys will be City:Hme, values will be line from log


Reduce

• The contract for the Reduce is• All like keys go to same reducer• Keys arrive in order• Reducer gets key and list of values


Reduce: our example

• In our example the the values for a parHcular city will end up in Hme order at the same reducer

• The Reducer will have a reasonably sized chunk of work to do, yet the work will be distributed accross machines

• Each Reducer gepng a set of CiHes.*

* Without going into details, the City:Hmestamp Key takes a liNle opHmizaHon to manage to parHHon only by City.


Output

• For Each K/V processed the Reducer can emit 0 or more final Keys and Values

• These are wriNen into hdfs• Format specified per job


Focus on the Task

• TaskTracker when given a task launches a child process• InformaHon from the running process is channeled up through to the JobTracker for monitoring purposes

• Any output from a MapTask is collected locally for later secondary processing by a Reduce task if there is a Reduce phase

• When finished the child process is allowed to die


GeWng Started

Cloudera provides a Virtual Machine with Hadoop, and the assorted tools installed ready to go. hNps://ccp.cloudera.com/display/SUPPORT/Downloads

https://ccp.cloudera.com/display/SUPPORT/Downloads

https://ccp.cloudera.com/display/SUPPORT/Downloads


The Hadoop Ecosystem



Hadoop Ecosystem

• Hbase• Hive• Pig• Streaming• Sqoop

In the next few slides we will get a brief introducHon to some of the tools in the hadoop related toolbox


Hbase

• Column Oriented• Scales• High throughut• Distributed• No enforced schema, up to applicaHon• Access methods Get, Put, and Scan

Hbase is a layer on top of HDFS, (not MapReduce) that allows for fast random access with caching to a distributed sorted map.


Hbase

New book on Hbase is in the process of being released. Check O’Reilly for release date.


Hive

• Not an RDBMS• More than enough SELECT funcHonality to be extremely useful. (who needs all that transacHonal stuff anyhow ? )

MapReduce is wriNen in Java, the people demand SQL. Hive allows the users to have their SQL and translates into Map Reduce.


Pig

Exampledata = Load ‘logs.txt’ as (id:int, name:chararray);data2 = load ‘stuff.tct’ as (id:int, data:chararray)data3 = Join data by (id) , data2 by (‘id’);store data3 into ‘out’;

Pig provides a language to express Map Reduce in a DeclaraHve (?) language. You state what you want to happen , Pig makes it happen.


Streaming

How does it work ??Streaming passes the split as STDIN to your mapper code, gathers the output and shuffles and sorts, then sends to your Reducer code.

Streaming allows you to write your Map Reduce logic in any available language.


Sqoop

Oracle, MySQL, Postgres on one side, HDFS, Hive or Pig formaNed files in hadoop on the other.

Sqoop is a prebuilt toolkit to aid in the work of exporHng data from a RelaHonal Database into hadoop.

Ques)ons?

(Thank you for your 5me)


Appendix A: Resources

• Cloudera• hNp://www.cloudera.com/blog• hNp://wiki.cloudera.com/• hNp://www.cloudera.com/blog/2011/03/simple-‐moving-‐average-‐secondary-‐sort-‐and-‐mapreduce-‐part-‐1/

• Hadoop• hNp://hadoop.apache.org/

• HBase• hNp://hbase.apache.org/

• HIve• hNp://hive.apache.org/

• Pig• hNp://pig.apache.org/

• Sqoop

• hNps://github.com/cloudera/sqoop/wiki/apache.org

http://www.cloudera.com/blog




http://wiki.cloudera.com/

http://wiki.cloudera.com/

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

http://hadoop.apache.org/

http://hadoop.apache.org/

http://hbase.apache.org/














Introduc)onto Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Documents

Transcript of Introduc)onto Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Introduc)on*to* Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Documents

Transcript of Introduc)on*to* Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Introduc)onto Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...

Transcript of Introduc)onto Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...