Introduc)on*to* Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...
-
Upload
phunghuong -
Category
Documents
-
view
213 -
download
1
Transcript of Introduc)on*to* Hadoop - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to...
Copyright 2011 Cloudera Inc. All rights reserved
Today’s speaker – Tom Hanlon
• [email protected]• Senior Instructor at Cloudera
2
Copyright 2011 Cloudera Inc. All rights reserved
Agenda
• What is Hadoop?• Some Internals• The ecosystem
Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Distributed File System (HDFS)
MapReduce
Apache Hadoop
• Consolidates Mixed Data• Move complex and relaHonal data into a single repository
• Stores Inexpensively• Keep raw data always available• Use industry standard hardware
• Processes at the Source• Eliminate ETL boNlenecks• Mine data first, govern later
Open Source Distributed Storage and Processing Engine
Copyright 2011 Cloudera Inc. All rights reserved
What is Hadoop
• Hadoop is system of Distributed Fault tolerant Scalable storage and processing• Modeled aSer papers published by Google describing their architecture
• Open Source Apache Licensed• Java
Copyright 2011 Cloudera Inc. All rights reserved
What is Hadoop
• More info:• hNp://hadoop.apache.org
Copyright 2011 Cloudera Inc. All rights reserved
Who uses Hadoop ?
• TwiNer, Facebook, Yahoo, StumbleUpon and others
Copyright 2011 Cloudera Inc. All rights reserved
Should you use Hadoop ?
• Is your data too big for a database ? • Is the main boNleneck in processing your data the Hme that it takes to read the data from disk ?• We can store more data than we can process• The raHo of IOPS/TB is going backwards• Disks get a liNle bit faster, but they get a lot bigger
Copyright 2011 Cloudera Inc. All rights reserved
Limita)ons of hadoop
• All Systems have limitaHons, we just get used to them and ignore them over Hme. Think about how hard it was to squeeze your data into a relaHonal model back in the day.
Copyright 2011 Cloudera Inc. All rights reserved
Limita)ons of hadoop
• Batch Processing• Hadoop is currently more or less about Batch processing of huge amounts of data, and it excels at that.
• Hadoop is NOT about random access *• Hadoop is NOT about low latency• Hadoop is NOT about real Hme lookup• Hadoop is NOT about caching• Hadoop is NOT about SQL
*hbase is for quick random access with caching
Copyright 2011 Cloudera Inc. All rights reserved
Now look at some internals
• What is Hadoop?• Some Internals• The ecosystem
Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Internals
• Hadoop at the core consists of HDFS and Map Reduce• HDFS, The Hadoop Distributed File System
• Responsible for storing files and handling node failure
• Map Reduce• A Distributed system for processing data stored in HDFS
Copyright 2011 Cloudera Inc. All rights reserved
HDFS
• Presents the aggregate storage of the disks in your datanodes as more or less one big file system
Copyright 2011 Cloudera Inc. All rights reserved
HDFS Fundamentals
• Data is replicated• Default is 3 Hmes
• Files are immutable• No updates, no appends
• Disk access is opHmized for SequenHal Reads• Store data in large “blocks” 64MB default
Copyright 2011 Cloudera Inc. All rights reserved
HDFS Fundamentals con)nued
• Avoid CorrupHon• “blocks” are verified with checksum when stored and read
• High throughput• Avoid contenHon, have system share as liNle informaHon and resources as possible
• Fault Tolerant• Loss of a disk, or machine, or rack of machines should not lead to data loss
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the NameNode
• Only one per cluster “master node”• Stores Meta informaHon of “filesystem”• Filename, permissions, directories, blocks
• Kept in RAM for fast access• Persisted to disk
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the DataNodes
• Many per cluster, “slave nodes”• Stores individual file “blocks” but knows nothing about them, accept the block name. • The NameNode is the brains of the ougit
• Reports regularly to NameNode• “Hey I am alive, and I have these blocks”
NameNode
DataNodes
Client
Copyright 2011 Cloudera Inc. All rights reserved
First Client connects to NameNode for permission check and to record metadata
NameNode
DataNodes
Client
Copyright 2011 Cloudera Inc. All rights reserved
NameNode creates metadata and confirms permission to write. Returns datanode that Client can stream to.
NameNode
DataNodes
Client
Copyright 2011 Cloudera Inc. All rights reserved
Client begins streaming the file to a datanode
Copyright 2011 Cloudera Inc. All rights reserved
Datanode receiving data, streams to second and third datanode
Copyright 2011 Cloudera Inc. All rights reserved
HDFS blocks revisited
• If a file is larger that 64MB then it will consist of mulHple blocks.• 70MB file = 1 64MB block, 1 6MB block• Blocks will be replicated 3 Hmes• We can read the file by accessing any of the 3 copies of the blocks.• Immutable blocks mean any copy is as good as any other
Copyright 2011 Cloudera Inc. All rights reserved
MapReduce
• HDFS handles the Distributed FileSystem layer• MapReduce is how we process the data• MapReduce Daemons• JobTracker• TaskTracker
• Goals• Distribute the reading and processing of data• Localize the processing when possible• Share as liNle data as possible while processing
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the JobTracker
• One per cluster “master node”• Takes jobs from clients• Splits work into “tasks”• Distributes “tasks” to TaskTrackers• Monitors progress,deals with failures
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the TaskTrackers
• Many per cluster “slave nodes”• Does the actual work, executes the code for the job• Talks regularly with JobTracker• Launches child process when given a task• Reports progress of running “task” back to JobTracker
Copyright 2011 Cloudera Inc. All rights reserved
Anatomy of a MapReduce Job
• Client submits job• I want to process all the apache log files and compare hits to our website from top 50 CiHes in the US and compare with census data on income. *
* we will assume census data is in Data Warehouse, already so the hadoop piece is just weblog report from 50 ci<es.
Copyright 2011 Cloudera Inc. All rights reserved
Anatomy of a MapReduce Job con)nued...
• JobTracker receives Job• Queries NameNode for number of blocks in file• The Job is split into tasks• One map task for each block• As many Reduce tasks as specified in the Job
Copyright 2011 Cloudera Inc. All rights reserved
Anatomy of a MapReduce Job con)nued...
• TaskTracker checks in Regularly with JobTracker• Is there any work for me ?
• If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”• In this case the task tracker will search the log for lines with acHvity from any of the 50 ciHes we care about
Copyright 2011 Cloudera Inc. All rights reserved
JobTracker defines job as collec)on of Tasks
1 file, 100 blocks = 100 Map Tasks10 reduce Tasks = 10 Reduce TaskswaiHng to assign 110 tasks for jobid 12422424_01
Copyright 2011 Cloudera Inc. All rights reserved
When all Tasks for a par)cular Job completethe job is complete.
Text
110 tasks completed successfully, Update web interface..
Job 12422424_01 is done
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the Workflow
• Five Steps• Input• Map• Shuffle and Sort• Reduce• Output
Copyright 2011 Cloudera Inc. All rights reserved
Input
• MapReduce process “records” • In our case lines of text in an apache log
• Records are presented as Key Value pairs • InputFormat determines Input Keys/Values
• TextInputFormat is the default• Key is numeric equivalent of line number• Value is content of the line
• Given a 64MB block, the mapper will get however many lines are contained in the block
Copyright 2011 Cloudera Inc. All rights reserved
Map
• For Each K/V processed the Mapper can emit 0 or more intermediate Keys and Values• The code you write determines these• In our case, if the log entry came from the 50 ciHes we are analyzing, then we emit, If not we ignore
Copyright 2011 Cloudera Inc. All rights reserved
Intermediate Data.. what key ?......
Key points...IdenHcal keys will go to the same Reducer, so a key is chosen based on what type of aggregaHon you intend to do.
Copyright 2011 Cloudera Inc. All rights reserved
Intermediate Data.. what key ?
• In our case we want to aggregate by City, and probably do some calculaHon based on Hme. • Key of City might be good• Key of City:Hmestamp of record might be beNer• Value will be the line from the log
• So input ~linenumber* as Key value as line, emit City:Hmestamp as Key, value as line
*not really linenumber, byteoffset of line.. close enough
Copyright 2011 Cloudera Inc. All rights reserved
Shuffle and Sort
• For Each K/V processed the Mapper can emit 0 or more intermediate Keys and Values• The code you write determines the Intermediate Keys and Values. In our case, if the log entry came from the 50 ciHes we are analyzing, then we emit, If not we ignore.
• Keys will be City:Hme, values will be line from log
Copyright 2011 Cloudera Inc. All rights reserved
Reduce
• The contract for the Reduce is• All like keys go to same reducer• Keys arrive in order• Reducer gets key and list of values
Copyright 2011 Cloudera Inc. All rights reserved
Reduce: our example
• In our example the the values for a parHcular city will end up in Hme order at the same reducer
• The Reducer will have a reasonably sized chunk of work to do, yet the work will be distributed accross machines
• Each Reducer gepng a set of CiHes.*
* Without going into details, the City:Hmestamp Key takes a liNle opHmizaHon to manage to parHHon only by City.
Copyright 2011 Cloudera Inc. All rights reserved
Output
• For Each K/V processed the Reducer can emit 0 or more final Keys and Values
• These are wriNen into hdfs• Format specified per job
Copyright 2011 Cloudera Inc. All rights reserved
Focus on the Task
• TaskTracker when given a task launches a child process• InformaHon from the running process is channeled up through to the JobTracker for monitoring purposes
• Any output from a MapTask is collected locally for later secondary processing by a Reduce task if there is a Reduce phase
• When finished the child process is allowed to die
Copyright 2011 Cloudera Inc. All rights reserved
GeWng Started
Cloudera provides a Virtual Machine with Hadoop, and the assorted tools installed ready to go. hNps://ccp.cloudera.com/display/SUPPORT/Downloads
Copyright 2011 Cloudera Inc. All rights reserved
The Hadoop Ecosystem
• What is Hadoop?• Some Internals• The ecosystem
Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Ecosystem
• Hbase• Hive• Pig• Streaming• Sqoop
In the next few slides we will get a brief introducHon to some of the tools in the hadoop related toolbox
Copyright 2011 Cloudera Inc. All rights reserved
Hbase
• Column Oriented• Scales• High throughut• Distributed• No enforced schema, up to applicaHon• Access methods Get, Put, and Scan
Hbase is a layer on top of HDFS, (not MapReduce) that allows for fast random access with caching to a distributed sorted map.
Copyright 2011 Cloudera Inc. All rights reserved
Hbase
New book on Hbase is in the process of being released. Check O’Reilly for release date.
Copyright 2011 Cloudera Inc. All rights reserved
Hive
• Not an RDBMS• More than enough SELECT funcHonality to be extremely useful. (who needs all that transacHonal stuff anyhow ? )
MapReduce is wriNen in Java, the people demand SQL. Hive allows the users to have their SQL and translates into Map Reduce.
Copyright 2011 Cloudera Inc. All rights reserved
Pig
Exampledata = Load ‘logs.txt’ as (id:int, name:chararray);data2 = load ‘stuff.tct’ as (id:int, data:chararray)data3 = Join data by (id) , data2 by (‘id’);store data3 into ‘out’;
Pig provides a language to express Map Reduce in a DeclaraHve (?) language. You state what you want to happen , Pig makes it happen.
Copyright 2011 Cloudera Inc. All rights reserved
Streaming
How does it work ??Streaming passes the split as STDIN to your mapper code, gathers the output and shuffles and sorts, then sends to your Reducer code.
Streaming allows you to write your Map Reduce logic in any available language.
Copyright 2011 Cloudera Inc. All rights reserved
Sqoop
Oracle, MySQL, Postgres on one side, HDFS, Hive or Pig formaNed files in hadoop on the other.
Sqoop is a prebuilt toolkit to aid in the work of exporHng data from a RelaHonal Database into hadoop.
Copyright 2011 Cloudera Inc. All rights reserved
Appendix A: Resources
• Cloudera• hNp://www.cloudera.com/blog• hNp://wiki.cloudera.com/• hNp://www.cloudera.com/blog/2011/03/simple-‐moving-‐average-‐secondary-‐sort-‐and-‐mapreduce-‐part-‐1/
• Hadoop• hNp://hadoop.apache.org/
• HBase• hNp://hbase.apache.org/
• HIve• hNp://hive.apache.org/
• Pig• hNp://pig.apache.org/
• Sqoop
• hNps://github.com/cloudera/sqoop/wiki/apache.org