Practical Hadoop using Pig

76
Practical Hadoop with Pig Dave Wellman #openwest @dwellman

description

So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed. Thursday, May 8th, 02:00pm-02:50pm

Transcript of Practical Hadoop using Pig

Page 1: Practical Hadoop using Pig

Practical Hadoop with Pig

Dave Wellman

#openwest @dwellman

Page 2: Practical Hadoop using Pig

How does it all work?

HDFSHadoop Shell

MR Data Structures Pig Commands

Pig Example

Page 3: Practical Hadoop using Pig

HDFS

Page 4: Practical Hadoop using Pig

HDFS has 3 main actors

Page 5: Practical Hadoop using Pig

The Name Node

The Name Node is “The Conductor”.

It directs the performance of the cluster.

Page 6: Practical Hadoop using Pig

The Data Nodes:

A Data Node stores blocks of data.

Clusters can be contain thousands of Data Nodes.

*Yahoo has a 40,000 node cluster.

Page 7: Practical Hadoop using Pig

The Client

The client is a window to the cluster.

Page 8: Practical Hadoop using Pig

The Name Node

Page 9: Practical Hadoop using Pig

The heart of the System.

Page 10: Practical Hadoop using Pig

The heart of the System.

Maintains a virtual File Directory.

Page 11: Practical Hadoop using Pig

The heart of the System.

Maintains a virtual File Directory.

Tracks all the nodes.

Page 12: Practical Hadoop using Pig

The heart of the System.

Maintains a virtual File Directory.

Tracks all the nodes.

Listens for “heartbeats” and “Block Reports” (more on this later).

Page 13: Practical Hadoop using Pig

The heart of the System.

Maintains a virtual File Directory.

Tracks all the nodes.

Listens for “heartbeats” and “Block Reports” (more on this later).

If the NameNode is down, the cluster is offline.

Page 14: Practical Hadoop using Pig

Storing Data

Page 15: Practical Hadoop using Pig

The Data Nodes

Page 16: Practical Hadoop using Pig

Add a Data Node:

Page 17: Practical Hadoop using Pig

Add a Data Node:

The Data Node says “Hello” to the Name Node.

Page 18: Practical Hadoop using Pig

Add a Data Node:

The Data Node says “Hello” to the Name Node.

The Name Node offers the Data Node a handshake with version requirements.

Page 19: Practical Hadoop using Pig

Add a Data Node:

The Data Node says “Hello” to the Name Node.

The Name Node offers the Data Node a handshake with version requirements.

The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.

Page 20: Practical Hadoop using Pig

Add a Data Node:

The Data Node says “Hello” to the Name Node.

The Name Node offers the Data Node a handshake with version requirements.

The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.

The Name Node hands the Data Node a NodeId that it remembers.

.

Page 21: Practical Hadoop using Pig

Add a Data Node:

The Data Node says “Hello” to the Name Node.

The Name Node offers the Data Node a handshake with version requirements.

The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.

The Name Node hands the Data Node a NodeId that it remembers.

The Data Node is now part of cluster and it checks in with the Name Node every 3 seconds.

Page 22: Practical Hadoop using Pig

Data Node Heartbeat:

Page 23: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

Page 24: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

This "check-in" is very important communication protocol that guarantees the health of the cluster.

Page 25: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

This "check-in" is very important communication protocol that guarantees the health of the cluster.

Block Reports – what data I have and is it okay.

Page 26: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

This "check-in" is very important communication protocol that guarantees the health of the cluster.

Block Reports – what data I have and is it okay.

Name Node controls the Data Nodes by issuing orders when they return and report their status.

Page 27: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

This "check-in" is very important communication protocol that guarantees the health of the cluster.

Block Reports – what data I have and is it okay.

Name Node controls the Data Nodes by issuing orders when they return and report their status.

Replicate Data, Delete Data, Verify Data

Page 28: Practical Hadoop using Pig

Data Node Heartbeat:

The “check-in” is a simple HTTP Request/Response.

This "check-in" is very important communication protocol that guarantees the health of the cluster.

Block Reports – what data I have and is it okay.

Name Node controls the Data Nodes by issuing orders when they return and report their status.

Replicate Data, Delete Data, Verify Data

Same process for all nodes within a cluster.

Page 29: Practical Hadoop using Pig

Writing Data

Page 30: Practical Hadoop using Pig

The client “tells” the NameNode the virtual directory location for the file.

Page 31: Practical Hadoop using Pig

A64 B64 C28

The client “tells” the NameNode the virtual directory location for the file.

The Client breaks the file into 64MB “blocks”

Page 32: Practical Hadoop using Pig

A64 B64 C28

The client “tells” the NameNode the virtual directory location for the file.

The Client breaks the file into 64MB “blocks”

The client “ask” the NameNode where the blocks go.

Page 33: Practical Hadoop using Pig

A64 B64 C28

A64 B64 C28

The client “tells” the NameNode the virtual directory location for the file.

The Client breaks the file into 64MB “blocks”

The client “ask” the NameNode where the blocks go.

The Client “stream” the blocks, in parallel, to the DataNodes.

Page 34: Practical Hadoop using Pig

A64 B64 C28

The client “tells” the NameNode the virtual directory location for the file.

The Client breaks the file into 64MB “blocks”

The client “ask” the NameNode where the blocks go.

The Client “stream” the blocks, in parallel, to the DataNodes.

DataNode(s) tells the NameNode they have the data via the block report

Page 35: Practical Hadoop using Pig

The client “tells” the NameNode the virtual directory location for the file.

The Client breaks the file into 64MB “blocks”

The client “ask” the NameNode where the blocks go.

The Client “stream” the blocks, in parallel, to the DataNodes.

DataNode(s) tells the NameNode they have the data via the block report

The NameNode tells the DataNode where to replicate the block.

A64 A64

A64

Page 36: Practical Hadoop using Pig

Reading Data

Page 37: Practical Hadoop using Pig

The client tells the NameNode it would like to read a file.

Page 38: Practical Hadoop using Pig

The client tells the NameNode it would like to read a file.

The NameNode reply’s with the list of blocks and the nodes the blocks are on.

Page 39: Practical Hadoop using Pig

A64

B64 C28

The client tells the NameNode it would like to read a file.

The NameNode reply’s with the list of blocks and the nodes the blocks are on.

The client request the first block from a DataNode

Page 40: Practical Hadoop using Pig

B64 C28

A64The client tells the NameNode it would like to read a file.

The NameNode reply’s with the list of blocks and the nodes the blocks are on.

The client request the first block from a DataNode

The client compares the checksum of the block against the manifest from the NameNode.

Page 41: Practical Hadoop using Pig

The client tells the NameNode it would like to read a file.

The NameNode reply’s with the list of blocks and the nodes the blocks are on.

The client request the first block from a DataNode

The client compares the checksum of the block against the manifest from the NameNode.

The client moves on to the next block in the sequence until the file has been read.

B64 C28

A64 B64 C28

Page 42: Practical Hadoop using Pig

Failure Recovery

Page 43: Practical Hadoop using Pig

A Data Node Fails to “check-in”

A64

Page 44: Practical Hadoop using Pig

A Data Node Fails to “check-in”

After 10 minutes the Name Node gives up on that Data Node.

A64X

Page 45: Practical Hadoop using Pig

A Data Node Fails to “check-in”

After 10 minutes the Name Node gives up on that Data Node.

When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. A64XA64

A64

Page 46: Practical Hadoop using Pig

A Data Node Fails to “check-in”

After 10 minutes the Name Node gives up on that Data Node.

When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command.

The Data Node replicates that block of data. (Just like a write)

A64XA64

A64A64

Page 47: Practical Hadoop using Pig

Interacting with HadoopHDFS Shell Commands

Page 48: Practical Hadoop using Pig

HDFS Shell Commands.

> Hadoop fs –ls <args>

Same as unix or osx ls command.

/user/hadoop/file1 /user/hadoop/file2...

Page 49: Practical Hadoop using Pig

HDFS Shell Commands.

> Hadoop fs –mkdir <path>

Creates directories in HDFS using path.

Page 50: Practical Hadoop using Pig

HDFS Shell Commands.

> hadoop fs -copyFromLocal <localsrc> URI

Copy a file from your client to HDFS.

Similar to put command, except that the source is restricted to a local file reference.

Page 51: Practical Hadoop using Pig

HDFS Shell Commands.

> hadoop fs -cat <path>

Copies source paths to stdout.

Page 52: Practical Hadoop using Pig

HDFS Shell Commands.

> hadoop fs -copyToLocal URI <localdst>

Copy a file from HDFS to your client.

Similar to get command, except that the destination is restricted to a local file reference.

Page 53: Practical Hadoop using Pig

HDFS Shell Commands.

catchgrpchmodchowncopyFromLocalcopyToLocalcpdudusexpungegetgetmergels

lsrmkdirmovefromLocalmvputrmrmrsetrepstattailtesttexttouchz

Page 54: Practical Hadoop using Pig

Map Reduce Data StructuresBasic, Tuples & Bags

Page 55: Practical Hadoop using Pig

Basic Data Types:

Strings, Integers, Doubles, Longs, Byte,

Boolean, etc.

Advanced Data Types:

Tuples and Bags

Page 56: Practical Hadoop using Pig

Tuples are JSON like and simple.

raw_data: {date_time: bytearray,seconds: bytearray

}

Page 57: Practical Hadoop using Pig

Bags hold Tuples and Bags

element: {date_time: bytearray,seconds: bytearraygroup: chararray,ordered_list: {

date: chararray,hour: chararray,score: long

}}

Page 58: Practical Hadoop using Pig

Expert Advice:

Always know your data structures.

They are the foundation for all Map Reduce

operations.

Complex (deep) data structures will kill -9

performance.

Keep them simple!

Page 59: Practical Hadoop using Pig

Processing DataInteracting with Pig using Grunt

Page 60: Practical Hadoop using Pig

GRUNT

Grunt is a command line interface used to

debug pig jobs. Similar to Ruby IRB or

Groovy CLI.

Grunt is your best weapon against bad

pigs.

pig -x local

Grunt> |

Page 61: Practical Hadoop using Pig

GRUNT

Grunt> describe Element

Describe will display the data structure of

an Element

Grunt> dump Element

Dump will display the data represented by

an Element

Page 62: Practical Hadoop using Pig

GRUNT

> describe raw_data

Produces the output:> raw_data: { date_time: bytearray, items: bytearray }

Or in a more human readable form:Raw_data: {

date_time: bytearray,items: bytearray

}

Page 63: Practical Hadoop using Pig

GRUNT

> dump raw_data

You can dump terabytes of data to your screen,

so be careful.

(05/10/2011 20:30:00.0,0)(05/10/2011 20:45:00.0,0)(05/10/2011 21:00:00.0,0)(05/10/2011 21:15:00.0,0)...

Page 64: Practical Hadoop using Pig

Pig ProgramsMap Reduce Made Simple

Page 65: Practical Hadoop using Pig

Most PIG commands are assignments.

• The element names the collection of records that exist out in the cluster.

• It’s not a traditional programming variable. • It describes the data from the operation.• It does not change.

Element = Operation;

Page 66: Practical Hadoop using Pig

The SET command

Used to set a hadoop job variable. Like the name of

your pig job.

SET job.name 'Day over Day - [$input]’;

Page 67: Practical Hadoop using Pig

The REGISTER and DEFINE commands

-- Setup udf jarsREGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar

DEFINE BUCKET_FORMAT_DATE com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/yyyy HH:mm', 'HH');

Page 68: Practical Hadoop using Pig

The LOAD USING command

-- load in the data from HDFS

raw_data = LOAD '$input' USING PigStorage('\t') AS (date_time, items);

Page 69: Practical Hadoop using Pig

The FILTER BY command

Selects tuples from a relation based on some condition.

-- filter to the week we wantbroadcast_week = FILTER bucket_list BY (date >= '03-Oct-2011') AND (date <= '10-Oct-2011');

Page 70: Practical Hadoop using Pig

The GROUP BY command

Groups the data in one or multiple relations.

daily_stats = GROUP broadcast_week BY (date,

hour);

Page 71: Practical Hadoop using Pig

The FOREACH command

Generates data transformations based on columns of

data.

bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS

date,MINUTE_BUCKET(date_time) AS hour, MAX_ITEMS(items) AS items;

*DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.

Page 72: Practical Hadoop using Pig

The GENERATE command

Use the FOREACH GENERATE operation to work with

columns of data.

bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS

date,MINUTE_BUCKET(date_time) AS hour, MAX_ITEMSS(items) AS items;

Page 73: Practical Hadoop using Pig

The FLATTEN command

FLATTEN substitutes the fields of a tuple in place of the

tuple.

traffic_stats = FOREACH daily_stats GENERATE FLATTEN(GROUP),COUNT(broadcast_week) AS cnt,SUM(broadcast_week.items) AS total;

Page 74: Practical Hadoop using Pig

The STORE INTO USING command

Store function determine how data stored after a pig

job.

-- All done, now store it STORE final_results INTO '$output' USING PigStorage();

Page 75: Practical Hadoop using Pig

Demo Time! “Because, it’s all a big lie

until someone demos’ the code.”- Genghis Khan

Page 76: Practical Hadoop using Pig

Thank You.

- Genghis Khan