Apache Pig on Amazon AWS - Swine Not?

Post on 05-Dec-2014

3.065 views 1 download

description

A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience following along with Amazon's Pig tutorial at: http://aws.amazon.com/articles/2729

Transcript of Apache Pig on Amazon AWS - Swine Not?

Apache Pig on Amazon AWS

Swine Not?

What is Apache Pig?

Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.

(Disturbing Logo) -->

Pig is a tool that...

● creates complex jobs that efficiently process large volumes of data

● supports many relational features, making it easy to join, group, and aggregate data

● performs ETL tasks quickly, on many servers simultaneously

What is Pig Latin?

It is a high level data transformation language that:● allows you to concentrate on the data

transformations you require

Rather than:● force you to be concerned with individual

map and reduce functions

Walkthrough - Create a Job Flow

* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729

And now we wait...

SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com

Type "pig" to enter the grunt shell

$ piggrunt> _

It's a freakin' shell!

grunt> pwdhdfs://10.174.115.214:9000/

You can enter the HDFS file system:grunt> cd hdfs:///

grunt> lshdfs://10.174.115.214:9000/mnt <dir>

Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/

grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171

Load Piggybank - Open source library, user contributed functions

grunt> register file:/home/hadoop/lib/pig/piggybank.jar

DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;

LOAD

Use TextLoader (internal Pig function) to Load each line of the source file:

grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);

ILLUSTRATE

Shows a step-by-step process on how Pig would transform a small sample of data

grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------

Now let's:● split each line into fields● store everything in a bag

grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );

ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE;...| LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

Create a bag containing tuples with just the referrer element (limit 10 items):

grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;

Output the contents of the bag:

grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....

More log output before we get our results (cleaned up here)

...

Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"

Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"

Counters:Total records written : 10

...

Voila! Our exciting results:

(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)

First 10 referrers (the dashes represent no referrer)

Now let's filter only by referrerals from bing.com*

grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)

* We all use Bing, am I right?

Don't forget to terminate your Job Flow

Amazon will charge you even if it's idle!