Apache Pig on Amazon AWS - Swine Not?

Apache Pig on Amazon AWS

Swine Not?

What is Apache Pig?

Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.

(Disturbing Logo) -->

Pig is a tool that...

● creates complex jobs that efficiently process large volumes of data

● supports many relational features, making it easy to join, group, and aggregate data

● performs ETL tasks quickly, on many servers simultaneously

What is Pig Latin?

It is a high level data transformation language that:● allows you to concentrate on the data

transformations you require

Rather than:● force you to be concerned with individual

map and reduce functions

Walkthrough - Create a Job Flow

* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729

http://aws.amazon.com/articles/2729

And now we wait...

SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com

Type "pig" to enter the grunt shell

$ piggrunt> _

It's a freakin' shell!

grunt> pwdhdfs://10.174.115.214:9000/

You can enter the HDFS file system:grunt> cd hdfs:///

grunt> lshdfs://10.174.115.214:9000/mnt <dir>

Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/

grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171

Load Piggybank - Open source library, user contributed functions

grunt> register file:/home/hadoop/lib/pig/piggybank.jar

DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;

LOAD

Use TextLoader (internal Pig function) to Load each line of the source file:

grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);

ILLUSTRATE

Shows a step-by-step process on how Pig would transform a small sample of data

grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------

Now let's:● split each line into fields● store everything in a bag

grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );

Create a bag containing tuples with just the referrer element (limit 10 items):

grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;

Output the contents of the bag:

grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....

More log output before we get our results (cleaned up here)

...

Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"

Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"

Counters:Total records written : 10

...

Voila! Our exciting results:

(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)

First 10 referrers (the dashes represent no referrer)

Now let's filter only by referrerals from bing.com*

grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)

* We all use Bing, am I right?

Don't forget to terminate your Job Flow

Amazon will charge you even if it's idle!

Apache Pig on Amazon AWS - Swine Not?

Technology

Transcript of Apache Pig on Amazon AWS - Swine Not?