Apache Pig on Amazon AWS - Swine Not?
-
Upload
drake-emko -
Category
Technology
-
view
3.065 -
download
1
description
Transcript of Apache Pig on Amazon AWS - Swine Not?
Apache Pig on Amazon AWS
Swine Not?
What is Apache Pig?
Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.
(Disturbing Logo) -->
Pig is a tool that...
● creates complex jobs that efficiently process large volumes of data
● supports many relational features, making it easy to join, group, and aggregate data
● performs ETL tasks quickly, on many servers simultaneously
What is Pig Latin?
It is a high level data transformation language that:● allows you to concentrate on the data
transformations you require
Rather than:● force you to be concerned with individual
map and reduce functions
Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
And now we wait...
SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com
Type "pig" to enter the grunt shell
$ piggrunt> _
It's a freakin' shell!
grunt> pwdhdfs://10.174.115.214:9000/
You can enter the HDFS file system:grunt> cd hdfs:///
grunt> lshdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/
grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171
Load Piggybank - Open source library, user contributed functions
grunt> register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
LOAD
Use TextLoader (internal Pig function) to Load each line of the source file:
grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);
ILLUSTRATE
Shows a step-by-step process on how Pig would transform a small sample of data
grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------
Now let's:● split each line into fields● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE;...| LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
Create a bag containing tuples with just the referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....
More log output before we get our results (cleaned up here)
...
Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:Total records written : 10
...
Voila! Our exciting results:
(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)
First 10 referrers (the dashes represent no referrer)
Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?
Don't forget to terminate your Job Flow
Amazon will charge you even if it's idle!