Apache Pig on Amazon AWS

Swine Not?

What is Apache Pig?

Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.

(Disturbing Logo) -->

Pig is a tool that...

● creates complex jobs that efficiently process large volumes of data

● supports many relational features, making it easy to join, group, and aggregate data

● performs ETL tasks quickly, on many servers simultaneously

What is Pig Latin?

It is a high level data transformation language that:● allows you to concentrate on the data

transformations you require

Rather than:● force you to be concerned with individual

map and reduce functions

Walkthrough - Create a Job Flow

* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729

And now we wait...

SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com

Type "pig" to enter the grunt shell

$ piggrunt> _

It's a freakin' shell!

grunt> pwdhdfs://10.174.115.214:9000/

You can enter the HDFS file system:grunt> cd hdfs:///

grunt> lshdfs://10.174.115.214:9000/mnt <dir>

Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/

grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171

Load Piggybank - Open source library, user contributed functions

grunt> register file:/home/hadoop/lib/pig/piggybank.jar

DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;

Use TextLoader (internal Pig function) to Load each line of the source file:

grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);

ILLUSTRATE

Shows a step-by-step process on how Pig would transform a small sample of data

grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------

Now let's:● split each line into fields● store everything in a bag

grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );

Create a bag containing tuples with just the referrer element (limit 10 items):

grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;

Output the contents of the bag:

grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....

More log output before we get our results (cleaned up here)

Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"

Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"

Counters:Total records written : 10

Voila! Our exciting results:

(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)

First 10 referrers (the dashes represent no referrer)

Now let's filter only by referrerals from bing.com*

grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)

* We all use Bing, am I right?

Don't forget to terminate your Job Flow

Amazon will charge you even if it's idle!

Apache Pig on Amazon AWS - Swine Not?

Transcript of Apache Pig on Amazon AWS - Swine Not?

Apache Pig on Amazon AWS - Swine Not?

Technology

Transcript of Apache Pig on Amazon AWS - Swine Not?

Anatomy & Physiology of the Pig ANSC 4401 Swine Production.

Apache PIG introduction

Swine diseases. Mange: Sarcoptes scabei var suis Greasy pig disease: Staphylococcus hyicus: Gram-positive coccus Swine pox: Swine pox virus Erysipelas:

Feeding Barley to Swine - Pig Articles From the Pig Site

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Apache Pig Example

Apache pig as a researcher’s stepping stone

AFRICAN SWINE FEVER TREATENS THE GLOBAL PIG POPULATION · AFRICAN SWINE FEVER TREATENS THE GLOBAL PIG POPULATION AFRICAN SWINE FEVER (ASF) IS A THREAT TO ALL PIGS IF YOUR CLIENTS

Apache Pig Tutorial

Swine (Pig) Production

Developing Pig on Tez - events.static.linuxfound.org...Developing Pig on Tez Mark Wagner Committer, Apache Pig LinkedIn Cheolsoo Park VP, Apache Pig Netflix. What is Pig Apache project

Introduction to Apache Hadoop & Pig - Indiana University

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

Hortonworks Data Platform for HDInsight · • Apache Pig 0.16.0 • Apache Ranger 1.2.0 • Apache Spark 2.4.0 • Apache Sqoop 1.4.7 • Apache Storm 1.2.1 • Apache TEZ 0.9.1

Apache Pig: A big data processor

A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16

Performance Analysis and Optimization of Apache Pig...Apache Pig is a language, compiler, and run-time library for simplifying the de-velopment of data-analytics applications on Apache

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer