CX4242: Scaling Up Pig - Visualization · 5.Pig Latin allows developers to insert their own code...

CX4242:

Scaling Up

Mahdi Roozbahani

Lecturer, Computational Science and

Engineering, Georgia Tech

High-level language

• instead of writing low-level map and reduce functions

Easier to program, understand and maintain

Created at Yahoo!

Produces sequences of Map-Reduce programs

(Lets you do “joins” much more easily)

http://pig.apache.org

Your data analysis task becomes a data flow

sequence (i.e., data transformations)

Input ➡ data flow ➡ output

You specify data flow in Pig Latin (Pig’s

language). Then, Pig turns the data flow into a

sequence of MapReduce jobs automatically!

http://pig.apache.org

Pig: 1st Benefit

Write only a few lines of Pig Latin

Typically, MapReduce development cycle is long

• Write mappers and reducers

• Compile code

• Submit jobs

• ...

Pig: 2nd Benefit

Pig can perform a sample run on representative

subset of your input data automatically!

Helps debug your code in smaller scale (much

faster!), before applying on full data

What Pig is good for?

Batch processing

• Since it’s built on top of MapReduce

• Not for random query/read/write

May be slower than MapReduce programs coded

from scratch

• You trade ease of use + coding time for

some execution speed

How to run Pig

Pig is a client-side application

(run on your computer)

Nothing to install on Hadoop cluster

How to run Pig: 2 modesLocal Mode

• Run on your computer (e.g., laptop)

• Great for trying out Pig on small datasets

MapReduce Mode

• Pig translates your commands into MapReduce jobs

• Remember you can have a single-machine cluster

set up on your computer

Difference between PIG local and mapreduce mode: http://stackoverflow.com/questions/11669394/difference

Pig program: 3 ways to write

Script

Grunt (interactive shell)

• Great for debugging

Embedded (into Java program)

• Use PigServer class (like JDBC for SQL)

• Use PigRunner to access Grunt

Grunt (interactive shell)

Provides code completion

Press Tab key to complete Pig Latin keywords

and functions

Let’s see an example Pig program run with Grunt

• Find highest temperature by year

Example Pig program

Find highest temperature by year

records = LOAD 'input/ ncdc/ micro-tab/ sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records =

FILTER records BY temperature != 9999

AND (quality = = 0 OR quality = = 1 OR

quality = = 4 OR quality = = 5 OR

quality = = 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE

group, MAX(filtered_records.temperature);

DUMP max_temp;

Example Pig program

grunt>

records = LOAD 'input/ncdc/micro-tab/sample.txt'

grunt> DUMP records;

grunt> DESCRIBE records;

records: {year: chararray, temperature: int, quality: int}

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

called a “tuple”

Example Pig program

grunt>

filtered_records =

FILTER records BY temperature != 9999

AND (quality == 0 OR quality == 1 OR

quality == 4 OR quality == 5 OR

quality == 9);

grunt> DUMP filtered_records;(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

In this example, no tuple is filtered out

Example Pig program

grunt> grouped_records = GROUP filtered_records BY year;

grunt> DUMP grouped_records;

grunt> DESCRIBE grouped_records;

(1949,{(1949,111,1), (1949,78,1)})

(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

called a “bag”

= unordered collection of tuples

grouped_records: {group: chararray, filtered_records:

{year: chararray, temperature: int, quality: int}}

alias that Pig created

Example Pig program

grunt> max_temp = FOREACH grouped_records GENERATE

group, MAX(filtered_records.temperature);

grunt> DUMP max_temp;

(1949,{(1949,111,1), (1949,78,1)})

(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

grouped_records: {group: chararray, filtered_records: {year:

chararray, temperature: int, quality: int}}

(1949,111)

(1950,22)

Run Pig program on a subset of your data

You saw an example run on a tiny dataset

How to do that for a larger dataset?

• Use the ILLUSTRATE command to

generate sample dataset

Run Pig program on a subset of your data

grunt> ILLUSTRATE max_temp;

How does Pig compare to SQL?

SQL: “fixed” schema

PIG: loosely defined schema, as in

records = LOAD 'input/ncdc/micro-tab/sample.txt'

How does Pig compare to SQL?

SQL: supports fast, random access

(e.g., <10ms, but of course depends on

hardware, data size, and query complexity too)

PIG: batch processing

Pig vs SQL

http://yahoohadoop.tumblr.com/post/98294444546/comparing-pig-latin-and-sql-for-constructing-data

1. Pig Latin is procedural, where SQL is declarative.

2. Pig Latin allows pipeline developers to decide where

to checkpoint data in the pipeline.

3. Pig Latin allows the developer to select specific

operator implementations directly rather than relying on

the optimizer.

4. Pig Latin supports splits in the pipeline.

5. Pig Latin allows developers to insert their own code

almost anywhere in the data pipeline.

Much more to learn about PigRelational Operators, Diagnostic Operators (e.g., describe,

explain, illustrate), utility commands (cat, cd, kill, exec), etc.

CX4242: Scaling Up Pig - Visualization · 5.Pig Latin allows developers to insert their own code...

Documents

Transcript of CX4242: Scaling Up Pig - Visualization · 5.Pig Latin allows developers to insert their own code...

Relational Model & Relational Algebra

Pig health= pig wealth

Relational vs. Non-Relational

Introduction to pig & pig latin

Chapter 2: Relational Model - WordPress.com · Chapter 2: Relational Model Structure of Relational Databases Fundamental Relational-Algebra-Operations Additional Relational-Algebra-Operations

The Relational Data Model and Relational Database Constraints · »The Relational Data Model »Relational Model Constraints »Update Operations »Relational Algebra . Relational Model

Developing Pig on Tez - events.static.linuxfound.org...Developing Pig on Tez Mark Wagner Committer, Apache Pig LinkedIn Cheolsoo Park VP, Apache Pig Netflix. What is Pig Apache project

CX4242: Data & Visual Analytics Time Series · CSE6242 / CX4242: Data & Visual Analytics Time Series Non-linear Forecasting Duen Horng (Polo) Chau Associate Professor Associate Director,

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen

Using TB-Sized Data to Understand Multi-Device Advertising · tabular relations used in relational databases. Open source framework such as MapReduce, Spark, Hive, and Pig support

Relational Algebra. 2 Outline Relational Algebra Unary Relational Operations Relational Algebra Operations from Set Theory Binary Relational Operations.

Class Website CX4242: Data & Visual AnalyticsAssignment Submission Canvas Logistics Make sure you’re in the right Piazza! (CX4242, CSE-6242-O01, CSE-6242-OAN have their Piazza forums

OCTOBER 2012 Apache Hadoop* Community Spotlight … · The Pig system sits on top of the Apache Hadoop* Distributed File ... questions from structured relational data. ... transform,

Chapter 6: Formal Relational Query Languages. 6.2 Chapter 6: Formal Relational Query Languages Relational Algebra Tuple Relational Calculus Domain Relational.

CX4242: Data & Visual Analytics Scaling Up · 2020-04-14 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Associate Professor Associate Director,

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/cse6242/...630-ScalingUp-hbase.pdf · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau

Pig Veterinary Society THE CASUALTY PIG - … Pig - April 2013-1... · 1 Pig Veterinary Society THE CASUALTY PIG Interim Update April 2013

Pig. Pig, Pig

Class Website CX4242 · Class Website CX4242: Classification Key Concepts Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech