Apache Pig -- PittsburghHug

Apache Pig

Pittsburgh Hadoop User Group 11/3/2009

Dmitriy RyaboyAshutosh Chauhan

In This Talk

• What is Pig and why it’s needed– This part is going to be brief.– It’s a Hadoop User Group, after all.

• Examples– As well as some extra motivation

• Advanced Features• Improvements currently in development• Interesting research problems

– Want to get involved?

What is Pigor, “duality of Pig”

Pig compiles data

analysis tasks into

Map-Reduce jobs and

runs them on Hadoop.

Pig Latin is a language

for expressing data

transformation flows.

Pig can be made to understand other languages, too.

There is an SQL prototype. Just a question of compiling a language into internal operator tree.

See: Smith, Agent. The Matrix Trilogy, Warner Bros.

Who Uses Pig

• Yahoo – “40% of all Hadoop jobs are run with Pig”

• Twitter – “Some Java-based MapReduce, some

Hadoop Streaming”– “Most analysis, and most interesting

analysis, done in Pig”

• LinkedIn, AOL, CoolIris, Ning, eBuddy …

Why use Pig

• Express data transformation tasks in a few lines.• “Reach out and touch a Petabyte.”

Image credit: Michelangelo, with apologies.

Pig Latin Example

Why not plain Map/Reduce?

Brevity.

Corresponding Pig Script

Why Not SQL

• Pig Latin allows expressing transformations as a sequence of steps. SQL describes desired outcome.

• Writing down a sequence of steps is intuitive to developers.– Allows expressing more complex data flows– But still no loops or conditionals.

• Support for nested structures (arrays, maps)• Much easier to work with groups

– Let me show you…

Top 5 scores for each player

• Data: < playerId, score, date >• For each player

• Best 5 scores• Date each of these scores was achieved

• Common task; this particular one lifted from http://stackoverflow.com/questions/1467898

• Classically painful in SQL

• Self-join• Data explosion for same values of tblAbc.aa• Readability?• I like SQL. But this is far from straightforward.

We can fix these problems, by going procedural.

There goes the brevity and declarativity.

Pig Latin

Diving Deeper

http://www.flickr.com/photos/lollaping/2573664394/

User-Defined Functions

• Java Interfaces– Reading / Writing Data

• Allows reading from DBs, HBase, custom file formats• Significant API overhaul underway.

– Transformation / Evaluation of Tuples– Group Aggregation

• See Piggybank for exampleshttp://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/

• Support for other languages in the workshttps://issues.apache.org/jira/browse/PIG-928

Streaming

• IO Cost of scanning the data dominates most jobs

• Share data scans • This example

requires separate pipelines for state and demographics

• Or does it?

Multiquery Optimizationload usersload users

filter out botsfilter out bots

group by stategroup by state group by demographic

group by demographic

apply UDFsapply UDFs apply UDFsapply UDFs

store into ‘bystate’store into ‘bystate’

store into ‘bydemo’store into ‘bydemo’

Slide credit: Alan Gates et al, VLDB 2009

Multiquery Optimization

• Tag each record with the pipeline it belongs to

• Regular Hadoop grouping on [pipeline #, key]

• Multiplexer in reduce stage sends records to appropriate pipeline

• Use a single script to compute many things from shared sources.

mapmap filterfilter

local rearrangelocal rearrange

splitsplit

local rearrangelocal rearrange

reducereduce

multiplexmultiplexpackagepackage packagepackage

foreachforeach foreachforeach

Slide credit: Alan Gates et al, VLDB 2009

Hash Join

• Default algorithm • Mapper emits ([key, relid], tuple)• Reducer sees all tuples from each relation with the same

key, performs join– easy, due to sorted order

• Optimization: entries in the last relation can be streamed through (not accumulated in memory for each key)

• Can join multiple tables, supports outer joins• If one relation has many duplicate keys, put it last!

Fragment-Replicate Join

• Join large table A with 1 or more small tables (B, C)

• Fragment large table across mappers• Replicate smaller tables in entirety to all

mappers

• Supports multiple tables• Use when all small tables together can

fit in memory on a single map task

Merge Join

• 2 inputs, both sorted on join key • Build sparse index into B• Partition A across mappers• Use index to read B from

corresponding block on each mapper

• Bounded buffering on both sides – least memory intensive

• Significant speed gains, especially on skewed data

Skewed Join

• 1+ keys with more tuples per key than fit in reducer memory, in both relations

• Build histogram based on sample

• Round-robin skewed keys into multiple reducers

• Stream other table, sending each record to all reducers responsible for key

http://wiki.apache.org/pig/PigSkewedJoinSpec

In Development

• Release 0.5 : Hadoop 20 compatibility– Or apply a patch, or get Cloudera distro– Imminent

• Release 0.6 (or later)– Load/Store redesign. See PIG-966– UDFs in other languages, see PIG-923– Columnar Storage, see Zebra in /contrib– Stored Schemas, see PIG-760– Metadata service (“Owl”), see PIG-823– SQL compiler, see PIG-824– Auto selection of joins when stats are known, see us

Research Problemsaka, when pigs fly

• Extending the language– Functions– Conditional logic– Loops

• Smarter data storage– Learn from workload

• auto-selected column families• different sort orders• overlapping projections

– Go Faster• pushing queries to storage• lazy decompression

Research Problemsaka, when pigs fly

• Adaptive optimization– Change plan mid-flight based on observation of current

environment– Can we do this without waiting for MR stage to finish?

• Continuous Queries, Approximate Early Results– Initial work at Berkeley: Map-Reduce Online *

• Cost-Based Optimization– Appropriate cost model?– balance of disk IO, network traffic, task overhead, memory

limitations on individual tasks…– Different from, but similar to, distributed DBs

* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Apache Pig -- PittsburghHug

Documents

Transcript of Apache Pig -- PittsburghHug

Performance Analysis and Optimization of Apache Pig...Apache Pig is a language, compiler, and run-time library for simplifying the de-velopment of data-analytics applications on Apache

Practical Problem Solving with Apache Hadoop & Pig

MAPREDUCE & PIG TUTORIALvvtesh.co.in/teaching/bigdata-2020/slides/Tut5-MR-Pig.pdf · Apache PIG • PIG is an abstraction over MapReduce. • It uses high level language - Pig Latin.

Hortonworks Data Platform for HDInsight · • Apache Pig 0.16.0 • Apache Ranger 1.2.0 • Apache Spark 2.4.0 • Apache Sqoop 1.4.7 • Apache Storm 1.2.1 • Apache TEZ 0.9.1

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

COSC 416 Apache Pig - People | UBC's Okanagan campus · 2013. 1. 27. · Apache Pig Apache Pig is a high-level language for expressing Map-Reduce programs. Pig defines a language

Query Analyzer for Apache Pig - Imperial College · PDF fileQuery Analyzer for Apache Pig Author: Robert Yau ... The scala- bility of RDBMS su ... eBay, IBM and Yahoo! [5]. Hadoop

High-level Programming Languages: Apache Pig and Pig Latin

Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Apache pig presentation_siddharth_mathur

Apache Pig on Amazon AWS - Swine Not?

EEDC Apache Pig Language

Apache Pig: A big data processor

Apache Pig Tutorial PDF

APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Apache PIG - User Defined Functions

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer

A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16

Apache Pig - JavaZone 2013