A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface...

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite

Eli Levine & Julian Hyde

Apache: Big Data, Miami2017/05/16

Julian Hyde @julianhyde

Original developer of CalcitePMC member of Calcite, Drill, Eagle, KylinASF member

About us

Eli Levine @teleturn

PMC member of PhoenixASF member

Apache Calcite

Apache top-level project since October, 2015

Query planning framework➢ Relational algebra, rewrite rules➢ Cost model & statistics➢ Federation via adapters➢ Extensible

Packaging➢ Library➢ Optional SQL parser, JDBC server➢ Community-authored rules, adapters

Apache Pig

Apache top-level project

Platform for Analyzing Large Datasets➢ Uses Pig Latin language

○ Relational operators (join, filter)○ Functional operators (map, reduce)

➢ Runs as MapReduce (also Tez)➢ ETL➢ Extensible

○ LOAD/STORE○ UDFs

Outline

Batch compute on Force.com Platform (Eli Levine)

Apache Calcite deep dive (Julian Hyde)

Building Pig adapter for Calcite (Eli Levine)

Salesforce Platform

Object-relational data model in the cloud

Contains standard objects that users can customize or add their own

SQL-like query language SOQL- Real-time- Batch/ETL

Federated data store: Oracle, HBase, external

User queries span data sources (federated joins)

SELECT DEPT.NAMEFROM EMPLOYEEWHERE FIRST_NAME = ‘Eli’

Salesforce Platform - Current Batch/ETL

Called Async SOQL

- REST API- Users supply SOQL and info about where to deposit results

SOQL -> Pig Latin script

Pig loaders move data/computation to HDFS for federated query execution

Own SOQL parsing, no Calcite

Query Planning in Async SOQL

Current

Next generation

Apache Calcite for Next-Gen Optimizer

- Strong relational algebra foundation

- Support for different physical engines

- Pluggable cost model

- Optimization rules

- Federation-aware

[Julian talks about Calcite]

- Building blocks for building query engine or DB- Federation-aware- Represent and optimize logical plan- Convert to physical plan

Planning queries

Splunk

Key: productId

Key: productNameAgg: count

filter

Condition:action = 'purchase'

Key: c desc

Table: products

select p.productName, count(*) as cfrom splunk.splunk as s join mysql.products as p on s.productId = p.productIdwhere s.action = 'purchase'group by p.productNameorder by c desc

Table: splunk

Optimized query

Splunk

Key: productId

Key: productNameAgg: count

filter

Condition:action = 'purchase'

Key: c desc

Table: splunk

Table: products

select p.productName, count(*) as cfrom splunk.splunk as s join mysql.products as p on s.productId = p.productIdwhere s.action = 'purchase'group by p.productNameorder by c desc

Adapters

Connect to a data source

How to push down logic to the data source?

A set of planner rules

Maybe also a calling convention

Maybe also a query model & query generator

Calcite Pig Adapter

EMPLOYEE = LOAD 'EMPLOYEE' ... ;

EMPLOYEE = GROUP EMPLOYEE BY (DEPT_ID);

EMPLOYEE = FOREACH EMPLOYEE GENERATE COUNT(EMPLOYEE.DEPT_ID) as DEPT_ID__COUNT_,

group as DEPT_ID;

EMPLOYEE = FILTER EMPLOYEE BY (DEPT_ID__COUNT_ > 10);

SELECT DEPT_ID FROM EMPLOYEE GROUP BY DEPT_ID HAVING COUNT(DEPT_ID) > 10

Building the Pig Adapter

1. Implement Pig-specific RelNodes. e.g. PigFilter

2. RelNode Factories

3. Various RelOptRule types for converting Abstract RelNodes to PigRelNodes

4. Schema implementation (several iterations)

5. Unit tests run local Pig

Lessons Learned

Calcite is very flexible (both good and bad)

Lots available out of the box

Little info on writing optimization and conversion rules

Dynamic code generation using Janino -- cryptic errors

SQL maps well to Pig. Inverse not always true.

Thank you!

Eli Levine @teleturnJulian Hyde @julianhydehttp://calcite.apache.orghttp://pig.apache.org

A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface...

Documents

Transcript of A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface...

Introduction To Apache Pig at WHUG

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

Apache PIG introduction

Introduction to Apache Hadoop & Pig - Indiana University

Practical Problem Solving with Apache Hadoop & Pig

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Apache pig as a researcher’s stepping stone

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

COSC 416 Apache Pig - People | UBC's Okanagan campus · 2013. 1. 27. · Apache Pig Apache Pig is a high-level language for expressing Map-Reduce programs. Pig defines a language

Apache Pig for Data Science · 2017. 12. 14. · Apache Pig for Data Science Casey Stella April9,2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: A big data processor

Apache PIG - User Defined Functions

Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

MAPREDUCE & PIG TUTORIALvvtesh.co.in/teaching/bigdata-2020/slides/Tut5-MR-Pig.pdf · Apache PIG • PIG is an abstraction over MapReduce. • It uses high level language - Pig Latin.

Performance Analysis and Optimization of Apache Pig...Apache Pig is a language, compiler, and run-time library for simplifying the de-velopment of data-analytics applications on Apache

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer

Apache hadoop pig overview and introduction

Apache Pig on Amazon AWS - Swine Not?

APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

High-level Programming Languages: Apache Pig and Pig Latin