PIG ver_2.0
description
Transcript of PIG ver_2.0
-
1 Copyright 2012 Tata Consultancy Services Limited
Introduction To PIG
August 16, 2012
INTERNAL
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
2
MapReduce????
Map Reduce is very Powerful ,but
It requires a Java Programmer
User has to re-invent common functionality (join, filter, etc.)
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
3
Don't Know JAVA...........
Don't Worry
Just Hire
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
4
Execution Engine on top of Hadoop.
Removes need for users to tune Hadoop for their needs
PIG Engine
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
5
PIG Latin
PIG provides PIG Latin, a high Level Language.
Increase Productivity.
5 Lines of PIG 50 Lines of JAVA.
For Non - JAVA.
Provides common Operators
JOIN
GROUP
FILTER
SORT, etc
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
6
New Language....
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
7
Background
Yahoo! adopted Hadoop.
Hadoop usage increased extensively
Yahoo! Research developed Pig, as a High Level Language for HADOOP.
Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
8
Web Log Processing
Data processing for web search
PIG Use Cases
Ad hoc queries across large data sets
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
Installation
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
10
Downloading & Installation
Download PIG
ftp://ftp.nextgen.com/pub/Hadoop%20Ecosystem/pig-0.8.1-cdh3u2.tar.gz
Untar PIG
$> tar -xzvf pig-0.8.1-cdh3u2.tar.gz
Set Environment Variables
Open .profile file
export PIG_HOME=/home/empId/pig-0.8.1-cdh3u2
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME
:$HADOOP_HOME/conf
Load .profile File
$> . .profile
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
11
Running PIG
PIG has two Run Modes Local Mode (use Local Filesystem) MapReduce Mode (Use HDFS)
Ways to use PIG Latin Grunt Shell PIG Script Embedded Programs
*all above three way can run in both modes.
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
12
Grunt Shell
Interactive Shell for PIG.
Local Mode -
$ > pig -x local
MapReduce Mode -
$ > pig or $ > pig -x mapreduce
NOTE - Hadoop should be UP for MapReduce Mode.
For either mode, the Grunt shell is invoked and you can enter commands at the prompt. The
results are displayed to your terminal screen (if DUMP is used) or to a file (if STORE is used).
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
13
Script File
To run PIG Commands as batch Jobs.
Local Mode
$> pig -x local testScript.pig
Mapreduce Mode
$> pig testScript.pig
OR
$> pig -x mapreduce testScript.pig
For either mode, the Pig Latin statements are executed ,and, the results are displayed to your
terminal screen (if DUMP is used) or to a file (if STORE is used).
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
14
Embedded Programs
Embed Pig commands in a host language .
Run the program.
Local Mode
Compile Program
$> javac -cp pig.jar testLocal.java
Note: testLocal.class is written to your current working directory.
Include . in the class path when you run the program.
Run the Program
$> java -cp pig.jar:. testLocal
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
15
Embedded Programs (Contd..)
MapReduce Mode
Point $HADOOPDIR to the directory that contains the hadoop-site.xml file.
$> export HADOOPDIR=$HADOOP_HOME/conf
Compile the program:
$> javac -cp pig.jar testMapreduce.java
Note: testMapreduce.class is written to your current working directory. Include . in the class path
when you run the program.
Run the program:
$> java -cp pig.jar:.:$HADOOPDIR testMapreduce
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
Data Model
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
17
Data Types
int long float
double chararraybytearray
Scalar Types
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
18
Data Types (Contd..)
Complex Types
can contain any type of data including complex
Map
chararray to data element mapping.
Element : any Pig Type (Scalar or Complex)
Key : charaaray
Tuple
Ordered Collection of Pig data elements.
Analogus to row in SQL.
General Tuple :: t : (field1 , field2)
Field can be any Pig Type.
Schema definition is not necessary.
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
19
Data Types (Contd..)
Bag
Unordered collection of tuples.
Schema definition is not necessary.
General Bag :: b : { ( ' China ' , ' 01 ' ) , ( ' India ' , 02 ) , ( ' Brazil ' , 03 ) }
Max Size of Bag = Size of Local Disk available.
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
20
Null Concept
Concept same as in SQL.
NULL refers to UNKOWN Value
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
21
Schemas
Pig is Flexible in terms of Schema.
Defining schema is not necessary.
If NO Schema , PIG makes best guess.
Example
stockData = LOAD '/Stock' as (stockName : chararray, stockCode : chararray,
currentValue : float , high : float , low : float )
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
22
Schemas Syntax
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
23
Cast
Syntax is same as that of java
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
PIG Latin
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
25
Basics
Relation are referred as alias.
Relation can be re-used.
Relation name and fields starts with an alphabetic character.
Both names composed of alphabetic , numeric , _ characters.
PIG keywords are not case-sensitive.
Relation and field names are case-sensitive.
Comments
Using ' - - ..... '
Using ' /* ....*/ '
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
26
LOAD Data
Using LOAD keyword.
Different Load Function :
PigStorage();
HbaseStorage();
DbStorage();
File = LOAD ' testData ' USING PigStorage( );
File = LOAD ' testData ' USING HbaseStorage( );
File = LOAD ' testData ' USING DbStorage( );
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
27
STORE Data
Using STORE keyword.
Different Store Function :
PigStorage();
HbaseStorage();
DbStorage();
Store File into ' testData ' USING PigStorage( );
Store File into ' testData ' USING HbaseStorage( );
Store File into ' testData ' USING DbStorage( );
DUMP - Used to print output to terminal.
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
28
Relation Operations
These are the function used to relate data, extract inforamtion etc..
Relational Operators in PIG
Foreach Filter
Group Order by
Distinct Join
Limit Sample
Parallel Flatten
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
29
Foreach
Apply Expression(s) to Each Record. A = load 'input' as (username : chararray , userid : chararray , location : chararray) ; B = foreach A generate userid , location ; DUMP B;
For No Schema type A = load ' input ' ; B = foreach A generate $0 ; DUMP B;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
30
Foreach
Collects record with same value of a given key into a Bag. A = load 'input' as (username : chararray , userid : chararray , location : chararray) ; B = group A by location ; C = group A by all; DUMP C;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
31
Filter
Filters record in data pipeline.
If predicate evaluates to true , record will be moved further.
Equality operators , a predicate can contain :
= = , ! = , > , >= , < ,
-
32
Order by
Sorts the data.
A = load 'input' as (username : chararray, userid : chararray, location : chararray) ;
B = order A by location ;
C = order A by username, userid ;
DUMP C;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
33
Distinct
Removes duplicate records.
A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;
B = distinct A ;
DUMP B;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
34
Join
Joins records from different inputs according to keys provided.
A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;
B = load 'input1' as (name : chararray , age : int);
C = join A by username , B by name ;
DUMP C;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
35
Limit
Limit the number of records to display.
A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;
B = limit A by 7 ;
DUMP B;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
36
Sample
Provides sample of input data.
You can supply what percentage of data you want to see.
Percentage value : 0 to 1 for (0 % to 100%)
A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;
B = sample A 0.2 ;
DUMP B;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
37
Parallel
It controls reduce - side parallelism.
A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;
B = group A by location parallel 5;
DUMP B;
Parallel will cause the MapReduce job to have 5 reducers.
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
38
User Defined Functions - UDFs
PIG allows implement of user written code .
Code can be - Java , Python.
Piggybank is a package of UDFs that is shipped with PIG.
To use UDFs :
Register your UDF.
register '/path/piggybank.jar ' ;
Call your UDF.
A = load 'input' as (userid : chararray , location : chararray) ;
B = foreach A generate org.apache.pig .piggybank. evaluation.string.Reverse(location);
DUMP B;
Only for TCS Internal Training TCS NextGen Solutions, Kochi
-
Thank You
Only for TCS Internal Training TCS NextGen Solutions, Kochi
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39