PIG ver_2.0

download PIG ver_2.0

of 39

description

pig

Transcript of PIG ver_2.0

  • 1 Copyright 2012 Tata Consultancy Services Limited

    Introduction To PIG

    August 16, 2012

    INTERNAL

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 2

    MapReduce????

    Map Reduce is very Powerful ,but

    It requires a Java Programmer

    User has to re-invent common functionality (join, filter, etc.)

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 3

    Don't Know JAVA...........

    Don't Worry

    Just Hire

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 4

    Execution Engine on top of Hadoop.

    Removes need for users to tune Hadoop for their needs

    PIG Engine

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 5

    PIG Latin

    PIG provides PIG Latin, a high Level Language.

    Increase Productivity.

    5 Lines of PIG 50 Lines of JAVA.

    For Non - JAVA.

    Provides common Operators

    JOIN

    GROUP

    FILTER

    SORT, etc

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 6

    New Language....

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 7

    Background

    Yahoo! adopted Hadoop.

    Hadoop usage increased extensively

    Yahoo! Research developed Pig, as a High Level Language for HADOOP.

    Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 8

    Web Log Processing

    Data processing for web search

    PIG Use Cases

    Ad hoc queries across large data sets

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • Installation

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 10

    Downloading & Installation

    Download PIG

    ftp://ftp.nextgen.com/pub/Hadoop%20Ecosystem/pig-0.8.1-cdh3u2.tar.gz

    Untar PIG

    $> tar -xzvf pig-0.8.1-cdh3u2.tar.gz

    Set Environment Variables

    Open .profile file

    export PIG_HOME=/home/empId/pig-0.8.1-cdh3u2

    export PATH=$PATH:$PIG_HOME/bin

    export PIG_CLASSPATH=$HADOOP_HOME

    :$HADOOP_HOME/conf

    Load .profile File

    $> . .profile

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 11

    Running PIG

    PIG has two Run Modes Local Mode (use Local Filesystem) MapReduce Mode (Use HDFS)

    Ways to use PIG Latin Grunt Shell PIG Script Embedded Programs

    *all above three way can run in both modes.

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 12

    Grunt Shell

    Interactive Shell for PIG.

    Local Mode -

    $ > pig -x local

    MapReduce Mode -

    $ > pig or $ > pig -x mapreduce

    NOTE - Hadoop should be UP for MapReduce Mode.

    For either mode, the Grunt shell is invoked and you can enter commands at the prompt. The

    results are displayed to your terminal screen (if DUMP is used) or to a file (if STORE is used).

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 13

    Script File

    To run PIG Commands as batch Jobs.

    Local Mode

    $> pig -x local testScript.pig

    Mapreduce Mode

    $> pig testScript.pig

    OR

    $> pig -x mapreduce testScript.pig

    For either mode, the Pig Latin statements are executed ,and, the results are displayed to your

    terminal screen (if DUMP is used) or to a file (if STORE is used).

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 14

    Embedded Programs

    Embed Pig commands in a host language .

    Run the program.

    Local Mode

    Compile Program

    $> javac -cp pig.jar testLocal.java

    Note: testLocal.class is written to your current working directory.

    Include . in the class path when you run the program.

    Run the Program

    $> java -cp pig.jar:. testLocal

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 15

    Embedded Programs (Contd..)

    MapReduce Mode

    Point $HADOOPDIR to the directory that contains the hadoop-site.xml file.

    $> export HADOOPDIR=$HADOOP_HOME/conf

    Compile the program:

    $> javac -cp pig.jar testMapreduce.java

    Note: testMapreduce.class is written to your current working directory. Include . in the class path

    when you run the program.

    Run the program:

    $> java -cp pig.jar:.:$HADOOPDIR testMapreduce

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • Data Model

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 17

    Data Types

    int long float

    double chararraybytearray

    Scalar Types

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 18

    Data Types (Contd..)

    Complex Types

    can contain any type of data including complex

    Map

    chararray to data element mapping.

    Element : any Pig Type (Scalar or Complex)

    Key : charaaray

    Tuple

    Ordered Collection of Pig data elements.

    Analogus to row in SQL.

    General Tuple :: t : (field1 , field2)

    Field can be any Pig Type.

    Schema definition is not necessary.

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 19

    Data Types (Contd..)

    Bag

    Unordered collection of tuples.

    Schema definition is not necessary.

    General Bag :: b : { ( ' China ' , ' 01 ' ) , ( ' India ' , 02 ) , ( ' Brazil ' , 03 ) }

    Max Size of Bag = Size of Local Disk available.

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 20

    Null Concept

    Concept same as in SQL.

    NULL refers to UNKOWN Value

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 21

    Schemas

    Pig is Flexible in terms of Schema.

    Defining schema is not necessary.

    If NO Schema , PIG makes best guess.

    Example

    stockData = LOAD '/Stock' as (stockName : chararray, stockCode : chararray,

    currentValue : float , high : float , low : float )

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 22

    Schemas Syntax

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 23

    Cast

    Syntax is same as that of java

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • PIG Latin

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 25

    Basics

    Relation are referred as alias.

    Relation can be re-used.

    Relation name and fields starts with an alphabetic character.

    Both names composed of alphabetic , numeric , _ characters.

    PIG keywords are not case-sensitive.

    Relation and field names are case-sensitive.

    Comments

    Using ' - - ..... '

    Using ' /* ....*/ '

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 26

    LOAD Data

    Using LOAD keyword.

    Different Load Function :

    PigStorage();

    HbaseStorage();

    DbStorage();

    File = LOAD ' testData ' USING PigStorage( );

    File = LOAD ' testData ' USING HbaseStorage( );

    File = LOAD ' testData ' USING DbStorage( );

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 27

    STORE Data

    Using STORE keyword.

    Different Store Function :

    PigStorage();

    HbaseStorage();

    DbStorage();

    Store File into ' testData ' USING PigStorage( );

    Store File into ' testData ' USING HbaseStorage( );

    Store File into ' testData ' USING DbStorage( );

    DUMP - Used to print output to terminal.

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 28

    Relation Operations

    These are the function used to relate data, extract inforamtion etc..

    Relational Operators in PIG

    Foreach Filter

    Group Order by

    Distinct Join

    Limit Sample

    Parallel Flatten

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 29

    Foreach

    Apply Expression(s) to Each Record. A = load 'input' as (username : chararray , userid : chararray , location : chararray) ; B = foreach A generate userid , location ; DUMP B;

    For No Schema type A = load ' input ' ; B = foreach A generate $0 ; DUMP B;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 30

    Foreach

    Collects record with same value of a given key into a Bag. A = load 'input' as (username : chararray , userid : chararray , location : chararray) ; B = group A by location ; C = group A by all; DUMP C;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 31

    Filter

    Filters record in data pipeline.

    If predicate evaluates to true , record will be moved further.

    Equality operators , a predicate can contain :

    = = , ! = , > , >= , < ,

  • 32

    Order by

    Sorts the data.

    A = load 'input' as (username : chararray, userid : chararray, location : chararray) ;

    B = order A by location ;

    C = order A by username, userid ;

    DUMP C;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 33

    Distinct

    Removes duplicate records.

    A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;

    B = distinct A ;

    DUMP B;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 34

    Join

    Joins records from different inputs according to keys provided.

    A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;

    B = load 'input1' as (name : chararray , age : int);

    C = join A by username , B by name ;

    DUMP C;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 35

    Limit

    Limit the number of records to display.

    A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;

    B = limit A by 7 ;

    DUMP B;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 36

    Sample

    Provides sample of input data.

    You can supply what percentage of data you want to see.

    Percentage value : 0 to 1 for (0 % to 100%)

    A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;

    B = sample A 0.2 ;

    DUMP B;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 37

    Parallel

    It controls reduce - side parallelism.

    A = load 'input' as (username : chararray , userid : chararray , location : chararray) ;

    B = group A by location parallel 5;

    DUMP B;

    Parallel will cause the MapReduce job to have 5 reducers.

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • 38

    User Defined Functions - UDFs

    PIG allows implement of user written code .

    Code can be - Java , Python.

    Piggybank is a package of UDFs that is shipped with PIG.

    To use UDFs :

    Register your UDF.

    register '/path/piggybank.jar ' ;

    Call your UDF.

    A = load 'input' as (userid : chararray , location : chararray) ;

    B = foreach A generate org.apache.pig .piggybank. evaluation.string.Reverse(location);

    DUMP B;

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

  • Thank You

    Only for TCS Internal Training TCS NextGen Solutions, Kochi

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39