04 pig data operations

Apache Pig Data Operations

An Example

• Let’s look at a simple example by writing the program to calculate the maximum recorded temperature by year for the weather dataset in Pig Latin.

• Data:

• Start up Grunt in local mode, then enter the first line of the Pig script:

• For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields.

records = LOAD 'input/ncdc/micro-tab/sample.txt'AS (year:chararray, temperature:int, quality:int);

YEAR TMP QUALITY1950 0 11950 22 11950 -11 11949 111 11949 78 1

An Example

• This line describes the input data we want to process.

• The year:chararray notation describes the field’s name and type; a chararrayis like a Java string, and an int is like a Java int.

• The LOAD operator takes a URI argument; here we are just using a local file, but we could refer to an HDFS URI.

• The AS clause (which is optional) gives the fields names to make it convenient to refer to them in subsequent statements.

• The result of the LOAD operator, indeed any operator in Pig Latin, is a relation, which is just a set of tuples.

• A tuple is just like a row of data in a database table, with multiple fields in a particular order.

• In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file.

• We write a relation with one tuple per line, where tuples are represented as comma-separated items in parentheses: (1950,0,1)

records = LOAD 'input/ncdc/micro-tab/sample.txt'AS (year:chararray, temperature:int, quality:int);

An Example

• Relations are given names, or aliases, so they can be referred to.

• This relation is given the records alias.

• We can examine the contents of an alias using the DUMP operator:

DUMP records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)

An Example• We can also see the structure of a relation—the relation’s schema—using

the DESCRIBE operator on the relation’s alias: DESCRIBE records;

• This statement removes records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality reading.

• For this small dataset, no records are filtered out.

• The third statement uses the GROUP function to group the records relation by the year field.

• Let’s use DUMP to see what it produces for grouped_records.

• Let’s use DESCRIBE grouped_records; to see what is the structure of grouped_records.

filtered_records = FILTER records BY temperature != 9999 AND(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grouped_records = GROUP filtered_records BY year;

An Example• We now have two rows, or tuples, one for each year in the input data. The

first field in each tuple is the field being grouped by (the year), and the second field is a bag of tuples for that year.

• A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.

• So now all that remains is to find the maximum temperature for the tuples in each bag.

• FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row.

• In this example, the first field is group, which is just the year.

• The second field filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_recordsrelation.

• MAX is a built-in function for calculating the maximum value of fields in a bag.

max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);

Pig Latin

• Supports read-only data analysis workloads that are scan-centric; no transactions!

• Fully nested data model.

– Does not satisfy First normal form!

– By definition will violate the other normal forms.

• Extensive support for user-defined functions.

– UDF as first class citizen.

• Manages plain input files without any schema information.

• A novel debugging environment.

Nested data/set model

• The nested set model is a particular technique for representing nested sets (also known as trees or hierarchies) in relational databases.

Why Nested Data Model?

• Closer to how programmers think and more natural to them.

– E.g., To capture information about the positional occurrences of terms in a collection of documents, a programmer may create a structure of the form Idx<documentId, Set<positions>> for each term.

– Normalization of the data creates two tables:Term_info: (TermId, termString, ….)

Pos_info: (TermId, documentId, position)

– Obtain positional occurrence by joining these two tables on TermId and grouping on <TermId, documentId>

Why Nested Data Model?

• Data is often stored on disk in an inherently nested fashion.

– A web crawler might output for each url, the set of outlinksfrom that url.

• A nested data model justifies a new algebraic language!

• Adaptation by programmers because it is easier to write user-defined functions.

Dataflow Language

• User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural –desirable for programmers.

• With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers.

Dataflow Language: Example

• A high level program that specifies a query execution plan.

• Example: Suppose we have a table urls: (url, category, pagerank). The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category.

• In PigLatin:

Lazy Execution

• Database style optimization by lazy processing of expressions.

• Example

Recall urls: (url, category, pagerank)

Set of urls of pages that are classified as spam and have a high pagerank score.

1. Spam_urls = Filter urls BY isSpam(url);

2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8;

Optimized execution:1. HighRank_urls = FILTER urls BY pagerank > 0.8;

2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);

Quick Start/Interoperability

• To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records.

• Output of a Pig program is formatted based on a user-defined function.

• Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)

Quick Start/Interoperability

• To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records.

• Output of a Pig program is formatted based on a user-defined function.

• Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)

– To enable transactional consistency guarantees,

– To enable efficient point lookups (RIDs),

– To curate data on behalf of the user, and record the schema so that other users can make sense of the data.

Pig Latin - Simple Data Types

• PIG Latin statements work with relations,– A Relation is a Bag (Outer Bag)

– A Bag is a collection of Tuples

– A Tuple is an ordered set of Fields

– A Field can be any simple or complex data type • thus supports nested data model

• Simple data types– int => signed 32 bit => 10

– long => signed 64 bit => 10L

– float => 32 bit => 10.5f, 10.5e2

– double => 64 bit => 10.5, 10.5e2

– Arrays• chararray => string in UTF-8 => ‘Hello World”

• bytearray => byte array (blob)

Data Model

• Consists of four types:

– Atom: Contains a simple atomic value such as a string or a number, e.g., ‘Joe’.

– Tuple: Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’)

– Bag: A collection of tuples with possible duplicates. Schema of a bag is flexible.

– Map: A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. Flexibility enables data to change without re-writing programs.

A Comparison with Relational Algebra

• Pig Latin– Everything is a bag.

– Dataflow language.

• Relational Algebra– Everything is a table.


Pig Latin – NULL support

• Same as SQL definition : unknown or non-existent• Null can be used as constant expression in place of expression

of any type• If certain fields in the data are missing, it is load/store functions

responsibility to insert NULL– E.g. text loader returns NULL in place of empty strings in the

data • Operations that produce NULL

– Divide by zero– Dereferencing a field or map key that does not exists – UDFs can return NULL

• NULLs and Operators – Comparison, Matches, Cast, Dereferencing returns null if one

of the input variables is null– AVG, MIN, MAX, SUM functions ignore NULLs– COUNT function counts values including NULLs– If Filter expression is NULL then record is rejected

Expressions in Pig Latin

Expressions

( 1, (2,3),

(4,6), [‘yahoo’#‘mail’])

A = LOAD ‘data.txt’ AS (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] )

(2,3),

(4,6)

A.f1 or A.$0

A.f2 or A.$1

A.f3 or A.$2

A.f2 =(2,3),

(4,6)A.$1 =

A.$0 = 1

Field referred to by position Field referred to by name Projection of a data item

A.f2.$0 =(2),

(4)

A.f3# ‘yahoo’ = ‘mail’

Map lookup

SUM(A.f2.$0) = 6

COUNT(A.f2) = 2L

Function application

A = name of an outer

bag/relation

NOTE: bag, tuple keywords are optional

Comparison Operators

Numerical comparison (==, !=, >, >=, <, <= )

f1 > 5

f3#‘yahoo’ == ‘mail’

Regular expression matching : matches

f3#‘yahoo’ matches ‘(?i)MAIL’

Logical Operators AND, OR, NOT

f1==1 AND f3#‘yahoo’ eq ‘mail’

Conditional Expression (aka BinCond)

(Condition?exp1:exp2)

f3#‘yahoo’ matches ‘(?i)MAIL’ ? ‘matched’ : ‘notmatched’

f1or $0 f2 or $1 f3 or $2

( 1, (2,3),

(4,6),[‘yahoo’#‘mail’])

a (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] )

Pig Built-in Functions

• Pig has a variety of built-in functions for each type– Storage

• TextLoader: for loading unstructured text files. Each line is loaded as a tuple with a single field which is the entire line.

– Filter• isEmpty: tests if bags are empty

– Eval Functions• COUNT: computes number of elements in a bag• SUM: computes the sum of the numeric values in a single-

column bag • AVG: computes the average of the numeric values in a single-

column bag • MIN/MAX: computes the min/max of the numeric values in a

single-column bag. • SIZE: returns size of any datum example map • CONCAT: concatenate two chararrays or two bytearrays• TOKENIZE: splits a string and outputs a bag of words • DIFF: compares the fields of a tuple with size 2

Specifying Input Data

• Use LOAD command to specify input data file.• Input file is query_log.txt• Convert input file into tuples using myLoad deserializer.• Loaded tuples have 3 fields.• USING and AS clauses are optional.

– Default serializer that expects a plain text, tab-deliminated file, is used.

• No schema reference fields by position $0• Return value, assigned to “queries”, is a handle to a bag.

– “queries” can be used as input to subsequent Pig Latin expressions.– Handles such as “queries” are logical. No data is actually read and no

processing carried out until the instruction that explicitly asks for output (STORE).

– Think of it as a “logical view”.

FOREACH• Once input data file(s) have been specified through LOAD, one can specify

the processing that needs to be carried out on the data.

• One of the basic operations is that of applying some processing to every tuple of a data set.

• This is achieved through the FOREACH command. For example:

• The above command specifies that each tuple of the bag queries (loaded by previous command) should be processed independently to produce an output tuple.

• The first field of the output tuple is the userId field of the input tuple, and the second field of the output tuple is the result of applying the UDF expandQuery to the queryString field of the input tuple.

Per-tuple Processing with FOREACH

• Suppose the UDF expandQuery generates a bag of likely expansions of a given query string.

• Then an example transformation carried out by the above statement is a bag of likely expansions of a given query string.

• Semantics:

– No dependence between processing of different tupels of the input Parallelism!

– GENERATE can be followed by a list of any expression.

FOREACH & Flattening

• To eliminate nesting in data, use FLATTEN.

• FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and makes them fields of the tuple being output by GENERATE, removing one level of nesting.

Discarding Unwanted Data: FILTER

• Identical to the select operator of relational algebra.

• Synatx:– FILTER bag-id BY expression

• Expression is:

field-name op Constant

Field-name op UDF

op might be ==, eq, !=, neq, <, >, <=, >=

• A comparison operation may utilize boolean operators (AND, OR, NOT) with several expressions

• For example, to get rid of bot traffic in the bag queries

• Since arbitrary expressions are allowed, it follows that we can use UDFs while filtering.

• Thus, in our less ideal world, where bots don’t identify themselves, we can use a sophisticated UDF (isBot) to perform the filtering, e.g.:

A Comparison with Relational Algebra

• Pig Latin– Everything is a bag.


– FILTER is same as the Select operator.

• Relational Algebra– Everything is a table.


– Select operator is same as the FILTER cmd.

Grouping related data

• COGROUP groups together tuples from one or more data sets that are related in some way.

• Example:

– For example, suppose we have two data sets that we have specified through a LOAD command:

– Results contains, for different query strings, the urls shown as search results and the position at which they are shown.

– Revenue contains, for different query strings, and different ad slots, the average amount of revenue made by the ad for that query string at that slot.

– Then to group together all search result data and revenue data for the same query string, we can write:

COGROUP

• The output of a COGROUP contains one tuple for each group.– First field of the tuple, named group, is the group identifier.

– Each of the next fields is a bag, one for each input being cogrouped, and is named the same as the alias of that input.

COGROUP is not JOIN

• Grouping can be performed according to arbitrary expressions which may include UDFs.

• Grouping is different than “Join”

• It is evident that JOIN is equivalent to COGROUP, followed by taking a cross product of the tuples in the nested bags. While joins are widely applicable, certain custom processing might require access to the tuples of the groups before the cross-product is taken.

Example• Suppose we were trying to attribute search revenue to search-result urls to

figure out the monetary worth of each url. We might have a sophisticated model for doing so. To accomplish this task in Pig Latin, we can follow the COGROUP with the following statement:

• Where distributeRevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them.

• For example, distributeRevenue might attribute revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.

Example…

• Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.

WITH JOIN• To specify the same operation in SQL, one would have to join by queryString,

then group by queryString, and then apply a custom aggregation function.

• But while doing the join, the system would compute the cross product of the

• search and revenue information, which the custom aggregation function would then have to undo.

• Thus, the whole process become quite inefficient, and the query becomes hard to read and understand.

Special Case of COGROUP: GROUP

• A special case of COGROUP when there is only one data set involved.

• Example: Find the total revenue for each query string.

• In the second statement above, revenue.amount refers to a projection of the nested bag in the tuples of grouped_revenue.

• Also, as in SQL, the AS clause is used to assign names to fields on the fly.

• To group all tuples of a data set together (e.g., to compute the overall total revenue), one uses the syntax GROUP revenue ALL.

JOIN

• Pig Latin supports equi-joins.

• It is easy to verify that JOIN is only a syntactic shortcut for COGROUP followed by flattening.

• The above join command is equivalent to:

MapReduce in Pig Latin

• With the GROUP and FOREACH statements, it is trivial to express a mapreduce program in Pig Latin.

• Converting to our data-model terminology, a map function operates on one input tuple at a time, and outputs a bag of key-value pairs.

• The reduce function then operates on all values for a key at a time to produce the final result.

• The first line applies the map UDF to every tuple on the input, and flattens the bag of key value pairs that it produces.

• We use the shorthand * as in SQL to denote that all the fields of the input tuples are passed to the map UDF.

• Assuming the first field of the map output to be the key, the second statement groups by key.

• The third statement then passes the bag of values for every key to the reduce UDF to obtain the final result.

Other Commands

• Pig Latin has a number of other commands that are very similar to their SQL counterparts. These are:

– UNION: Returns the union of two or more bags.

– CROSS: Returns the cross product of two or more bags.

– ORDER: Orders a bag by the specified field(s).

– DISTINCT: Eliminates duplicate tuples in a bag. This command is just a shortcut for grouping the bag by all fields, and then projecting out the groups.

Asking for Output: STORE• The user can ask for the result of a Pig Latin expression

• sequence to be materialized to a file, by issuing the STORE

• command, e.g.,

• The above command specifies that bag query_revenues should be serialized to the file myoutput using the custom serializer myStore.

• As with LOAD, the USING clause may be omitted for a default serializer that writes plain text, tabdelimited files.

• Pig also comes with a built-in serializer/ deserializer that can load/store arbitrarily nested data.

Word Count using Pig

myinput = LOAD ‘input.txt' USING TextLoader() as (text_line:chararray);

words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(text_line));

grouped = GROUP words BY $0;

counts = FOREACH grouped GENERATE group, COUNT(words);

STORE counts into ‘pigoutput’ using PigStorage();

Write to HDFS pigoutput/part-* file

Build Inverted Index

• Load set of files as string:chararray

• Associate filenames with their string representation

• Union all the entries <filename, string>

• For each entry tokenize string to generate

– <filename, word> tuples

• Group by word

– <word1, {(filename1, word1), (filename2, word1)…}>

– For each group take records with distinct filenames from the

associated bag

– Generate <word1, {(fillename1) (filename2)..}

• Store it

Build Inverted Index

t1 = LOAD ‘input1.txt’ USING TextLoader() AS (string:chararray);

t2 = FOREACH t1 GENERATE ‘input1.txt’ as fname, string;

t3 = LOAD ‘input2.txt’ USING TextLoader() as (string:chararray);

t4 = FOREACH t3 GENERATE ‘input2.txt’ as fname, string;

text = UNION t2, t4;

words = FOREACH text GENERATE fname, FLATTEN(TOKENIZE(string));

word_groups = GROUP words BY $1;

index = FOREACH word_groups { files = DISTINCT $1.$0; GENERATE $0, cnt, files; };

STORE index INTO ‘inverted_index’ using PigStorage();

Nested FOREACH

End of session

Day – 3: Apache Pig Data Operations

04 pig data operations

Software

Transcript of 04 pig data operations