IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

Post on 26-Jan-2015

109 views 1 download

Tags:

description

 

Transcript of IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

Crunch Big Data in the Cloud with IBM BigInsights and Hadoop IBD-3475

Leons Petrazickis, IBM Canada

@leonsp

© 2013 IBM Corporation

Please note

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

First step

Request a lab environment

http://bit.ly/requestLab

BigDataUniversity.com

Hadoop Architecture

Agenda

• Terminology review

• Hadoop architecture

– HDFS

– Blocks

– MapReduce

– Type of nodes

– Topology awareness

– Writing a file to HDFS

6

7

Hadoop cluster

Rack 1

Node 2

Node n

Terminology review

Node 1

Node 2

Node n

Rack 2

Node 1

Node 2

Node n

Rack n

Node 1

Hadoop architecture

• Two main components:

– Hadoop Distributed File System (HDFS)

8

– MapReduce Engine

Hadoop distributed file system (HDFS)

9

• Hadoop file system that runs on top of existing file system

• Designed to handle very large files with streaming data access patterns

• Uses blocks to store a file or parts of a file

HDFS - Blocks

10

• File Blocks

– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX

– Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks

• Advantages of blocks:

– Fixed size – easy to calculate how many fit on a disk

– A file can be larger than any single disk in the network

– If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as:

• Fits well with replication to provide fault tolerance and availability

128MB 128MB 36MB 128MB

128 MB

OS Blocks

HDFS Block

HDFS - Replication

• Blocks with data are replicated to multiple nodes

• Allows for node failure without data loss

11

Node 1

Node 2

Node 3

MapReduce engine

12

• Technology from Google

• A MapReduce program consists of map and reduce

functions

• A MapReduce job is broken into tasks that run in

parallel

Types of nodes - Overview

13

• HDFS nodes

– NameNode

– DataNode

• MapReduce nodes

– JobTracker

– TaskTracker

• There are other nodes not discussed in this course

Types of nodes - Overview

14

Types of nodes - NameNode

15

• NameNode

– Only one per Hadoop cluster

– Manages the filesystem namespace and metadata

– Single point of failure, but mitigated by writing state to

multiple filesystems

– Single point of failure: Don’t use inexpensive

commodity hardware for this node, large memory

requirements

Types of nodes - DataNode

16

• DataNode

– Many per Hadoop cluster

– Manages blocks with data and

serves them to clients

– Periodically reports to name

node the list of blocks it stores

– Use inexpensive commodity

hardware for this node

Types of nodes - JobTracker

17

• JobTracker node

– One per Hadoop cluster

– Receives job requests submitted by client

– Schedules and monitors MapReduce jobs on task

trackers

Types of nodes - TaskTracker

18

• TaskTracker node

– Many per Hadoop cluster

– Executes MapReduce operations

– Reads blocks from DataNodes

19

…lesson continued in the next video>

Topology awareness

20

Bandwidth becomes progressively smaller in the following scenarios:

Topology awareness

21

Bandwidth becomes progressively smaller in the following scenarios:

1. Process on the same node.

Bandwidth becomes progressively smaller in the following scenarios:

1. Process on the same node

2. Different nodes on the same rack

Topology awareness

22

Bandwidth becomes progressively smaller in the following scenarios:

1. Process on the same node

2. Different nodes on the same rack

3. Nodes on different racks in the same data center

Topology awareness

23

Bandwidth becomes progressively smaller in the following scenarios:

1. Process on the same node

2. Different nodes on the same rack

3. Nodes on different racks in the same data center

4. Nodes in different data centers

Topology awareness

24

Writing a file to HDFS

25

Writing a file to HDFS

26

Writing a file to HDFS

27

Writing a file to HDFS

28

Writing a file to HDFS

29

Writing a file to HDFS

30

Writing a file to HDFS

31

Writing a file to HDFS

32

Writing a file to HDFS

33

Writing a file to HDFS

34

Writing a file to HDFS

35

Thank You

What is Hadoop?

Agenda

38

• What is Hadoop?

• What is Big Data?

• Hadoop-related open source projects

• Examples of Hadoop in action

• Big Data solutions and the Cloud

What is Hadoop?

39

Relational Database

1GB

What is Hadoop?

40

Relational Database

1GB

10GB

What is Hadoop?

41

Relational Database

1GB

10GB

100GB

What is Hadoop?

42

Relational Database

1GB

10GB

100GB

What is Hadoop?

43

Relational Database

1TB

What is Hadoop?

44

Relational Database

1TB

10TB 100TB

What is Hadoop?

45

Relational Database

1TB

10TB 100TB

What is Hadoop?

46

Relational Database

1TB

10TB 100TB

RFIDs

Sensors

Facebook

Twitter

What is Hadoop?

47

• Written in Java

• Using inexpensive commodity hardware

• A variety of data (structured, unstructured, semi-structured)

• Massive amounts of data through parallelism

• Optimized to handle

• Not for OLTP, not for OLAP/DSS, good for Big Data

• Open source project

• Reliability provided through replication

• Current version: 0.20.2

• Great performance

What is Big Data?

48

RFID Readers

What is Big Data?

49

2 Billion internet users

What is Big Data?

50

4.6 Billion mobile phones

What is Big Data?

51

7TB of data processed by Twitter every day

7TB

a day

What is Big Data?

52

10TB of data processed by Facebook every day

10TB

a day

What is Big Data?

53

About 80% of this data is unstructured

Examples of Hadoop in action – IBM Watson

55

Examples of Hadoop in action

56

• In the telecommunication industry

• In the media

• In the technology industry

Hadoop is not for all types of work

57

• Not to process transactions (random access)

• Not good when work cannot be parallelized

• Not good for low latency data access

• Not good for processing lots of small files

• Not good for intensive calculations with little data

Big Data solutions and the Cloud

58

• Big Data solutions are more than just Hadoop

– Add business intelligence/analytics functionality

– Derive information of data in motion

• Big Data solutions and the Cloud are a perfect fit.

– The Cloud allows you to set up a cluster of systems in minutes and it’s relatively inexpensive.

Thank You

HDFS – Command Line

Agenda

• HDFS Command Line Interface

• Examples

61

HDFS Command line interface

62

• File System Shell (fs)

• Invoked as follows:

hadoop fs <args>

• Example:

Listing the current directory in hdfs

hadoop fs –ls .

HDFS Command line interface

63

• FS shell commands take paths URIs as argument

• URI format:

scheme://authority/path

• Scheme:

• For the local filesystem, the scheme is file

• For HDFS, the scheme is hdfs

hadoop fs –copyFromLocal file://myfile.txt hdfs://localhost/user/keith/myfile.txt

• Scheme and authority are optional

• Defaults are taken from configuration file core-site.xml

HDFS Command line interface

64

• Many POSIX-like commands

• cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail

• Some HDFS-specific commands

• copyFromLocal, copyToLocal, get, getmerge, put, setrep

HDFS – Specific commands

65

• copyFromLocal / put

• Copy files from the local file system into fs

hadoop fs -copyFromLocal <localsrc> .. <dst>

hadoop fs -put <localsrc> .. <dst>

Or

HDFS – Specific commands

66

• copyToLocal / get

• Copy files from fs into the local file system

hadoop fs -copyToLocal [-ignorecrc] [-crc] <src> <localdst>

hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Or

HDFS – Specific commands

67

• getMerge

• Get all the files in the directories that match the source file pattern

• Merge and sort them to only one file on local fs

• <src> is kept

hadoop fs -getmerge <src> <localdst>

HDFS – Specific commands

68

• setRep

• Set the replication level of a file.

• The -R flag requests a recursive change of replication level for an entire tree.

• If -w is specified, waits until new replication level is achieved.

hadoop fs -setrep [-R] [-w] <rep> <path/file>

Thank You

Hadoop MapReduce

Agenda

71

• Map operations

• Reduce operations

• Submitting a MapReduce job

• Distributed Mergesort Engine

• Two fundamental data types

• Fault tolerance

• Scheduling

• Task execution

What is a Map operation?

72

• Doing something to every element in an array is a common operation:

var a = [1,2,3];

for (i = 0; i < a.length; i++)

a[i] = a[i] * 2;

What is a Map operation?

73

• Doing something to every element in an array is a common operation:

var a = [1,2,3];

for (i = 0; i < a.length; i++)

• New value for variable a would be:

var a = [2,4,6];

a[i] = a[i] * 2;

What is a Map operation?

74

• Doing something to every element in an array is a common operation:

var a = [1,2,3];

for (i = 0; i < a.length; i++)

• New value for variable a would be:

var a = [2,4,6];

This can

be written as

a function

a[i] = a[i] * 2;

What is a Map operation?

75

• Doing something to every element in an array is a common operation:

var a = [1,2,3];

for (i = 0; i < a.length; i++)

• New value for variable a would be:

var a = [2,4,6];

a[i] = a[i] * 2; a[i] = fn(a[i]);

Like this,

where fn

is

a function

defined

as:

function

fn(x)

{return

x*2;}

What is a Map operation?

76

• Doing something to every element in an array is a common operation:

var a = [1,2,3];

for (i = 0; i < a.length; i++)

a[i] = fn(a[i]);

Now, all of this can also be

converted into a “map” function

What is a Map operation?

77

• …like this, where fn is a function passed as an argument:

function map(fn, a) {

for (i = 0; i < a.length; i++)

a[i] = fn(a[i]);

}

What is a Map operation?

78

• …like this, where fn is a function passed as an argument:

function map(fn, a) {

for (i = 0; i < a.length; i++)

a[i] = fn(a[i]);

}

• You can invoke this map function like this:

map(function(x){return x*2;}, a);

What is a Map operation?

79

• …like this, where fn is a function passed as an argument:

function map(fn, a) {

for (i = 0; i < a.length; i++)

a[i] = fn(a[i]);

}

• You can invoke this map function like this:

map(function(x){return x*2;}, a);

This is function fn whose definition is included in the call

What is a Map operation?

80

for (i = 0; i < a.length; i++)

a[i] = a[i] * 2;

}

• In summary, now you can rewrite:

as a map operation:

map(function(x){return x*2;}, a);

What is a Reduce operation?

81

• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;

for (i = 0; i < a.length; i++)

s += a[i];

return s;

}

What is a Reduce operation?

82

• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;

for (i = 0; i < a.length; i++)

s += a[i];

return s;

}

This can

be written

as a

function

What is a Reduce operation?

83

• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;

for (i = 0; i < a.length; i++)

s = fn(s,a[i]);

return s;

}

Like this, where function fn is defined so it adds its arguments: function fn(a,b){ return a+b; }

What is a Reduce operation?

84

• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;

for (i = 0; i < a.length; i++)

s = fn(s, a[i]);

return s;

}

The whole function sum can also be rewritten so that fn is passed as an

argument

What is a Reduce operation?

85

• Another common operation on arrays is to combine all their values:

function reduce(fn, a, init) {

var s = init;

for (i = 0; i < a.length; i++)

s = fn(s, a[i]);

return s;

}

Like this… The function name was changed to reduce, and now it takes

three arguments, a function, an array, and an initial value

What is a Reduce operation?

86

• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;

for (i = 0; i < a.length; i++)

s += a[i];

return s;

}

as a reduce operation:

reduce(function(a,b){return a+b;},a,0);

87

…lesson continued in the next video>

Submitting a MapReduce job

88

Submitting a MapReduce job

89

Submitting a MapReduce job

90

Submitting a MapReduce job

91

Submitting a MapReduce job

92

Submitting a MapReduce job

93

Submitting a MapReduce job

94

Submitting a MapReduce job

95

Submitting a MapReduce job

96

Submitting a MapReduce job

97

98

…lesson continued in the next video>

MapReduce – Distributed Mergesort Engine

99

MapReduce – Distributed Mergesort Engine

100

MapReduce – Distributed Mergesort Engine

101

MapReduce – Distributed Mergesort Engine

102

MapReduce – Distributed Mergesort Engine

103

MapReduce – Distributed Mergesort Engine

104

MapReduce – Distributed Mergesort Engine

105

MapReduce – Distributed Mergesort Engine

106

MapReduce – Distributed Mergesort Engine

107

MapReduce – Distributed Mergesort Engine

108

MapReduce – Distributed Mergesort Engine

109

110

…lesson continued in the next video>

Two Fundamental data types

111

Input Output

map

reduce

• Key/value pairs

• Lists

Two Fundamental data types

112

Input Output

map <k1, v1>

reduce

• Key/value pairs

• Lists

Two Fundamental data types

113

Input Output

map <k1, v1> list(<k2, v2>)

reduce

• Key/value pairs

• Lists

Two Fundamental data types

114

Input Output

map <k1, v1> list(<k2, v2>)

reduce <k2, list(v2)>

• Key/value pairs

• Lists

Two Fundamental data types

115

Input Output

map <k1, v1> list(<k2, v2>)

reduce <k2, list(v2)> list(<k3, v3>)

• Key/value pairs

• Lists

Simple data flow example

116

Simple data flow example

117

Simple data flow example

118

Simple data flow example

119

Simple data flow example

120

121

…lesson continued in the next video>

Fault tolerance

122

Fault tolerance

123

• Task Failure

Fault tolerance

124

• Task Failure

• If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task.

Fault tolerance

125

• Task Failure

• If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task.

• If the child task hangs, it is killed. JobTracker reschedules the task on another machine.

Fault tolerance

126

• Task Failure

• If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task.

• If the child task hangs, it is killed. JobTracker reschedules the task on another machine.

• If task continues to fail, job is failed.

Fault tolerance

127

• TaskTracker Failure

Fault tolerance

128

• TaskTracker Failure

• JobTracker receives no heartbeat

Fault tolerance

129

• TaskTracker Failure

• JobTracker receives no heartbeat

• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

Fault tolerance

130

• TaskTracker Failure

• JobTracker receives no heartbeat

• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

• JobTracker Failure

Fault tolerance

131

• TaskTracker Failure

• JobTracker receives no heartbeat

• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

• JobTracker Failure

• Singe point of failure. Job fails

132

…lesson continued in the next video>

Scheduling

133

Scheduling

134

• FIFO scheduler (with priorities)

Scheduling

135

• FIFO scheduler (with priorities)

• Each job uses the whole cluster, so jobs wait their turn.

Scheduling

136

• FIFO scheduler (with priorities)

• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler

Scheduling

137

• FIFO scheduler (with priorities)

• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler

• Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity.

Scheduling

138

• FIFO scheduler (with priorities)

• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler

• Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity.

• Capacity scheduler

Scheduling

139

• FIFO scheduler (with priorities)

• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler

• Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity.

• Capacity scheduler

• Allows Hadoop to simulate, for each user, a separate MapReduce cluster with FIFO scheduling.

Task execution

140

Task execution

141

• Speculative Execution

Task execution

142

• Speculative Execution

• Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used.

Task execution

143

• Speculative Execution

• Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used.

• Task JVM Reuse

Task execution

144

• Speculative Execution

• Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used.

• Task JVM Reuse

• Tasks run in their own JVMs for isolation. Jobs that have a large number of short-lived tasks or tasks with lengthy initialization can benefit from sequential JVM reuse through configuration.

Thank You

Pig, Hive, and JAQL

Agenda

147

• Overview

• Pig

• Hive

• Jaql

Agenda

148

• Overview

• Pig

• Hive

• Jaql

Similarities of Pig, Hive and Jaql

149

All translate their respective high-level languages to MapReduce jobs

All offer significant reductions in program size over Java

All provide points of extension to cover gaps in functionality

All provide interoperability with other languages

None support random reads/writes or low-latency queries

Comparing Pig, Hive, and Jaql

150

Pig Hive Jaql

Developed by Yahoo! Facebook IBM

Language name Pig Latin HiveQL Jaql

Type of language Data flow

Declarative

(SQL dialect) Data flow

Data structures it

operates on Complex

Geared

towards

structured data

Loosely structured

data, JSON

Schema optional? Yes

No, but data

can have many

schemas Yes

Turing complete?

Yes when

extended with

Java UDFs

Yes when

extended with

Java UDFs Yes

Agenda

151

• Overview

• Pig

• Hive

• Jaql

Pig components

• Two Components

Language (called Pig Latin)

Compiler

• Two execution environments

Local (Single JVM)

pig -x local

Distributed (Hadoop cluster)

pig -x mapreduce, or simply pig

152

Pig Latin

Compiler

Local

Distributed

Pig

Execution Environment

152

Running Pig

Script

pig scriptfile.pig

Grunt (command line)

pig (to launch command line tool)

Embedded

Call in to Pig from Java

153 153

Pig Latin sample code

154

#pig

grunt> records = LOAD ‘econ_assist.csv’

using PigStorage (‘,’)

AS (country:chararray, sum:long);

grunt> grouped = GROUP records BY country;

grunt> thesum = FOREACH grouped

GENERATE group,

SUM(records, sum);

grunt> DUMP thesum;

154

Pig Latin – Statements, operations & commands

155

Pig Latin program

… LOAD ‘input.txt’;

… ls *.txt

… DUMP…

An operation

as a statement A

command

as a

statement

Logical Plan

Compile Physical

Plan

Execute

155

Pig Latin statements

UDF Statements

REGISTER, DEFINE

Commands

Hadoop Filesystem (cat, ls, etc.)

Hadoop MapReduce (kill)

Utility (exec, help, quit, run, set)

Operators

Diagnostic: DESCRIBE, EXPLAIN, ILLUSTRATE

Relational: LOAD, STORE, DUMP, FILTER, etc.

156 156

Pig Latin – Relational operators

Loading and storing

Eg: LOAD (into a program), STORE (to disk), DUMP (to the screen)

Filtering Eg: FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE

Grouping and joining Eg: JOIN, COGROUP, GROUP, CROSS

Sorting Eg: ORDER, LIMIT

Combining and splitting Eg: UNION, SPLIT

157 157

Pig Latin – Relations and schema

Result of a relational operator is a relation

A relation is a set of tuples

Relations can be named using an alias (Eg: “x”)

158

x = LOAD ‘sample.txt’ AS (id: int, year:int);

DUMP x

Output is a tuple. Eg: (1,1987)

158

Pig Latin – Relations and schema

Structure of a relation is a schema

Use the DESCRIBE operator to see the schema. Eg:

The output is the schema:

159

DESCRIBE x

x: {id: int, year: int}

159

Pig Latin expressions

Statements that contain relational operators may also contain expressions.

Kinds of expressions:

Constant Field Projection

Map lookup Cast Arithmetic

Conditional Boolean Comparison

Functional Flatten

160 160

Pig Latin – Data types

• Simple types:

int float bytearray

long double chararray

Complex types:

Tuple – Sequence of fields of any type

Bag – Unordered collection of tuples

Map – Set of key-value pairs. Keys must be chararray.

161 161

Pig Latin – Function types

Eval

Input: One or more expressions

Output: An expression

Example: MAX

Filter

Input: Bag or map

Output: boolean

Example: IsEmpty

162 162

Load

Input: Data from external storage

Output: A relation

Example: PigStorage

Store

Input: A relation

Output: Data to external storage

Example: PigStorage

163

Pig Latin – Function types

163

Pig Latin – User-Defined Functions

• Written in Java

Packaged in a JAR file

Register JAR file using the REGISTER statement

Optionally, alias it with DEFINE statement

164 164

Agenda

165

• Overview

• Pig

• Hive

• Jaql

Hive architecture

166

Metastore

(Relational

database

for metadata)

Hadoop

JDBC/ODBC

CLI

Web

Interface

Parser,

Planner

Optimizer

DDL Queries

166

Running Hive

Hive Shell

Interactive

hive

Script

hive -f myscript

Inline

hive -e 'SELECT * FROM mytable'

167 167

Hive services

hive --service servicename

where servicename can be:

hiveserver

server for Thrift, JDBC, ODBC clients

hwi

web interface

jar

hadoop jar with Hive jars in classpath

metastore

out of process metastore

168 168

Hive - Metastore

Stores Hive metadata

Configurations

Embedded

in-process metastore, in-process database

Local

in-process metastore, out-of-process database

Remote

out-of-process metastore, out-of-process database

169 169

Hive – Schema-On-Read

Faster loads into the database (simply copy or move)

Slower queries

Flexibility – multiple schemas for the same data

170 170

Hive - Configuration

• Three ways to configure hive:

• hive-site.xml

- fs.default.name

- mapred.job.tracker

- Metastore configuration settings

hive –hiveconf

“Set” command in the Hive Shell

171 171

Hive Query Language (HiveQL)

SQL dialect

Does not support full SQL92 specification

No support for:

HAVING clause in SELECT

Correlated subqueries

Subqueries outside FROM clauses

Updateable or materialized views

Stored procedures

172 172

Sample code

173

#hive

hive> CREATE TABLE foreign_aid

(country STRING, sum BIGINT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE;

hive> SHOW TABLES;

hive> DESCRIBE foreign_aid;

hive> LOAD DATA INPATH ‘econ_assist.csv’

OVERWRITE INTO TABLE foreign_aid;

hive> SELECT * FROM foreign_aid LIMIT 10;

hive> SELECT country, SUM(sum) FROM foreign_aid

GROUP BY country;

173

Hive Query Language (HiveQL)

Extensions

MySQL-like extensions

MapReduce extensions

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Data Types

Simple

TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING

Complex

ARRAY, MAP, STRUCT

174 174

Hive Query Language (HiveQL)

Built-in Functions SHOW FUNCTIONS

DESCRIBE FUNCTION

175 175

Hive – User-Defined Functions

Written in Java

Three UDF types:

UDF

Input: single row, output: single row

UDAF

Input: multiple rows, output: single row

UDTF

Input: single row, output: multiple rows

Register UDF using ADD JAR

Create alias using CREATE TEMPORARY FUNCTION

176 176

Agenda

177

• Overview

• Pig

• Hive

• Jaql

Jaql architecture

178

Interactive shell / Applications

Script

Compiler / Parser / Rewriter

File Systems

(HDFS, GPFS, Local)

Databases

(DBMS, HBase)

Streams

(Web, Pipes)

Storage layer

I/O layer

178

Jaql data model: JSON

JSON = JavaScript object Notation

Flexible (Schema is optional)

Powerful modeling for semi-structured data

Popular exchange format

179 179

JSON example

180

[

{ACCT_NUM:18,AUTH_DATE:”2011-01-29”,

AUTH_AMT:”111.11”,ZIP:98765,MERCH_NAME:”Acme”},

{ACCT_NUM:19,AUTH_DATE:”2011-01-29”,

AUTH_AMT:”222.22”,ZIP:98765,MERCH_NAME:”Exxme”,

NICKNAME:”Xyz”},

{ACCT_NUM:20,AUTH_DATE:”2011-01-30”,

AUTH_AMT:”3.33”,ZIP:12345,MERCH_NAME:”Acme”,

ROUTE:[”68.86.85.188”,”64.215.26.111”]},

… ]

180

Running Jaql

Jaql Shell

Interactive. Eg: jaqlshell

Batch Eg: jaqlshell -b myscript.jaql

Inline Eg: jaqlshell -e jaqlstatement

Modes

Cluster Eg: jaqlshell -c

Minicluster Eg: jaqlshell

181 181

Jaql query language

• Sources and sinks

Eg: Copy data from a local file to a new file on HDFS

source sink

read(file(“input.json”)) -> write(hdfs(“output”))

Core Operators

Filter Group Tee

Transform Join Sort

Expand Union Top

182

source sink operator operator …

182

Jaql query language

• Variables

Equal operator (=) binds source output to a variable

e.g. $tweets = read(hdfs(“twitterfeed”))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets → filter $.from_src == 'tweetdeck';

$ – implicit variable referencing current array value

183 183

Jaql query language

• Categories of Built-in Functions

system schema agg

core xml number

hadoop regex string

io binary function

array date random

index nil record

184 184

Jaql – Data Storage

Data store examples Amazon S3 DB2 HBase HDFS

HTTP JDBC Local FS

Data format examples JSON AVRO CSV XML

185 185

Jaql sample code

186

#jaqlshell -c

jaql> $foreignaid =

read(del(“econ_assist.csv”,

{schema: schema

{country: string, sum: long}

} )

)

jaql> $foreignaid

-> group by $country = ($.country)

into {$country.country, sum($[*].sum)};

186

Hadoop core lab – Part 3

BigDataUniversity.com

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in

which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for

informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.

While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without

warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this

presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or

representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use

of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have

achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended

to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other

results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM

Corp.

IBM, the IBM logo, ibm.com, InfoSphere and BigInsights, Streams, and DB2 are trademarks or registered trademarks of International

Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on

their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law

trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law

trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at

www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Communities

• On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more

o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity

• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions

o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities

• ibm.com/champion

Thank You Your feedback is important!

• Access the Conference Agenda Builder to complete your session surveys

oAny web or mobile browser at http://iod13surveys.com/surveys.html

oAny Agenda Builder kiosk onsite