Hive

46
Global Big Data Conference - 2014 Hive BALA KRISHNA G Global Big Data Bootcamp Jan 2014 (http://globalbigdataconference.com)

description

This presentation is one of my talks at "Global Big Data Conference" held in end of January'14. This presentation is mainly targeted the audience to let them understand overview of Hive and getting hands-on-experience on Hive Query Language. The overview part focuses on What is the need for Hive? Hive Architecture, Hive Components, Hive Query Language, and many others.

Transcript of Hive

Page 1: Hive

Global Big Data Conference - 2014

Hive

BALA KRISHNA G Global Big Data Bootcamp – Jan 2014 (http://globalbigdataconference.com)

Page 2: Hive

2 Global Big Data Conference - 2014 2 Global Big Data Conference - 2014 Speaker : Bala

My introduction

Senior Software and Research Engineer

Big data trainer

Experience on Hadoop and Strom for more than 1.5 years

Worked at various big companies SUN/ORACLE, IBM, etc.,

www.linkedin.com/in/gbalakrishna/

[email protected]

Page 3: Hive

3 Global Big Data Conference - 2014 3 Global Big Data Conference - 2014 Speaker : Bala

Agenda

Class structure

– 1 hour lecture and 1 ½ hour lab

Lecture

– Need for Hive

– Hive history

– Hive powered by

– What is Hive?

– Hive Architecture

– Hive Query Life cycle

– Hive Query Language (HiveQL)

Lab:

– Extensive hands-on-experience on Hive

– Derive various insights from a real-world dataset by Hive

Page 4: Hive

4 Global Big Data Conference - 2014 4 Global Big Data Conference - 2014 Speaker : Bala

Need for Hive

Do I need to

learn JAVA?

Don’t worry!

I am here to

rescue you

Page 5: Hive

5 Global Big Data Conference - 2014 5 Global Big Data Conference - 2014 Speaker : Bala

Need for Hive contd.,

In general, one MR job is not suffice to derive BI (Business Intelligence)

Oftentimes, require a series of complex MR jobs chained together (Advanced data processing)

MR 1

MR 2

MR 3

MR 4

MR 5

MR 6

legends

MR – Map Reduce

Mapper Task

Reducer Task

Page 6: Hive

6 Global Big Data Conference - 2014 6 Global Big Data Conference - 2014 Speaker : Bala

Need for Hive contd.,

20 lines of code in Hive can result into ~200 lines of Java code

Lowers the development time significantly (~16 times)

0

50

100

150

200

250

300

Hadoop Pig

code

0

50

100

150

200

250

300

Hadoop Pig

Min

ute

s

time

Page 7: Hive

7 Global Big Data Conference - 2014 7 Global Big Data Conference - 2014 Speaker : Bala

Need for Hive contd.,

Just focuses on “WHAT” part of your data analysis

“HOW” part is rest assured by framework

HOW

Page 8: Hive

8 Global Big Data Conference - 2014 8 Global Big Data Conference - 2014 Speaker : Bala

Hive powered by

And many more…

https://cwiki.apache.org/confluence/display/Hive/PoweredBy

Uses for processing large amount of user and

central to meet company reporting need’s

Ad hoc queries reporting and analytics

Data analytics and Data cleaning

Page 9: Hive

9 Global Big Data Conference - 2014 9 Global Big Data Conference - 2014 Speaker : Bala

What is Hive?

Data warehouse built on top of Hadoop

Provides an SQL like interface to analyze data

An open source project under apache

Works on high throughput and high latency

principle (same as Hadoop)

Ability to plug-in custom Map Reduce programs

Mainly targeted for structured data

Hides Map Reduce program complexities to end

user

Page 10: Hive

10 Global Big Data Conference - 2014 10 Global Big Data Conference - 2014 Speaker : Bala

Hive Architecture

Meta

Store

CLI

Web

Interface

Python

Compiler

Optimizer

Driver

Plan

executor

ODBC

Perl

Hive Thrift

Server

HIVE

HDFS

Map

Reduce

HADOOP

Page 11: Hive

11 Global Big Data Conference - 2014 11 Global Big Data Conference - 2014 Speaker : Bala

Metastore

Stores metadata of tables like database location, owner, creation time, access attributes, table schema, etc.,

Comprises of two components 1) Service 2) Data storage

Driver Metastore

Service

MySQL Driver Metastore

Service

Derby

Driver Metastore

Server MySQL

Embedded

Metastore

Hive Service

Local

Metastore

Remote

Metastore

Page 12: Hive

12 Global Big Data Conference - 2014 12 Global Big Data Conference - 2014 Speaker : Bala

Hive Query Life cycle Insight

Page 13: Hive

13 Global Big Data Conference - 2014 13 Global Big Data Conference - 2014 Speaker : Bala

7 6

5 4

Hive Query Life cycle contd.,

Hive

Interface Driver

Parser Semantic

Analyzer

Logical

plan

generator

Optimizer

Metastore Optimizer

Execution

Engine

Hadoop

Map

Reduce

Compiler

1

2

3

9

10 11

12

13

14

6 7

8

Physical

plan

generator

Page 14: Hive

14 Global Big Data Conference - 2014 14 Global Big Data Conference - 2014 Speaker : Bala

Data Models

Database: Holds namespace for tables

Table: Container of actual data

sample

Id Name Age Sex State

/user/$USER/warehouse/sample

In Hive warehouse

stored as a folder

Page 15: Hive

15 Global Big Data Conference - 2014 15 Global Big Data Conference - 2014 Speaker : Bala

Data Models contd.,

Partition: Horizontal slice of table by a partition key

Let say sample table is partitioned by state column

sample

Stored as many subfolders under sample directory

/user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/

/user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/

Id Name Age Sex State

Partition 1

Partition 2

Page 16: Hive

16 Global Big Data Conference - 2014 16 Global Big Data Conference - 2014 Speaker : Bala

Data Models contd.,

Bucket: Divides into further chunks by an other column for sampling

Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets

In warehouse, the data is stored as

/user/$USER/warehouse/State=AL/part-00000

/user/$USER/warehouse/State=AL/part-00001

/user/$USER/warehouse/State=GA/part-00000

/user/$USER/warehouse/State=GA/part-00001

.

.

/user/$USER/warehouse/State=ND/part-00000

/user/$USER/warehouse/State=ND/part-00001

Page 17: Hive

17 Global Big Data Conference - 2014 17 Global Big Data Conference - 2014 Speaker : Bala

Data Loading Techniques

Managed Table: Tables managed by Hive Ware House

– Copy file from local file system to Hive Ware House

– Copy file from HDFS to Hive Ware House

HDFS Hive

Warehouse

1) Local FS File

2)

HDFS

File Hive

Warehouse copy

copy

Page 18: Hive

18 Global Big Data Conference - 2014 18 Global Big Data Conference - 2014 Speaker : Bala

Referenced

Data Loading Techniques contd.,

External Table: Tables are just referenced by Hive Ware House

– Directly managing file in HDFS with out copying it into Hive Ware House

3) HDFS

File Hive

Warehouse

referenced

Page 19: Hive

19 Global Big Data Conference - 2014 19 Global Big Data Conference - 2014 Speaker : Bala

Data Loading Techniques contd.,

Explain when to go for external table and managed table?

Page 20: Hive

20 Global Big Data Conference - 2014 20 Global Big Data Conference - 2014 Speaker : Bala

Question - 01

In which scenario you use Hive?

1. Completely unstructured nasty data

2. Structured data

3. Any kind of data

4. None of the above

Page 21: Hive

21 Global Big Data Conference - 2014 21 Global Big Data Conference - 2014 Speaker : Bala

Question – 01 answer

2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig

Page 22: Hive

22 Global Big Data Conference - 2014 22 Global Big Data Conference - 2014 Speaker : Bala

Question - 02

Which option is not correct about Metastore?

1. It stores the table location

2. It has information about number of partitions and number of buckets

3. It can give you time at which the table is created

4. It stores the actual data

Page 23: Hive

23 Global Big Data Conference - 2014 23 Global Big Data Conference - 2014 Speaker : Bala

Question – 02 answer

4. Metastore stores only the metadata. Actual data is stored in HDFS.

Page 24: Hive

24 Global Big Data Conference - 2014 24 Global Big Data Conference - 2014 Speaker : Bala

Question – 03 (last question)

What is incorrect about Hive?

1. Hive internally generates MapReduce jobs to serve your query

2. Hive runs on top of HDFS

3. Hive is a proprietary software

4. Hive supports multiple interfaces to interact with

Page 25: Hive

25 Global Big Data Conference - 2014 25 Global Big Data Conference - 2014 Speaker : Bala

Question – 03 answer

3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly.

Page 26: Hive

26 Global Big Data Conference - 2014 26 Global Big Data Conference - 2014 Speaker : Bala

Hive Query Language (Hive QL)

Data types – provides types for variables

DDL – provides a way to define databases, tables, etc.,

DML – provides a way to modify content

Query statements – provides a way to retrieve the content

Page 27: Hive

27 Global Big Data Conference - 2014 27 Global Big Data Conference - 2014 Speaker : Bala

Data types

Primitive Types

Integers:

TINYINT (1 byte)

SMALLINT (2 bytes)

INT (4 bytes)

BIGINT (8 bytes)

Floating point

numbers:

Float (4 bytes)

Double (8 bytes)

Booleans:

BOOLEAN

(TRUE or FALSE)

String:

STRING

(sequence of

characters)

Usage

variable_name <Data Type>

ex: name STRING

Page 28: Hive

28 Global Big Data Conference - 2014 28 Global Big Data Conference - 2014 Speaker : Bala

Data types contd.,

Complex Types

ARRAY collection of multiple

same data type values

STRUCT collection of multiple

different data type

values

MAP collection of

(key, value) pairs

Usage name ARRAY <primitive type>

ex: marks ARRAY<INT>

Usage name STRUCT <type1, type2,

type3, …>

ex: record STRUCT <name

STRING, id INT, marks

ARRAY<INT>>

Usage name MAP <key, value>

ex: score MAP<STRING, INT>

Page 29: Hive

29 Global Big Data Conference - 2014 29 Global Big Data Conference - 2014 Speaker : Bala

Data types contd.,

Key must be a primitive in MAP

Referencing complex types

Previous example:

– marks ARRAY<INT>

– record STRUCT <name STRING, id INT, marks ARRAY<INT>>

– score MAP<STRING, INT>

SELECT marks[0], record.name, score[‘joe’]

Complex type inside a complex type is allowed

– array inside a struct (as seen before)

Page 30: Hive

30 Global Big Data Conference - 2014 30 Global Big Data Conference - 2014 Speaker : Bala

DDL

CREATE TABLE sample(id INT, name STRING, age INT, sex

STRING, state STRING)

COMMENT ‘This is a sample table’

PARTITIONED BY (state STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE;

schema

comments for readability

partition data by state column

rows are delimited by ‘\n’

fields are terminated by ‘,’

store file as a text file

Table is created in warehouse directory and completely managed by Hive

Specific row format and file format can be expressed by custom SerDe

Page 31: Hive

31 Global Big Data Conference - 2014 31 Global Big Data Conference - 2014 Speaker : Bala

SerDe

HDFS

File

InputFile

Format <Key,

Value>

Deserializ

er Row

Deserializer

Row Serializer <Key,

Value>

OutputFile

Format

HDFS

File

Serializer

SerDe stands for Serializer and Deserializer

Page 32: Hive

32 Global Big Data Conference - 2014 32 Global Big Data Conference - 2014 Speaker : Bala

DDL contd.,

CREATE EXTERNAL TABLE external_sample(id INT, name STRING,

age INT, sex STRING, state STRING)

LOCATION ‘/user/department/sample’

Table is not created in warehouse directory and just referenced by Hive

The file referenced is in HDFS (hdfs://user/department/sample)

Page 33: Hive

33 Global Big Data Conference - 2014 33 Global Big Data Conference - 2014 Speaker : Bala

DDL contd.,

DELETE TABLE sample

Since sample table is managed by Hive, it deletes entire data along with

metadata

DELETE TABLE external_sample

Since external_sample table is *not* managed by Hive, it just deletes the

metadata leaving actual data untouched

Page 34: Hive

34 Global Big Data Conference - 2014 34 Global Big Data Conference - 2014 Speaker : Bala

DML

Load data into managed table from local file system

Load data into managed table from HDFS

LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE

sample;

The file ‘/home/hive/sample.txt’ is in local file system

It is copied into Hive warehouse folder

LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE

sample;

The file ‘/user/hive/sample.txt’ is in HDFS

It is copied into Hive warehouse folder

Page 35: Hive

35 Global Big Data Conference - 2014 35 Global Big Data Conference - 2014 Speaker : Bala

DML contd.,

Insert results into a new table

Create a new table with automatically derived schema

INSERT OVERWRITE TABLE newsample

SELECT * from sample;

newsample table must be created before hand

select query results are loaded (overwritten) into new sample

CREATE TABLE newsample

AS SELECT * from sample;

creates newsample time with automatically derived schema

query results are populated into it

Page 36: Hive

36 Global Big Data Conference - 2014 36 Global Big Data Conference - 2014 Speaker : Bala

Query statements

To list available databases

To use a particular database

To list all tables available in a database

SHOW DATABASES;

USE <databasename>;

SHOW TABLES;

Page 37: Hive

37 Global Big Data Conference - 2014 37 Global Big Data Conference - 2014 Speaker : Bala

Query statements contd.,

select

Aggregation functions

Group by, Sort by, Order by

SELECT * FROM sample;

SELECT COUNT(DISTINCT state) FROM sample;

SELECT COUNT(*) FROM sample GROUP BY state;

SELECT * FROM sample SORT BY id DESC;

FROM sample SELECT * ORDER BY id ASC;

Page 38: Hive

38 Global Big Data Conference - 2014 38 Global Big Data Conference - 2014 Speaker : Bala

Query statements contd.,

Joins

Left join and Right joins are also supported

Multiple joins are accepted

SELECT s.* , o.*

FROM sample s

JOIN orders o

ON (s.id = o.id)

Page 39: Hive

39 Global Big Data Conference - 2014 39 Global Big Data Conference - 2014 Speaker : Bala

Custom Functions

UDF:

– User defined function

– Complex/additional logic can be expressed

– Operates on row by row

UDAF:

– User defined aggregate function

– Custom aggregated function logic can be written

– Operates on groups retrieved by group by clause

UDTF:

– User defined table function

– Operates on entire table

Page 40: Hive

40 Global Big Data Conference - 2014 40 Global Big Data Conference - 2014 Speaker : Bala

Hive Limitations

Not suitable for unstructured data

Perfectly suitable for OLAP system (analysis)

Representing machine learning algorithms can be a challenging task

Performance tradeoff with actual MR programs in various scenarios

– The gap is narrowing with release to release

Page 41: Hive

41 Global Big Data Conference - 2014 41 Global Big Data Conference - 2014 Speaker : Bala

Important practical tips

Hive logs: /tmp/$USER/hive.log

To know available functions: SET FUNCTIONS

To know help about a specific function: DESCRIBE FUNCTION <function_name>

Explain about config files the one in /usr/lib/hive/conf folder

– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?

SETTING parameters in the hive session

Page 42: Hive

42 Global Big Data Conference - 2014 42 Global Big Data Conference - 2014 Speaker : Bala

References

Hadoop: The Definitive Guide -Tom White

https://cwiki.apache.org/confluence/display/Hive/Home

http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/sig_2010_v21.pdf

Venner, Jason (2009). Pro Hadoop

http://hortonworks.com/big-data-insights/how-facebook-uses-hadoop-and-hive/

Page 43: Hive

43 Global Big Data Conference - 2014 43 Global Big Data Conference - 2014 Speaker : Bala

Q/A

Page 44: Hive

44 Global Big Data Conference - 2014 44 Global Big Data Conference - 2014 Speaker : Bala

Page 45: Hive

45 Global Big Data Conference - 2014 45 Global Big Data Conference - 2014 Speaker : Bala

Backup slides

Page 46: Hive

46 Global Big Data Conference - 2014 46 Global Big Data Conference - 2014 Speaker : Bala

Schema on Read (?)

[To do] where to put this slide?

Explain what is schema on read

Explain what is schema on write

Advantages of using schema on read

– Faster load time

– Impacts query time