Hive

Global Big Data Conference - 2014

Hive

BALA KRISHNA G Global Big Data Bootcamp – Jan 2014 (http://globalbigdataconference.com)

2 Global Big Data Conference - 2014 2 Global Big Data Conference - 2014 Speaker : Bala

My introduction

Senior Software and Research Engineer

Big data trainer

Experience on Hadoop and Strom for more than 1.5 years

Worked at various big companies SUN/ORACLE, IBM, etc.,

www.linkedin.com/in/gbalakrishna/

[email protected]

http://www.linkedin.com/in/gbalakrishna/


Agenda

Class structure

– 1 hour lecture and 1 ½ hour lab

Lecture

– Need for Hive

– Hive history

– Hive powered by

– What is Hive?

– Hive Architecture

– Hive Query Life cycle

– Hive Query Language (HiveQL)

Lab:

– Extensive hands-on-experience on Hive

– Derive various insights from a real-world dataset by Hive


Need for Hive

Do I need to

learn JAVA?

Don’t worry!

I am here to

rescue you


Need for Hive contd.,

In general, one MR job is not suffice to derive BI (Business Intelligence)

Oftentimes, require a series of complex MR jobs chained together (Advanced data processing)

MR 1

MR 2

MR 3

MR 4

MR 5

MR 6

legends

MR – Map Reduce

Mapper Task

Reducer Task



20 lines of code in Hive can result into ~200 lines of Java code

Lowers the development time significantly (~16 times)

0

50

100

150

200

250

300

Hadoop Pig

code

0

50

100

150

200

250

300

Hadoop Pig

Min

ute

s

time



Just focuses on “WHAT” part of your data analysis

“HOW” part is rest assured by framework

HOW


Hive powered by

And many more…

https://cwiki.apache.org/confluence/display/Hive/PoweredBy

Uses for processing large amount of user and

central to meet company reporting need’s

Ad hoc queries reporting and analytics

Data analytics and Data cleaning


What is Hive?

Data warehouse built on top of Hadoop

Provides an SQL like interface to analyze data

An open source project under apache

Works on high throughput and high latency

principle (same as Hadoop)

Ability to plug-in custom Map Reduce programs

Mainly targeted for structured data

Hides Map Reduce program complexities to end

user


Hive Architecture

Meta

Store

CLI

Web

Interface

Python

Compiler

Optimizer

Driver

Plan

executor

ODBC

Perl

Hive Thrift

Server

HIVE

HDFS

Map

Reduce

HADOOP


Metastore

Stores metadata of tables like database location, owner, creation time, access attributes, table schema, etc.,

Comprises of two components 1) Service 2) Data storage

Driver Metastore

Service

MySQL Driver Metastore

Service

Derby

Driver Metastore

Server MySQL

Embedded

Metastore

Hive Service

Local

Metastore

Remote

Metastore


Hive Query Life cycle Insight


7 6

5 4

Hive Query Life cycle contd.,

Hive

Interface Driver

Parser Semantic

Analyzer

Logical

plan

generator

Optimizer

Metastore Optimizer

Execution

Engine

Hadoop

Map

Reduce

Compiler

1

2

3

9

10 11

12

13

14

6 7

8

Physical

plan

generator


Data Models

Database: Holds namespace for tables

Table: Container of actual data

sample

Id Name Age Sex State

/user/$USER/warehouse/sample

In Hive warehouse

stored as a folder


Data Models contd.,

Partition: Horizontal slice of table by a partition key

Let say sample table is partitioned by state column

sample

Stored as many subfolders under sample directory

/user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/

/user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/

Id Name Age Sex State

Partition 1

Partition 2


Data Models contd.,

Bucket: Divides into further chunks by an other column for sampling

Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets

In warehouse, the data is stored as

/user/$USER/warehouse/State=AL/part-00000

/user/$USER/warehouse/State=AL/part-00001

/user/$USER/warehouse/State=GA/part-00000

/user/$USER/warehouse/State=GA/part-00001

.

.

/user/$USER/warehouse/State=ND/part-00000

/user/$USER/warehouse/State=ND/part-00001


Data Loading Techniques

Managed Table: Tables managed by Hive Ware House

– Copy file from local file system to Hive Ware House

– Copy file from HDFS to Hive Ware House

HDFS Hive

Warehouse

1) Local FS File

2)

HDFS

File Hive

Warehouse copy

copy


Referenced

Data Loading Techniques contd.,

External Table: Tables are just referenced by Hive Ware House

– Directly managing file in HDFS with out copying it into Hive Ware House

3) HDFS

File Hive

Warehouse

referenced


Data Loading Techniques contd.,

Explain when to go for external table and managed table?


Question - 01

In which scenario you use Hive?

1. Completely unstructured nasty data

2. Structured data

3. Any kind of data

4. None of the above


Question – 01 answer

2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig


Question - 02

Which option is not correct about Metastore?

1. It stores the table location

2. It has information about number of partitions and number of buckets

3. It can give you time at which the table is created

4. It stores the actual data



4. Metastore stores only the metadata. Actual data is stored in HDFS.


Question – 03 (last question)

What is incorrect about Hive?

1. Hive internally generates MapReduce jobs to serve your query

2. Hive runs on top of HDFS

3. Hive is a proprietary software

4. Hive supports multiple interfaces to interact with



3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly.


Hive Query Language (Hive QL)

Data types – provides types for variables

DDL – provides a way to define databases, tables, etc.,

DML – provides a way to modify content

Query statements – provides a way to retrieve the content


Data types

Primitive Types

Integers:

TINYINT (1 byte)

SMALLINT (2 bytes)

INT (4 bytes)

BIGINT (8 bytes)

Floating point

numbers:

Float (4 bytes)

Double (8 bytes)

Booleans:

BOOLEAN

(TRUE or FALSE)

String:

STRING

(sequence of

characters)

Usage

variable_name <Data Type>

ex: name STRING


Data types contd.,

Complex Types

ARRAY collection of multiple

same data type values

STRUCT collection of multiple

different data type

values

MAP collection of

(key, value) pairs

Usage name ARRAY <primitive type>

ex: marks ARRAY<INT>

Usage name STRUCT <type1, type2,

type3, …>

ex: record STRUCT <name

STRING, id INT, marks

ARRAY<INT>>

Usage name MAP <key, value>

ex: score MAP<STRING, INT>


Data types contd.,

Key must be a primitive in MAP

Referencing complex types

Previous example:

– marks ARRAY<INT>

– record STRUCT <name STRING, id INT, marks ARRAY<INT>>

– score MAP<STRING, INT>

SELECT marks[0], record.name, score[‘joe’]

Complex type inside a complex type is allowed

– array inside a struct (as seen before)


DDL

CREATE TABLE sample(id INT, name STRING, age INT, sex

STRING, state STRING)

COMMENT ‘This is a sample table’

PARTITIONED BY (state STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE;

schema

comments for readability

partition data by state column

rows are delimited by ‘\n’

fields are terminated by ‘,’

store file as a text file

Table is created in warehouse directory and completely managed by Hive

Specific row format and file format can be expressed by custom SerDe


SerDe

HDFS

File

InputFile

Format <Key,

Value>

Deserializ

er Row

Deserializer

Row Serializer <Key,

Value>

OutputFile

Format

HDFS

File

Serializer

SerDe stands for Serializer and Deserializer


DDL contd.,

CREATE EXTERNAL TABLE external_sample(id INT, name STRING,

age INT, sex STRING, state STRING)

LOCATION ‘/user/department/sample’

Table is not created in warehouse directory and just referenced by Hive

The file referenced is in HDFS (hdfs://user/department/sample)


DDL contd.,

DELETE TABLE sample

Since sample table is managed by Hive, it deletes entire data along with

metadata

DELETE TABLE external_sample

Since external_sample table is *not* managed by Hive, it just deletes the

metadata leaving actual data untouched


DML

Load data into managed table from local file system

Load data into managed table from HDFS

LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE

sample;

The file ‘/home/hive/sample.txt’ is in local file system

It is copied into Hive warehouse folder

LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE

sample;

The file ‘/user/hive/sample.txt’ is in HDFS

It is copied into Hive warehouse folder


DML contd.,

Insert results into a new table

Create a new table with automatically derived schema

INSERT OVERWRITE TABLE newsample

SELECT * from sample;

newsample table must be created before hand

select query results are loaded (overwritten) into new sample

CREATE TABLE newsample

AS SELECT * from sample;

creates newsample time with automatically derived schema

query results are populated into it


Query statements

To list available databases

To use a particular database

To list all tables available in a database

SHOW DATABASES;

USE <databasename>;

SHOW TABLES;


Query statements contd.,

select

Aggregation functions

Group by, Sort by, Order by

SELECT * FROM sample;

SELECT COUNT(DISTINCT state) FROM sample;

SELECT COUNT(*) FROM sample GROUP BY state;

SELECT * FROM sample SORT BY id DESC;

FROM sample SELECT * ORDER BY id ASC;


Query statements contd.,

Joins

Left join and Right joins are also supported

Multiple joins are accepted

SELECT s.* , o.*

FROM sample s

JOIN orders o

ON (s.id = o.id)


Custom Functions

UDF:

– User defined function

– Complex/additional logic can be expressed

– Operates on row by row

UDAF:

– User defined aggregate function

– Custom aggregated function logic can be written

– Operates on groups retrieved by group by clause

UDTF:

– User defined table function

– Operates on entire table


Hive Limitations

Not suitable for unstructured data

Perfectly suitable for OLAP system (analysis)

Representing machine learning algorithms can be a challenging task

Performance tradeoff with actual MR programs in various scenarios

– The gap is narrowing with release to release


Important practical tips

Hive logs: /tmp/$USER/hive.log

To know available functions: SET FUNCTIONS

To know help about a specific function: DESCRIBE FUNCTION <function_name>

Explain about config files the one in /usr/lib/hive/conf folder

– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?

SETTING parameters in the hive session


References

Hadoop: The Definitive Guide -Tom White

https://cwiki.apache.org/confluence/display/Hive/Home

http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/sig_2010_v21.pdf

Venner, Jason (2009). Pro Hadoop

http://hortonworks.com/big-data-insights/how-facebook-uses-hadoop-and-hive/





Q/A


Backup slides


Schema on Read (?)

[To do] where to put this slide?

Explain what is schema on read

Explain what is schema on write

Advantages of using schema on read

– Faster load time

– Impacts query time

Hive

Technology

Transcript of Hive