Hive.pptx_ver_2.0

31
1 Copyright © 2012 Tata Consultancy Services Limited Hive August 16, 2012 INTERNAL Only for TCS Internal Training – TCS NextGen Solutions, Kochi

description

h

Transcript of Hive.pptx_ver_2.0

Page 1: Hive.pptx_ver_2.0

1 Copyright © 2012 Tata Consultancy Services Limited

Hive

August 16, 2012

INTERNAL

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 2: Hive.pptx_ver_2.0

2

Contents

Introduction Why Hive? Configuring Hive The Hive Shell Hive Architecture HiveQL Data Types and Table types Managed Table External Table Storage Formats Queries View Hive Data Model The Metastore User Defined Functions What Hive is not?

Co

nte

nts

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 3: Hive.pptx_ver_2.0

3

Introduction

Hive is a data warehouse infrastructure built on top of Apache Hadoop

Hive is designed to enableEasy data summarizationAd-hoc queryingAnalysis of large volumes of data

Hive provides a simple query language called Hive QL

HiveQL allows traditional map/reduce programmers to be able to plug in their custom mappers and reduce

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 4: Hive.pptx_ver_2.0

4

Why Hive ?

Need a multi petabyte warehouse

Files are insufficient data abstractionsNeed Tables, Schema, Partitions,Indices

Need for an open data formatRDBMS have a closed data formatFlexible schema

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 5: Hive.pptx_ver_2.0

5

Configuring Hive

Download a release at ftp://ftp.nextgen.com

Unpack the tarball in a suitable place on your workstation %tar xzf hive-x.y.z-dev.tar.gz

Put Hive on your class path %export HIVE_HOME=/home/EmpID/hive-x.y.z-dev %export PATH=$PATH:$HIVE_INSTALL/bin

Type hive to launch the shell % hive hive>

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 6: Hive.pptx_ver_2.0

6

The Hive Shell

The hive shell is the primary way that we will interact with Hive.

HiveQL is Hive's query language, a

dialect of SQL.

HiveQL is generally case insensitive(except for string

comparisons).

The hive shell can be run in

non-interactive mode also.

The -f option runs the

commands in the specified

script file.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 7: Hive.pptx_ver_2.0

7

Hive Architecture

HADOOP(MAP-Reduce + HDFS)

HADOOP(MAP-Reduce + HDFS)

HiveHive

Command Line Interface

Web Interface Thrift Server

JDBC ODBC

MetastoreLibraries

Driver (Complier, Optimizer, Executor)

HiveQL

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 8: Hive.pptx_ver_2.0

8

Hive Architecture (Contd..)

UIUI The user interface for users to submit queries and other operations to the system

CLICLI The command line interface to Hive (the shell). This is the default service

HWIHWI Hive web interface can be used as an alternative to shell.

It can be started using the following commands

% export ANT_LIB=/path/to/ant/lib

%hive –service hwi

DriverDriver The component which receives the queries.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 9: Hive.pptx_ver_2.0

9

Hive Architecture (Contd..)

MetastoreMetastore The component that stores all the structure information of the various table and partitions in the warehouse.

CompilerCompiler The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan

Execution Engine

Execution Engine

The component which executes the execution plan created by the compiler.

Thrift ClientThrift Client

Thrift client makes it easy to run Hive commands from a wide range of programming languages

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 10: Hive.pptx_ver_2.0

10

Hive Architecture (Contd..)

JDBC DriverJDBC Driver

Hive provides a Type 4 (pure java) JDBC driver,defined in the class org.apache.hadoop.hive.jdbc.HiveDriver

When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hiveserver running in a separate process at the given host and port.

ODBC DriverODBC Driver

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive.The ODBC driver uses Thrift to communicate with the Hive server

Map Reduce

Map Reduce

Hive internally runs the query as a map reduce.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 11: Hive.pptx_ver_2.0

11

HiveQL

Features SQL HiveQL

Updates Insert,update and delete. Insert overwrite table

Indexes Supported Not supported

Functions Hundreds of built in functions Dozens of built-in functions

Views updatable read-only

Multitable inserts Not supported Supported

HiveQL is hive's SQL dialect

It does not provide the full features of SQL_92 language constructs

The main differences between HiveQL and SQL are

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 12: Hive.pptx_ver_2.0

12

Data Types and Table Types

Hive Data Types Hive supports both complex and primitive datatypes.

Primitive Data Types Signed Integer - TINYINT, SMALLINT, INT, BIGINT Floating Point - FLOAT, DOUBLE BOOLEAN STRINGComplex Data Types ARRAY,MAP and STRUCT

Hive Table TypesHive Tables are of two types Managed Tables External Tables

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 13: Hive.pptx_ver_2.0

13

Managed Table

Managed Table - Hive moves the data into its warehouse directory

hive> Create table managed_table(dummy String);Load data inpath '/user/txt' into table managed_table;

When a managed table is dropped then the table including its data and metadata is deleted.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 14: Hive.pptx_ver_2.0

14

External Table

External Table - Hive refers to the data that is at an existing location outside the warehouse directory

Uses the keyword 'EXTERNAL to specify an external table.hive>Create EXTERNAL table ext_table(dummyString) location '/user/tom/ext_table‘;hive>load data inpath '/user/text' into table ext_table;

When an external table is dropped hive will leave the data untouched and delete only the metadata.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 15: Hive.pptx_ver_2.0

15

Queries

Table Creation hive> CREATE TABLE <table name> (<column name> <data type>, ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<character>';Terminated By ' <character>';

Alter a Table hive> ALTER TABLE <table name> ADD COLUMN (<column name> <data type>);

Drop a Tablehive> DROP TABLE <table name>;

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 16: Hive.pptx_ver_2.0

16

Queries (Contd..)

Describe table structure hive> DESCRIBE <table name>

To show all tables in database

hive> SHOW TABLES

To load data Into Hive tableshive> LOAD DATA INPATH <file path>

INTO TABLE <table name>

To retrieve Data From Hive Tableshive> SELECT * from <table name>

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 17: Hive.pptx_ver_2.0

17

Subquery

Hive supports subqueries only in the FROM clause.

The columns in the subquery select list are available in the outer query just like columns of a table

ExampleSELECT col FROM (

SELECT col1+col2 AS colFROM table1

) table2

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 18: Hive.pptx_ver_2.0

18

Join in Hive

Hive supports only equality joins, outer joins, and left semi joins.

Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.

More than two tables can be joined in Hive

ExampleHive> SELECT table1.*, table2.*>FROM table1 JOIN table2 ON (table1.col1 = table2.col1) ;

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 19: Hive.pptx_ver_2.0

19

View

A view is a sort of “virtual table” that is defined by a SELECT statement

Views can be used to present data to users in a different way to the way it is actually stored on disk

SyntaxCREATE VIEW <TableName>AS SELECT *FROM <TableName>WHERE <Condition>;

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 20: Hive.pptx_ver_2.0

20

Hive Data Model

Data in hive is organized into

Tables

These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in hdfs.

Partitions

Each Table can have one or more partition keys which determine how the data is stored

Buckets

Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory.

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 21: Hive.pptx_ver_2.0

21

Partitions

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 22: Hive.pptx_ver_2.0

22

Buckets

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 23: Hive.pptx_ver_2.0

23

Buckets (Contd..)

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 24: Hive.pptx_ver_2.0

24

The Metastore

There are three configurations

Embedded metastoreEmbedded metastore

metastore-It contains an embedded Derby database

instance backed by the local disk.This doesnot

support multiple sessions.

metastore-It contains an embedded Derby database

instance backed by the local disk.This doesnot

support multiple sessions.

Local metastoreLocal metastore

It uses a standalone database.MySQL is a popular choice for the standalone metastore

It uses a standalone database.MySQL is a popular choice for the standalone metastore

Remote metastoreRemote metastore

One or more metastore servers run in seperate processes to the Hive

service

One or more metastore servers run in seperate processes to the Hive

service

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 25: Hive.pptx_ver_2.0

25

Configuring Hive to have MySQL as Metastore DB

Get MySql JDBC Connector Jar and copy to hive/lib directory

Get the hive-schema-0.7.0.mysql.sql file identified in hive-0.7.1-cdh3u2/src/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql to the machine where MySQL DB is installed and keep it in a directory for later use.

connect to DB with the id and password$> mysql -u username -p"password"

create database hive_db_metastoremysql> create database hive_db_metastore;

mysql> use hive_db_metastore;

mysql> SOURCE /home/Emp_Id/hive-schema-0.7.0.mysql.sql;

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 26: Hive.pptx_ver_2.0

26

Configuring Hive to have MySQL as Metastore DB

You also need a MySQL user account for Hive to use/to access the Metastore

Steps

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'password';

mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON hive_db_metastore.* TO 'hiveuser'@'%';

mysql> REVOKE ALTER,CREATE ON hive_db_metastore.* FROM 'hiveuser'@'%';

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 27: Hive.pptx_ver_2.0

27

User Defined Functions

There are three types of UDF in hive

UDF (User Defined Function) - Operates on

a single row and produces a single row

as output.

UDAF (User Defined Aggregate Function) -

Works on multiple input rows and creates a single output row

UDTF (User Defined Table Generating

Function) - Operates on a single row and

produces multiple rows as output

1 2 3

A UDF must satisfy the following two properties

1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF

2. A UDF must implement at least one evaluate() method

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 28: Hive.pptx_ver_2.0

28

User Defined Functions (Contd..)

To use the UDF in hive

ADD JAR /path/to/hive-examples.jar;Create temporary function strip as 'com.hive.Strip';

hive> SELECT strip('banana', 'ab') FROM dummy; output : nan

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 29: Hive.pptx_ver_2.0

29

What hive is not ?

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates

Latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred mega bytes)

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 30: Hive.pptx_ver_2.0

30

References

https://cwiki.apache.org/Hive/tutorial.html

https://cwiki.apache.org/Hive/languagemanual-cli.html

Hadoop-The Definitive Guide

Hadoop in Action

Only for TCS Internal Training – TCS NextGen Solutions, Kochi

Page 31: Hive.pptx_ver_2.0

Thank You

Only for TCS Internal Training – TCS NextGen Solutions, Kochi