Hive.pptx_ver_2.0
description
Transcript of Hive.pptx_ver_2.0
1 Copyright © 2012 Tata Consultancy Services Limited
Hive
August 16, 2012
INTERNAL
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
2
Contents
Introduction Why Hive? Configuring Hive The Hive Shell Hive Architecture HiveQL Data Types and Table types Managed Table External Table Storage Formats Queries View Hive Data Model The Metastore User Defined Functions What Hive is not?
Co
nte
nts
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
3
Introduction
Hive is a data warehouse infrastructure built on top of Apache Hadoop
Hive is designed to enableEasy data summarizationAd-hoc queryingAnalysis of large volumes of data
Hive provides a simple query language called Hive QL
HiveQL allows traditional map/reduce programmers to be able to plug in their custom mappers and reduce
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
4
Why Hive ?
Need a multi petabyte warehouse
Files are insufficient data abstractionsNeed Tables, Schema, Partitions,Indices
Need for an open data formatRDBMS have a closed data formatFlexible schema
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
5
Configuring Hive
Download a release at ftp://ftp.nextgen.com
Unpack the tarball in a suitable place on your workstation %tar xzf hive-x.y.z-dev.tar.gz
Put Hive on your class path %export HIVE_HOME=/home/EmpID/hive-x.y.z-dev %export PATH=$PATH:$HIVE_INSTALL/bin
Type hive to launch the shell % hive hive>
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
6
The Hive Shell
The hive shell is the primary way that we will interact with Hive.
HiveQL is Hive's query language, a
dialect of SQL.
HiveQL is generally case insensitive(except for string
comparisons).
The hive shell can be run in
non-interactive mode also.
The -f option runs the
commands in the specified
script file.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
7
Hive Architecture
HADOOP(MAP-Reduce + HDFS)
HADOOP(MAP-Reduce + HDFS)
HiveHive
Command Line Interface
Web Interface Thrift Server
JDBC ODBC
MetastoreLibraries
Driver (Complier, Optimizer, Executor)
HiveQL
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
8
Hive Architecture (Contd..)
UIUI The user interface for users to submit queries and other operations to the system
CLICLI The command line interface to Hive (the shell). This is the default service
HWIHWI Hive web interface can be used as an alternative to shell.
It can be started using the following commands
% export ANT_LIB=/path/to/ant/lib
%hive –service hwi
DriverDriver The component which receives the queries.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
9
Hive Architecture (Contd..)
MetastoreMetastore The component that stores all the structure information of the various table and partitions in the warehouse.
CompilerCompiler The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan
Execution Engine
Execution Engine
The component which executes the execution plan created by the compiler.
Thrift ClientThrift Client
Thrift client makes it easy to run Hive commands from a wide range of programming languages
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
10
Hive Architecture (Contd..)
JDBC DriverJDBC Driver
Hive provides a Type 4 (pure java) JDBC driver,defined in the class org.apache.hadoop.hive.jdbc.HiveDriver
When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hiveserver running in a separate process at the given host and port.
ODBC DriverODBC Driver
The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive.The ODBC driver uses Thrift to communicate with the Hive server
Map Reduce
Map Reduce
Hive internally runs the query as a map reduce.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
11
HiveQL
Features SQL HiveQL
Updates Insert,update and delete. Insert overwrite table
Indexes Supported Not supported
Functions Hundreds of built in functions Dozens of built-in functions
Views updatable read-only
Multitable inserts Not supported Supported
HiveQL is hive's SQL dialect
It does not provide the full features of SQL_92 language constructs
The main differences between HiveQL and SQL are
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
12
Data Types and Table Types
Hive Data Types Hive supports both complex and primitive datatypes.
Primitive Data Types Signed Integer - TINYINT, SMALLINT, INT, BIGINT Floating Point - FLOAT, DOUBLE BOOLEAN STRINGComplex Data Types ARRAY,MAP and STRUCT
Hive Table TypesHive Tables are of two types Managed Tables External Tables
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
13
Managed Table
Managed Table - Hive moves the data into its warehouse directory
hive> Create table managed_table(dummy String);Load data inpath '/user/txt' into table managed_table;
When a managed table is dropped then the table including its data and metadata is deleted.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
14
External Table
External Table - Hive refers to the data that is at an existing location outside the warehouse directory
Uses the keyword 'EXTERNAL to specify an external table.hive>Create EXTERNAL table ext_table(dummyString) location '/user/tom/ext_table‘;hive>load data inpath '/user/text' into table ext_table;
When an external table is dropped hive will leave the data untouched and delete only the metadata.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
15
Queries
Table Creation hive> CREATE TABLE <table name> (<column name> <data type>, ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<character>';Terminated By ' <character>';
Alter a Table hive> ALTER TABLE <table name> ADD COLUMN (<column name> <data type>);
Drop a Tablehive> DROP TABLE <table name>;
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
16
Queries (Contd..)
Describe table structure hive> DESCRIBE <table name>
To show all tables in database
hive> SHOW TABLES
To load data Into Hive tableshive> LOAD DATA INPATH <file path>
INTO TABLE <table name>
To retrieve Data From Hive Tableshive> SELECT * from <table name>
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
17
Subquery
Hive supports subqueries only in the FROM clause.
The columns in the subquery select list are available in the outer query just like columns of a table
ExampleSELECT col FROM (
SELECT col1+col2 AS colFROM table1
) table2
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
18
Join in Hive
Hive supports only equality joins, outer joins, and left semi joins.
Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.
More than two tables can be joined in Hive
ExampleHive> SELECT table1.*, table2.*>FROM table1 JOIN table2 ON (table1.col1 = table2.col1) ;
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
19
View
A view is a sort of “virtual table” that is defined by a SELECT statement
Views can be used to present data to users in a different way to the way it is actually stored on disk
SyntaxCREATE VIEW <TableName>AS SELECT *FROM <TableName>WHERE <Condition>;
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
20
Hive Data Model
Data in hive is organized into
Tables
These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in hdfs.
Partitions
Each Table can have one or more partition keys which determine how the data is stored
Buckets
Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory.
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
21
Partitions
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
22
Buckets
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
23
Buckets (Contd..)
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
24
The Metastore
There are three configurations
Embedded metastoreEmbedded metastore
metastore-It contains an embedded Derby database
instance backed by the local disk.This doesnot
support multiple sessions.
metastore-It contains an embedded Derby database
instance backed by the local disk.This doesnot
support multiple sessions.
Local metastoreLocal metastore
It uses a standalone database.MySQL is a popular choice for the standalone metastore
It uses a standalone database.MySQL is a popular choice for the standalone metastore
Remote metastoreRemote metastore
One or more metastore servers run in seperate processes to the Hive
service
One or more metastore servers run in seperate processes to the Hive
service
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
25
Configuring Hive to have MySQL as Metastore DB
Get MySql JDBC Connector Jar and copy to hive/lib directory
Get the hive-schema-0.7.0.mysql.sql file identified in hive-0.7.1-cdh3u2/src/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql to the machine where MySQL DB is installed and keep it in a directory for later use.
connect to DB with the id and password$> mysql -u username -p"password"
create database hive_db_metastoremysql> create database hive_db_metastore;
mysql> use hive_db_metastore;
mysql> SOURCE /home/Emp_Id/hive-schema-0.7.0.mysql.sql;
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
26
Configuring Hive to have MySQL as Metastore DB
You also need a MySQL user account for Hive to use/to access the Metastore
Steps
mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON hive_db_metastore.* TO 'hiveuser'@'%';
mysql> REVOKE ALTER,CREATE ON hive_db_metastore.* FROM 'hiveuser'@'%';
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
27
User Defined Functions
There are three types of UDF in hive
UDF (User Defined Function) - Operates on
a single row and produces a single row
as output.
UDAF (User Defined Aggregate Function) -
Works on multiple input rows and creates a single output row
UDTF (User Defined Table Generating
Function) - Operates on a single row and
produces multiple rows as output
1 2 3
A UDF must satisfy the following two properties
1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
2. A UDF must implement at least one evaluate() method
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
28
User Defined Functions (Contd..)
To use the UDF in hive
ADD JAR /path/to/hive-examples.jar;Create temporary function strip as 'com.hive.Strip';
hive> SELECT strip('banana', 'ab') FROM dummy; output : nan
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
29
What hive is not ?
Hive is not designed for online transaction processing and does not offer real-time queries and row level updates
Latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred mega bytes)
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
30
References
https://cwiki.apache.org/Hive/tutorial.html
https://cwiki.apache.org/Hive/languagemanual-cli.html
Hadoop-The Definitive Guide
Hadoop in Action
Only for TCS Internal Training – TCS NextGen Solutions, Kochi
Thank You
Only for TCS Internal Training – TCS NextGen Solutions, Kochi