Hive
-
Upload
bala-krishna -
Category
Technology
-
view
1.186 -
download
6
description
Transcript of Hive
Global Big Data Conference - 2014
Hive
BALA KRISHNA G Global Big Data Bootcamp – Jan 2014 (http://globalbigdataconference.com)
2 Global Big Data Conference - 2014 2 Global Big Data Conference - 2014 Speaker : Bala
My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 years
Worked at various big companies SUN/ORACLE, IBM, etc.,
www.linkedin.com/in/gbalakrishna/
3 Global Big Data Conference - 2014 3 Global Big Data Conference - 2014 Speaker : Bala
Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab
Lecture
– Need for Hive
– Hive history
– Hive powered by
– What is Hive?
– Hive Architecture
– Hive Query Life cycle
– Hive Query Language (HiveQL)
Lab:
– Extensive hands-on-experience on Hive
– Derive various insights from a real-world dataset by Hive
4 Global Big Data Conference - 2014 4 Global Big Data Conference - 2014 Speaker : Bala
Need for Hive
Do I need to
learn JAVA?
Don’t worry!
I am here to
rescue you
5 Global Big Data Conference - 2014 5 Global Big Data Conference - 2014 Speaker : Bala
Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business Intelligence)
Oftentimes, require a series of complex MR jobs chained together (Advanced data processing)
MR 1
MR 2
MR 3
MR 4
MR 5
MR 6
legends
MR – Map Reduce
Mapper Task
Reducer Task
6 Global Big Data Conference - 2014 6 Global Big Data Conference - 2014 Speaker : Bala
Need for Hive contd.,
20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time significantly (~16 times)
0
50
100
150
200
250
300
Hadoop Pig
code
0
50
100
150
200
250
300
Hadoop Pig
Min
ute
s
time
7 Global Big Data Conference - 2014 7 Global Big Data Conference - 2014 Speaker : Bala
Need for Hive contd.,
Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW
8 Global Big Data Conference - 2014 8 Global Big Data Conference - 2014 Speaker : Bala
Hive powered by
And many more…
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
Uses for processing large amount of user and
central to meet company reporting need’s
Ad hoc queries reporting and analytics
Data analytics and Data cleaning
9 Global Big Data Conference - 2014 9 Global Big Data Conference - 2014 Speaker : Bala
What is Hive?
Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project under apache
Works on high throughput and high latency
principle (same as Hadoop)
Ability to plug-in custom Map Reduce programs
Mainly targeted for structured data
Hides Map Reduce program complexities to end
user
10 Global Big Data Conference - 2014 10 Global Big Data Conference - 2014 Speaker : Bala
Hive Architecture
Meta
Store
CLI
Web
Interface
Python
Compiler
Optimizer
Driver
Plan
executor
ODBC
Perl
Hive Thrift
Server
HIVE
HDFS
Map
Reduce
HADOOP
11 Global Big Data Conference - 2014 11 Global Big Data Conference - 2014 Speaker : Bala
Metastore
Stores metadata of tables like database location, owner, creation time, access attributes, table schema, etc.,
Comprises of two components 1) Service 2) Data storage
Driver Metastore
Service
MySQL Driver Metastore
Service
Derby
Driver Metastore
Server MySQL
Embedded
Metastore
Hive Service
Local
Metastore
Remote
Metastore
12 Global Big Data Conference - 2014 12 Global Big Data Conference - 2014 Speaker : Bala
Hive Query Life cycle Insight
13 Global Big Data Conference - 2014 13 Global Big Data Conference - 2014 Speaker : Bala
7 6
5 4
Hive Query Life cycle contd.,
Hive
Interface Driver
Parser Semantic
Analyzer
Logical
plan
generator
Optimizer
Metastore Optimizer
Execution
Engine
Hadoop
Map
Reduce
Compiler
1
2
3
9
10 11
12
13
14
6 7
8
Physical
plan
generator
14 Global Big Data Conference - 2014 14 Global Big Data Conference - 2014 Speaker : Bala
Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id Name Age Sex State
/user/$USER/warehouse/sample
In Hive warehouse
stored as a folder
15 Global Big Data Conference - 2014 15 Global Big Data Conference - 2014 Speaker : Bala
Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state column
sample
Stored as many subfolders under sample directory
/user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/
/user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/
Id Name Age Sex State
Partition 1
Partition 2
16 Global Big Data Conference - 2014 16 Global Big Data Conference - 2014 Speaker : Bala
Data Models contd.,
Bucket: Divides into further chunks by an other column for sampling
Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets
In warehouse, the data is stored as
/user/$USER/warehouse/State=AL/part-00000
/user/$USER/warehouse/State=AL/part-00001
/user/$USER/warehouse/State=GA/part-00000
/user/$USER/warehouse/State=GA/part-00001
.
.
/user/$USER/warehouse/State=ND/part-00000
/user/$USER/warehouse/State=ND/part-00001
17 Global Big Data Conference - 2014 17 Global Big Data Conference - 2014 Speaker : Bala
Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware House
– Copy file from HDFS to Hive Ware House
HDFS Hive
Warehouse
1) Local FS File
2)
HDFS
File Hive
Warehouse copy
copy
18 Global Big Data Conference - 2014 18 Global Big Data Conference - 2014 Speaker : Bala
Referenced
Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in HDFS with out copying it into Hive Ware House
3) HDFS
File Hive
Warehouse
referenced
19 Global Big Data Conference - 2014 19 Global Big Data Conference - 2014 Speaker : Bala
Data Loading Techniques contd.,
Explain when to go for external table and managed table?
20 Global Big Data Conference - 2014 20 Global Big Data Conference - 2014 Speaker : Bala
Question - 01
In which scenario you use Hive?
1. Completely unstructured nasty data
2. Structured data
3. Any kind of data
4. None of the above
21 Global Big Data Conference - 2014 21 Global Big Data Conference - 2014 Speaker : Bala
Question – 01 answer
2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig
22 Global Big Data Conference - 2014 22 Global Big Data Conference - 2014 Speaker : Bala
Question - 02
Which option is not correct about Metastore?
1. It stores the table location
2. It has information about number of partitions and number of buckets
3. It can give you time at which the table is created
4. It stores the actual data
23 Global Big Data Conference - 2014 23 Global Big Data Conference - 2014 Speaker : Bala
Question – 02 answer
4. Metastore stores only the metadata. Actual data is stored in HDFS.
24 Global Big Data Conference - 2014 24 Global Big Data Conference - 2014 Speaker : Bala
Question – 03 (last question)
What is incorrect about Hive?
1. Hive internally generates MapReduce jobs to serve your query
2. Hive runs on top of HDFS
3. Hive is a proprietary software
4. Hive supports multiple interfaces to interact with
25 Global Big Data Conference - 2014 25 Global Big Data Conference - 2014 Speaker : Bala
Question – 03 answer
3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly.
26 Global Big Data Conference - 2014 26 Global Big Data Conference - 2014 Speaker : Bala
Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, etc.,
DML – provides a way to modify content
Query statements – provides a way to retrieve the content
27 Global Big Data Conference - 2014 27 Global Big Data Conference - 2014 Speaker : Bala
Data types
Primitive Types
Integers:
TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)
Floating point
numbers:
Float (4 bytes)
Double (8 bytes)
Booleans:
BOOLEAN
(TRUE or FALSE)
String:
STRING
(sequence of
characters)
Usage
variable_name <Data Type>
ex: name STRING
28 Global Big Data Conference - 2014 28 Global Big Data Conference - 2014 Speaker : Bala
Data types contd.,
Complex Types
ARRAY collection of multiple
same data type values
STRUCT collection of multiple
different data type
values
MAP collection of
(key, value) pairs
Usage name ARRAY <primitive type>
ex: marks ARRAY<INT>
Usage name STRUCT <type1, type2,
type3, …>
ex: record STRUCT <name
STRING, id INT, marks
ARRAY<INT>>
Usage name MAP <key, value>
ex: score MAP<STRING, INT>
29 Global Big Data Conference - 2014 29 Global Big Data Conference - 2014 Speaker : Bala
Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record STRUCT <name STRING, id INT, marks ARRAY<INT>>
– score MAP<STRING, INT>
SELECT marks[0], record.name, score[‘joe’]
Complex type inside a complex type is allowed
– array inside a struct (as seen before)
30 Global Big Data Conference - 2014 30 Global Big Data Conference - 2014 Speaker : Bala
DDL
CREATE TABLE sample(id INT, name STRING, age INT, sex
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY (state STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;
schema
comments for readability
partition data by state column
rows are delimited by ‘\n’
fields are terminated by ‘,’
store file as a text file
Table is created in warehouse directory and completely managed by Hive
Specific row format and file format can be expressed by custom SerDe
31 Global Big Data Conference - 2014 31 Global Big Data Conference - 2014 Speaker : Bala
SerDe
HDFS
File
InputFile
Format <Key,
Value>
Deserializ
er Row
Deserializer
Row Serializer <Key,
Value>
OutputFile
Format
HDFS
File
Serializer
SerDe stands for Serializer and Deserializer
32 Global Big Data Conference - 2014 32 Global Big Data Conference - 2014 Speaker : Bala
DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/department/sample’
Table is not created in warehouse directory and just referenced by Hive
The file referenced is in HDFS (hdfs://user/department/sample)
33 Global Big Data Conference - 2014 33 Global Big Data Conference - 2014 Speaker : Bala
DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE TABLE external_sample
Since external_sample table is *not* managed by Hive, it just deletes the
metadata leaving actual data untouched
34 Global Big Data Conference - 2014 34 Global Big Data Conference - 2014 Speaker : Bala
DML
Load data into managed table from local file system
Load data into managed table from HDFS
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
The file ‘/home/hive/sample.txt’ is in local file system
It is copied into Hive warehouse folder
LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE
sample;
The file ‘/user/hive/sample.txt’ is in HDFS
It is copied into Hive warehouse folder
35 Global Big Data Conference - 2014 35 Global Big Data Conference - 2014 Speaker : Bala
DML contd.,
Insert results into a new table
Create a new table with automatically derived schema
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be created before hand
select query results are loaded (overwritten) into new sample
CREATE TABLE newsample
AS SELECT * from sample;
creates newsample time with automatically derived schema
query results are populated into it
36 Global Big Data Conference - 2014 36 Global Big Data Conference - 2014 Speaker : Bala
Query statements
To list available databases
To use a particular database
To list all tables available in a database
SHOW DATABASES;
USE <databasename>;
SHOW TABLES;
37 Global Big Data Conference - 2014 37 Global Big Data Conference - 2014 Speaker : Bala
Query statements contd.,
select
Aggregation functions
Group by, Sort by, Order by
SELECT * FROM sample;
SELECT COUNT(DISTINCT state) FROM sample;
SELECT COUNT(*) FROM sample GROUP BY state;
SELECT * FROM sample SORT BY id DESC;
FROM sample SELECT * ORDER BY id ASC;
38 Global Big Data Conference - 2014 38 Global Big Data Conference - 2014 Speaker : Bala
Query statements contd.,
Joins
Left join and Right joins are also supported
Multiple joins are accepted
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)
39 Global Big Data Conference - 2014 39 Global Big Data Conference - 2014 Speaker : Bala
Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row
UDAF:
– User defined aggregate function
– Custom aggregated function logic can be written
– Operates on groups retrieved by group by clause
UDTF:
– User defined table function
– Operates on entire table
40 Global Big Data Conference - 2014 40 Global Big Data Conference - 2014 Speaker : Bala
Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine learning algorithms can be a challenging task
Performance tradeoff with actual MR programs in various scenarios
– The gap is narrowing with release to release
41 Global Big Data Conference - 2014 41 Global Big Data Conference - 2014 Speaker : Bala
Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a specific function: DESCRIBE FUNCTION <function_name>
Explain about config files the one in /usr/lib/hive/conf folder
– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?
SETTING parameters in the hive session
42 Global Big Data Conference - 2014 42 Global Big Data Conference - 2014 Speaker : Bala
References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/sig_2010_v21.pdf
Venner, Jason (2009). Pro Hadoop
http://hortonworks.com/big-data-insights/how-facebook-uses-hadoop-and-hive/
43 Global Big Data Conference - 2014 43 Global Big Data Conference - 2014 Speaker : Bala
Q/A
44 Global Big Data Conference - 2014 44 Global Big Data Conference - 2014 Speaker : Bala
45 Global Big Data Conference - 2014 45 Global Big Data Conference - 2014 Speaker : Bala
Backup slides
46 Global Big Data Conference - 2014 46 Global Big Data Conference - 2014 Speaker : Bala
Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advantages of using schema on read
– Faster load time
– Impacts query time