Working with Hive Analytics

13
Working with Hive Topics to Cover - Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises

Transcript of Working with Hive Analytics

Page 1: Working with Hive Analytics

Working with Hive

Topics to Cover

- Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises

Page 2: Working with Hive Analytics

2

Introduction to Hive and its Architecture

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed File System. This tutorial can be your first step towards becoming a successful Hadoop Developer with Hive. Having prior knowledge on Core Java, Database concepts of SQL, Hadoop File system, and any of Linux operating system flavors is an added added advantage if you want to speed up learning Hive. Features of Hive Here are the features of Hive: • It stores schema in a database and processed data into HDFS. • It is designed for OLAP. • It provides SQL type language for querying called HiveQL or HQL. • It is familiar, fast, scalable, and extensible.

It is important to understand that, Hive is not : • A relational database • A design for OnLine Transaction Processing (OLTP) • A language for real-time queries and row-level updates

Page 3: Working with Hive Analytics

3

Hive Architecture

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name

Operation

User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL Process Engine

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Page 4: Working with Hive Analytics

4

How does Hive Work The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation

1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.

3 Get Metadata The compiler sends metadata request to Metastore (any database).

4 Send Metadata Metastore sends metadata as a response to the compiler.

5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

6 Execute Plan The driver sends the execute plan to the execution engine.

7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

8 Fetch Result The execution engine receives the results from Data nodes.

Page 5: Working with Hive Analytics

5

9 Send Results The execution engine sends those resultant values to the driver.

10 Send Results The driver sends the results to Hive Interfaces.

HiveQL (DDL & DML Operations)

All the data types in Hive are classified into four types, given as follows: 1. Column Types 2. Literals 3. Null Values 4. Complex Types

Create Database Statement

Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;

Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb:

hive> CREATE DATABASE [IF NOT EXISTS] userdb;

or hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list: hive> SHOW DATABASES;

default

userdb

Create Table Statement

Create Table is a statement used to create a table in Hive. The syntax and example are as follows:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[ROW FORMAT row_format]

[STORED AS file_format]

Page 6: Working with Hive Analytics

6

LAB 1 GETTING STARTED WITH HIVE ENVIRONMENT Hive is open source project and can be downloaded from Apache website URL : http://hive.apache.org You can install it on CenOS that was installed previously in lab exercises. Hive comes preinstalled with Cloudera CDH Virtual Machine, and may not require reinstallation. 1. Start the CDH VM, and login as user cloudera.

Page 7: Working with Hive Analytics

7

2. In the web browser, click Hue and login with same credentials as used for VM login

3. Click on Query Editors drop down -> Hive. Run a basic query :

Page 8: Working with Hive Analytics

8

Hive can also be run on the command line. For the, either open a terminal within your VM, or connect to it through Putty SSH application.

Execute command as given below: login as: cloudera

[email protected]'s password: cloudera

Page 9: Working with Hive Analytics

9

[cloudera@quickstart ~]$ hive

2016-12-04 22:15:02,688 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing

PrefixTreeCodec is not present. Continuing without it.

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive> show tables;

OK

canada_regions

sales

things

Time taken: 1.166 seconds, Fetched: 3 row(s)

hive>

hive> select * from sales;

OK

Joe 2

Hank 4

Ali 0

Eve 3

Hank 2

Time taken: 0.98 seconds, Fetched: 5 row(s)

hive>

LAB 2 USING HIVE TO MAP AN EXTERNAL TABLE OVER WEBLOG DATA IN HDFS

You will often want to create tables over existing data that does not live within the managed Hive warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario. Queries from the Hive client will execute as they normally do over internally managed tables. Make sure you have access to a the Hadoop cluster with Hive installed. This recipe depends on having the weblog_entries dataset loaded into an HDFS directory at the absolute path /input/weblog/weblog_records.txt. Carry out the following steps to map an external table in HDFS: 1. Open a text editor, like vi or gedit. 2. Add the CREATE TABLE syntax, as follows: DROP TABLE IF EXISTS weblog_entries;

CREATE EXTERNAL TABLE weblog_entries (

md5 STRING,

url STRING,

request_date STRING,

request_time STRING,

ip STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY

'\n'

LOCATION '/input/weblog/';

3. Save the script as weblog_create_external_table.hsql in the working directory. Copy the web logs file in to HDFS. [cloudera@localhost]$ hadoop fs -mkdir -p /input/weblog/ [cloudera@localhost]$ hadoop fs -put weblog_entries.txt /input/weblog/ 4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows: hive -f weblog_create_external_table.hql

Page 10: Working with Hive Analytics

10

5. You should see two successful commands issued to the Hive client. OK Time taken: 3.036 seconds OK Time taken: 3.389 seconds Open Hive in the terminal and explore the newly created table. [cloudera@quickstart data]$ hive

hive> show tables;

OK

sales

weblog_entries

Time taken: 1.139 seconds, Fetched: 4 row(s)

hive> desc weblog_entries;

OK

md5 string

url string

request_date string

request_time string

ip string

Time taken: 0.254 seconds, Fetched: 5 row(s)

hive>

hive> exit;

LAB 3 USING HIVE TO DYNAMICALLY CREATE TABLES FROM THE RESULTS OF A WEBLOG QUERY

This lab will outline a shorthand technique for inline table creation when the query is executed. Having to create every table definition up front is impractical and does not scale for large ETL. Being able to dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging points. In this lab, we will create a new table that contains three fields from the weblog entry dataset, namely request_date, request_time, and url. In addition to this, we will define a new field called url_length. This lab depends on having the weblog_entries dataset loaded into Hive table through previous lab exercise. Issue the following command in Hive: hive> desc weblog_entries;

Carry out the following steps to create an inline table definition using an alias: 1. Open a text editor, like vi or gedit. 2. Add the following inline creation syntax: CREATE TABLE weblog_entries_with_url_length AS

SELECT url, request_date, request_time, length(url) as url_length

FROM weblog_entries;

3. Save the script as weblog_entries_create_table_as.hql in the active directory. 4. Run the script from the operating system shell by supplying the -f option to the Hive, as follows: hive -f weblog_create_table_as.hql

5. To verify that the table was created successfully, issue the following command , using the -e option: hive -e "describe weblog_entries_with_url_length"

6. You should see a table with three string fields and a fourth int field holding the

Page 11: Working with Hive Analytics

11

URL length: url string request_date string request_time string url_length int

LAB 4 USING HIVE TO INTERSECT WEBLOG IPS AND DETERMINE THE COUNTRY

Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on identically matching keys contained in one or more tables. This recipe will show a very simple inner join over weblog data that links each request record in the weblog_entries table to a country, based on the request IP. For each record contained in the weblog_entries table, the query will print the record out with an additional trailing value showing the determined country. Make sure you have access to a the Hadoop cluster with Hive installed. This lab depends on having the weblog_entries dataset loaded into Hive table through lab exercise 2. Issue the following command in Hive: describe weblog_entries

You should see the following response: OK

md5 string

url string

request_date string

request_time string

ip string

Additionally, this recipe requires that the ip-to-country dataset be loaded into a Hive table named ip_to_country with the following fields mapped to the respective datatypes. 1. Copy the file ip_to_country.txt in to HDFS. [cloudera@localhost data]$ hadoop fs -put ip_to_country.txt /input/ip_to_country

2. Add the CREATE TABLE syntax, as follows: [cloudera@localhost]$ vi ip-to-country.hsql

DROP TABLE IF EXISTS ip_to_country;

CREATE EXTERNAL TABLE ip_to_country (

ip string,

country string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY

'\n'

LOCATION '/input/ip_to_country';

[cloudera@localhost data]$ hive -f ip-to-country.hsql

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties

OK

Time taken: 0.812 seconds

OK

Time taken: 0.601 seconds

Page 12: Working with Hive Analytics

12

[cloudera@localhost]$ hive -e "describe ip_to_country"

Performing an inner join in Hive:

1. Open a text editor, like vi or gedit.

2. Add the following inline creation syntax: SELECT wle.*, itc.country FROM weblog_entries wle

JOIN ip_to_country itc ON wle.ip = itc.ip;

3. Save the script as weblog_simple_ip_join.hql in the active directory. 4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the results of the SELECT statement printed out to the console. The following snippet is a printout containing only two sample rows. The full printout will contain all 3000 rows.

Page 13: Working with Hive Analytics

13