Download - Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hive Workshop

Peter Smyth

24rd June 2016

An introduction to Hadoop, HDFS, Hive and HQL

Workshop

Admin

• Fire Alarm Test expected at 11:00am• Real Fire – out the way you came in (upstairs and out

main entrance• Toilets - on this floor

• Coffee breaks in the morning and afternoon• Lunch (Sandwiches) about 1:00pm

Program session times09.30 Registration and coffee

10.00 Introductions to Hadoop, Hive, the software and each other

11.15 Coffee break

11.30 Hive 1: Hive Queries (Lessons 1 & 2)

13.00 Lunch - sandwiches

13.30 Hive 2: Creating tables, table types and table storage (Lesson 3)

14.30 Coffee break

14.45 Hive 3: Accessing Hive using the command line and ODBC (Lessons 4 & 5)

16.00 Close

Overview of this workshop

1. High level explanation of the Hadoop ecosystem

2. Overview of HDFS and Hadoop processing

3. Develop Hive queries to slice, dice sample and join big datasets

4. Create and load data into tables5. Demonstrate the use of hive

queries from external systems using ODBC

• Experience of big data?• Expectations?• Next steps?

User experience & aims

What is big data?

• Perhaps most relevant, is the most obvious;

“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (wikipedia)

• Or to put it another way, too big to fit into your favourite desktop application.

Growing and shrinking data

Tweets

Smart meter data

Sent Tweet

All Smart meter data

All tweets from user

All tweets from User & Friends

Data from Tweet

Smart meter by day

Smart meter by Month

By Month and Geography

1Kb 1Mb 1Gb 10+ Gb

Desktop Application Big Data Environment

Hadoop

• Created by Doug Cutting and Mike Cafarella• Based on the 2003-4 Google papers on MapReduce and

GFS (Google File System)• MapReduce had been around for 40 years, first

appearing in the programming language Lisp in 1961• The name Hadoop comes from the name of a cuddly toy

elephant owned by Doug Cuttings son

Hadoop Infrastructure

• A Hadoop cluster can be formed by thousands of nodes – 4 would be a minimum

• A node is an individual computer many times more powerful than the average desktop.

• The strength of the Hadoop system is not its raw power, but its ability to break a processing task down and run all of the parts in parallel

• It is built to be resilient, i.e. cope with server failures

A picture of the Hadoop Eco-system

(http://hortonworks.com/)

A (very) minimal picture of Hadoop

Hadoop

HDFS Map Reduce

Hadoop components

• HDFS (Hadoop Distributed File System)• To the end user, just a file system• Internally, files are segmented into blocks of 128MB and

randomly distributed across the available datanodes.• A datanode is a server in the Hadoop cluster where actual

processing takes place – i.e. where your programs are run.• A namenode (another server in the cluster) keeps track of the

random distribution of the blocks

• Map Reduce• The execution engine• Enhanced by Tez

Data + Program = Processing

Bigdata

In Hadoop

program

HadoopTraditional client server

Program

e.g. MS Word

data

Sandbox v Cluster

• The Sandbox environment we will be using is not a cluster – It is one Virtual machine running on one physical machine

• The Sandbox thinks it is a cluster. It is configured that way

• In reality everything that a cluster does in parallel, the Sandbox does in series

• This makes it very slow• But it can still process big datasets• The coding for the queries you write will be just the same

Sandbox limitations

• Speed of processing – already mentioned• Capacity – A Hortonworks Sandbox has a size limit of

50Gb• In practice this would mean a limit of 25-30 Gb of

storage space for your datasets in HDFS• It is possible to expand the size of the VM • Any space used by the VM as it expands must be

available on the hosting physical machine

What is Hive?

This is the definition from Apache.org:

The Apache Hive ™ data warehouse software facilitates

querying and managing large datasets residing in

distributed storage. Hive provides a mechanism to project

structure onto this data and query the data using a SQL-

like language called HiveQL

Using Hive – What we will cover

• Select statements (i.e. Queries)• Table creation and loading data• Sampling tables, Joining tables• Table storage and Partitioned tables• Various ways of accessing Hive

Finding Information on the Internet

• Relating to Apache Hadoop and Hive

Apache.org

Specific Projects

• Hive has its own entry in the project list• HDFS is included as part of the Hadoop project.• All of the top pages of the projects have links to the

official documentation• Can be a bit thin on examples and explanations• Versions change regularly

Hive wiki page(s)

Getting Help

• Apache Wikis• Hortonworks or other Hadoop providers• Stackoverflow• Just google it!

The Environment

• We are using Laptops with:• Windows 7• Intel I5-6200U processors • 16Gb of Ram

The amount of RAM is key as the Hortonworks Sandbox requires 10Gb to run reasonably well

Installed on the Laptop (all Freeware)

• VMWare Player - for running the virtualisation environment which contains the Sandbox VM

• The Hortonworks Hadoop Sandbox v2.3.2 (just a set of files used by VMWare to build the Sandbox VM)

• PuTTY – A program which allows remote access to, typically Linux based systems – we need this to access the command line of the VM

• Filezilla – A program which allows the transfer of files between two different systems – we can use this to transfer our data files from the Windows system to the Linux VM system

Installed on the Laptop (all Freeware)

• Web browser - used to connect to the Web servers exposed by Hadoop and the Hortonworks Sandbox – we need this to communicate with Hive via Ambari or Hue

• Toad for Hadoop - used to connect directly to the Hive and HDFS environments on the Sandbox

The Sandbox environment

• The Sandbox is a complete Linux operating system which contains a complete Hadoop environment all running in a single VM (Virtual Machine)

• Which in turn is hosted on a Windows machine using VMware virtualisation software.

• Why a VirtualMachine – separate, isolated environment

The Hadoop Environment

• Available automatically after the Linux environment has started

• The hadoop environment includes • The software component known as Hive • A file system called HDFS

• Within HDFS • the data files we will be using to write Hive queries against will

be stored• Hive also uses HDFS to store tables

Accessing Hive

• We can use a variety of applications to access the Hive environment

• Some are better than others• They don’t all provide the same functionality• For some tasks there may be no alternatives

• All of the ones we will be using are either part of the Hadoop environment or are freely available for download

Hive accessing applications

VMWare Player

• Used to create the VM Sandbox

You only ned to know the IP address from here

PuTTY

• Used to access Hive and HDFS from the command line

Filezilla

• Used to transfer files between the Windows and the Linux VM

Web Browser

• Any web browser will do• Older versions of IE don’t always get the layouts correctly

• There are two web interfaces available• Ambari – This is a Hortonworks developed product which

provides Hadoop management as well as access to Hive and HDFS

• Hue - is an open source development by Cloudera. It has been removed from the most recent version of the Hortonworks Sandbox

• Both provide similar functionality for Hive and HDFS• Load file• Create tables• Run queries

Ambari front screen

Ambari - Hive interface

Hue – Front screen

Hue – Hive interface

Toad for Hadoop

• A Windows application, in beta, currently free.

ODBC

• The Hive ODBC driver is transparent once installed but allows access to Hive from a variety of programming and application environments.

The data to be used

We will be using the Smartmeter data which can be downloaded from UKDS (SN7591). There are four files.

The edrp_metadata.xlsx file is not loaded into Hive. It will be used to compare results from some of our queries

File Size Num. Records

edrp_elec.csv 12.07GB 413836038

edrp_gas.csv 6.83GB 246482700

edrp_geography_data.csv 1.3MB 14617edrp_metadata.xlsx

Edrp_elec, edrp_gas file layout

ANON_ID,ADVANCEDATETIME,HH,ELECKWH5110,15FEB08:12:30:00,25,0.611617,15FEB08:12:30:00,25,0.254869,15FEB08:12:30:00,25,0.39015,15FEB08:12:30:00,25,0.411628,15FEB08:12:30:00,25,0.85

In the gas file, the last column is GASKWH

Edrp_geography_data file layout

anonID,eProfileClass,fuelTypes,ACORN_Category,ACORN_Group,ACORN_Type,ACORN_Code,ACORN_Description,NUTS4,LACode,NUTS1,gspGroup,LDZ,Elec_Tout,Gas_Tout1,2,Dual,1,C,10,"1 ,C ,10",Well-off working families with mortgages,--,--,UKG,_B,WM,0,02,1,Dual,4,M,43,"4 ,M ,43","Older people, rented terraces",UKL1605,00PL,UKL,_K,WS,1,13,1,ElecOnly,3,I,32,"3 ,I ,32",Retired home owners,UKJ4210,29UN,UKJ,_J,SE,0,04,1,Dual,3,H,31,"3 ,H ,31",Home owning Asian family areas,--,--,UKI,--,--,0,05,1,ElecOnly,4,M,43,"4 ,M ,43","Older people, rented terraces",UKM3800,00RF,UKM,_N,SC,1,0

edrp_metadata.xlsx contents

• The highlighted columns are ones we will try to verify with queries

Objectives of the analysis

• To compare the data contained in the edrp_metadata.xlsx file with our query results

• Verify the Hypothesis that:Electricity only users, use proportionally more electricity in winter than the dual fuel users

Acorn codes

Acorn segments postcodes and neighbourhoods into 6 Categories, 18 Groups and 62 types, three of which are not private households. By analysing significant social factors and population behaviour, it provides precise information and in-depth understanding of the different types of people.

http://acorn.caci.co.uk/

NUTS1 Codes