Hive Workshop
Peter Smyth
24rd June 2016
An introduction to Hadoop, HDFS, Hive and HQL
Workshop
Admin
• Fire Alarm Test expected at 11:00am• Real Fire – out the way you came in (upstairs and out
main entrance• Toilets - on this floor
• Coffee breaks in the morning and afternoon• Lunch (Sandwiches) about 1:00pm
Program session times09.30 Registration and coffee
10.00 Introductions to Hadoop, Hive, the software and each other
11.15 Coffee break
11.30 Hive 1: Hive Queries (Lessons 1 & 2)
13.00 Lunch - sandwiches
13.30 Hive 2: Creating tables, table types and table storage (Lesson 3)
14.30 Coffee break
14.45 Hive 3: Accessing Hive using the command line and ODBC (Lessons 4 & 5)
16.00 Close
Overview of this workshop
1. High level explanation of the Hadoop ecosystem
2. Overview of HDFS and Hadoop processing
3. Develop Hive queries to slice, dice sample and join big datasets
4. Create and load data into tables5. Demonstrate the use of hive
queries from external systems using ODBC
• Experience of big data?• Expectations?• Next steps?
User experience & aims
What is big data?
• Perhaps most relevant, is the most obvious;
“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (wikipedia)
• Or to put it another way, too big to fit into your favourite desktop application.
Growing and shrinking data
Tweets
Smart meter data
Sent Tweet
All Smart meter data
All tweets from user
All tweets from User & Friends
Data from Tweet
Smart meter by day
Smart meter by Month
By Month and Geography
1Kb 1Mb 1Gb 10+ Gb
Desktop Application Big Data Environment
Hadoop
• Created by Doug Cutting and Mike Cafarella• Based on the 2003-4 Google papers on MapReduce and
GFS (Google File System)• MapReduce had been around for 40 years, first
appearing in the programming language Lisp in 1961• The name Hadoop comes from the name of a cuddly toy
elephant owned by Doug Cuttings son
Hadoop Infrastructure
• A Hadoop cluster can be formed by thousands of nodes – 4 would be a minimum
• A node is an individual computer many times more powerful than the average desktop.
• The strength of the Hadoop system is not its raw power, but its ability to break a processing task down and run all of the parts in parallel
• It is built to be resilient, i.e. cope with server failures
A picture of the Hadoop Eco-system
(http://hortonworks.com/)
A (very) minimal picture of Hadoop
Hadoop
HDFS Map Reduce
Hadoop components
• HDFS (Hadoop Distributed File System)• To the end user, just a file system• Internally, files are segmented into blocks of 128MB and
randomly distributed across the available datanodes.• A datanode is a server in the Hadoop cluster where actual
processing takes place – i.e. where your programs are run.• A namenode (another server in the cluster) keeps track of the
random distribution of the blocks
• Map Reduce• The execution engine• Enhanced by Tez
Data + Program = Processing
Bigdata
In Hadoop
program
HadoopTraditional client server
Program
e.g. MS Word
data
Sandbox v Cluster
• The Sandbox environment we will be using is not a cluster – It is one Virtual machine running on one physical machine
• The Sandbox thinks it is a cluster. It is configured that way
• In reality everything that a cluster does in parallel, the Sandbox does in series
• This makes it very slow• But it can still process big datasets• The coding for the queries you write will be just the same
Sandbox limitations
• Speed of processing – already mentioned• Capacity – A Hortonworks Sandbox has a size limit of
50Gb• In practice this would mean a limit of 25-30 Gb of
storage space for your datasets in HDFS• It is possible to expand the size of the VM • Any space used by the VM as it expands must be
available on the hosting physical machine
What is Hive?
This is the definition from Apache.org:
The Apache Hive ™ data warehouse software facilitates
querying and managing large datasets residing in
distributed storage. Hive provides a mechanism to project
structure onto this data and query the data using a SQL-
like language called HiveQL
Using Hive – What we will cover
• Select statements (i.e. Queries)• Table creation and loading data• Sampling tables, Joining tables• Table storage and Partitioned tables• Various ways of accessing Hive
Finding Information on the Internet
• Relating to Apache Hadoop and Hive
Apache.org
Specific Projects
• Hive has its own entry in the project list• HDFS is included as part of the Hadoop project.• All of the top pages of the projects have links to the
official documentation• Can be a bit thin on examples and explanations• Versions change regularly
Hive wiki page(s)
Getting Help
• Apache Wikis• Hortonworks or other Hadoop providers• Stackoverflow• Just google it!
The Environment
• We are using Laptops with:• Windows 7• Intel I5-6200U processors • 16Gb of Ram
The amount of RAM is key as the Hortonworks Sandbox requires 10Gb to run reasonably well
Installed on the Laptop (all Freeware)
• VMWare Player - for running the virtualisation environment which contains the Sandbox VM
• The Hortonworks Hadoop Sandbox v2.3.2 (just a set of files used by VMWare to build the Sandbox VM)
• PuTTY – A program which allows remote access to, typically Linux based systems – we need this to access the command line of the VM
• Filezilla – A program which allows the transfer of files between two different systems – we can use this to transfer our data files from the Windows system to the Linux VM system
Installed on the Laptop (all Freeware)
• Web browser - used to connect to the Web servers exposed by Hadoop and the Hortonworks Sandbox – we need this to communicate with Hive via Ambari or Hue
• Toad for Hadoop - used to connect directly to the Hive and HDFS environments on the Sandbox
The Sandbox environment
• The Sandbox is a complete Linux operating system which contains a complete Hadoop environment all running in a single VM (Virtual Machine)
• Which in turn is hosted on a Windows machine using VMware virtualisation software.
• Why a VirtualMachine – separate, isolated environment
The Hadoop Environment
• Available automatically after the Linux environment has started
• The hadoop environment includes • The software component known as Hive • A file system called HDFS
• Within HDFS • the data files we will be using to write Hive queries against will
be stored• Hive also uses HDFS to store tables
Accessing Hive
• We can use a variety of applications to access the Hive environment
• Some are better than others• They don’t all provide the same functionality• For some tasks there may be no alternatives
• All of the ones we will be using are either part of the Hadoop environment or are freely available for download
Hive accessing applications
VMWare Player
• Used to create the VM Sandbox
You only ned to know the IP address from here
PuTTY
• Used to access Hive and HDFS from the command line
Filezilla
• Used to transfer files between the Windows and the Linux VM
Web Browser
• Any web browser will do• Older versions of IE don’t always get the layouts correctly
• There are two web interfaces available• Ambari – This is a Hortonworks developed product which
provides Hadoop management as well as access to Hive and HDFS
• Hue - is an open source development by Cloudera. It has been removed from the most recent version of the Hortonworks Sandbox
• Both provide similar functionality for Hive and HDFS• Load file• Create tables• Run queries
Ambari front screen
Ambari - Hive interface
Hue – Front screen
Hue – Hive interface
Toad for Hadoop
• A Windows application, in beta, currently free.
ODBC
• The Hive ODBC driver is transparent once installed but allows access to Hive from a variety of programming and application environments.
The data to be used
We will be using the Smartmeter data which can be downloaded from UKDS (SN7591). There are four files.
The edrp_metadata.xlsx file is not loaded into Hive. It will be used to compare results from some of our queries
File Size Num. Records
edrp_elec.csv 12.07GB 413836038
edrp_gas.csv 6.83GB 246482700
edrp_geography_data.csv 1.3MB 14617edrp_metadata.xlsx
Edrp_elec, edrp_gas file layout
ANON_ID,ADVANCEDATETIME,HH,ELECKWH5110,15FEB08:12:30:00,25,0.611617,15FEB08:12:30:00,25,0.254869,15FEB08:12:30:00,25,0.39015,15FEB08:12:30:00,25,0.411628,15FEB08:12:30:00,25,0.85
In the gas file, the last column is GASKWH
Edrp_geography_data file layout
anonID,eProfileClass,fuelTypes,ACORN_Category,ACORN_Group,ACORN_Type,ACORN_Code,ACORN_Description,NUTS4,LACode,NUTS1,gspGroup,LDZ,Elec_Tout,Gas_Tout1,2,Dual,1,C,10,"1 ,C ,10",Well-off working families with mortgages,--,--,UKG,_B,WM,0,02,1,Dual,4,M,43,"4 ,M ,43","Older people, rented terraces",UKL1605,00PL,UKL,_K,WS,1,13,1,ElecOnly,3,I,32,"3 ,I ,32",Retired home owners,UKJ4210,29UN,UKJ,_J,SE,0,04,1,Dual,3,H,31,"3 ,H ,31",Home owning Asian family areas,--,--,UKI,--,--,0,05,1,ElecOnly,4,M,43,"4 ,M ,43","Older people, rented terraces",UKM3800,00RF,UKM,_N,SC,1,0
edrp_metadata.xlsx contents
• The highlighted columns are ones we will try to verify with queries
Objectives of the analysis
• To compare the data contained in the edrp_metadata.xlsx file with our query results
• Verify the Hypothesis that:Electricity only users, use proportionally more electricity in winter than the dual fuel users
Acorn codes
Acorn segments postcodes and neighbourhoods into 6 Categories, 18 Groups and 62 types, three of which are not private households. By analysing significant social factors and population behaviour, it provides precise information and in-depth understanding of the different types of people.
http://acorn.caci.co.uk/
NUTS1 Codes
Top Related