Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
May 2013 HUG: HCatalog/Hive Data Out
-
Upload
yahoo-developer-network -
Category
Technology
-
view
105 -
download
1
description
Transcript of May 2013 HUG: HCatalog/Hive Data Out
HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup
May 15, 2013
Moving Data Out of Hadoop Clusters Today
2 Yahoo! Presentation, Confidential
Client’s Machine
HTTP Client
HTTP Server
Launcher/ Gateway
HDFS Proxy1
HTTP Proxy
M/R on YARN
HDFS
Hadoop RPC
Hadoop RPC
SSH
HTTPS
HTTPS
M/R on YARN
Custom Proxy
HTTPS
HTTP Server
Filers
HTTPS
HDFS
M/R on YARN
DistCp
Clients Multi-tenant Hadoop Clusters Managed Data-loading
1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP
SSH
SQLLDR
Typical Data Out Scenario
3 Yahoo! Presentation, Confidential
HDFS Proxy HDFS
§ Data (to be pulled out) is stored in a predefined directory structure as files
§ Client determines (through a custom interface) if a particular data feed of interest is committed or not
§ If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy
Cus
tom
Inte
rface
Filer Temp Table
Main Table
cURL data copy
INSERT
Oracle DB
Ext. Table
Main Table
delimited files
Pros and Cons of the Data Out Approach
4 Yahoo! Presentation, Confidential
Pros
§ Security of DB passwords – password not stored in the grid
§ Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers
§ Encryption – data out of the grids has to be encrypted as it may be cross-colo
§ ACLs – DB hosts are not accessible from grid nodes, and hence the proxy
Cons
§ Directory structure – has to be predefined and known to downstream consumers of data
§ Data discovery – availability of data for consumption requires polling or other hooks
§ Overhead – Use of DONE files
§ Maintenance – Separate schema files and schema file formats
The introduction of HCatalog and JMS notifications solves the problem
Hadoop – One Platform, Many Tools
Yahoo! Presentation, Confidential 5
Metastore HDFS
Hive
Metastore Client InputFormat/ OuputFormat
SerDe
InputFormat/ OuputFormat
MapReduce Pig
Load/Store
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
MapReduce/ Pig § Pipelines § Iterative Processing § Research
Data Warehouse Hive § BI Tools § Analysis
HCatLoader/ HCatStorer
HCatalog – Opening Up the Hive Metastore
Yahoo! Presentation, Confidential 6
Metastore HDFS
Metastore Client InputFormat/ OuputFormat
SerDe
HCatInputFormat/ HCatOuputFormat
MapReduce Pig
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
Hive
REST
External System
HCatalog Value Proposition
Yahoo! Presentation, Confidential 7
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
§ Centralized metadata service for Hadoop
§ Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data
§ Provides DB-like abstractions (databases, tables, and partitions) and
supports schema evolution
§ Abstracts out the file storage format and data location
HiveServer2 with HCatalog
Yahoo! Presentation, Confidential 8
HDFS
(ODBC)
HiveServer2 (ODBC/ JDBC)
Data Out Client
(JDBC)
HCatalog Server (Metastore)
Messaging Service
(ActiveMQ)
HiveServer2 Jobs
Hive Jobs (CLI)
HCat Jobs (Pig, M/R)
doAs(user)
doAs(user)
JMS notification (Producer)
Notification (Consumer)
Issues Solved
9 Yahoo! Presentation, Confidential
Directory structure – has to be predefined and known to downstream consumers of data
Data discovery – availability of data for consumption requires polling or other hooks
Overhead – Use of DONE files
Maintenance – Separate schema files and schema file formats
✔
✔
✔ ✔
DataOut Motivation
10 Yahoo! Presentation, Confidential
§ Many ways to load and manage data on the grid § HCatalog/Hive § Pig § Hadoop MR § Sqoop § GDM
§ Fewer ways of getting data off the cluster § Sqoop § HDFSProxy § HDFS copy to local file system § distcp between clusters
§ Challenges § Underlying file format § Size of data § SLA
DataOut Overview
11 Yahoo! Presentation, Confidential
§ What is DataOut? § Efficient method of moving data off the grid
§ API exposes a programmatic interface
§ What are the advantages of DataOut? § API based on well-known JDBC API
§ Works with HCatalog/Hive
§ Agnostic to the underlying storage format
§ Parts of the whole data can be pulled in parallel
§ What are the limitations of DataOut?
§ Queries must be SELECT * FROM type queries
DataOut Deployment
12 Yahoo! Presentation, Confidential
HDFS
HS2 HS2 … HS2 HS2
DataOut Client
Query Data
How DataOut Works
13 Yahoo! Presentation, Confidential
HiveServer2 M
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
Execute Query Prepare Splits
Fetch Splits
Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database
Code to Prepare the HiveSplits
14 Yahoo! Presentation, Confidential
DataOut dataout = new DataOut(); HiveConnection c = dataout.getConnection(); Statement s = c.createGenerateSplitStatement(); ResultSet rs = s.executeQuery(sql); while(rs.next()) {
HiveSplit split = (HiveSplit) rs.getObject(1); /* Launch job to fetch the split data. */
} /* Synchronize on fetch jobs. */ rs.close(); s.close(); c.close();
Code to Retrieve the HiveSplits
15 Yahoo! Presentation, Confidential
DataOut dataout = new DataOut(); HiveConnection c = dataout.getConnection(); PreparedStatement ps = c.prepareFetchSplitStatement(split); ResultSet rs = ps.executeQuery(); while(rs.next()) {
/* Process row data. */ } rs.close(); ps.close(); c.close(); /* Communicate with master process. */
DataOut Demo
Yahoo! Presentation, Confidential 16
HS2 Performance – Single Client Connection
17 Yahoo! Presentation, Confidential
HS2 Performance – Five Concurrent Clients
18 Yahoo! Presentation, Confidential
HS2 Performance Summary
19 Yahoo! Presentation, Confidential
§ Throughput scales linearly § Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s
§ Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s
§ Throughput is affected by fetch size § Sweet spot around ~200 rows
§ Average row size may affect this number (pending further testing)
§ HiveServer2 is capable of handling multiple clients § Throughput of 10GB in ~20 minutes with five client connections
§ Drop-off in throughput is expected and reasonable
§ 5x increase in concurrent connections = 2x increase in transfer time
§ Goal of 50GB in 5min § Achievable with ~10 HiveServer2 instances streaming data