Hive Evolution
description
Transcript of Hive Evolution
Hive Evolution
A Progress Report
November 2010
John Sichi (Facebook)
Agenda
• Hive Overview
• Version 0.6 (just released!)
• Version 0.7 (under development)
• Hive is now a TLP!
• Roadmaps
What is Hive?
• A Hadoop-based system for querying and managing structured data– Uses Map/Reduce for execution– Uses Hadoop Distributed File System
(HDFS) for storage
Hive Origins
• Data explosion at Facebook
• Traditional DBMS technology could not keep up with the growth
• Hadoop to the rescue!
• Incubation with ASF, then became a Hadoop sub-project
• Now a top-level ASF project
Hive Evolution
• Originally:– a way for Hadoop users to express
queries in a high-level language without having to write map/reduce programs
• Now more and more:– A parallel SQL DBMS which happens to
use Hadoop for its storage and execution architecture
Intended Usage
• Web-scale Big Data– 100’s of terabytes
• Large Hadoop cluster– 100’s of nodes (heterogeneous OK)
• Data has a schema
• Batch jobs– for both loads and queries
So Don’t Use Hive If…
• Your data is measured in GB
• You don’t want to impose a schema
• You need responses in seconds
• A “conventional” analytic DBMS can already do the job– (and you can afford it)
• You don’t have a lot of time and smart people
Scaling Up
• Facebook warehouse, July 2010:– 2250 nodes– 36 petabytes disk space
• Data access per day:– 80 to 90 terabytes added (uncompressed)– 25000 map/reduce jobs
• 300-400 users/month
Facebook Deployment
Web Servers Scribe MidTier
Production Hive-Hadoop Cluster
Sharded MySQL
Scribe-Hadoop Clusters
Adhoc Hive-Hadoop Cluster
Hive replication
Hive Architecture
Metastore
Query Engine
CLIHive Thrift API
MetastoreThrift API
JDBC/ODBC clients
Hadoop Map/Reduce+ HDFS Clusters
Web Management Console
Physical Data Model
clicks
ds=‘2010-10-28’
ds=‘2010-10-29’
ds=‘2010-10-30’
Partitions(possiblymulti-level)
Table HDFS Files(possibly ashash buckets)
Map/Reduce Plans
Input Files Map Tasks
Reduce Tasks
Splits
Result Files
Query Translation Example
• SELECT url, count(*) FROM page_views GROUP BY url
• Map tasks compute partial counts for each URL in a hash table– “map side” preaggregation– map outputs are partitioned by URL and
shipped to corresponding reducers
• Reduce tasks tally up partial counts to produce final results
It Gets Quite Complicated!
Behavior Extensibility
• TRANSFORM scripts (any language)– Serialization+IPC overhead
• User defined functions (Java)– In-process, lazy object evaluation
• Pre/Post Hooks (Java)– Statement validation/execution– Example uses: auditing, replication,
authorization
UDF vs UDAF vs UDTF
• User Defined Function• One-to-one row mapping• Concat(‘foo’, ‘bar’)
• User Defined Aggregate Function• Many-to-one row mapping• Sum(num_ads)
• User Defined Table Function• One-to-many row mapping• Explode([1,2,3])
Storage Extensibility
• Input/OutputFormat: file formats– SequenceFile, RCFile, TextFile, …
• SerDe: row formats– Thrift, JSON, ProtocolBuffer, …
• Storage Handlers (new in 0.6)– Integrate foreign metadata, e.g. HBase
• Indexing– Under development in 0.7
Release 0.6
• October 2010– Views– Multiple Databases– Dynamic Partitioning– Automatic Merge– New Join Strategies– Storage Handlers
Views: Syntax
CREATE VIEW [IF NOT EXISTS] view_name
[ (column_name [COMMENT column_comment], … ) ]
[COMMENT ‘view_comment’]
AS SELECT …
[ ORDER BY … LIMIT … ]
Views: Usage
• Use Cases– Column/table renaming– Encapsulate complex query logic– Security (Future)
• Limitations– Read-only– Obscures partition metadata from underlying
tables– No dependency management
Multiple Databases
• Follows MySQL convention– CREATE DATABASE [IF NOT EXISTS]
db_name [COMMENT ‘db_comment’]– USE db_name
• Logical namespace for tables
• ‘default’ database is still there
• Does not yet support queries across multiple databases
Dynamic Partitions: Syntax
• ExampleINSERT OVERWRITE TABLE page_view
PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
FROM page_view_stg pvs
Dynamic Partitions: Usage
• Automatically create partitions based on distinct values in columns
• Works as rudimentary indexing– Prune partitions via WHERE clause
• But be careful…– Don’t create too many partitions!– Configuration parameters can be used to
prevent accidents
Automatic merge
• Jobs can produce many files
• Why is this bad?– Namenode pressure– Downstream jobs have to deal with file
processing overhead
• So, clean up by merging results into a few large files (configurable)– Use conditional map-only task to do this
Join Strategies Before 0.6
• Map/reduce join– Map tasks partition inputs on join keys
and ship to corresponding reducers– Reduce tasks perform sort-merge-join
• Map-join– Each mapper builds lookup hashtable
from copy of small table– Then hash-join the splits of big table
New Join Strategies
• Bucketed map-join– Each mapper filters its lookup table by the
bucketing hash function – Allows “small” table to be much bigger
• Sorted merge in map-join– Requires presorted input tables
• Deal with skew in map/reduce join– Conditional plan step for skew keys p(after
main map/reduce join step)
Storage Handlers
Hive
HDFS
NativeTables
StorageHandlerInterface
HBaseHandler
CassandraHandler
HypertableHandler
HypertableAPI
CassandraAPI
HBaseAPI
HBaseTables
Low Latency Warehouse
HBaseHBase
OtherFiles/
Tables
OtherFiles/
TablesPeriodic LoadPeriodic Load
Continuous Update
Continuous Update
HiveQueries
HiveQueries
Storage Handler Syntax
• HBase ExampleCREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” = “small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”);
Release 0.7• In development
– Concurrency Control
– Stats Collection– Stats Functions– Indexes– Local Mode
– Faster map join– Multiple DISTINCT
aggregates– Archiving– JDBC/ODBC
improvements
Concurrency Control
• Pluggable distributed lock manager– Default is Zookeeper-based
• Simple read/write locking
• Table-level and partition-level
• Implicit locking (statement level)– Deadlock-free via lock ordering
• Explicit LOCK TABLE (global)
Statistics Collection
• Implicit metastore update during load– Or explicit via ANALYZE TABLE
• Table/partition-level– Number of rows– Number of files– Size in bytes
Stats-driven Optimization
• Automatic map-side join
• Automatic map-side aggregation
• Need column-level stats for better estimates– Filter/join selectivity– Distinct value counts– Column correlation
Statistical Functions
• Stats 101– Stddev, var, covar– Percentile_approx
• Data Mining– Ngrams, sentences (text analysis)– Histogram_numeric
• SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
Histogram query results
• “It’s complicated” peaks at 18-19, but lasts into late 40s!• “In a relationship” peaks at 20• “Engaged” peaks at 25• Married peaks in early 30s• More married than single at 28• Only teenagers use widowed?
Pluggable Indexing
• Reference implementation– Index is stored in a normal Hive table– Compact: distinct block addresses– Partition-level rebuild
• Currently in R&D– Automatic use for WHERE, GROUP BY– New index types (e.g. bitmap, HBase)
Local Mode Execution
• Avoids map/reduce cluster job latency
• Good for jobs which process small amounts of data
• Let Hive decide when to use it– set hive.exec.model.local.auto=true;
• Or force its usage– set mapred.job.tracker=local;
Faster map join
• Make sure small table can fit in memory– If it can’t, fall back to reduce join
• Optimize hash table data structures
• Use distributed cache to push out pre-filtered lookup table– Avoid swamping HDFS with reads from
thousands of mappers
Multiple DISTINCT Aggs
• ExampleSELECT
view_date,
COUNT(DISTINCT userid),
COUNT(DISTINCT page_url)
FROM page_views
GROUP BY view_date
Archiving
• Use HAR (Hadoop archive format) to combine many files into a few
• Relieves namenode memory• Archived partition becomes read-only• Syntax:
ALTER TABLE page_views
{ARCHIVE|UNARCHIVE}
PARTITION (ds=‘2010-10-30’)
JDBC/ODBC Improvements
• JDBC: Basic metadata calls– Good enough for use with UI’s such as
SQuirreL
• JDBC: some PreparedStatement support– Pentaho Data Integration
• ODBC: new driver under development (based on sqllite)
Hive is now a TLP• PMC
– Namit Jain (chair)– John Sichi– Zheng Shao– Edward Capriolo– Raghotham Murthy– Ning Zhang
– Paul Yang– He Yongqiang– Prasad Chakka– Joydeep Sen Sarma– Ashish Thusoo
• Welcome to new committer Carl Steinbach!
Developer Diversity
• Recent Contributors– Facebook, Yahoo, Cloudera– Netflix, Amazon, Media6Degrees, Intuit– Numerous research projects– Many many more…
• Monthly San Francisco bay area contributor meetups
• East coast meetups?
Roadmap: Security
• Authentication– Upgrading to SASL-enabled Thrift
• Authorization– HDFS-level
• Very limited (no ACL’s)• Can’t support all Hive features (e.g. views)
– Hive-level (GRANT/REVOKE)• Hive server deployment for full effectiveness
Roadmap: Hadoop API
• Dropping pre-0.20 support starting with Hive 0.7– But Hive is still using old mapred.*
• Moving to mapreduce.* will be required in order to support newer Hadoop versions– Need to resolve some complications with
0.7’s indexing feature
Roadmap: Howl
• Reuse metastore across Hadoop
Howl
Hive Pig Oozie Flume
HDFS
Roadmap: Heavy-Duty Tests
• Unit tests are insufficient
• What is needed:– Real-world schemas/queries– Non-toy data scales– Scripted setup; configuration matrix– Correctness/performance verification– Automatic reports: throughput, latency,
profiles, coverage, perf counters…
Roadmap: Shared Test Site
• Nightly runs, regression alerting
• Performance trending
• Synthetic workload (e.g. TPC-H)
• Real-world workload (anonymized?)
• This is critical for– Non-subjective commit criteria– Release quality