Introduction to Hive Liyin Tang liyintan@usc.edu.

Introduction to Hive

Liyin Tangliyintan@usc.edu

Outline

Motivation Overview Data Model / Metadata Architecture Performance Cons and Pros Application Related Work

2Introduction to Hive04/18/23

Motivation

Web ServersScribe Writers

RealtimeHadoop Cluster

Hadoop Hive WarehouseOracle RAC MySQL

Scribe MidTier

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

Motivation

Limitation of MR Have to use M/R model Not Reusable Error prone For complex jobs:

Multiple stage of Map/Reduce functions Just like ask dev to write specify physical

execution plan in the database

Overview

Intuitive Make the unstructured data looks like tables

regardless how it really lay out SQL based query can be directly against these

tables Generate specify execution plan for this query

What’s Hive A data warehousing system to store structured

data on Hadoop file system Provide an easy query these data by execution

Hadoop MapReduce plans5Introduction to Hive04/18/23

Data Model

Tables Basic type columns (int, float, boolean) Complex type: List / Map ( associate array)

Partitions Buckets CREATE TABLE sales( id INT, items

ARRAY<STRUCT<id:INT,name:STRING>) PARITIONED BY (ds STRING)CLUSTERED BY (id) INTO 32 BUCKETS;

SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)

Metadata

Database namespace Table definitions

schema info, physical location In HDFS

Partition data

ORM Framework All the metadata can be stored in Derby by

default Any database with JDBC can be configed

Map Reduce

Web UI + Hive CLI + JDBC/ODBC

Browse, Query, DDL

Hive QL

Parser

Planner

Optimizer

Execution

CSVThriftRegex

UDF/UDAF

substrsum

averageFileFormat

TextFileSequenceFil

eRCFile

User-definedMap-reduce

Scripts

Architecture

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

Performance

GROUP BY operation Efficient execution plans based on:

Data skew: how evenly distributed data across a number

of physical nodes bottleneck VS load balance

Partial aggregation: Group the data with the same group by

value as soon as possible In memory hash-table for mapper Earlier than combiner

Performance

JOIN operation Traditional Map-Reduce Join Early Map-side Join

very efficient for joining a small table with a large table

Keep smaller table data in memory first Join with a chunk of larger table data each

time Space complexity for time complexity

Performance

Ser/De Describe how to load the data from the file into

a representation that make it looks like a table;

Lazy load Create the field object when necessary Reduce the overhead to create unnecessary

objects in Hive Java is expensive to create objects Increase performance

Hive – Performance

QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; map-side time only (incl. GzipCodec for comp/decompression) * These two features need to be tested with other queries.

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

Date SVN Revision Major Changes Query A Query B Query C

2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec

2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec

3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec

4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec

6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec

8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec

Pros A easy way to process large scale data Support SQL-based queries Provide more user defined interfaces to

extend Programmability Efficient execution plans for performance Interoperability with other database tools

Cons No easy way to append data Files in HDFS are immutable

Future work Views / Variables More operator

In/Exists semantic

More future work in the mail list

Application

Log processing Daily Report User Activity Measurement

Data/Text mining Machine learning (Training Data)

Business intelligence Advertising Delivery Spam Detection

Related Work

Parallel databases: Gamma, Bubba, Volcano

Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE

Reference

[1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009.

[2] Hadoop 2009: http://www.slideshare.net/cloudera/hw09-hadoop-

development-at-facebook-hive-and-hdfs [4] Facebook Data Team:

http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation

[3] Cloudera: http://www.cloudera.com/videos/

introduction_to_hive17Introduction to Hive7/20/2010

Q & AThank you

Back up

Hive Components

Shell Interface: Like the MySQL shell Driver:

Session handles, fetch, exeucition

Complier: Prarse,plan,optimzie

Execution Engine: DAG stage Run map or reduce

Motivation

MapReduce Motivation Data processing: > 1 TB Massively parallel Locality Fault Tolerant

Hive Usage

hive> show tables;

hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile

hive> load data inpath “shakespeare_freq” into table shakespeare;

22Introduction to Hive

Hive Usage

hive> load data inpath “shakespeare_freq” into table shakespeare;

hive> select * from shakespeare where freq>100 sort by freq asc limit 10;

23Introduction to Hive

Hive Usage @ Facebook Statistics per day:

4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day

Hive simplifies Hadoop: ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-development-at-

facebook-hive-and-hdfs

Introduction to Hive Liyin Tang liyintan@usc.edu.

Documents

Transcript of Introduction to Hive Liyin Tang liyintan@usc.edu.

T2WML: Table To Wikidata Mapping Language€¦ · T2WML: Table To Wikidata Mapping Language Pedro Szekely, Daniel Garijo, Divij Bhatia, Jiasheng Wu, Yixiang Yao and Jay Pujara szekely@usc.edu,dgarijo@isi.edu,divijbha@usc.edu,jiashenw@usc.edu,yixiangy@isi.edu,jpujara@usc.edu

info-international@usc.edu | international.usc.edu UNIVERSITY … · 2019-12-10 · 3335 S Figueroa Street, Los Angeles, CA 90007 . info-international@usc.edu | international.usc.edu

Sign Classification Boosted Cascade of Classifiers using University of Southern California Thang Dinh thang.dinh@usc.edu Eunyoung Kim eunyoung.kim@usc.edu.

TCP Over Wireless Ad-hoc Networks. Vinay Sridhara vsridhar@usc.edu Nagendra Subramanya nsubrama@usc.edu.

Keeping honeybees alive… one hive at a time. Hives to Honey Hive rescue & relocation serviceSustainable backyard hive management.

-HIVE- Hive Insulation Valuation Experiment

Life Cycle Plan (LCP) · Web viewPreksha Gupta Software Architect, Operational Concept Engineer prekshag@usc.edu Mangalore Rakesh Shenoy Software Developer, Prototyper mangalos@usc.edu

Hive Global

Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •

The Unexpected Positive Impact of Fixed Structures on Goal Completion Liyin Jin …faculty.mccombs.utexas.edu/ying.zhang/Sequence.pdf · 2014-04-08 · Liyin Jin Szu-chi Huang Ying

Department of Computer Science University of Southern California Los Angeles, CA, USA George Edwards gedwards@usc.edu Nenad Medvidovic neno@usc.edu.

HIVE: an Open Infrastructure for Malware Collection and ...netlab-mn.unipv.it/hive/ossconf_presentation.pdf · Introduction HIVE Conclusions HIVE: an Open Infrastructure for Malware

DLM Installation and Upgrade - Cloudera...Hive For replicating Hive database content Updates via Hive 1 (Based on Apache Hive 1.2.x) and HiveServerInteractive (Based on Apache Hive

Hive - Core Servletscourses.coreservlets.com/Course-Materials/pdf/hadoop/07-Hive-01.pdf · • Hive Overview and Concepts ... Hive Hadoop Cluster Execute on Hadoop Cluster Monitor/Report

A Model-Driven Framework for Architectural Evaluation of Mobile Software Systems George Edwards gedwards@usc.edu Dr. Nenad Medvidovic neno@usc.edu Center.

Introduction To Hive - Stanford Universitysnap.stanford.edu/class/cs341-2011/handout/hive/cs341-hive.pdf · Introduction To Hive How to use Hive in Amazon EC2 ... ... •Kafka helps

ITP 457 Network Security Joseph Greenfield joseph.greenfield@usc.edu.

CoDesign A Highly Extensible Collaborative Software Modeling Framework Jae young Bang (jaeyounb@usc.edu)jaeyounb@usc.edu University of Southern California.

Construction of Analytic Frameworks for Component- Based ...€¦ · George Edwards gedwards@usc.edu Chiyoung Seo cseo@usc.edu Nenad Medvidovic neno@usc.edu Computer Science Department

Aloha Hive BUZZ · Hive, oh Hive Never so alive Oh how I love Hive Oh Hive, Oh Hive Menehune and goats Counselors and boats Oh how I love you Oh camp, oh camp High five for Hive Oh