H base vs hive srp vs analytics 2-14-2012

Post on 11-Jul-2015

763 views 1 download

Tags:

Transcript of H base vs hive srp vs analytics 2-14-2012

HBase vs. Hive

Philip WicklineChief Technology Officer

Hadapt

Goals

Brief introduction to the differences between transactional/operational and analytical systems

Understand when to use Hive and when to use HBase and why

2

Databases

3

Datastores

4

Differences of Purpose : “Transaction Processing”

Operational systems

• Optimized for small short random access – reads and writes

• E.g. record that an employee invested $100 in a S&P500 index fund in his 401(k) *or* record that a user posted something on another users “wall”

Traditional DB examples

• Oracle

• MySQL

NoSQL Examples

• HBase

• MongoDB

• Cassandra

5

Differences of Purpose: Analytics

Analytics

• Optimized for read-only computations about large amounts of data

• E.g. compute the average amount invested in bond funds and stock funds for all employees at all employers over the last 5 years

DB Examples

• Netezza

• Vertica

NoSQL Examples

• Hive

• Pig

6

0

2

4

6

8

10

12

14

16

Oct Nov Dec Jan Feb Mar

Plan

Actual

Option 1

Acme

GM

Newco

Oldco

Bigcorp

Option 1

0

5

10

5-10

0-5

HBase Data Model : Conceptual

From the BigTable paper:

“a sparse, distributed, persistent multi-dimensional sorted map”

(row : bytestring, column family : bytestring, column : bytestring, time : int64) -> byte string

7

HBase Map

{ ”key_1" : {

”columnfamily_a" : {

”column_i" : {

15 : "y",

4 : "m"

},

”column_ii" : {

15 : "d”,

}},

“columnfamily_b" : {

”column_other" : {

6 : "w"

3 : "o"

1 : "w”

}}}}

8

Hive Data Model : Conceptual

Traditional Relational Tables

9

CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT

451234 NEWC

ORP

196

Broadway

1 111-555-

1212

$1,231,285 NULL

887765 ACME 1 Main st.

2 222-555-

1212

$46,945 “Top

customer”

HBase Data Model : Physical

Every cell stored with row, family, column and timestamp

Allows fast lookup with low copy overhead

BUT

Space inefficient (optional compression available) and inefficient to scan

10

“key_1” “cf_a” “c_i” 15 “foo”

“key_1” “cf_a” “c_ii” 15 “bar”

“key_2” “cf_a” “c_ii” 4 “baz”

Hive Data Model : Physical

Depends on the underlying storage files

Can use flat text files, RCFiles, even use HBase for storage

Standard Row Storage

11

C_1 C_2 C_3 C_4

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

51 52 53 54

Hive Data Model : RCFile

Break into row groups, and then store as columns

12

Row Group 1

C_1 11 21 31

C_2 12 22 32

C_3 13 23 33

C_4 14 24 34

Row Group 2

C_1 41 51

C_2 42 52

C_3 43 53

C_4 44 54

Informal Performance Comparison

13

Hive HBase

Insert Speed batch Fast!

Update Speed NA Fast!

Lookup speed MR lower bound

(10s of seconds)

Fast!

Data warehouse

queries

15x faster on one

test

Uh oh

THANK YOU