Introduction to HBase

22
Introduction to HBase Byeongweon Moon / REDDUCK [email protected]

Transcript of Introduction to HBase

Page 1: Introduction to HBase

Introduction to HBaseByeongweon Moon / [email protected]

Page 2: Introduction to HBase

HBase Key Point

Clustered, commodity(-ish) hardware Mostly schema-less Dynamic distribution Spread writes out over the cluster

Page 4: Introduction to HBase

HBase (cont.)

Column-oriented store Wide table costs only the data stored NULLs in row are ‘free’ Good compression: columns of similar type Column name is arbitrary

Rows stored in sorted order Can random read and write Goal of billions of rows X millions of cells

Petabytes of data across thousands of servers

Page 5: Introduction to HBase

Column Oriented Storage

Page 6: Introduction to HBase

!HBase

“NoSQL” Database No joins No sophisticated query engine No transactions (sort of) No column typing No SQL, no ODBC/JDBC, etc.

Not a replacement for RDBMS Matching Impedance

Page 7: Introduction to HBase

Why HBase?

Datasets are reaching Petabytes Traditional databases are expensive

to scale and difficult to distribute Commodity hardware is cheap and

powerful Need for random access and batch

processing (which Hadoop does not offer)

Page 8: Introduction to HBase

Tables

Table is split into roughly equal sized “regions”

Each region is a contiguous range of keys

Regions split as they grow, thus dy-namically adjusting to your data set

Page 9: Introduction to HBase

Table (cont.)

Tables are sorted by Row Table schema defines column fami-

lies Families consist of any number of col-

umns Columns consist of any number of ver-

sions Everything except table name is byte[](Table, Row, Family:Column, Timestamp) -> Value

Page 10: Introduction to HBase

Table (cont.)

As a data structrue

SortedMap(RowKey, List(

SortedMap(Column, List(

Value, Timestamp)

))

)

Page 11: Introduction to HBase

HBase Open Source Stack

ZooKeeper : Small Data Coordination Service

HBase : Database Storage Engine HDFS : Distributed File system Hadoop : Asynchrous Map-Reduce

Jobs

Page 12: Introduction to HBase

Server Architecture

Similar to HDFS Master == Namenode Regionserver == Datanode

Often run these alongside each other! Difference: HBase stores state in HDFS HDFS provides robust data storage across

machines, insulating against failure Master and Regionserver fairly stateless

and machine independent

Page 13: Introduction to HBase

Region Assignment

Each region from every table is as-signed to a Regionserver

Master Duties: Responsible for assignment and handling

regionserver problems (if any!) When machines fail, move regions When regions split, move regions to bal-

ance Could move regions to respond to load Can run multiple backup masters

Page 14: Introduction to HBase

Master

The master does NOT Handle any write request (not a DB mas-

ter!) Handle location finding requests Not involved in the read/write path Generally does very little most of the

time

Page 15: Introduction to HBase

Distributed Coordi-nation

Zookeeper is used to manage master election and server availability

Set up as a cluster, provides distrib-uted coordination primitives

An excellent tool for building cluster management systems

Page 16: Introduction to HBase

HBase Architecture

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Page 17: Introduction to HBase

How data actually stored

Page 18: Introduction to HBase

Write-ahead-Log

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

Page 19: Introduction to HBase

HLog

Page 20: Introduction to HBase

Demo

Page 21: Introduction to HBase

HBase - Roadmap

HBase 0.92.0 Coprocessors Distributed Log Splitting Running Tasks in UI Performance Improvements

HBase 0.94.0 Security Secondary Indexes Search Integration HFile v2