Introduction to HBase

Introduction to HBaseByeongweon Moon / [email protected]

mailto:[email protected]

HBase Key Point

Clustered, commodity(-ish) hardware Mostly schema-less Dynamic distribution Spread writes out over the cluster

HBase

Distributed database modeled on Bigtable Bigtable :

A Distributed Storage System for Structured Data by Chang et al.

Runs on top of Hadoop Core Layers on HDFS for storage Native connections to MapReduce Distributed, High Availability, High

Performance, Strong Consistency

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/papers/bigtable-osdi06.pdf




HBase (cont.)

Column-oriented store Wide table costs only the data stored NULLs in row are ‘free’ Good compression: columns of similar type Column name is arbitrary

Rows stored in sorted order Can random read and write Goal of billions of rows X millions of cells

Petabytes of data across thousands of servers

Column Oriented Storage

!HBase

“NoSQL” Database No joins No sophisticated query engine No transactions (sort of) No column typing No SQL, no ODBC/JDBC, etc.

Not a replacement for RDBMS Matching Impedance

Why HBase?

Datasets are reaching Petabytes Traditional databases are expensive

to scale and difficult to distribute Commodity hardware is cheap and

powerful Need for random access and batch

processing (which Hadoop does not offer)

Tables

Table is split into roughly equal sized “regions”

Each region is a contiguous range of keys

Regions split as they grow, thus dy-namically adjusting to your data set

Table (cont.)

Tables are sorted by Row Table schema defines column fami-

lies Families consist of any number of col-

umns Columns consist of any number of ver-

sions Everything except table name is byte[](Table, Row, Family:Column, Timestamp) -> Value

Table (cont.)

As a data structrue

SortedMap(RowKey, List(

SortedMap(Column, List(

Value, Timestamp)

))

)

HBase Open Source Stack

ZooKeeper : Small Data Coordination Service

HBase : Database Storage Engine HDFS : Distributed File system Hadoop : Asynchrous Map-Reduce

Jobs

Server Architecture

Similar to HDFS Master == Namenode Regionserver == Datanode

Often run these alongside each other! Difference: HBase stores state in HDFS HDFS provides robust data storage across

machines, insulating against failure Master and Regionserver fairly stateless

and machine independent

Region Assignment

Each region from every table is as-signed to a Regionserver

Master Duties: Responsible for assignment and handling

regionserver problems (if any!) When machines fail, move regions When regions split, move regions to bal-

ance Could move regions to respond to load Can run multiple backup masters

Master

The master does NOT Handle any write request (not a DB mas-

ter!) Handle location finding requests Not involved in the read/write path Generally does very little most of the

time

Distributed Coordi-nation

Zookeeper is used to manage master election and server availability

Set up as a cluster, provides distrib-uted coordination primitives

An excellent tool for building cluster management systems

HBase Architecture

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

How data actually stored

Write-ahead-Log

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html



HBase - Roadmap

HBase 0.92.0 Coprocessors Distributed Log Splitting Running Tasks in UI Performance Improvements

HBase 0.94.0 Security Secondary Indexes Search Integration HFile v2

Reference

http://ofps.oreilly.com/titles/9781449396107/index.html

http://hbase.apache.org/book.html#quickstart

http://www.larsgeorge.com/2010/02/fosdem-2010-nosql-talk.html











Introduction to HBase

Documents

Transcript of Introduction to HBase