Introduction to HBase
-
Upload
byeongweon-moon -
Category
Documents
-
view
1.421 -
download
3
Transcript of Introduction to HBase
Introduction to HBaseByeongweon Moon / [email protected]
HBase Key Point
Clustered, commodity(-ish) hardware Mostly schema-less Dynamic distribution Spread writes out over the cluster
HBase
Distributed database modeled on Bigtable Bigtable :
A Distributed Storage System for Structured Data by Chang et al.
Runs on top of Hadoop Core Layers on HDFS for storage Native connections to MapReduce Distributed, High Availability, High
Performance, Strong Consistency
HBase (cont.)
Column-oriented store Wide table costs only the data stored NULLs in row are ‘free’ Good compression: columns of similar type Column name is arbitrary
Rows stored in sorted order Can random read and write Goal of billions of rows X millions of cells
Petabytes of data across thousands of servers
Column Oriented Storage
!HBase
“NoSQL” Database No joins No sophisticated query engine No transactions (sort of) No column typing No SQL, no ODBC/JDBC, etc.
Not a replacement for RDBMS Matching Impedance
Why HBase?
Datasets are reaching Petabytes Traditional databases are expensive
to scale and difficult to distribute Commodity hardware is cheap and
powerful Need for random access and batch
processing (which Hadoop does not offer)
Tables
Table is split into roughly equal sized “regions”
Each region is a contiguous range of keys
Regions split as they grow, thus dy-namically adjusting to your data set
Table (cont.)
Tables are sorted by Row Table schema defines column fami-
lies Families consist of any number of col-
umns Columns consist of any number of ver-
sions Everything except table name is byte[](Table, Row, Family:Column, Timestamp) -> Value
Table (cont.)
As a data structrue
SortedMap(RowKey, List(
SortedMap(Column, List(
Value, Timestamp)
))
)
HBase Open Source Stack
ZooKeeper : Small Data Coordination Service
HBase : Database Storage Engine HDFS : Distributed File system Hadoop : Asynchrous Map-Reduce
Jobs
Server Architecture
Similar to HDFS Master == Namenode Regionserver == Datanode
Often run these alongside each other! Difference: HBase stores state in HDFS HDFS provides robust data storage across
machines, insulating against failure Master and Regionserver fairly stateless
and machine independent
Region Assignment
Each region from every table is as-signed to a Regionserver
Master Duties: Responsible for assignment and handling
regionserver problems (if any!) When machines fail, move regions When regions split, move regions to bal-
ance Could move regions to respond to load Can run multiple backup masters
Master
The master does NOT Handle any write request (not a DB mas-
ter!) Handle location finding requests Not involved in the read/write path Generally does very little most of the
time
Distributed Coordi-nation
Zookeeper is used to manage master election and server availability
Set up as a cluster, provides distrib-uted coordination primitives
An excellent tool for building cluster management systems
HBase Architecture
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
How data actually stored
Write-ahead-Log
http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
HLog
Demo
HBase - Roadmap
HBase 0.92.0 Coprocessors Distributed Log Splitting Running Tasks in UI Performance Improvements
HBase 0.94.0 Security Secondary Indexes Search Integration HFile v2
Reference
http://ofps.oreilly.com/titles/9781449396107/index.html
http://hbase.apache.org/book.html#quickstart
http://www.larsgeorge.com/2010/02/fosdem-2010-nosql-talk.html