HBase Lightning Talk

24
Scott Leberknight HBASE A P A C H E

description

Slides for a lightning talk on HBase that I gave at Near Infinity (www.nearinfinity.com) spring 2012 conference. The associated sample code is on GitHub at https://github.com/sleberknight/basic-hbase-examples

Transcript of HBase Lightning Talk

Page 1: HBase Lightning Talk

Scott Leberknight

HBASEA P A C H E

Page 2: HBase Lightning Talk

BACKGROUND

Page 3: HBase Lightning Talk

Bigtable

Google

Page 4: HBase Lightning Talk

"Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable including web indexing, Google Earth, and Google Finance."

- Bigtable: A Distributed Storage System for Structured Data

http://labs.google.com/papers/bigtable.html

Page 5: HBase Lightning Talk

"A Bigtable is a sparse, distributed, persistent

multidimensional sorted map"

- Bigtable: A Distributed Storage System for Structured Data

http://labs.google.com/papers/bigtable.html

Page 6: HBase Lightning Talk

wtf?

Page 7: HBase Lightning Talk

distributed

sparse

column-oriented

versioned

Page 8: HBase Lightning Talk

(row key, column key, timestamp) => value

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

- Bigtable: A Distributed Storage Systemfor Structured Data

http://labs.google.com/papers/bigtable.html

Page 9: HBase Lightning Talk

row key => 20120407152657

column family => "personal:"

column key => "personal:givenName", "personal:surname"

timestamp => 1239124584398

Key Concepts:

Page 10: HBase Lightning Talk

Row Key Timestamp Column Family "info:"Column Family "info:" ColumN Family "content:"

20120407145045 t7 "info:summary" "An intro to..."20120407145045

t6 "info:author" "John Doe"

20120407145045

t5 "Google's Bigtable is..."

20120407145045

t4 "Google Bigtable is..."

20120407145045

t3 "info:category" "Persistence"

20120407145045

t2 "info:author" "John"

20120407145045

t1 "info:title" "Intro to Bigtable"

20120320162535 t4 "info:category" "Persistence"20120320162535

t3 "CouchDB is..."

20120320162535

t2 "info:author" "Bob Smith"

20120320162535

t1 "info:title" "Doc-oriented..."

Page 11: HBase Lightning Talk

Row Key Timestamp Column Family "info:"Column Family "info:" Column Family "content:"

20120407145045 t7 "info:summary" "An intro to..."20120407145045

t6 "info:author" "John Doe"

20120407145045

t5 "Google's Bigtable is..."

20120407145045

t4 "Google Bigtable is..."

20120407145045

t3 "info:category" "Persistence"

20120407145045

t2 "info:author" "John"

20120407145045

t1 "info:title" "Intro to Bigtable"

20120320162535 t4 "info:category" "Persistence"20120320162535

t3 "CouchDB is..."

20120320162535

t2 "info:author" "Bob Smith"

20120320162535

t1 "info:title" "Doc-oriented..."

Get row 20120407145045...

Page 12: HBase Lightning Talk

Use HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable.

- http://hbase.apache.org/

Page 13: HBase Lightning Talk

HBase Shell

hbase(main):001:0> create 'blog', 'info', 'content'0 row(s) in 4.3640 secondshbase(main):002:0> put 'blog', '20120320162535', 'info:title', 'Document-oriented storage using CouchDB'0 row(s) in 0.0330 secondshbase(main):003:0> put 'blog', '20120320162535', 'info:author', 'Bob Smith'0 row(s) in 0.0030 secondshbase(main):004:0> put 'blog', '20120320162535', 'content:', 'CouchDB is a document-oriented...'0 row(s) in 0.0030 secondshbase(main):005:0> put 'blog', '20120320162535', 'info:category', 'Persistence'0 row(s) in 0.0030 secondshbase(main):006:0> get 'blog', '20120320162535'COLUMN CELL content: timestamp=1239135042862, value=CouchDB is a doc... info:author timestamp=1239135042755, value=Bob Smith info:category timestamp=1239135042982, value=Persistence info:title timestamp=1239135042623, value=Document-oriented... 4 row(s) in 0.0140 seconds

Page 14: HBase Lightning Talk

hbase(main):015:0> get 'blog', '20120407145045', {COLUMN=>'info:author', VERSIONS=>3 }timestamp=1239135325074, value=John Doe timestamp=1239135324741, value=John 2 row(s) in 0.0060 secondshbase(main):016:0> scan 'blog', { STARTROW => '20120300', STOPROW => '20120400' }ROW COLUMN+CELL 20120320162535 column=content:, timestamp=1239135042862, value=CouchDB is... 20120320162535 column=info:author, timestamp=1239135042755, value=Bob Smith 20120320162535 column=info:category, timestamp=1239135042982, value=Persistence 20120320162535 column=info:title, timestamp=1239135042623, value=Document... 4 row(s) in 0.0230 seconds

HBase Shell

Page 15: HBase Lightning Talk

Got byte[]?

Page 16: HBase Lightning Talk

// Create a new tableConfiguration conf = HBaseConfiguration.create();HBaseAdmin admin = new HBaseAdmin(conf);

String tableName = "people";HTableDescriptor desc = new HTableDescriptor(tableName);desc.addFamily(new HColumnDescriptor("personal"));desc.addFamily(new HColumnDescriptor("contactinfo"));desc.addFamily(new HColumnDescriptor("creditcard"));admin.createTable(desc);

System.out.printf("%s is available? %b\n", tableName, admin.isTableAvailable(tableName));

Page 17: HBase Lightning Talk

import static org.apache.hadoop.hbase.util.Bytes.toBytes;

// Add some data into 'people' tableConfiguration conf = HBaseConfiguration.create();Put put = new Put(toBytes("connor-john-m-43299"));put.add(toBytes("personal"), toBytes("givenName"), toBytes("John"));put.add(toBytes("personal"), toBytes("mi"), toBytes("M"));put.add(toBytes("personal"), toBytes("surname"), toBytes("Connor"));put.add(toBytes("contactinfo"), toBytes("email"), toBytes("[email protected]"));table.put(put);table.flushCommits();table.close();

Page 18: HBase Lightning Talk

Finding data:

get (by row key)

scan (by row key ranges, filtering)

Page 19: HBase Lightning Talk

// Get a row. Ask for only the data you need.Configuration conf = HBaseConfiguration.create();HTable table = new HTable(conf, "people");Get get = new Get(toBytes("connor-john-m-43299"));get.setMaxVersions(2);get.addFamily(toBytes("personal"));get.addColumn(toBytes("contactinfo"), toBytes("email"));Result result = table.get(get);

Page 20: HBase Lightning Talk

// Update existing values, and add a new oneConfiguration conf = HBaseConfiguration.create();HTable table = new HTable(conf, "people");Put put = new Put(toBytes("connor-john-m-43299"));put.add(toBytes("personal"), toBytes("surname"), toBytes("Smith"));put.add(toBytes("contactinfo"), toBytes("email"), toBytes("[email protected]"));put.add(toBytes("contactinfo"), toBytes("address"), toBytes("San Diego, CA"));table.put(put);table.flushCommits();table.close();

Page 21: HBase Lightning Talk

// Scan rows...Configuration conf = HBaseConfiguration.create();HTable table = new HTable(conf, "people");Scan scan = new Scan(toBytes("smith-"));scan.addColumn(toBytes("personal"), toBytes("givenName"));scan.addColumn(toBytes("contactinfo", toBytes("email"));scan.addColumn(toBytes("contactinfo", toBytes("address"));scan.setFilter(new PageFilter(numRowsPerPage));ResultScanner sacnner = table.getScanner(scan);for (Result result : scanner) { // process result...}

Page 22: HBase Lightning Talk

DAta Modeling

Row key design

MATCH TO DATA ACCESS PATTERNS

WIDE VS. NARROW ROWS

Page 23: HBase Lightning Talk

REferences

shop.oreilly.com/product/0636920014348.do

http://shop.oreilly.com/product/0636920021773.do

(3rd edition pub date is May 29, 2012)

hbase.apache.org

Page 24: HBase Lightning Talk

scott.leberknight at nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight

(my info)