Musings on Secondary Indexing in HBase

33
Secondary Indexing the discussion so far…. 9/11/12 HBase Pow-wow Jesse Yates Salesforce.com

description

Presentation on Secondary Indexes from the 9/11/12 HBase Contributor's Meetup. It discusses the current state of the discussion and some possible future directions.

Transcript of Musings on Secondary Indexing in HBase

Page 1: Musings on Secondary Indexing in HBase

Secondary Indexing

the discussion so far….

9/11/12 HBase Pow-wow

Jesse YatesSalesforce.com

Page 2: Musings on Secondary Indexing in HBase

What is it?

Page 3: Musings on Secondary Indexing in HBase

Problem

• HBase rows are multi-dimensional– Only sorted on the row key

• How do you efficiently lookup deeper into the row key?

Page 4: Musings on Secondary Indexing in HBase

ExampleRow Family Qualifier Timestamp value

1 Name First 0 Babe

1 Name Last 0 Ruth

How do we find all people with the last name ‘Ruth’?

Full table scan!

Page 5: Musings on Secondary Indexing in HBase

Indexing!Row Family Qualifier Timestamp Value

Ruth Name Last 0 1

Store the property we need to search for as the primary key• pointer back to the primary row • fast lookup - O(lg(n))

Page 6: Musings on Secondary Indexing in HBase

Use Cases

• Point lookups– Volume of data influences usefulness of index• Let user decide if they need to use an index

• Scan lookup– WHERE age > 16

Page 7: Musings on Secondary Indexing in HBase

Implementations

Page 8: Musings on Secondary Indexing in HBase

Omid

Full transactional supportCentralized oracle

Page 9: Musings on Secondary Indexing in HBase

Lily

WAL implementation on top of HBase100-500 writes/sec

Page 10: Musings on Secondary Indexing in HBase

Percolator

Full transactionsDistributed, optimistic locking

~10 sec latencies possible

Page 11: Musings on Secondary Indexing in HBase

Culvert

AsyncDead project, incomplete

Page 12: Musings on Secondary Indexing in HBase

http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html

Client-side coordinated indexUse timestamps to coordinate

Not yet implemented

Page 13: Musings on Secondary Indexing in HBase

Trend Micro Implementation

Still just POC???

Page 14: Musings on Secondary Indexing in HBase

Solr/Lucene

Standard Lucene library bolted on HBaseNot commonly used

Lots of formats/codecs already written

Page 15: Musings on Secondary Indexing in HBase

Considerations for HBase

What do we need to do?

Page 16: Musings on Secondary Indexing in HBase

Built-in vs. external library vs.

semi-supported (e.g. security)

Page 17: Musings on Secondary Indexing in HBase

Which should I use??

• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project– Written in everywhere– hbase-index module– External library

Page 18: Musings on Secondary Indexing in HBase

Async vs. Synchronous vs.

Transactional

Page 19: Musings on Secondary Indexing in HBase

Key Observation

“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.”

- Lars Hofhansl

Page 20: Musings on Secondary Indexing in HBase

Async vs. Synchronous vs.Transactional

• We don’t need full transactions– Transactions are slow – Transactions fail with increasing probability as

number of servers increases• Optionally async or sync– Async• Inherently ‘dirty’ index

• How does index cleanup work?– Inherently different for each type

Page 21: Musings on Secondary Indexing in HBase

Locality

Page 22: Musings on Secondary Indexing in HBase

Where’s my data?

• Extra columns vs. index table• HBase Region-pinning– Has to be best-effort or will decrease availability – Helps minimize RPC overhead– Cross-table region-pinning– Needs a coprocessor hook to be useful

• HDFS block allocation– Keep index and data blocks on same HDFS node

Page 23: Musings on Secondary Indexing in HBase

Index Cardinality

Page 24: Musings on Secondary Indexing in HBase

How much data are we talking?

“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is

more efficient for reads

2. dense indexes (like eventType) where there are likely values of every index key on each region

3. very dense indexes (like male/female) where you should just be doing a table scan anyway”

- Matt Corgan (9/10/12)

Page 25: Musings on Secondary Indexing in HBase

Impact on implementation

• Need a lot of knowledge of data to pick the right kind of index– User knows their data, let them do the hard work

of picking indexes

Page 26: Musings on Secondary Indexing in HBase

Pluggability

Page 27: Musings on Secondary Indexing in HBase

Everyone’s got an impl already

• We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching– Lucene style Codec/CodecProvider?

Page 28: Musings on Secondary Indexing in HBase

Client-interface

Page 29: Musings on Secondary Indexing in HBase

What should it look like?

• Minimal changes to the top-level interfaces– Add a single new flag?– Configuration based?

• Enough that the user gets to be smart about what should be used– We can’t get all cases right – just provide building

blocks• Automatically use an index?• Scanner/Filter style use?

Page 30: Musings on Secondary Indexing in HBase

Properties for the client

• Should the user even see the index lookups?

• ACID?• Ordering of results?– Support the current sorted order?– Batch lookup?

• Implications on current features– Replication– splitting

Page 31: Musings on Secondary Indexing in HBase

Schema(less)

• Schema enforced?– Rigid usage of index matching an expected schema?– Schema table? Reserved schema columns? .META.?

• Schema-less– Let the user apply whatever they think and use only

what actually works• Best-effort– Use client-hinted schema and try to apply all the

known indexes

Page 32: Musings on Secondary Indexing in HBase

My random thoughts….

• Client-side managed indexes are efficient– Minimal RPC overhead• Cleanup is async to client and rarely misses

– Solves the cross-region/server problem• Region-pinning is a nice-to-have optimization

– Scales without concern for locality– Flexible enough to support custom codecs– Can be built to provide server-side optimizations• Locality aware indexes to minimize RPCs

Page 33: Musings on Secondary Indexing in HBase

Discussion!