An overview and discussion on indexing data in Redis to facilitate fast and efficient data retrieval. Presented on September 22nd, 2014 to the Redis Tel Aviv Meetup.
Transcript of Redis Indices (#RedisTLV)
1. Redis Indices 127.0.0.1:6379> CREATE INDEX _email ON
user:*->email @itamarhaber / #RedisTLV / 22/9/2014
2. A Little About Myself A Redis Geek and Chief Developers
Advocate at .com I write at http://redislabs.com/blog and edit the
Redis Watch newsletter at
http://redislabs.com/redis-watch-archive
3. Motivation Redis is a Key-Value datastore -> fetching (is
always) by (primary) key is fast Searching for keys is expensive -
SCAN (or, god forbid, the "evil" KEYS command) Searching for values
in keys requires a full (hash) table scan & sending the data to
the client for processing
5. antirez is Right Redis is a "database SDK" Indices imply
some kind of schema (and there's none in Redis) Redis wasn't made
for indexing ... But despite the Creator's humble opinion,
sometimes you still need a fast way to search :)
6. So What is an Index? "A database index is a data structure
that improves the speed of data retrieval operations" Wikipedia,
2014 Space-Time Tradeoff
7. What Can be Indexed? Data Index Key -> Value Value ->
Key Values can be numbers or strings Can be derived from "opaque"
values: JSONs, data structures (e.g. Hash), functions,
8. Index Operations Checklist 1. Create index from existing
data 2. Update the index on a. Addition of new values b. Updates of
existing values c. Deletion of keys (and also RENAME/MIGRATE) 3.
Drop the index 4. If needed do index housekeeping 5. Access keys
using the index
9. A Simple Example: Reverse Lookup Assume the following
database, where every user has a single unique email address: HMSET
user:1 id "1" email "dfucbitz@terah.net" How would you go about
efficiently fetching the user's ID given an email address?
10. Reverse Lookup (Pseudo) Recipe def idxEmailAdd(email, id):
# 2.a if not(r.setnx("_email:" + email, id)): raise
Exception("INDEX_EXISTS") def idxEmailCreate(): # 1 for each u in
r.scan("user:*"): id, email = r.hmget(u, "id", "email")
idxEmailAdd(email, id)
11. Reverse Lookup Recipe, more admin def idxEmailDel(email): #
2.c r.del("_email:" + email) def idxEmailUpdate(old, new): # 2.b
idxEmailDel(old) idxEmailAdd(new) def idxEmailDrop(): ... # similar
to Create
14. Reverse Lookup Recipe, Analysis Asymptotic computational
complexity: o Creating the index: O(N), N is no. of values o Adding
a new value to the index: O(1) o Deleting a value from the index:
O(1) o Updating a value: O(1) + O(1) = O(1) o Deleting the index:
O(N), N is no. of values What about memory? Every key in Redis
takes up some extra space...
15. Hash Index _email = { "dfucbitz@terah.net": 1,
"foo@bar.baz": 2 ... } Small lookups (e.g. countries) single key
Big lookups partitioned to "buckets" (e.g. by email address hash
value) More info: http://redis.io/topics/memory-optimization
16. Always Remember That You Are Absolutely Unique (Just Like
Everyone Else)
17. Uniqueness The lookup recipe makes the assumption that
every user has a single email address and that it's unique (i.e.
1:1 relationship). What happens if several keys (users) have the
same indexed value (email)?
18. Non-Uniqueness with Lists Use lists instead of using Redis'
strings/hashes. To add: r.lpush("_email:" + email, id) # 2.a
Simple. What about accessing the list for writes or reads?
Naturally, getting the all list's members is O(N) but...
19. What?!? WTF do you mean O(N)?!? Because a Redis List is
essentially a linked list, traversing it requires up to N
operations (LINDEX, LRANGE). That means that updates & deletes
are O(N) Conclusion: suitable when N (i.e. number of duplicate
index entries) is smallish (e.g. < 10)
20. OT: A Tip for Traversing Lists Lists don't have LSCAN, but
with RPOPLPUSH you easily can do a circular list pattern and go
over all the members in O(N) w/o copying the entire list. More at:
http://redis.io/commands/rpoplpush
21. Back to Non-Uniqueness - Hashes Use Hashes to store
multiple index values: r.hset("_email:" + email, id, "") # 2.a
Great - still O(1). How about deleting? r.hdel("_email:" + email,
id) # 2.b Another O(1). (unused)
22. Non-Uniqueness, Sets Variant r.sadd("_email:" + email, id)
# 2.a Great - still O(1). How about deleting? r.srem("_email:" +
email, id) # 2.b Another O(1).
23. List vs. Hash vs. Set for NUIVs* * Non-Unique Index Value
Memory: List ~= Set ~= Hash (N < 100) Performance: List <
Set, Hash Unlike a List's elements, Set members and Hash fields
are: o Unique - meaning you can't index the same key more than once
(makes sense). o Unordered - a non-issue for this type of index. o
Are SCANable Forget Lists, use Sets or Hashes.
24. Forget Hashes, Sets are Better Because of the Set
operations: SUNION, SDIFF, SINTER Endless possibilities, including
matchmaking: SINTER _interest:devops _hair:blond _gender:...
25. [This Slide has No Title] NULL means no value and Redis is
all about values. When needed, arbitrarily decide on a value for
NULLs (e.g. "") and handle it appropriately in code.
26. Index Cardinality (~= unique values) High cardinality/no
duplicates -> use a Hash Some duplicates -> use Hash and
"pointers" to Sets _email = { "dfucbitz@terah.net": 1,
"foo@bar.baz": "*" ...} _email:foo@bar.baz = { 2, 3 } Low
cardinality is, however, another story...
27. Low Cardinality When an indexed attribute has a small
number of possible values (e.g. Boolean, gender...): If
distribution of values is 50:50, consider not indexing it at all If
distribution is heavily unbalanced (5:95), index only the smaller
subsets, full scan rest Use a bitmap index if possible
28. Bitmap Index Assumption: key names are ordered How: a
Bitset where a bit's position maps to a key and the bit's value is
the indexed value: first bit -> dfucbitz is online _isLoggedIn =
/100/ second bit -> foo isn't logged in
29. Bitmap Index, cont. More than 2 values? Use n Bitsets,
where n is the number of possible indexed values, e.g.:
_isFromTerah = /100.../ _isFromEarth = /010.../ Bonus: BITOP AND /
OR / XOR / NOT BITOP NOT _ET _isFromEarth BITOP AND onlineET
_isLoggedIn _ET
30. Interlude: Redis Indices Save Space Consider the following:
in a relational database you need "x2" space: for the indexed data
(stored in a table) and for the index itself. With most Redis
indices, you don't have to store the indexed data -> space saved
:)
31. Numerical Ranges with Sorted Sets Numerical values,
including timestamps (epoch), are trivially indexed with a Sorted
Set: ZADD _yearOfBirth 1972 "1" 1961 "2"... ZADD _lastLogin
1411245569 "1" Use ZRANGEBYSCORE and ZREVRANGEBYSCORE for range
queries
32. Ordered "Composite" Numerical Indices Use Sorted Sets
scores that are constructed by the sort (range) order. Store two
values in one score using the integer and fractional parts: user:1
= { "id": "1", "weightKg": "82", "heightCm": "218", ... } score =
weightKg + ( heightCm / 1000 )
33. "Composite" Numerical Indices, cont. For more "complex"
sorts (up to 53 bits of percision), you can construct the score
like so: user:1 = { "id": "1", "weightKg": "82", "heightCm": "218",
"IQ": "100", ... } score = weightKg * 1000000 + heightCm * 1000 +
IQ Adapted from:
http://www.dr-josiah.com/2013/10/multi-column-sql-like-sorting-in-redis.html
34. Full Text Search (Almost) (v2.8.9+) ZRANGEBYLEX on Sorted
Set members that have the same score is handy for suffix wildcard
searches, i.e. dfuc*, a-la autocomplete:
http://autocomplete.redis.io/ Tip: by storing the reversed string
(gnirts) you can also do prefix searches, i.e. *terah.net, just as
easily.
35. Another Nice Thing With Sorted Sets By combining the use of
two of these, it is possible to map ranges to keys (or just data).
For example, what is 5? ZADD min 1 "low" 4 "medium" 7 "high" ZADD
max 3 "low" 6 "medium" 9 "high" ZREVRANGEBYSCORE min inf 5 LIMIT 0
1 ZRANGEBYSCORE max 5 +inf LIMIT 0 1
36. Binary Trees Everybody knows that binary trees are really
useful for searching and other stuff. You can store a binary tree
as an array in a Sorted Set: (Happy 80th Birthday!)
37. Why stop at binary trees? BTrees! @thinkingfish from
Twitter explained that they took the BSD implementation of BTrees
and welded it into Redis (open source rulez!). This allows them to
do efficient (speed-wise, not memory) key and range lookups.
http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-
to-scale-105tb-ram-39mm-qps-10000-ins.html
38. Index Atomicity & Consistency In a relational database
the index is (hopefully) always in sync with the data. You can
strive for that in Redis, but: Your code will be much more complex
Performance will suffer There will be bugs/edge cases/extreme
uses
39. The Opposite of Atomicity & Consistency On the other
extreme, you could consider implementing indexing with a:
Periodical process (lazy indexing) Producer/Consumer pattern (i.e.
queue) Keyspace notifications You won't have any guarantees, but
you'll be offloading the index creation from the app.
40. Indices, Lua & Clustering Server-side scripting is an
obvious consideration for implementing a lot (if not all) of the
indexing logic. But ... in a cluster setup, a script runs on a
single shard and can only access the keys there -> no guarantee
that a key and an index are on the same shard.
41. Don't Think Copy-Paste! For even more "inspiration" you can
review the source code of popular ORMs libraries for Redis, for
example: https://github.com/josiahcarlson/rom
https://github.com/yohanboniface/redis-limpyd