COSC 6339 Big Data Analytics NoSQL (II) Redis and...
Transcript of COSC 6339 Big Data Analytics NoSQL (II) Redis and...
1
COSC 6339
Big Data Analytics
NoSQL (II) –
Redis and Memcached
Edgar Gabriel
Spring 2017
Redis
• in-memory key/value
• Support for various types:
– Lists
– sets and sorted sets
– hash tables
– append-able buffers
• Open sourceSponsored by VMWare
• Used in the real world: github, craigslist, engineyard, ...
• Used heavily as a front-end database
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
2
Redis features
• All data is in memory
• All data is eventually persistent (But can be
immediately)
• Handles huge workloads easily
• Mostly O(1) behavior
• Ideal for write-heavy workloads
• Support for atomic operations
• Supports for transactions
• Has publish / subscribe functionality
• Tons of client libraries for all major languages
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
• Master-slave replication out of the box
• Slaves can be made masters on the fly
• Supports cluster mode (sharding) or and manually shard
it client side
• If your database is too big - redis can handle swapping
on its own.
• Keys remain in memory and least used values are
swapped to disk.
• Swapping I/O happens in separate threads
Redis features (II)
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
3
Usage examples
• Get/Sets - nothing fancy. Keys are strings, anything goes - just
quote spaces.
redis> SET foo "bar"
OK
redis> GET foo
"bar"
• You can atomically increment numbers
redis> SET bar 337
OK
redis> INCRBY bar 1000
(integer) 1337
• Getting multiple values at once
redis> MGET foo bar
1. "bar"
2. "1337"Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
Usage examples
• Keys are lazily expired
redis> EXPIRE foo 1
(integer) 1
redis> GET foo
(nil)
• Note: re-setting a value without re-expiring it will remove the
expiration
• GET / SET puts a different value inside a key, retrieving the old
one
redis> SET foo bar
OK
redis> GETSET foo baz
"bar"
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
4
Usage examples
• SETNX sets a value only if it does not exist
redis> SETNX foo bar
*OK*
redis> SETNX foo baz
*FAILS*
• SETNX + Timestamp => Named Locks! w00t!
redis> SETNX myLock <current_time>
OK
redis> SETNX myLock <new_time>
*FAILS*
Note that If the locking client crashes that might cause some problems, but it
can be solved easily.
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
List operations
• Lists are ordinary linked lists: supports push, pop, extract range,
resize, etc.
• Random access and ranges at O(N)!
redis> LPUSH foo bar
(integer) 1
redis> LPUSH foo baz
(integer) 2
redis> LRANGE foo 0 2
1. "baz"
2. "bar"
redis> LPOP foo
"baz"
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
5
Sets example
• Sets are unique values
– can be intersected/diffed /union'ed server side.
– useful as keys when building complex schema.
redis> SADD foo bar
(integer) 1
redis> SADD foo baz
(integer) 1
redis> SMEMBERS foo
["baz", "bar"]
redis> SADD foo2 baz // << another set
(integer) 1
redis> SADD foo2 raz
(integer) 1
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
Publish/Subscribe interface
• Clients can subscribe to channels or patterns and receive
• notifications when messages are sent to channels.
• Subscribing is O(1), posting messages is O(n)
• Think chats, Comet applications: real-time analytics, twitter
redis> subscribe feed:joe feed:moe feed:boe
//now we wait
....
1. "message"
2. "feed:joe"
3. "all your base are belong to me"
redis> publish feed:joe "all your base are belong to me"
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
6
Transactions
• Sequence of MULTI, ...., EXEC instructions: All commands are
executed after EXEC, block and return values for the commands as
a list.
• Example:
redis> MULTI
OK
redis> SET "foo" "bar"
QUEUED
redis> INCRBY "num" 1
QUEUED
redis> EXEC
1) OK
2) (integer) 1
• Transactions can be discarded with DISCARD.
Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis
Redis cluster mode
• All nodes are directly connected with a service channel
• Node to Node protocol is binary, optimized for
bandwidth and speed.
• Clients talk to nodes using asci protocol, with minor
additions.
• Nodes don't proxy queries.
• Keyspace is divided into 4096 hash slots
• Different nodes will hold a subset of hash slots.
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
7
Redis cluster mode
• Nodes are all connected
and functionally
equivalent
• There are two kind of
nodes: slave and master
nodes
• the cluster will continue
to work as long as there
is at least one node for
every hash slot.
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
Redis cluster mode: client requests
• Dummy, single-connection clients, will work with
minimal modifications to existing client code base. Just
try a random node among a list, then reissue the query
if needed.
• Smart clients will take persistent connections to many
nodes, will cache hashslot -> node info, and will update
the table when they receive a -MOVED error.
• Schema scales horizontally
• Low latency if the clients are smart
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
8
Redis cluster mode fault-tolerance
• All nodes continuously ping other nodes...
• A node marks another node as possibly failing when
there is a timeout longer than N seconds.
• Every PING and PONG packet contain a gossip section:
information about other nodes idle times, from the
point of view of the sending node.
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
Redis cluster mode fault-tolerance
• A guesses B is failing, as the latest PING request timed
out. A will not take any action without any other hint.
• C sends a PONG to A, with the gossip section containing
information about B: C also thinks B is failing
• At this point A marks B as failed, and notifies all other
nodes in the cluster
• All other nodes will also mark B as failing
• If B returns, the first time he'll ping any node of the
cluster, it will be notified to shut down
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
9
Redis-trib – the redis cluster manager
• It is used to setup a new cluster, once you start N blank
nodes.
• it is used to check if the cluster is consistent. And to fix
it if the cluster can't continue, as there are hash slots
without a single node.
• It is used to add new nodes to the cluster, either as
slaves of an already existing master node, or as blank
nodes where we can re-shard a few hash slots to lower
other nodes load.
Slide based on https://redis.io/presentation/Redis_Cluster.pdf
memcached
• High-performance, distributed memory object caching
system
• Key-based cache daemon that stores data and objects
in main memory for very quick access
• Based on a distributed hash table
• Doesn’t provide redundancy, failover or authentication
• In Use by Facebook, twitter, …
• Open Source
10
Use cases
• Anything what is more expensive to fetch from
elsewhere, and has sufficient hit rate, can be
placed in memcached
– How often will object or data be used?
– How expensive is it to generate the data?
– What is the expected hitrate?
– Will the application invalidate the data itself, or will TTL
be used?
– How much development work has to be done to embed it?
Limitations
• Memcache is held in RAM => finite resource
• If the system can respond within the requirements
without it - leave it alone
• Keys can be no more then 250 characters
• Stored data can not exceed 1M (largest typical slab
size)
• There are generally no limits to the number of nodes
running memcached
• There are generally no limits to the amount of RAM
used by memcached over all nodes
11
Starting memcached
• Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction
• shell> memcached
• Default configuration - Memory: 64MB, all network interfaces, max simultaneous connections: 1024
• Changing default configuration options:
-u <user> : run as user if started as root
-m <num> : maximum <num> MB memory to use for items
-d : Run as a daemon
-c <num> : max simultaneous connections
-v : verbose
-t <threads> : number of threads to use to process incoming
requests.
Memcached commands
• Storage - ask server to store data identified by a key
– set, add, replace, append, prepend and cas
• Retrieval - ask server to retrieve data corresponding to
a set of keys
– get, gets
• Others operations that don’t involve unstructured data
– Deletion: delete
– Increment/Decrement: incr, decr
– Statistics: stats,
– flush_all: always succeeds, invalidate all existing items
immediately (by default) or after the expiration
specified.
12
Example
Since memcached is a hash table we need to store
things and retrieve things using keys and values.
As the sole user of your own data stored on a
memcached daemon you might store things with keys:
>>> import memcache
>>> mc = memcache.Client(['137.140.8.101:11211'])
>>> mc.set('user:19','{name: "Lancelot",quest: "Grail"}') # set(key, picklable value)
True
>>> mc.get('user:19')
'{name: "Lancelot", quest: "Grail"}'
>>> mc.get('user:20')
>>>
{'userID:1','userID:2','userID:3', ... }
memcached keys
Key can be a string or a tuple of (hash_value, string) .
More extensive spec of set():
key: key
value: value
time: Optional expiration time, either relative number
of seconds from current time (up to 1 month), or an
absolute Unix epoch time.
min_ compress_len: ignored in Python
namespace: key “namespace”.
set(key, value, time=0, min_compress_len=0,
namespace=None)
13
Basic Idea Typical usecase for memcached is to check if you have
previously determined and cached the output of a
potentially expensive operation.
If so, recover the previously determined output
If not, do the calculation and save the output in
memcached.#!/usr/bin/env python
import memcache, random, time, timeit
mc = memcache.Client(['127.0.0.1:11211'])
def compute_square(n):
value = mc.get('sq:%d' % n)
if value is None:
value = n * n
mc.set('sq:%d' % n, value)
return value
What Data can you Cache?
Low-level results like answers to queries to a database,
keyed on the query.
Higher-level constructs like entire web pages including
dynamic data.
Data might need to be purged and re-cached.
keys
– have to be unique
– typically include user identifier + item distinguisher
– can't exceed 250 bytes
– can use a hash function on your desired key to produce a
shorter one.
values can be up to 1MB
14
Memcached is a cache…
So how to replace an old value
– Set an expiration time in the set() function.
– Invalidate items already cached
– Replace old values with new values for the same key.
– Values need to be serialized so pickling them does quite
nicely. Values that are already strings can probably be
cached as-is.
Memcached sharding
• Memcached does not have a built in notion of sharding
• One can execute multiple memcached servers on a
cluster, sharding implemented on the client side
• Memcached servers
– has no internal hash table
– are dumb and don’t know about each other
– Use a slab allocator
• Servers are not coordinated => no protocol overhead or
consistency problems
• Least recently accessed items are cycled out
• One LRU per ‘slab class’
15
Memcached clients
• Client hashes key to server list
– Patterns in the characters that make up key values can
lead to uneven distribution of memcached entries across
various servers.
– Clients have own sharding algorithm so you don't really
need to worry about this. All clients implement the same
stable algorithm to turn a key into an integer, n, that
selects one of the memcached servers.
• Serializes objects
• Provides authentication