COSC 6339 Big Data Analytics NoSQL (II) Redis and...

15
1 COSC 6339 Big Data Analytics NoSQL (II) – Redis and Memcached Edgar Gabriel Spring 2017 Redis in-memory key/value Support for various types: Lists sets and sorted sets hash tables append-able buffers Open sourceSponsored by VMWare Used in the real world: github, craigslist, engineyard, ... Used heavily as a front-end database Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

Transcript of COSC 6339 Big Data Analytics NoSQL (II) Redis and...

1

COSC 6339

Big Data Analytics

NoSQL (II) –

Redis and Memcached

Edgar Gabriel

Spring 2017

Redis

• in-memory key/value

• Support for various types:

– Lists

– sets and sorted sets

– hash tables

– append-able buffers

• Open sourceSponsored by VMWare

• Used in the real world: github, craigslist, engineyard, ...

• Used heavily as a front-end database

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

2

Redis features

• All data is in memory

• All data is eventually persistent (But can be

immediately)

• Handles huge workloads easily

• Mostly O(1) behavior

• Ideal for write-heavy workloads

• Support for atomic operations

• Supports for transactions

• Has publish / subscribe functionality

• Tons of client libraries for all major languages

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

• Master-slave replication out of the box

• Slaves can be made masters on the fly

• Supports cluster mode (sharding) or and manually shard

it client side

• If your database is too big - redis can handle swapping

on its own.

• Keys remain in memory and least used values are

swapped to disk.

• Swapping I/O happens in separate threads

Redis features (II)

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

3

Usage examples

• Get/Sets - nothing fancy. Keys are strings, anything goes - just

quote spaces.

redis> SET foo "bar"

OK

redis> GET foo

"bar"

• You can atomically increment numbers

redis> SET bar 337

OK

redis> INCRBY bar 1000

(integer) 1337

• Getting multiple values at once

redis> MGET foo bar

1. "bar"

2. "1337"Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

Usage examples

• Keys are lazily expired

redis> EXPIRE foo 1

(integer) 1

redis> GET foo

(nil)

• Note: re-setting a value without re-expiring it will remove the

expiration

• GET / SET puts a different value inside a key, retrieving the old

one

redis> SET foo bar

OK

redis> GETSET foo baz

"bar"

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

4

Usage examples

• SETNX sets a value only if it does not exist

redis> SETNX foo bar

*OK*

redis> SETNX foo baz

*FAILS*

• SETNX + Timestamp => Named Locks! w00t!

redis> SETNX myLock <current_time>

OK

redis> SETNX myLock <new_time>

*FAILS*

Note that If the locking client crashes that might cause some problems, but it

can be solved easily.

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

List operations

• Lists are ordinary linked lists: supports push, pop, extract range,

resize, etc.

• Random access and ranges at O(N)!

redis> LPUSH foo bar

(integer) 1

redis> LPUSH foo baz

(integer) 2

redis> LRANGE foo 0 2

1. "baz"

2. "bar"

redis> LPOP foo

"baz"

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

5

Sets example

• Sets are unique values

– can be intersected/diffed /union'ed server side.

– useful as keys when building complex schema.

redis> SADD foo bar

(integer) 1

redis> SADD foo baz

(integer) 1

redis> SMEMBERS foo

["baz", "bar"]

redis> SADD foo2 baz // << another set

(integer) 1

redis> SADD foo2 raz

(integer) 1

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

Publish/Subscribe interface

• Clients can subscribe to channels or patterns and receive

• notifications when messages are sent to channels.

• Subscribing is O(1), posting messages is O(n)

• Think chats, Comet applications: real-time analytics, twitter

redis> subscribe feed:joe feed:moe feed:boe

//now we wait

....

1. "message"

2. "feed:joe"

3. "all your base are belong to me"

redis> publish feed:joe "all your base are belong to me"

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

6

Transactions

• Sequence of MULTI, ...., EXEC instructions: All commands are

executed after EXEC, block and return values for the commands as

a list.

• Example:

redis> MULTI

OK

redis> SET "foo" "bar"

QUEUED

redis> INCRBY "num" 1

QUEUED

redis> EXEC

1) OK

2) (integer) 1

• Transactions can be discarded with DISCARD.

Slide based on tutorial by Dvir Volk https://www.slideshare.net/dvirsky/introduction-to-redis

Redis cluster mode

• All nodes are directly connected with a service channel

• Node to Node protocol is binary, optimized for

bandwidth and speed.

• Clients talk to nodes using asci protocol, with minor

additions.

• Nodes don't proxy queries.

• Keyspace is divided into 4096 hash slots

• Different nodes will hold a subset of hash slots.

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

7

Redis cluster mode

• Nodes are all connected

and functionally

equivalent

• There are two kind of

nodes: slave and master

nodes

• the cluster will continue

to work as long as there

is at least one node for

every hash slot.

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

Redis cluster mode: client requests

• Dummy, single-connection clients, will work with

minimal modifications to existing client code base. Just

try a random node among a list, then reissue the query

if needed.

• Smart clients will take persistent connections to many

nodes, will cache hashslot -> node info, and will update

the table when they receive a -MOVED error.

• Schema scales horizontally

• Low latency if the clients are smart

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

8

Redis cluster mode fault-tolerance

• All nodes continuously ping other nodes...

• A node marks another node as possibly failing when

there is a timeout longer than N seconds.

• Every PING and PONG packet contain a gossip section:

information about other nodes idle times, from the

point of view of the sending node.

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

Redis cluster mode fault-tolerance

• A guesses B is failing, as the latest PING request timed

out. A will not take any action without any other hint.

• C sends a PONG to A, with the gossip section containing

information about B: C also thinks B is failing

• At this point A marks B as failed, and notifies all other

nodes in the cluster

• All other nodes will also mark B as failing

• If B returns, the first time he'll ping any node of the

cluster, it will be notified to shut down

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

9

Redis-trib – the redis cluster manager

• It is used to setup a new cluster, once you start N blank

nodes.

• it is used to check if the cluster is consistent. And to fix

it if the cluster can't continue, as there are hash slots

without a single node.

• It is used to add new nodes to the cluster, either as

slaves of an already existing master node, or as blank

nodes where we can re-shard a few hash slots to lower

other nodes load.

Slide based on https://redis.io/presentation/Redis_Cluster.pdf

memcached

• High-performance, distributed memory object caching

system

• Key-based cache daemon that stores data and objects

in main memory for very quick access

• Based on a distributed hash table

• Doesn’t provide redundancy, failover or authentication

• In Use by Facebook, twitter, …

• Open Source

10

Use cases

• Anything what is more expensive to fetch from

elsewhere, and has sufficient hit rate, can be

placed in memcached

– How often will object or data be used?

– How expensive is it to generate the data?

– What is the expected hitrate?

– Will the application invalidate the data itself, or will TTL

be used?

– How much development work has to be done to embed it?

Limitations

• Memcache is held in RAM => finite resource

• If the system can respond within the requirements

without it - leave it alone

• Keys can be no more then 250 characters

• Stored data can not exceed 1M (largest typical slab

size)

• There are generally no limits to the number of nodes

running memcached

• There are generally no limits to the amount of RAM

used by memcached over all nodes

11

Starting memcached

• Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction

• shell> memcached

• Default configuration - Memory: 64MB, all network interfaces, max simultaneous connections: 1024

• Changing default configuration options:

-u <user> : run as user if started as root

-m <num> : maximum <num> MB memory to use for items

-d : Run as a daemon

-c <num> : max simultaneous connections

-v : verbose

-t <threads> : number of threads to use to process incoming

requests.

Memcached commands

• Storage - ask server to store data identified by a key

– set, add, replace, append, prepend and cas

• Retrieval - ask server to retrieve data corresponding to

a set of keys

– get, gets

• Others operations that don’t involve unstructured data

– Deletion: delete

– Increment/Decrement: incr, decr

– Statistics: stats,

– flush_all: always succeeds, invalidate all existing items

immediately (by default) or after the expiration

specified.

12

Example

Since memcached is a hash table we need to store

things and retrieve things using keys and values.

As the sole user of your own data stored on a

memcached daemon you might store things with keys:

>>> import memcache

>>> mc = memcache.Client(['137.140.8.101:11211'])

>>> mc.set('user:19','{name: "Lancelot",quest: "Grail"}') # set(key, picklable value)

True

>>> mc.get('user:19')

'{name: "Lancelot", quest: "Grail"}'

>>> mc.get('user:20')

>>>

{'userID:1','userID:2','userID:3', ... }

memcached keys

Key can be a string or a tuple of (hash_value, string) .

More extensive spec of set():

key: key

value: value

time: Optional expiration time, either relative number

of seconds from current time (up to 1 month), or an

absolute Unix epoch time.

min_ compress_len: ignored in Python

namespace: key “namespace”.

set(key, value, time=0, min_compress_len=0,

namespace=None)

13

Basic Idea Typical usecase for memcached is to check if you have

previously determined and cached the output of a

potentially expensive operation.

If so, recover the previously determined output

If not, do the calculation and save the output in

memcached.#!/usr/bin/env python

import memcache, random, time, timeit

mc = memcache.Client(['127.0.0.1:11211'])

def compute_square(n):

value = mc.get('sq:%d' % n)

if value is None:

value = n * n

mc.set('sq:%d' % n, value)

return value

What Data can you Cache?

Low-level results like answers to queries to a database,

keyed on the query.

Higher-level constructs like entire web pages including

dynamic data.

Data might need to be purged and re-cached.

keys

– have to be unique

– typically include user identifier + item distinguisher

– can't exceed 250 bytes

– can use a hash function on your desired key to produce a

shorter one.

values can be up to 1MB

14

Memcached is a cache…

So how to replace an old value

– Set an expiration time in the set() function.

– Invalidate items already cached

– Replace old values with new values for the same key.

– Values need to be serialized so pickling them does quite

nicely. Values that are already strings can probably be

cached as-is.

Memcached sharding

• Memcached does not have a built in notion of sharding

• One can execute multiple memcached servers on a

cluster, sharding implemented on the client side

• Memcached servers

– has no internal hash table

– are dumb and don’t know about each other

– Use a slab allocator

• Servers are not coordinated => no protocol overhead or

consistency problems

• Least recently accessed items are cycled out

• One LRU per ‘slab class’

15

Memcached clients

• Client hashes key to server list

– Patterns in the characters that make up key values can

lead to uneven distribution of memcached entries across

various servers.

– Clients have own sharding algorithm so you don't really

need to worry about this. All clients implement the same

stable algorithm to turn a key into an integer, n, that

selects one of the memcached servers.

• Serializes objects

• Provides authentication