Scaling the Web: Databases & NoSQL

Post on 18-May-2015

1.560 views 1 download

Tags:

description

This is an introduction to relational and non-relational databases and how their performance affects scaling a web application. This is a recording of a guest Lecture I gave at the University of Texas school of Information. In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra. Find more on my blog: http://schneems.com

Transcript of Scaling the Web: Databases & NoSQL

Scaling the Web:Databases &NoSQL

Richard Schneeman@schneems works for @Gowalla

Wed Nov 10 2011

whoami• @Schneems

• BSME with Honors from Georgia Tech

• 5 + years experience Ruby & Rails

• Work for @Gowalla

• Rails 3.1 contributor : )

• 3 + years technical teaching

Traffic

Compounding Trafficex. Wikipedia

Compounding Trafficex. Wikipedia

Gowalla

Gowalla• 50 best websites NYTimes 2010

• Founded 2009 @ SXSW

• 1 million+ Users

• Undisclosed Visitors

• Loves/highlights/comments/stories/guides

• Facebook/Foursquare/Twitter integration

• iphone/android/web apps

• public API

Gowalla Backend• Ruby on Rails

• Uses the Ruby Language

• Rails is the Framework

The Web is Data• Username => String

• Birthday => Int/ Int/ Int

• Blog Post => Text

• Image => Binary-file/blob

Data needs to be stored to be useful

Database

Gowalla Database • PostgreSQL

• Relational (RDBMS)

• Open Source

• Competitor to MySQL

• ACID compliant

• Running on a Dedicated Managed Server

Need for Speed• Throughput:

• The number of operations per minute that can be performed

• Pure Speed:

• How long an individual operation takes.

Potential Problems • Hardware

• Slow Network

• Slow hard-drive

• Insufficient CPU

• Insufficient Ram

• Software

• too many Reads

• too many Writes

Scaling Up versus Out• Scale Up:

• More CPU, Bigger HD, More Ram etc.

• Scale Out:

• More machines

• More machines

• More machines

• ...

Scale Up• Bigger faster machine

• More Ram

• More CPU

• Bigger ethernet bus

• ...

• Moores Law

• Diminishing returns

Scale Out• Forget Moores law...

• Add more nodes

• Master/ Slave Database

• Sharding

Master DB

Slave DB Slave DB Slave DB Slave DB

Write

Copy

Read

Master/Slave

Master & Slave +/-• Pro

• Increased read speed

• Takes read load off of master

• Allows us to Join across all tables

• Con

• Doesn’t buy increased write throughput

• Single Point of Failure in Master Node

Users in USA

Read

Sharding

Write

Users in Europe

Users in Asia

Users in Africa

Sharding +/-• Pro

• Increased Write & Read throughput

• No Single Point of failure

• Individual features can fail

• Con

• Cannot Join queries between shards

What is a Database?• Relational Database Managment System

(RDBMS)

• Stores Data Using Schema

• A.C.I.D. compliant

• Atomic

• Consistent

• Isolated

• Durable

RDBMS• Relational

• Matches data on common characteristics in data

• Enables “Join” & “Union” queries

• Makes data modular

Relational +/-• Pros

• Data is modular

• Highly flexible data layout

• Cons

• Getting desired data can be tricky

• Over modularization leads to many join queries

• Trade off performance for search-ability

Schema Storage• Blueprint for data storage

• Break data into tables/columns/rows

• Give data types to your data

• Integer

• String

• Text

• Boolean

• ...

Schema +/-• Pros

• Regularize our data

• Helps keep data consistent

• Converts to programming “types” easily

• Cons

• Must seperatly manage schema

• Adding columns & indexes to existing large tables can be painful & slow

ACID• Properties that guarante a database

transaction are processed reliably

• Atomic

• Consistent

• Isolated

• Durable

ACID• Atomic

• Any database Transaction is all or nothing.

• If one part of the transaction fails it all fails

“An Incomplete Transaction Cannot Exist”

ACID• Consistent

• Any transaction will take the database from one consistent state to another

“Only Consistent data is allowed to be written”

ACID• Isolated

• No transaction should be able to interfere with another transaction

“the same field cannot be updated by two sources at the exact same time”

a = 0a += 1 a += 2 } a = ??

ACID• Durable

• Once a transaction Is committed it will stay that way

“Save it once, read it forever”

What is a Database?• RDBMS

• Relational

• Flexible

• Has a schema

• Most likely ACID compliant

• Typically fast under low load or when optimized

What is SQL?• Structured Query Language

• The language databases speak

• Based on relational algebra

• Insert

• Query

• Update

• Delete

“SELECT Company, Country FROM Customers WHERE Country = 'USA' ”

Why people <3 SQL• Relational algebra is powerful

• SQL is proven

• well understood

• well documented

Why people </3 SQL• Relational algebra Is hard

• Different databases support different SQL syntax

• Yet another programming language to learn

SQL != Database• SQL is used to talk to a RDBMS (database)

• SQL is not a RDBMS

What is NoSQL?

Not ARelationalDatabase

RDBMS

Types of NoSQL• Distributed Systems

• Document Store

• Graph Database

• Key-Value Store

• Eventually Consistent Systems

Mix And Match ↑

Key Value Stores• Non Relational

• Typically No Schema

• Map one Key (a string) to a Value (some object)

Example: Redis

Key Value Exampleredis = Redis.new

redis.set(“foo”, “bar”)

redis.get(“foo”)

>> “bar”

Key Value Exampleredis = Redis.new

redis.set(“foo”, “bar”)

redis.get(“foo”)

>> “bar”

Key Value

Key

Value

Key Value• Like a databse that can only ever use

primary Key (id)

YESselect * from users where id = ‘3’;

NOselect * from users where name = ‘schneems’;

NoSQL @ Gowalla• Redis (key-value store)

• Store “Likes” & Analytics

• Memcache (key-value store)

• Cache Database results

• Cassandra

• (eventually consistent, with-schema, key value store)

• Store “feeds” or “timelines”

• Solr (search index)

Memcache• Key-Value Store

• Open Source

• Distributed

• In memory (ram) only

• fast, but volatile

• Not ACID

• Memory object caching system

Memcache Examplememcache = Memcache.new

memcache.set(“foo”, “bar”)

memcache.get(“foo”)

>> “bar”

Memcache• Can store whole objects

memcache = Memcache.newuser = User.where(:username => “schneems”)memcache.set(“user:3”, user)

user_from_cache = memcache.get(“user:3”)user_from_cache == user>> trueuser_from_cache.username>> “Schneems”

Memcache @ Gowalla• Cache Common Queries

• Decreases Load on DB (postgres)

• Enables higher throughput from DB

• Faster response than DB

• Users see quicker page load time

What to Cache?• Objects that change infrequently

• users

• spots (places)

• etc.

• Expensive(ish) sql queries

• Friend ids for users

• User ids for people visiting spots

• etc.

Memcache Distributed

B

C

A

Memcache Distributed

B C

A

Easily add more nodes

D

Memcache <3’s DB• We use them Together

• If memcache doesn’t have a value

• Fetch from the database

• Set the key from database

• Hard

• Cache Invalidation : (

Redis• Key Value Store

• Open Source

• Not Distributed (yet)

• Extremely Quick

• “Data structure server”

Redis Example, againredis = Redis.new

redis.set(“foo”, “bar”)

redis.get(“foo”)

>> “bar”

Redis - Has Data Types• Strings

• Hashes

• Lists

• Sets

• Sorted Sets

Redis Example, setsredis = Redis.newredis.sadd(“foo”, “bar”)redis.members(“foo”)>> [“bar”]redis.sadd(“foo”, “fly”)redis.members(“foo”)>> [“bar”, “fly”]

Redis => Likeable• Very Fast response

• ~ 50 queries per page view

• ~ 1 ms per query

• http://github.com/Gowalla/likeable

Cassandra• Open Source

• Distributed

• Key Value Store

• Eventually Consistent

• Sortof not ACID

• Uses A Schema

• ColumnFamilies

Cassandra Distributed

B C

A

Eventual Consistency

D

Data In

Copied To Extra Nodes ... Eventually

Cassandra@ Gowalla{Activity

Feeds

Cassandra @ Gowalla• Chronologic

• http://github.com/Gowalla/chronologic

Should I use NoSQL?

Which One?

Pick the right tool

Tradeoffs • Every Data store has them

• Know your data store

• Strengths

• Weaknesses

NoSQL vs. RDBMS• No Magic Bullet

• Use Both!!!

• Model data in a datastore you understand

• Switch to when/if you need to

• Understand Your Options

Questions?

Richard Schneeman@schneems works for @Gowalla