Download - Webinar - Navigating the NoSQL Landscape (Which freaking database should I use?)

Which FreakingDatabase Should I Use?

Andrew C. OliverOpen Software Integrators

www.osintegrators.com@osintegrators

Andrew C. Oliver

• Programming since I was about 8

• Java since ~1997

• Founded POI project (currently hosted at Apache) with Marc Johnson ~2000o Former member Jakarta PMCo Emeritus member of Apache Software Foundation

• Joined JBoss ~2002

• Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)

• Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-olivero I make fanboys cry.@osintegrators 2

http://www.infoworld.com/author-bios/andrew-oliver

Open Software Integrators

• Founded Nov 2007 by Andrew C. Oliver (me)o in Durham, NC

Revenue and staff has at least doubled every year since 2009.

• New office (2012) in Chicago, ILo we're hiring mid to senior level as well as UI Developers

(JQuery, Javascript, HTML, CSS)o up to 25% travel, salary + bonus, 401k, health, etc etco preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,

JQueryo nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least one

Cloud platform

3@osintegrators

• Why not just use the RDBMS for everything?

• Operational vs Analytical

• Key Value

• Column Family

• Document

• Graph

• Hadoop?

• Convergence of "clustered filesystems" and "databases"

• Conclusions4

Overview

@osintegrators

Why not "just use" the RDBMS

for everything?

Before we begin...

6

• Let's handle the Elephant or rather the teddy bears in the room:

http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html/

@osintegrators



The CAP theorem

7@osintegrators

• Great at consistency

• Okay at availability

• Not so great at partition tolerance...

RDBMS CAP characteristics

8@osintegrators

• Lots of servers with many connections to few servers.

Single process model

9@osintegrators

Multiprocess Model

10@osintegrators

Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster Manager

• 10mb disks were "big"

• Scalability meant more disks, controllers and possilby CPUs

• CPUs went from 4.77 Mhz to 3.4ghz

• Disks went from 64kps@70ms to 6gb/s

• Network speeds went from under 4mb to gigabit to bonded gigabit and beyond.

• Disk speeds for a long time didn't keep up with CPU...

Historical Scalability

11@osintegrators

• RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory"

• Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf)

• Sometimes relationships are as important as the data

• Sometimes data is even simpler than the relational model but needs higher levels of availability, etc.

• One size never really did fit all

The Mathematical model

12@osintegrators

Data Complexity

13@osintegrators

Datarrhea

14@osintegrators

• Yes I've already registered that ;-)

• The cheapness of storing data has yielded more demando economics predicted this

• Moore's law ended while you slepto Intel says next year (but when did CPU speeds last

double?)

• Massive parallelization is the most feasible way to get at it (counter trended with an explosion in disk speeds)

...but

15@osintegrators

• Ifo your data is tabular;o fits cleanly in a relational model;o you aren't having scalability issues;o you don't have a large dataset; oro a dataset/problem that lends itself to massive

parallelization...

• you can probably stick with your RDBMS for nowo ...and probably aren't at this conference anyhow.

JPA/RDBMS Tables Example

PersonID Firstname Lastname CompanyID

2 Andy Oliver 3

CompanyID Name City State

3 Open Software Integrators

Durham NC

PhoneNumber Type PersonID

919.627.1236 google 2

919.321.0119 work 2

@osintegrators 16

Operational vs Analytical

17@osintegrators

• One DB type is unlikely to be well suited for all of your problems.

• The system doing "short and sweet" "lightweight" transactions is your operational system.

• The system doing long running reports and generating charts and graphs and statistics is your analytical system.

• There is also search. There are recommendation engines, etc.

Other types of databases

• Examples: Couchbase 1.8, Cassandrao also: Gemfire, Infinispan (distributed caches)

• Constant Time O(1) - Lookup by key

• Good enough for "right now" stock quotes

• Usually combined with an index for search, but the structure isn't inherently indexed.

• Generally works well with Map Reduce.

• Extremely scalable, easy to partition

Key-Value Stores

19@osintegrators

• Many Key-Value support "column families"o Cassandra

• Some we designed this wayo HBase

• Keys and values become composite

• essentially a hashmap with a multi-dimensional array o each column is a row of data

• map-reduce friendly

• Stock quote with time ranges

Column Family / Big Table

20@osintegrators

HBase Example

21

Row key

First name

Last name

Company City StatePhone number

Phone type

5bfbd4a0-d02a-11e1-9b23-0800200c9a66

Andy OliverOpen Software Integrators

Durham NC919-627-1236

google

7b2435c0-d02a-11e1-9b23-0800200c9a66

Andy OliverOpen Software Integrators

Durham NC919-321-0119

work

@osintegrators

• Many developers think these are the "holy grail" since the fit nicely with object-oriented programming.

• Couchbase 2.0, CouchDB, MongoDB

• JSON documents

• One way to think of this is a Key-Value store that understands the values.

• Not as map-reduce friendly, larger datasets require indexes.

• clearly rest services, operational store

Document databases

22@osintegrators

• JSON document:{

"firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ]}

Document databases

23@osintegrators

• Based on Graph Theory

• Less about volume of the data and more about complexity

• Many are transactionalo often the transactions are "more correct" than those

offered by a relational database.

• FOAF, direct path operations are easyo very complicated/inefficient in RDBMS

• Usually paired with an index for search

Graph Databases

24@osintegrators

Design: RDBMS vs Graph

25@osintegrators

Phone Number: 919.627.1236Type : googlevoice

HAS

Phone Number: 919.321.0119Type : work

Company: Open Software Integrators

LOCATED

FOUNDED

HAS

Firstname: AndrewLastname: Oliver

City: DurhamState: NC

Neo4j Graph Example

26

WORKS FOR

LOCATEDCity: ChicagoState: IL

HAS

RESIDES

@osintegrators

Note the extra relationships and details here - graph databases are just fun and easy to understand.

• NoSQL

• Software Framework (lots of pieces/lots of choices):

o Pig - scripting language used to quickly write MapReduce code to handle unstructured sources

o Hive - facilitates structure for the data

o HCatalog - provides inter-operability between these internal systems

o HBase - Bigtable-type database

o HDFS - Hadoop file system

• Excellent choice for data processing and data analysis

• MapReduce

Where does Hadoop fit?

27@osintegrators

• Hadoop HDFS is...a distributed filesystem

• So is Gluster, Ceph, GFS, etc

• Hadoop can use Ceph or Gluster in place of HDFS

Convergence of Filesystems and Databases

28@osintegrators

• Triplestoreso Apache Jenna

• OODBMS /ORDMSo Cache

Other Derivatives

29@osintegrators

• Persistenceo Asynch / Synch

• Replication

• Availability

• Transactions / Consistancy

• "Locality"

• Language

• Resourceso http://en.wikipedia.org/wiki/Comparison_of_structured_storage_softwa

re

o http://sevenweeks.org/

Things you may consider

30@osintegrators

http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software

http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software

http://sevenweeks.org/

• RDBMS may not scale to your needs

• Your data may not map efficiently to tables

• Key Value Store - data by key, fast, scalable, can't handle complex data

• Column Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not efficient for complex data

• Document - a good operational system, not your analytical, moderately scalable, matches OO

• Graph - great for complex data, transactional, less scalable

• Filesystems and "databases" are converging

Conclusions

31@osintegrators

Thank you for attending!

Andrew C. OliverOpen Software Integrators

www.osintegrators.com@osintegrators

Introduction to Document Databases

and Couchbase

Dipti BorkarDirector, Product Management

2.0

NoSQL DatabaseNoSQL Document Database

Couchbase Server

Easy Scalabilit

y

Consistent High

Performance

Always On

24x365

Grow cluster without application changes, without downtime with a single click

Consistent sub-millisecond read and write response times

with consistent high throughput

No downtime for software upgrades, hardware maintenance, etc.

Couchbase Server - Core Capabilities

JSONJSONJSON

JSONJSON

PERFORMANCE

Flexible Data Model

JSON document model with no fixed schema.

Relational vs Document data model

Relational data model Document data model

Collection of complex documents witharbitrary, nested data formats and

varying “record” format.

Highly-structured table organization with rigidly-defined data formats and

record structure.

JSONJSON

JSON

C1 C2 C3 C4

{

}

User ID

First Last Zip

1 Dipti Borkar 94040

2 Joe Smith 94040

3 AliDodso

n94040

4 Sarah Gorin NW1

5 Bob Young 30303

6 Nancy Baker 10010

7 Ray Jones 31311

8 Lee ChenV5V3

M

• • •

50000 Doug Moore 04252

50001 Mary WhiteSW19

5

50002 Lisa Clark 12425

Country ID

TEL3

001

Country ID

Country name

001 USA

002 UK

003 Argentina

004 Australia

005 Aruba

006 Austria

007 Brazil

008 Canada

009 Chile

•

•

•

130 Portugal

131 Romania

132 Russia

133 Spain

134 Sweden

User ID

Photo ID

Comment

2 d043 NYC

2 b054 Bday

5 c036 Miami

7 d072 Sunset

5002 e086 Spain

Photo Table

001

007

001

133

133

User IDStatus

IDText

1 a42 At conf

4 b26 excited

5 c32 hockey

12 d83 Go A’s

5000 e34 sailing

Status Table

134

007

008

001

005

Country Table

User ID

Affl ID Affl Name

2 a42 Cal

4 b96 USC

7 c14 UW

8 e22 Oxford

Affiliations TableCountry

ID

001

001

001

002

Country ID

Country ID

001

001

002

001

001

001

008

001

002

001

User Table

.

.

.

Making a Change Using RDBMS

Making the Same Change with a Document Database

{ “ID”: 1, “FIRST”: “Dipti”, “LAST”: “Borkar”, “ZIP”: “94040”, “CITY”: “MV”, “STATE”: “CA”, “STATUS”: { “TEXT”: “At Conf” }

}

“GEO_LOC”: “134” },“COUNTRY”: ”USA”

Just add information to a documentJust add information to a document

JSON

,}

Couchbase Server 2.0 Architecture

Hea

rtbe

at

Proc

ess

mon

itor

Glo

bal s

ingl

eton

sup

ervi

sor

Confi

gura

tion

man

ager

on each node

Reba

lanc

e or

ches

trat

or

Nod

e he

alth

mon

itor

one per cluster

vBuc

ket s

tate

and

repl

icati

on m

anag

er

httpRE

ST m

anag

emen

t API

/Web

UI

HTTP8091

Erlang port mapper4369

Distributed Erlang21100 - 21199

Erlang/OTP

storage interface

Couchbase EP Engine

11210Memcapable 2.0

Moxi

11211Memcapable 1.0

Memcached

New Persistence Layer

8092Query API

Que

ry E

ngin

e

Data Manager Cluster Manager

Couchbase “The basics”

Couchbase Server Cluster

Basic Operation

• Docs distributed evenly across servers

• Each server stores both active and replica docs– Only one server active at a time

• Client library provides app with simple interface to database

• Cluster map provides map to which server doc is on– App never needs to know

• App reads, writes, updates docs

• Multiple app servers can access same document at same time

User Configured Replica Count = 1

READ/WRITE/UPDATE

Active

Doc 5

Doc 2

Doc

Doc

Doc

Server 1

Active

Doc 4

Doc 7

Doc

Doc

Doc

Server 2

Doc 8

Active

Doc 1

Doc 2

Doc

Doc

Doc

REPLICA

Doc 4

Doc 1

Doc 8

Doc

Doc

Doc

REPLICA

Doc 6

Doc 3

Doc 2

Doc

Doc

Doc

REPLICA

Doc 7

Doc 9

Doc 5

Doc

Doc

Doc

Server 3

Doc 6

App Server 1

COUCHBASE Client LibraryCOUCHBASE Client Library

Cluster Map


Cluster Map

App Server 2

Doc 9

Add Nodes to Cluster

• Two servers added withone-click operation

• Docs automatically rebalance across cluster– Even distribution of

docs– Minimum doc

movement

• Cluster map updated

• App database calls now distributed

over larger number of servers

REPLICA

Active

Doc 5

Doc 2

Doc

Doc

Doc 4

Doc 1

Doc

Doc

Server 1

REPLICA

Active

Doc 4

Doc 7

Doc

Doc

Doc 6

Doc 3

Doc

Doc

Server 2

REPLICA

Active

Doc 1

Doc 2

Doc

Doc

Doc 7

Doc 9

Doc

Doc

Server 3 Server 4 Server 5

REPLICA

Active

REPLICA

Active

Doc

Doc 8 Doc

Doc 9 Doc

Doc 2 Doc

Doc 8 Doc

Doc 5 Doc

Doc 6

READ/WRITE/UPDATE READ/WRITE/UPDATE

App Server 1


Cluster Map


Cluster Map

App Server 2



Fail Over Node

REPLICA

Active

Doc 5

Doc 2

Doc

Doc

Doc 4

Doc 1

Doc

Doc

Server 1

REPLICA

Active

Doc 4

Doc 7

Doc

Doc

Doc 6

Doc 3

Doc

Doc

Server 2

REPLICA

Active

Doc 1

Doc 3

Doc

Doc

Doc 7

Doc 9

Doc

Doc

Server 3 Server 4 Server 5

REPLICA

Active

REPLICA

Active

Doc 9

Doc 8

Doc Doc 6 Doc

Doc

Doc 5 Doc

Doc 2

Doc 8 Doc

Doc

• App servers accessing docs

• Requests to Server 3 fail

• Cluster detects server failed– Promotes replicas of docs to

active– Updates cluster map

• Requests for docs now go to appropriate server

• Typically rebalance would follow

Doc

Doc 1 Doc 3

App Server 1


Cluster Map


Cluster Map

App Server 2



New in 2.0

JSON support Indexing and Querying

Cross data center replicationIncremental Map Reduce

JSONJSONJSON

JSONJSON


Cluster wide - Indexing and Querying


Active

Doc 5

Doc 2

Doc

Doc

Doc

Server 1

REPLICA

Doc 4

Doc 1

Doc 8

Doc

Doc

Doc

App Server 1

COUCHBASE Client Library


Cluster Map



Cluster Map

App Server 2

Doc 9

• Indexing work is distributed amongst nodes

• Large data set possible

• Parallelize the effort

• Each node has index for data stored on it

• Queries combine the results from required nodes

Active

Doc 5

Doc 2

Doc

Doc

Doc

Server 2

REPLICA

Doc 4

Doc 1

Doc 8

Doc

Doc

Doc

Doc 9

Active

Doc 5

Doc 2

Doc

Doc

Doc

Server 3

REPLICA

Doc 4

Doc 1

Doc 8

Doc

Doc

Doc

Doc 9

Query

Cluster wide - XDCR

Couchbase Server ClusterNY DATA CENTER

Active

Doc

Doc 2

Server 1

Doc 9

Server 2 Server 3

RAM

Doc Doc Doc

Active

Doc

Doc

Doc

RAM

Active

Doc

Doc

Doc

RAM

DISK

Doc Doc Doc

DISK

Doc Doc Doc

DISK

Couchbase Server ClusterSF DATA CENTER

Active

Doc

Doc 2

Server 1

Doc 9

Server 2 Server 3

RAM

Doc Doc Doc

Active

Doc

Doc

Doc

RAM

Active

Doc

Doc

Doc

RAM

DISK

Doc Doc Doc

DISK

Doc Doc Doc

DISK

Full Text Search Integration

• Elastic Search is good for ad-hoc queries and faceted browsing

• Couchbase adapter uses XDCR to push mutations to ES

• Couchbase ES Adapter is cluster-aware

ElasticSearch

Unidirectional Cross Data Center Replication

Couchbase Server Admin Console

Use cases

Data driven use cases

•Support for unlimited data growth •Data with non-homogenous structure •Need to quickly and often change data

structure•3rd party or user defined structure•Variable length documents•Sparse data records•Hierarchical data

Performance driven use cases

•Low latency matters•High throughput matters•Large number of users •Unknown demand with sudden growth of

users/data •Predominantly direct document access•Workloads with very high mutation rate per

document

Use Case Examples

Web app or Use-case

Couchbase Solution Example Customer

Content and Metadata Management System

Couchbase document store + Elastic Search

McGraw-Hill…

Social Game or Mobile App

Couchbase stores game and player data Zynga…

Ad Targeting Couchbase stores user information for fast access

AOL…

User Profile Store Couchbase Server as a key-value store TuneWiki…

Session Store Couchbase Server as a key-value store Concur….

High Availability Caching Tier

Couchbase Server as a memcached tier replacement

Orbitz…

Chat/Messaging Platform

Couchbase Server DOCOMO…

Andrew [email protected]

Dipti [email protected]