NoSQL with Hadoop and HBase

142
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org NoSQL with HBase and Hadoop BeJUG - 17/6/2010 http://www.flickr.com/photos/wolfgangstaudt/2215246206/

description

NoSQL overview presentation for BeJUG - 17/6/2010

Transcript of NoSQL with Hadoop and HBase

Page 1: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NoSQLwith HBase and HadoopBeJUG - 17/6/2010

http://www.flickr.com/photos/wolfgangstaudt/2215246206/

Page 2: NoSQL with Hadoop and HBase

THIS N OT E B OO K B ELO N GS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Who am I

» Steven Noels - [email protected]

»Outerthought : scalable content applications

»makers of Daisy, Lily and Kauri : open source internet/Java/REST/content apps

2

Page 3: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

1. Intro2. Theory3. Technology4. Experiences

Page 4: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

An evolution driven by pain.

Page 5: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

History

5

hierarchical databases

IMS

OODBMS

XMLDB RDBMS

1. standardization

2. simplification

Page 6: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

History

6

RDBMS NOSQL

cachingdenormalisationshardingreplication ...

3. pain

4. rethinkingthe problem

Page 7: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Four Trends

»Trend 1 : Data Size

»Trend 2 : Connectedness

»Trend 3 : Semi-structure

»Trend 4 : Architecture

7

Page 8: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8

2006 2007 2008 2009 2010

0

250

500

750

1000

161

253

397

623

988

ExaBytes (10!") of data stored per year

3

Trend 1: Data size

Data source: IDC 2007

Each year more and more digital data is created. Over two years we create more digital data than all the data created in history before that.

Page 9: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9

Trend 2: Connectedness

4

Text documents

1990

Info

rmat

ion

conne

ctiv

ity

Folksonomies

Tagging

User-generated

contentWikis

RSS

Blogs

Hypertext

2000 2010 2020

web 1.0 web 2.0 “web 3.0”

Ontologies

RDF

Giant

Global

Graph (GGG)

Over time data has evolved to be more and more interlinked and connected.Hypertext has links,Blogs have pingback,Tagging groups all related data

Page 10: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10

Trend 3: Semi-structure

5

! Individualization of content

• In the salary lists of the 1970s, all elements had exactly one job

• In the salary lists of the 2000s, we need 5 job columns! Or 8?

Or 15?

!All encompassing “entire world views”

• Store more data about each entity

!Trend accelerated by the decentralization of content generation

that is the hallmark of the age of participation (“web 2.0”)

Page 11: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11

Trend 4: Architecture

6

DB

Application

1980s: Mainframe applications

Page 12: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12

Trend 4: Architecture

7

DB

Application

1990s: Database as integration hub

Application Application

Page 13: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13

DBDB DB

Trend 4: Architecture

8

Application

2000s: (moving towards) Decoupled serviceswith their own backend

Application Application

Page 14: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14

http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html

Page 15: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Enter NoSQL

Page 16: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

It’s a Cambrian Explosion

16

NoSQL

Cassandra

neo4j

Page 17: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17

?Buzz-oriented development

Page 18: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18

Page 19: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Common themes

19

» SCALE SCALE SCALE

» new datamodels

» devops

»N-O-SQL

»The Cloud :technology is of no interest anymore

Page 20: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

New Data

» sparse structures

»weak schemas

» graphs

» semi-structured

» document-oriented

20

Page 21: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NoSQL

»Not a movement.

»Not ANSI NoSQL-2010.

»Not one-size-fits-all.

»Not (necessarily) anti-RDBMS.

»No silver bullet.

21

Page 22: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NoSQL = pro Choice

22

Page 23: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NoSQL = toolbox

23

Page 24: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NOSQL, if you need ...

» horizontal scaling (out rather than up)

» unusually common data (aka free-structured)

» speed (especially for writes)

» the bleeding edge

24

Page 25: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

SQL/RDBMS, if you need ...

» SQL

»ACID

» normalisation

» a defined liability

25

Page 26: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Some RDBMS bashing

» sparse and dynamic tables

26

Page 27: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Some RDBMS bashing

» solution

27

mysql> desc thefields;+---------------------+---------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |+---------------------+---------------+------+-----+---------+-------+| doc_id | bigint(20) | NO | PRI | NULL | | ...| fieldtype_id | bigint(20) | NO | PRI | NULL | | ...| stringvalue | varchar(255) | YES | MUL | NULL | | | datevalue | datetime | YES | MUL | NULL | | | datetimevalue | datetime | YES | MUL | NULL | | | integervalue | bigint(20) | YES | MUL | NULL | | | floatvalue | double | YES | MUL | NULL | | | decimalvalue | decimal(10,5) | YES | MUL | NULL | | | booleanvalue | char(1) | YES | MUL | NULL | | ...+---------------------+---------------+------+-----+---------+-------+25 rows in set (0.00 sec)

Page 28: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

More RDBMS bashing

» replication and failure recovery» (when working on a budget)

» application-level partitioning logic

28

Page 29: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

1. Intro2. Theory3. Technology4. Experiences

Page 30: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Academic background

»Amazon Dynamo

»Google BigTable

» Eric Brewer CAP theorem

30

Page 31: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Shameless plug

31

nosqlsummer.org

Page 32: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Shameless plug

31

nosqlsummer.org

Page 33: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Amazon Dynamo

32

» coined the term ‘eventual consistency’

» consistent hashing

Page 34: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Eventual Consistency Gone Wild

33

Page 35: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

server 1 replicaserver 2

1. update ACL: disallow mother from folder ‘spring break’

2. upload spring break pictures

how is my boydoing on hisspring break?

1.

2.

Page 36: NoSQL with Hadoop and HBase

» a solution for naive mod n distributions» specifically in the case of adding or deleting nodes

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Consistent hashing

35

Page 37: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Consistent hashing

36

N value02160

2160/2

2160/4

node 0

node 1

node 2

node 3

hash(<<"artist">>,<<"REM">>)

5

Tuesday, November 17, 2009

(c) Basho/Riak

Page 38: NoSQL with Hadoop and HBase

»multi-dimensional column-oriented database

» on top of GoogleFileSystem

» object versioning

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Google BigTable

37

Page 39: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CAP theorem

38

strong consistency

highavailability

partition-tolerance

Page 40: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CAP

»Strong Consistency: all clients see the same view, even in the presence of updates

»High Availability: all clients can find some replica of the data, even in the presence of failures

»Partition-tolerance: the system properties hold even when the system is partitioned

39

Page 41: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Culture Clash

40

»ACID» highest priority: strong

consistency for transactions

» availability less important

» pessimistic

» rigorous analysis

» complex mechanisms

» BASE» availability and scaling

highest priorities

» weak consistency

» optimistic

» best effort

» simple and fast

spectrum

Page 42: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Availability ≠ total async !

Page 43: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Enterprise Service Bus

42

✘bus =

congestion

Page 44: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Bus systems

43

» objects don’t fit in a pipe

» object ➙ message

» serialization / de-serialization cost

»message size

» queuing = cost

Page 45: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Use a mixture of both

»async + sync

44

stuff which matters !

Page 46: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

2.1 Interlude

Page 47: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

2.1 Interlude

Page 48: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

2.1 Interlude

Page 49: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Processing large datasets :

Hadoop + Map/Reduce

Page 50: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Hadoop: HDFS + MapReduce» single filesystem + single execution-space

47

Page 51: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

M/R Execution

48

(c) Google

Page 52: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

MapReduce example: WordCount

49

Page 53: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Hadoop ecosystem» Hadoop Common

» Subprojects

» Chukwa: A data collection system for managing large distributed systems.» HBase: A scalable, distributed database that supports structured data storage for

large tables.» HDFS: A distributed file system that provides high throughput access to application

data.» Hive: A data warehouse infrastructure that provides data summarization and ad hoc

querying.

» MapReduce: A software framework for distributed processing of large data sets on compute clusters.

» Pig: A high-level data-flow language and execution framework for parallel computation.

» ZooKeeper: A high-performance coordination service for distributed applications.» Mahout: machine learning libraries

50

Page 54: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Processing large datasets with MR

51

»Benefit from parallellisation

» Less modelling upfront (ad-hoc processing)

»Compartmentalized approach reduces operational risks

»AsterData et al. have SQL/MR hybrids for huge-scale BI

Page 55: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

1. Intro2. Theory3. Technology4. Experiences

Page 56: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We welcome the Polyglot Persistence overlords.

Page 57: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The NOSQL footprint

54

AC

ID,

sim

ple

oper

atio

nal

const

rain

ts

free-structured or sparse data

SQL

NOSQL

referential integrity,typed data

high

ly scalable an

davailab

le (com

plex

ity)

HBase

Cassandra

CouchDB

MongoDB

neo4j

(c) me!

Page 58: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Categories

» key-value stores

» column stores

» document stores

» graph databases

55

Page 59: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Key-value stores

» Focus on scaling to huge amounts of data

»Designed to handle big loads

»Often: cfr. Amazon Dynamo

» ring partitioning and replication

»Data model: key/value pairs

56

Page 60: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Key-value stores

»Redis

»Voldemort

»Tokyo Cabinet

57

Page 61: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Redis

»REmote DIctionary Server

» http://code.google.com/p/redis/

» vmware

58

Page 62: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Redis Features» persisted memcache, ‘awesome’

» RAM-based + persistable

» key ➙ values: string, list, set

» higher-level ops

» i.e. push/pop and sort for lists

» fast (very)

» configurable durability

» client-managed sharding

59

Page 63: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Voldemort

» http://project-voldemort.com/

» LinkedIn

60

Page 64: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Voldemort

» persistent

» distributed

» fault-tolerant

» hash table

61

Page 65: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Voldemort

62

API: GET, PUT,DELETE

Page 66: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Voldemort

63

routing logic moving up the stack,smaller latency

Page 67: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Column stores

»BigTable clones

» Sparseness!

»Data model: columns ➙ column families ➙ cells

»Datums keyed by: row, column, time, index

» Row-range ➙ tablet ➙ distribution

64

Page 68: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Column stores

»BigTable

»HBase

»Cassandra

65

Page 69: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

BigTable

» http://labs.google.com/papers/bigtable.html

»Google

» layered on top of GFS

66

Page 70: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase

» http://hadoop.apache.org/hbase/

» StumbleUpon / Adobe / Cloudera

67

Page 71: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase

» sorted» distributed» column-oriented»multi-dimensional» highly-available» high-performance

» persisted» storage system

» adds random access reads and writes atop HDFS

68

Page 72: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase data model

69

»Distributed multi-dimensional sparse map

»Multi-dimensional keys:(table, row, family:column, timestamp) → value

»Keys are arbitrary strings

»Access to row data is atomic

Page 73: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Sample schema

70

!"#$$%&'(

!"#$%&'(%)*+#,-%

)#&*+,-./%&

0122345322!12

)#&*+,-./%&

0122345322!14

(c) eCircle

Page 74: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Storage architecture

71

© lars george

Page 75: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Cassandra

» http://cassandra.apache.org/

»Rackspace / Facebook

72

Page 76: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Cassandra

»Key-value store (with added structure)

»Reliability (identical nodes)

» Eventual consistent

»Distributed

»Tunable

» Partitioning

» Replication

73

CA

P

Page 77: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Cassandra applicability

74

FIT

» Scalable reliability (through identical nodes)» Linear scaling»Write throughput» Large Data Sets

NO FIT

» Flexible indexing»Only PK-based

querying»Big Binary Data» 1 Row must fit in

RAM entirely

Page 78: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 75

Page 79: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 76

Page 80: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Document databases

»≈ K/V stores, but DB knows what the Value is

» Lotus Notes heritage

»Data model: collections of K/V collections

»Documents often versioned

77

Page 81: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Document stores

»CouchDB

»MongoDB

»Riak

78

Page 82: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB

» http://couchdb.apache.org/

» couch.io

79

Page 83: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB

» fault-tolerant

» schema-free

» document-oriented

» accessible via a RESTful HTTP/JSON API

80

Page 84: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB documents

{ “_id”: ”BCCD12CBB”, “_rev”: ”AB764C”, “type”: ”person”, “name”: ”Darth Vader”, “age”: 63, “headware”: [“Helmet”, “Sombrero”], “dark_side”: true }

81

Page 85: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB REST API

»HTTP

» PUT /db/docid

»GET /db/docid

» POST /db/docid

»DELETE /db/docid

82

Page 86: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB Views»MapReduce-based

» Filter, Collate, Aggregate

» Javascript

83

function (Key, Values) { var sum = 0; for(var i in Values) sum += Values[i]; return sum; }

function (doc) { for(var i in doc.tags) emit(doc.tags[i], 1); }

map reduce

Page 87: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

CouchDB

» be careful on semantics

» replication ≠ partioning/sharding !

» distributed database = distributable database

» sharded / distributed deploymentrequires proxy layer

84

Page 88: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

MongoDB

» http://www.mongodb.org/

» 10gen

85

Page 89: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

MongoDB

» cfr. CouchDB, really

» except for:

»C++

» performance focus

» runtime queries (mapreduce still available)

» native drivers (no REST/HTTP layering)

» no MVCC: update-in-place

» auto sharding (alpha)

86

Page 90: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Graph databases

» Focus on modeling structure of data - interconnectivity

» Scale, but only to the complexity of data

»Data model: property graphs

87

Page 91: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Graph databases

»Neo4j

»AllegroGraph (RDF)

88

Page 92: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Neo4j

» http://neo4j.org/

»Neo Technology

89

Page 93: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Neo4j» data = nodes + relationships + key/value properties

90

Page 94: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Neo4j

»many language bindings, little remoting

» ‘whiteboard’ friendly

» scaling to complexity (rather than volume?)

» lots of focus on domain modelling

» SPARQL/SAIL impl for triple geeks

»mostly RAM centric (with disk swapping & persistence)

91

Page 95: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Bandwagonjumpers

» JCR / Jackrabbit

»GT/M

»RDF stores

92

Page 96: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Market maturization

Page 97: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Rise of integrators

»Cloudera (H-stack)

»Riptano (Cassandra)

»Cloudant (hosted CouchDB)

» (Outerthought: HBase)

94

Page 98: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

VC capital

»Cloudera

» couch.io

»Neo

» 10gen

»many others

95

Page 99: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

1. Intro2. Theory3. Technology4. Experiences

Page 100: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 97

the fireside conversations

http://www.flickr.com/photos/52641994@N00/516394238/

Page 101: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

NOSQL applicability

»Horizontal scaling

»Multi-Master

»Data representation

» search of simplicity

» data that doesn’t fit the E-R model(graphs, trees, versions)

» Speed

98

Page 102: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Tool selection

» be careful with the marketeese:smoke and mirrors beware!

»monitor dev list, IRC, Twitter, blogs

»monitor project ‘sponsors’

»mix-and-match: polyglot persistency

»DON’T NOSQL WITHOUT INTERNAL SYS ARCHS & DEV(OP)S !

99

Page 103: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 100

Our Context: Lily

» cloud-scalable content store and search repository

» successor (in many ways) of Daisy

Page 104: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

101

complexity

age

1.0

2.0

3.0

software architecture

Page 105: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

102

complexity

age

1.0

2.0

3.0

user interest

Page 106: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business Development 101

103

budget

user interest

Page 107: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution

104

sophistication

nosql?

1.0

2.0

3.0

ability to cope

mysql

Page 108: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We Prefer Sophistication

105

» the challenge for us was to scale ...without dropping features

Page 109: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

106

database (+opt. filesystem) (+ opt. full-text indexes)

Page 110: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

107

application

database (+opt. filesystem) (+ opt. full-text indexes)

cache

Page 111: NoSQL with Hadoop and HBase

cacheapplication

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

108

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 112: NoSQL with Hadoop and HBase

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

109

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 113: NoSQL with Hadoop and HBase

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

110

client

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 114: NoSQL with Hadoop and HBase

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

111

client (+cache)

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 115: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

What we found hard to scale

» access control

» facet browsing

» all the nifty stuff people were using our software for

» ... anything that required random accessto in-memory-cache data for computations

112

Page 116: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond the ‘scaling’ problem

» three-prong data layer

» result set merging (between MySQL & Lucene)» happened in appcode/memory

» ‘transactions’, set operations = hard

113

fs

Page 117: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase I

114

» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus drive the direction of the project

» some preference for a Java-based solution

Page 118: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase II

»After careful consideration, we realized the important choices were also:

» consistency: no chance of having two conflicting versions of a row

» atomic updates of a single row, single-row transactions

» bonus points for MapReduce integration» e.g. full-text index rebuilding

115

Page 119: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

That brought us to HBase, which bought us:

» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model

» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it

» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar environment for us

116

Page 120: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 117

»OK, so now we had a data store !

Page 121: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 118

»However, content repository =store + search

ouch!

Page 122: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 119

That was

easy !

(however ...)

Page 123: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 120

Search ponderings

»CMS = two types of search

» structured search» numbers, strings» based on logic (SQL, anyone?)

» information retrieval (or: full-text search)» text» based on statistics

Page 124: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Search ponderings

»All of that, at scale

121

Page 125: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Structured Search

»HBase Indexing Library

» idea from Google App Engine datastore indexes

» http://code.google.com/appengine/articles/index_building.html

122

rowkey

A

B

col

val3

val2

col

foo6

foo7

content table index table A

rowkey

val2-B

val3-A

col

order

Page 126: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Full-text / IR search

» Lucene?

» no sharding (for scale)

» no replication (for availability)

» batched index updates (not real-time)

123

Page 127: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond Lucene» Katta

» scalable architecture, however only search, no indexing

» Elastic Search

» very young (sorry)

» hbasene et al.

» stores inverted index in HBase, might not scale all features

» SOLR

» widely used, schema, facets, query syntax, cloud branch

More info: http://lilycms.org/lily/prerelease/technology.html

124

Page 128: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 125

+?

=Easy ! O

r ?

Page 129: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 126

Remember distribution ?Remember secondary indexes ?

➙ Need for reliable queuing

Page 130: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 127

Connecting things

»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)

» indexing, reindexing, mass reindexing (M/R)

»we need a reliable method of updating HBase secondary indexes

» all of that eventually to run distributed

» distribution means coping with failure

Page 131: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution

»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.

»➙ WAL & Queue implemented on top of HBase tables

128

Page 132: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

WAL / Queue

» WAL» guaranteed execution

of synchronous actions

» call doesn’t return before secondary action finishes

» e.g. update secondary actions

» if all goes well, size = #concurrent ops

» will be useful/made available outside of Lily context as well!

» Queue» triggering of async

actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

129

Page 133: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Sum» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through SOLR

» using a WAL/Queue mechanismimplemented in HBase

» runtime based on Kauri

» with client/server comms via Avro

130

Page 134: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 131

Architecture

Page 135: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 132

Architecture

Page 136: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Roadmap

» June 7-8: release of learning material (architecture, model, API, Javadoc)➥ www.lilycms.org➥ bit.ly/lilyprerelease

»Tomorrow: WAL/queue

»Mid July = ‘proof of architecture’ release

» from there on, ca. 3-monthly releasesleading up to Lily 1.0

133

Nearly there!

Page 137: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 134

bit.ly/lilyprerelease

Page 138: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

License

»Apache

135

Page 139: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business model» Consulting, mentoring, turn-key projects

» audience: developers

» Strong focus on partner relations

» targeting vertical markets

» geographic coverage

» SaaS offerings

»Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data

»Not: OLAP

136

Page 140: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Reading material

»Amazon Dynamo, Google BigTable, CAP

» http://nosql.mypopescu.com/

» http://nosql-database.org/

» http://twitter.com/nosqlupdate

» http://highscalability.com/

137

Page 141: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Questions?

138

http://www.flickr.com/photos/leehaywood/4237636853/

Page 142: NoSQL with Hadoop and HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 139

» [email protected]

» @stevenn

Thanks for your attention !