ToroDB: a bridge between the NoSQL and Relational worlds

37
ToroDB A BRIDGE BETWEEN THE NOSQL AND RELATIONAL WORLDS Álvaro Hernández <[email protected]>

description

In the recent years, NoSQL databases have been gaining a lot of traction. Most of them haven been designed and written from scratch. Building on the principles of schema-less and high scalability, they offer a distinct approach to that of relational databases. But rather than re-using what the industry has learned in the last 3 decades of database development, most of these databases are re-inventing the wheel and designing the data storage layers -one of the toughest part when building a database- from scratch. ToroDB is a database that uses instead relational databases as well-known, durable, scalable and fast -despite what many would saystorage layers as a foundation to build a schema-less, document-oriented, scalable database. This project has been recently published as open-source software. It will effectively be the very first general-purpose database ever built in Spain. Document databases store documents, which are basically hierarchical, nested data structures of sets of key-value pairs. Current approaches to store them in relational databases are limited to storing documents in some form of binary serialization. We found is a set of algorithms to transform a document into a set of document-parts that can individually be stored in relational tables. This includes dynamic creation of tables, when needed, to match a table&#x27;s structure to that of the information to be stored. This means there is no engineering effort required in building the storage subsystem, which should handle durability, isolation and concurrency –all of which are tough properties to implement. But even more importantly, there are very significant performance advantages, both in query time and storage savings. Query time improves as queries targeting subsets of the documents (which are most of the queries) need only to address a subset of the data -as it is partitioned into tables- rather than reading the whole database. Storage savings are achieved by avoiding repetition of the schema of every document –many documents share the same schema (“structure”) but all them need to repeat that. Our benchmarks shows that JSON documents require in ToroDB 29% to 68% of the storage required for the same data on a MongoDB database. These means significant less I/O, significant less cost, and greater (vertical) scalability. This presentation shows how ToroDB works, how the JSON documents are split into tables. Why current document-oriented databases fail to maximize the performance of BigData requirements –ToroDB also includes a mechanism for storing in columnar format parts of the documents to improve aggregate-type queries, obtaining impressive performance benefits. And, finally, how this all can be done in a compatible way with existing systems: ToroDB includes a layer that natively speaks the MongoDB protocol, hence becoming a drop-in replacement for MongoDB installations, but running on top of existing relational databases.

Transcript of ToroDB: a bridge between the NoSQL and Relational worlds

Page 1: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDBA BRIDGE BETWEEN

THE NOSQL AND RELATIONAL WORLDS

Álvaro Hernández <[email protected]>

Page 2: ToroDB: a bridge between the NoSQL and Relational worlds

About *8Kdata*

● Research & Development in databases

● Consulting, Training and Support in PostgreSQL

● Founders of PostgreSQL España, 3rd largest PUG in the world (322 members as of today)

● About myself: CEO at 8Kdata:@ahachetehttp://linkd.in/1jhvzQ3

www.8kdata.com

Page 3: ToroDB: a bridge between the NoSQL and Relational worlds

How big is “NoSQL”?

Source: 451 Research

Page 4: ToroDB: a bridge between the NoSQL and Relational worlds

Why people want “NoSQL”?

● Schema-less

● High availability

● It's cool

Page 5: ToroDB: a bridge between the NoSQL and Relational worlds

The schema-less fallacy

{“name”: “Álvaro”,“surname”: “Hernández”,“height”: 200,“hobbies”: [

“PostgreSQL”, “triathlon”]

}

Page 6: ToroDB: a bridge between the NoSQL and Relational worlds

The schema-less fallacy

{“name”: “Álvaro”,“surname”: “Hernández”,“height”: 200,“hobbies”: [

“PostgreSQL”, “triathlon”]

}metadata → Isn't that... schema?

Page 7: ToroDB: a bridge between the NoSQL and Relational worlds

The schema-less fallacy: BSON

metadata → Isn't that... schema?

{“name”: (string) “Álvaro”,“surname”: (string) “Hernández”,“height”: (number) 200,“hobbies”: {

“0”: (string) “PostgreSQL” , “1”: (string) “triathlon”

}}

Page 8: ToroDB: a bridge between the NoSQL and Relational worlds

The schema-less fallacy

● It's not schema-less

● It is “attached-schema”

● It carries an overhead which is not 0

Page 9: ToroDB: a bridge between the NoSQL and Relational worlds

High availability: at what cost?

MongoDB:➔ Unacknowledged: 42% data loss➔ Safe: 37% data loss➔ Only majority is safe

http://aphyr.com/posts/284-call-me-maybe-mongodb

Jepsen!!! :)

Page 10: ToroDB: a bridge between the NoSQL and Relational worlds

More NoSQL struggle

● Durability is sometimes not guaranteed on a single node

● Programming for AP systems may be a big burden

● Most (all?) NoSQL databases wrote their storage from scratch. Journaling, concurrency are really hard

Page 11: ToroDB: a bridge between the NoSQL and Relational worlds

Can we do a better “NoSQL”?

● Document model is very appealing to many. Let's offer it

● DRY: why not use relational databases? They are proven, durable, concurrent and flexible

● Why not base it on relational databases, like PostgreSQL?

Page 12: ToroDB: a bridge between the NoSQL and Relational worlds

Schema-attached repetition

{ “a”: 1, “b”: 2 }{ “a”: 3 }{ “a”: 4, “c”: 5 }{ “a”: 6, “b”: 7 }{ “b”: 8 }{ “a”: 9, “b”: 10 }{ “a”: 11, “b”: 12, “j”: 13 }{ “a”: 14, “c”: 15 }

Counting “document types” in collections of millions: at most, 1000s of different types

Page 13: ToroDB: a bridge between the NoSQL and Relational worlds

Schema-attached repetition

How data is stored in schema-less

Page 14: ToroDB: a bridge between the NoSQL and Relational worlds

Pettus and BTP inspired us

https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday-fosdem-2013.pdfhttp://www.slideshare.net/nosys/billion-tables-project-nycpug-2013

Page 15: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB – Teaser https://flic.kr/p/9HzWhT

ToroDB

Page 16: ToroDB: a bridge between the NoSQL and Relational worlds

What is ToroDB

● Open source, document-oriented, JSON database that runs on top of PostgreSQL

● JSON documents are stored relationally, not as a blob: significant storage and I/O savings

● Wire-protocol compatibility with Mongo

Page 17: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB benefits

● 100% durable database

● High concurrency and performance

● Compatible with existing mongo API programs, clients

● Full set of JSON operations (MongoDB's “SELECT” API)

Page 18: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage

● Data is stored in tables

● JSON documents are split by hierarchy levels, and each (plain) level goes to a different table

● Subdocuments are classified by “type”, which maps to tables

Page 19: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage (II)

● A “structure” table keeps the subdocument “schema”

● Keys in JSON are mapped to attributes, which retain the original name

● Tables are created dinamically and transparently to match the exact types of the documents

Page 20: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage (III)

How data is stored in ToroDB

Page 21: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage internals

{ "name": "ToroDB", "data": { "a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } }}

Page 22: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage internals

The document is split into the following subdocuments:

{ "name": "ToroDB", "data": {}, "nested": {} }

{ "a": 42, "b": "hello world!"}

{ "j": 42, "deeper": {}}

{ "a": 21, "b": "hello"}

Page 23: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage internals

select * from demo.t_3┌─────┬───────┬────────────────────────────┬────────┐│ did │ index │ _id │ name │├─────┼───────┼────────────────────────────┼────────┤│ 0 │ ¤ │ \x5451a07de7032d23a908576d │ ToroDB │└─────┴───────┴────────────────────────────┴────────┘select * from demo.t_1┌─────┬───────┬────┬──────────────┐│ did │ index │ a │ b │├─────┼───────┼────┼──────────────┤│ 0 │ ¤ │ 42 │ hello world! ││ 0 │ 1 │ 21 │ hello │└─────┴───────┴────┴──────────────┘select * from demo.t_2┌─────┬───────┬────┐│ did │ index │ j │├─────┼───────┼────┤│ 0 │ ¤ │ 42 │└─────┴───────┴────┘

Page 24: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage internals

select * from demo.structures┌─────┬────────────────────────────────────────────────────────────────────────────┐│ sid │ _structure │├─────┼────────────────────────────────────────────────────────────────────────────┤│ 0 │ {"t": 2, "data": {"t": 1}, "nested": {"t": 3, "deeper": {"i": 1, "t": 1}}} │└─────┴────────────────────────────────────────────────────────────────────────────┘

select * from demo.root;┌─────┬─────┐│ did │ sid │├─────┼─────┤│ 0 │ 0 │└─────┴─────┘

Page 25: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB storage and I/O savings

29% - 68% storage required,compared to Mongo 2.6

Page 26: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB performance

Page 27: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB performance (II)

Page 28: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: query “by structure”

● ToroDB is effectively partitioning by type

● Structures (schemas, partitioning types) are cached in ToroDB memory

● Queries only scan a subset of the data.

● Negative queries are served directly from memory.

Page 29: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: Developer Preview

● ToroDB launched on October 2014, as a Developer Preview. Support for CRUD and most of the SELECT API

● github.com/torodb

● RERO policy. Comments, feedback, patches... greatly appreciated

● AGPLv3

Page 30: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: Developer Preview

● Clone the repo, build with Maven

● Or download the JAR:http://maven.torodb.com/release/com/torodb/torodb/0.11/torodb-0.11-jar-with-dependencies.jar

●Usage:java -jar torodb-version.jar –helpjava -jar torodb/target/torodb-version.jar -d dbname -u dbuser -P 27017Connect with normal mongo console!

Page 31: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: Community Response

Page 32: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: Community Response

Page 33: ToroDB: a bridge between the NoSQL and Relational worlds

ToroDB: Roadmap

● Current Developer Preview is single-node

● Version 1.0:➔ Expected Q1 2015➔ Production-ready➔ MongoDB Replication support (Paxos-based replication protocol?)

➔ Very high compatibility with Mongo API

Page 34: ToroDB: a bridge between the NoSQL and Relational worlds

Big Data speaking mongo:Vertical ToroDB

What if we use CitusData's cstore to store the JSON documents?

Page 35: ToroDB: a bridge between the NoSQL and Relational worlds

1.17% - 20.26% storage required,compared to Mongo 2.6

Big Data speaking mongo:Vertical ToroDB

Page 36: ToroDB: a bridge between the NoSQL and Relational worlds

“Software acknowledgements”

● PostgreSQL!

● The Netty framework

● jOOQ

● Guava, guice, findbugs

● Hikari CP

Page 37: ToroDB: a bridge between the NoSQL and Relational worlds