SQLFire at VMworld Europe 2011

39
© 2009 VMware Inc. All rights reserved Managing High Performance Data with vFabric SQLFire Carter Shanklin Product Manager for vFabric

description

SQLFire is a high-performance, memory-optimized distributed SQL database. SQLFire databases run on multiple servers simultaneously, but present a standard SQL interface to client applications, and appear to be just one database. SQLFi re also makes it easy to add or remove servers at any time, which makes redundan cy and elastic scaling very simple. This presentation has an overview of SQLFire as well as a walkthrough of the SQL extensions SQLFire uses to create a real distributed SQL database. Importantly all of the extensions are in the way tables are defined (i.e. the DDL commands) rather than extentions to data inserts or queries so clients are completely unaw are of SQLFire's distributed nature.

Transcript of SQLFire at VMworld Europe 2011

Page 1: SQLFire at VMworld Europe 2011

© 2009 VMware Inc. All rights reserved

Managing High Performance Data with vFabric SQLFire

Carter ShanklinProduct Manager for vFabric

Page 2: SQLFire at VMworld Europe 2011

Agenda

What is SQLFire?

Why SQL vs. NoSQL

Why SQLFire versus other SQL databases

SQLFire features + Demo

How SQLFire Scales

• Hash partitioning

• Entity groups and collocation

• Data-aware stored procedures

Consistency model

Shared nothing persistence

Page 3: SQLFire at VMworld Europe 2011

What is vFabric SQLFire?

SQLFire is a memory-optimized distributed SQL database.

SQLFire attacks scalability challenges in two ways:

• Relaxes ACID semantics somewhat (in transactions and in replication)

• Horizontally scalable. Add capacity by adding nodes.

SQLFire has built-in high availability and native support for replication to multiple datacenters.

SQLFire provides a real SQL interface. Ships with JDBC and ADO.NET bindings with more to come.

SQLFire can also be used as a cache in front of other databases.

Page 4: SQLFire at VMworld Europe 2011

SQLFire at-a-glance

4

As data changes, subscribers are pushed

notification eventsData transparently replicated and/or partitioned;Redundant storage can be in memory and/or on

disk

Many physical machine nodes appear as one logical system

Other

Synchronous read through, write through or

Asynchronous write-behind to other data sources and sinks

JavaClient

Increase/Decrease capacity on the fly

C#Client

JDBC

JDBC or ADO.NET

Databases

File system

Shared Nothing disk persistence

Each cache instance can optionally persist to disk

Page 5: SQLFire at VMworld Europe 2011

The database world is changing.

Many new data models (NoSQL) are emerging

• Key-value

• Column family (inspired by Google BigTable)

• Document

• Graph

Most focus on making model less rigid than SQL

Consistency model is not ACID

Different tradeoffs for different goals

Low scale High scale Very high scale

STRICT – Full ACID (RDB)

Tunable Consistency

Eventual

Page 6: SQLFire at VMworld Europe 2011

SQLFire Versus NoSQL

Attribute NoSQL SQLFire

DB Interface Idiosyncratic (i.e. each is custom).

Standard SQL.

Querying Idiosyncratic or not present. SQL Queries.

Data Consistency

Tunable, most favor eventual consistency.

Tunable, favors high consistency.

Transactions Weak or not present. Linearly scalable transaction model.

Interface Design Designed for simplicity. Designed for compatibility.

Data Model Wide variety of different models.

Relational model.

Schema Flexibility

Focus on extreme flexibility, dynamism.

SQL model, requires DB migrations, etc.

Page 7: SQLFire at VMworld Europe 2011

SQLFire Versus Other SQL Databases

Attribute SQLFire Other SQL DBs

DB Interface Standard SQL. Standard SQL.

Data Consistency

Tunable. Mix of eventual consistency and high consistency.

High consistency.

Transactions Supported. Very strong support.

Scaling Model Scale out, commodity servers.

Scale up.

Page 8: SQLFire at VMworld Europe 2011

SQLFire challenges traditional DB design, not SQL

Too much I/O

Design roots don’t necessarily apply today

• Too much focus on ACID

• Disk synchronization bottlenecks8

Confidential

First write to LOG

Second write to Data files

Buffers primarily tuned

for IO

Page 9: SQLFire at VMworld Europe 2011

SQLFire 1.0 Notable Features

Horizontally scalable with Partitioning and Replication

Multiple Topologies

• Client/Server, Asynchronous replication over WAN

Queries

• Distributed and memory-optimized

Procedures and Functions

• Standard Java stored procedures with “data awareness”

Caching

• Loader, writers, Eviction, Overflow and Expiration

Event framework

• Listeners, triggers, Asynchronous write behind

Command line tools

Manageability, Security

Page 10: SQLFire at VMworld Europe 2011

Scaling SQLFire

Partitioning & Replication

Page 11: SQLFire at VMworld Europe 2011

How SQLFire scales a common DB schema.

FLIGHTS---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..

PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTAVAILABILITY---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..

PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))

FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTHISTORY---------------------------------------------

FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..

1 – M

1 – 1

SEVERAL CODE/DIMENSION TABLES---------------------------------------------

AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED

Assume, thousands of flight rows, millions of flightavailability records

Page 12: SQLFire at VMworld Europe 2011

Table

CREATE TABLE AIRLINES ( AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. );

SQLFSQLF SQLF

Creating Tables

Page 13: SQLFire at VMworld Europe 2011

CREATE TABLE AIRLINES ( AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. ) REPLICATE;

Replicated TableReplicated Table Replicated Table

SQLFSQLF SQLF

Replicated Tables

Page 14: SQLFire at VMworld Europe 2011

CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEST_AIRPORT CHAR(3) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL) PARTITION BY COLUMN(FLIGHT_ID);

Table

Partitioned TablePartitioned TablePartitioned Table

Replicated TableReplicated Table Replicated Table

SQLFSQLF SQLF

Partitioned Tables

Page 15: SQLFire at VMworld Europe 2011

CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEST_AIRPORT CHAR(3) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL) PARTITION BY COLUMN (FLIGHT_ID) REDUNDANCY 1;

Table

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Replicated TableReplicated Table Replicated Table

SQLFSQLF SQLF

Partition Redundancy

Page 16: SQLFire at VMworld Europe 2011

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS);

Table

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Replicated TableReplicated Table Replicated Table

SQLFSQLF SQLF

Partition Colocation

Colocated PartitionColocated Partition Colocated Partition

Page 17: SQLFire at VMworld Europe 2011

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS) PERSISTENT persistentStore ASYNCHRONOUS;

Table

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Partitioned Table

Redundant Partition

Replicated TableReplicated Table Replicated Table

SQLFSQLF SQLF

Persistent Tables

Colocated PartitionColocated Partition Colocated Partition

sqlf backup /export/fileServerDirectory/sqlfireBackupLocation

Data dictionary is always persisted in each server

Page 18: SQLFire at VMworld Europe 2011

Demo Scaling with partitioned tables.

FLIGHTS---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..

PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTAVAILABILITY---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..

PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))

FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTHISTORY---------------------------------------------

FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..

1 – M

1 – 1

SEVERAL CODE/DIMENSION TABLES---------------------------------------------

AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED

Page 19: SQLFire at VMworld Europe 2011

Hash partitioning for linear scaling

Key Hashing provides single hop access to its partitionBut, what if the access is not based on the key … say, joins are involved

Page 20: SQLFire at VMworld Europe 2011

Pure hash-based partitioning will only get you so far.

Consider this query :

select * from flights, flightAvailability

where flights.id = flightAvailability.flightid

and flight.fromAirport = ‘CPH’;

If both tables are simply hash partitioned the join logic will need execution on all nodes where flightavailability data is stored.

This will not scale.

• joins across distributed nodes could involve distributed locks and potentially a lot of intermediate data transfer across nodes

Page 21: SQLFire at VMworld Europe 2011

To scale we need partition-aware DB design.

DB architect must think about how data maps to partitions.

The main idea is to:

• minimize excessive data distribution by keeping the most frequently accessed and joined data collocated on partitions.

Read Pat Helland’s “Life beyond Distributed Transactions” and the Google MegaStore paper.

Page 22: SQLFire at VMworld Europe 2011

SQLFire allows partition-aware design with the “colocated” keyword.

Entity Groups

Table FlightAvailability partitioned by FlightID colocated with Flights

FlightID is the entity group Key

Page 23: SQLFire at VMworld Europe 2011

Solving this scalability problem with SQLFire.

Create flightAvailability as follows:

CREATE TABLE flightAvailability

partitioned by flightid colocate with flights;

Re-run the query:

select * from flights, flightAvailability

where flights.id = flightAvailability.flightid

and flight.fromAirport = ‘CPH’;

The query is restricted to nodes containing flights with CPH as the fromAirport.

Page 24: SQLFire at VMworld Europe 2011

More about partition-aware database design.

OLTP systems tend to be partitionable.

• Typically it is the number of entities that grows over time and not the size of the entity. Customer count perpetually grows, not the size of the customer info

• Most often access is very restricted and based on select entities given a FlightID, fetch flightAvailability records

given a customerID, add/remove orders, shipment records

Identify partition key for “Entity Group”

• "entity groups": set of entities across several related tables that can all share a single identifier flightID is shared between the parent and child tables

CustomerID shared between customer, order and shipment tables

Page 25: SQLFire at VMworld Europe 2011

Scaling Application logic with Parallel “Data Aware procedures”

Page 26: SQLFire at VMworld Europe 2011

Stored Procedures in SQLFire.

SQLFire stored procedures.

• Written in pure Java rather than proprietary extensions.

• Created and defined based on SQL standards.

• Supports “data awareness” and run only on nodes where applicable data resides.

• They support a map/reduce-like execution style.

Benefits:

• Write procedures in pure Java or take advantage of existing Java libraries.

• Easily take advantage of SQLFire as a highly scalable distributed system.

Page 27: SQLFire at VMworld Europe 2011

Procedures

Java Stored Procedures may be created according to the SQL Standard

SQLFire also supports the JDBC type Types.JAVA_OBJECT. A parameter of type JAVA_OBJECT supports an arbitrary Serializable Java object.

In this case, the procedure will be executed on the server to which a client is connected (or locally for Peer Clients)

CREATE PROCEDURE getOverBookedFlights

(IN argument OBJECT, OUT result OBJECT)

LANGUAGE JAVA PARAMETER STYLE JAVA

READS SQL DATA DYNAMIC RESULT SETS 1

EXTERNAL NAME com.acme.OverBookedFLights;

Page 28: SQLFire at VMworld Europe 2011

Data Aware Procedures

Parallelize procedure and prune to nodes with required data

CALL [PROCEDURE]

procedure_name

( [ expression [, expression ]* ] )

[ WITH RESULT PROCESSOR processor_name ]

[ { ON TABLE table_name [ WHERE whereClause ] } |

{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}

]

Extend the procedure call with the following syntax:

Fabric Server 2Fabric Server 1

Client

Hint the data the procedure depends on

CALL getOverBookedFlights( <bind arguments>

ON TABLE FLIGHTAVAILABILITY

WHERE FLIGHTID = <SomeFLIGHTID> ;

If table is partitioned by columns in the where clause the procedure execution is pruned to nodes with the data (node with <someFLIGHTID> in this case)

Page 29: SQLFire at VMworld Europe 2011

Parallelize procedure then aggregate (reduce)

CALL [PROCEDURE]

procedure_name

( [ expression [, expression ]* ] )

[ WITH RESULT PROCESSOR processor_name ]

[ { ON TABLE table_name [ WHERE whereClause ] } |

{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}

]

Fabric Server 2Fabric Server 1

Client

Fabric Server 3

CALL SQLF.CreateResultProcessor( processor_name, processor_class_name);

register a Java Result Processor (optional in some cases):

Page 30: SQLFire at VMworld Europe 2011

Consistency model

Page 31: SQLFire at VMworld Europe 2011

Consistency Model without Transactions

Replication within cluster is always eager and synchronous

Row updates are always atomic; No need to use transactions

FIFO consistency: writes performed by a single thread are seen by all other processes in the order in which they were issued

Consistency in Partitioned tables

• a partitioned table row owned by one member at a point in time

• all updates are serialized to replicas through owner

• "Total ordering" at a row level: atomic and isolated

Membership changes and consistency – need another hour

Pessimistic concurrency support using ‘Select for update’

Support for referential integrity

Page 32: SQLFire at VMworld Europe 2011

SQLFire Transactions

Highly scalable without any centralized coordinator or lock manager

We make some important assumptions

• Most OLTP transactions are small in duration and size

• Write-write conflicts are very rare in practice

How does it work?

• Each data node has a sub-coordinator to track TX state

• Eagerly acquire local “write” locks on each replica Object owned by a single primary at a point in time

• Fail fast if lock cannot be obtained

Atomic and works with the cluster Failure detection system

Isolated until commit

• Only support local isolation during commit

Page 33: SQLFire at VMworld Europe 2011

Scaling disk access with shared nothing disk files and a

“journaling” store design

Page 34: SQLFire at VMworld Europe 2011

Disk persistence in SQLF

Parallel log structured storage

Each partition writes in parallel

Backups write to disk also

• Increase reliability against h/w loss

MemoryTables

Append only Operation logs

OS Buffers

LOG Compressor

Record1

Record2

Record3

Record1

Record2

Record3

MemoryTables

Append only Operation logs

OS Buffers

LOG Compressor

Record1

Record2

Record3

Record1

Record2

Record3

• Don’t seek to disk• Don’t flush all the way to disk

– Use OS scheduler to time write

• Do this on primary + secondary• Realize very high throughput

Page 35: SQLFire at VMworld Europe 2011

Performance benchmark

Page 36: SQLFire at VMworld Europe 2011

How does it perform? Scale?

Scale from 2 to 10 servers (one per host)

Scale from 200 to 1200 simulated clients (10 hosts)

Single partitioned table: int PK, 40 fields (20 ints, 20 strings)

Page 37: SQLFire at VMworld Europe 2011

How does it perform? Scale?

CPU% remained low per server – about 30% indicating many more clients could be handled

Page 38: SQLFire at VMworld Europe 2011

Is latency low with scale?

Latency decreases with server capacity

50-70% take < 1 millisecond

About 90% take less than 2 milliseconds

Page 39: SQLFire at VMworld Europe 2011

Q & A

SQLFire beta available now

http://vmware.com/go/sqlfire