A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data...

69
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access Ben Stopford : RBS

description

This lecture was presented at UCL on the Financial Computing course in October 2011.

Transcript of A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data...

Page 1: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access !

Ben Stopford : RBS

Page 2: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

How fast is a HashMap lookup?

~20 ns

Page 3: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

That’s how long it takes light to travel a room

Page 4: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

How fast is a database lookup?

~20 ms

Page 5: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

That’s how long it takes light to go to Australia and back

Page 6: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

3 times

Page 7: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Computers really are very fast!

Page 8: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

The problem is we’re quite good at writing software that slows them down

Page 9: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Desktop Virtualization

Page 10: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

We love abstraction

Page 11: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

There are many reasons why abstraction is a good idea… …performance just isn’t one of them

Page 12: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Question: is it fair to compare a Database with a HashMap?

Page 13: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Not really…

Page 14: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Key Point

On one end of the scale sits the

HashMap…

..on the other sits the database…

…but it’s a very very long scale that sits between them.

Page 15: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Times are changing

Page 16: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Database Architecture is Aging

Page 17: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access
Page 18: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

The Traditional Architecture

Page 19: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Traditional

Distributed In Memory

Shared Disk In Memory Shared

Nothing

Simpler Contract

Page 20: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Simplifying the Contract

Page 21: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

How big is the internet?

5 exabytes

(which is 5,000 petabytes or 5,000,000 terabytes)

Page 22: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

How big is an average enterprise database

80% < 1TB (in 2009)

Page 23: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Simplifying the Contract

Page 24: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Databases have huge operational overheads

Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al

Page 25: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Avoid that overhead with a simpler contract and avoiding IO

Page 26: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Improving Database Performance !Shared Disk Architecture

Shared Disk

Page 27: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Improving Database Performance !Shared Nothing Architecture

Page 28: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Each machine is responsible for a subset of the records. Each record exists on only one

machine. !

765, 769…

1, 2, 3… 97, 98, 99…

333, 334… 244, 245…

169, 170… Client

Page 29: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Improving Database Performance (3) !

In Memory Databases !(single address-space)

Page 30: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Databases must cache subsets of the data in memory

Cache

Page 31: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Not knowing what you don’t know

Data on Disk

90% in Cache

Page 32: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

If you can fit it ALL in memory you know everything!!

Page 33: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

The architecture of an in memory database

Page 34: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Memory is at least 100x faster than disk

0.000,000,000,000

μs ns ps ms

L1 Cache Ref

L2 Cache Ref

Main Memory Ref

1MB Main Memory

Cross Network Round Trip

Cross Continental Round Trip

1MB Disk/Network

* L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm

Page 35: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Memory allows random access. Disk only works well for sequential reads

Page 36: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

This makes them very fast!!

Page 37: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

The proof is in the stats. TPC-H Benchmarks on a 1TB data set

Page 38: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

So why haven’t in memory databases taken off?

Page 39: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Address-Spaces are relatively small and of a finite, fixed size

Page 40: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Durability

Page 41: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

One solution is distribution

Page 42: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Distributed In Memory (Shared Nothing)

Page 43: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Again we spread our data but this time only using RAM.

765, 769…

1, 2, 3… 97, 98, 99…

333, 334… 244, 245…

169, 170… Client

Page 44: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Distribution solves our two problems

Page 45: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

We get massive amounts of parallel processing

Page 46: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

But at the cost of loosing the single address space

Page 47: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Traditional

Distributed In Memory

Shared Disk In Memory Shared

Nothing

Simpler Contract

Page 48: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

There are three key themes here:

Distribution

Gain scalability through a distributed architecture

Simplify the contract

Improve scalability by picking appropriate ACID properties.

No Disk

All data is held in RAM

Page 49: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

ODC

Page 50: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

ODC – Distributed, Shared Nothing, In Memory, Semi-Normalised, Graph DB

450 processes

Messaging (Topic Based) as a system of record (persistence)

2TB of RAM

Page 51: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

ODC represents a balance between throughput and

latency

Page 52: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

What is Latency?

Page 53: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

What is Throughput

Page 54: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Which is best for latency?

Latency?

Traditional Database

Shared Nothing

(Distributed)

In-Memory Database

Page 55: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Which is best for throughput?

Latency?

Traditional Database

Shared Nothing

(Distributed)

In-Memory Database

Page 56: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

So why do we use distributed in memory?

In Memory Plentiful hardware

Latency Throughput

Page 57: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

This is the technology of the now. So what is the technology of the future?

Page 58: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Terabyte Memory Architectures

Page 59: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Fast Persistent Storage

Page 60: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

New Innovations on the Horizon

Page 61: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

These factors are remolding the hardware landscape to one where

memory both vast and durable

Page 62: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

This is changing the way we write software

Page 63: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Huge servers in the commodity space are driving us towards single process architectures that utilise many cores and large address spaces

Page 64: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

We can attain hundreds of thousands of executions per second from a single process if it is well optimised.

Page 65: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

“All computers wait at the same speed” !

Page 66: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

We need to optimise for our CPU architecture

0.000,000,000,000

μs ns ps ms

L1 Cache Ref

L2 Cache Ref

Main Memory Ref

1MB Main Memory

Cross Network Round Trip

Cross Continental Round Trip

1MB Disk/Network

* L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm

Page 67: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Tools like Vtune allow us to optimise software to truly leverage

our hardware

Page 68: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

So what does this all mean?

Page 69: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access

Further Reading