Franz Faerber, Michael Muehle DB development, SAP AG … · 2013-07-11 · Main memory column and...

37
Main memory column and row store for Enterprise applications Franz Faerber, Michael Muehle DB development, SAP AG April 2011

Transcript of Franz Faerber, Michael Muehle DB development, SAP AG … · 2013-07-11 · Main memory column and...

Main memory column and row store

for Enterprise applications

Franz Faerber, Michael Muehle DB development, SAP AG

April 2011

© 2011 SAP AG. All rights reserved. 2 Confidential

Why In-Memory?

Information at the Speed of Thought

Imagine access to business data,

supporting analytical queries on the transactional data,

no need for data replication, and zero response time

How would this change the way you work with data?

How would this change your perception of information availability?

How would this new information influence your decisions?

In-memory databases are a technology

with huge disruption potential,

we need to assume a competitive leadership position.

© 2011 SAP AG. All rights reserved. 3 Confidential

Motivation:

Make Use of Modern Hardware

Past

CPU clock rate growing

Software runs faster without change

Today

CPU clock rate growth is flat

Number of cores increases

Write software that scales with number of cores!

Game Changing Hardware Trends

Disk is Tape

Main memory affordable

CPU: more cores, no clock rate increase

Software makers need to react

© 2011 SAP AG. All rights reserved. 4 Confidential

Massive amount of memory and parallel processing power

for under $1M

1 server ...8 CPUs, each CPU with 8 cores for $100,000

can execute 64 threads in parallel

A A A A A A A A A A A A A A A A can hold memory modules adding up to ~2TB today

8 Server blades ...for less than $1M

can execute 512 threads in parallel

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A

can hold memory modules adding up to ~16TB

100GB/sec data throughput per server

© 2011 SAP AG. All rights reserved. 5 Confidential

Simplified Memory Hierarchy

(based on Intel Nehalem)

CPU

Core

1st Level

Cache

2nd Level

Cache

Shared 3rd Level Cache

Core

1st Level

Cache

2nd Level

Cache

Core

1st Level

Cache

2nd Level

Cache

Core

1st Level

Cache

2nd Level

Cache

Size Latency

64KB ~4 cycles

2 ns

256KB ~10 cycles

5 ns

8MB 35-40+ cycles

20 ns

several

GBs up

to TBs

~ 200 cycles

~ 100ns

several

TB

several million

cycles

several ms

Main Memory

Disk

© 2011 SAP AG. All rights reserved. 6 Confidential

Main Memory Bottleneck

Disk is Tape

105 times slower compared to main memory access

No disk access during normal operation

Use disk as kind of “archive”

Memory

Price / MB RAM is going down

64 bit architectures make maximum limit of 4GB main memory obsolete

Memory Access

Access to main memory is not arbitrarily fast

Memory bandwidth increases but memory access latency remains the same

Cache misses limit performance

Traditional database: CPU spends 50% of time waiting after cache misses (1999)

Transfer memory-cache is block wise (cache lines) , for example 64 bytes

Main memory access benefits from data locality, random access is slow

CPU

CPU Cache

MainMemory

Disk

CPU Cache

Core

Performance bottleneck in the past:

Disk I/O

Performance bottleneck today:

CPU waiting for data to be

loaded from memory into cache

© 2011 SAP AG. All rights reserved. 7 Confidential

“In-Memory” Computing

“In-Memory” Computing

Keep all required data in main memory

Rarely accessed data can be moved to disk

(e.g. after 1 year)

Compress data in memory

Disk I/O no longer and optimization target

Cache sensitive data layout

High locality

(data that is needed together is stored together)

Compression (decompress in cache)

Parallelization

Data structures allow splitting into pieces that can be processed in parallel

Avoid locks

Unify OLAP and OLTP systems : combine column and row store

Column store for reporting like access

Row store for OLTP like access

© 2011 SAP AG. All rights reserved. 8 Confidential

Row Based and Column Based Data Storage

US

US

US JP

UK

Alpha

Beta

Alpha

Alpha

3.000

1.250

700

450

Country Product Sales

Row-Based

Columnar

Sum(sales):

Sales: 4 byte, country: 2 byte, product: 10 byte

Cache line: 64 Byte → 48 byte waste = 75%

… … …

US Alpha 3.000 US Beta 1.250 JP Alpha 700 UK Alpha 450 …

US US JP UK Alpha Beta Alpha Alpha 3.000 1.250 700 450 … … …

data stored in contiguous memory

High locality

Aggregate sales figures:

Random access!

© 2011 SAP AG. All rights reserved. 9 Confidential

Row Based and Column Based Data Storage

Column Store

Many calculations on single or few columns only

Searches based on values of a few columns

Big number of columns

Big number of rows and columnar operations are required

aggregate, scan, etc.

High compression rates possible

Most columns contain only few distinct values

Row Store

Application often needs to process single records at one time

many selects and /or updates of single records

Application typically needs to access the complete record

Columns contain mainly distinct values

Aggregations and fast searching not required

Small number of rows (e.g. configuration tables)

?

© 2011 SAP AG. All rights reserved. 10 Confidential

In-Memory Computing Engine

Overall Architecture

NewDB

Request Processing

And Execution Control

Disk Storage

Relational Engines

Connection and Session Management

In Memory

Row Store

In Memory

Column

StoreIn Memory

Object

Store

Disk Based

StoreTemporary

Results

Persistence Layer

Data Volumes Transaction Log Volumes

Transaction

Manager

Metadata

Manager

NewDB Clients (Application Server, Analytics Technology, etc)

Page Management Logger

R

Session

ParametersR

SQL Script

R

R

Data Aging Manager

R

MDXPlanning

Engine

Calc EngineAuthorization

Manager

R

R

Request

Parser

Optimizer

Execution Layer

© 2011 SAP AG. All rights reserved. 11 Confidential

SAP In-Memory Computing Engine (NewDB)

High Performance and Scalability

Compression, cache-aware storage

Executing application logic inside database

Distribution

Reducing Complexity and Cost

Hybrid: column based, row based (future: object store and disk based) in one system

Analytics and OLTP in one system

Support for OLAP

MDX, calculation engine, OLAP views

Compatibility and Standard DBMS Features

SQL, ACID

run SAP applications that use Open SQL without changes.

Multi Tenant Support

Support For Temporal Tables

“time travel”, time slider applications, reporting using historical data, change recording

Main memory text search engine

© 2011 SAP AG. All rights reserved. 12 Confidential

“In-Memory” Computing

Business Database

Move data-intensive operations to the data layer

Do the relational operations at database level ( i.e. select for all entries )

Hierarchy handling ( i.e HR cost center hierarchy )

Reporting and planning functionality

Planning engine

Transactional behaviour

Unify DB and application transaction

Push snapshot isolation to the database / remove transactional buffer in the appserver layer

Business Function Library

Provide business functions in the DB layer like

– Currency /Unit conversion

– Date / time /fiscal period /calendar calculation

– Statistical functionality

Deep integration with the application server

Fast communication layer

SQL extensions ( SQL script )

New data types ( text, GUID, …)

© 2011 SAP AG. All rights reserved. 13 Confidential

Temporal Tables

UPDATE article SET SIZE = „S‟ WHERE ID=„712‟

DELETE FROM article WHERE ID=„546‟

Row ID Text Size

1 546 Shirt XL

2 712 Shoe M

3 913 Hat L

Temporal Table (Logical View)

Valid-From

2010-09-11 08:30

2010-10-06 11:42

2010-10-15 16:11

Valid-To

Row ID Text Size

1 546 Shirt XL

2 712 Shoe M

3 913 Hat L

Valid-From

2010-09-11 08:30

2010-10-06 11:42

2010-10-15 16:11

Valid-To

2010-10-26 17:05

4 712 Hat S 2010-10-26 17:05 ∞

Row ID Text Size

1 546 Shirt XL

2 712 Shoe M

3 913 Hat L

Valid-From

2010-09-11 08:30

2010-10-06 11:42

2010-10-15 16:11

Valid-To

2010-10-26 17:30

2010-10-26 17:05

4 712 Hat S 2010-10-26 17:05 ∞

historical

historical

insert

Currently only in column store

© 2011 SAP AG. All rights reserved. 14 Confidential

Intra operator parallization: Aggregation

Thread 1

Thread 2

Fact Table

… Thread 1

Local HT 1 Local HT 2

Cache-Sized Hash Tables

Buffer

Part HT 1 Part HT 2

Thread 3 Thread 4

© 2011 SAP AG. All rights reserved. 15 Confidential

Column Store: Dictionary Compression

JonesMiller

MillmanZsuwalskiBakerMillerJohnMillerJohnsonJones

Column „Name“

(uncompressed)Value-ID sequence

One element for each row in column

415N042431

Va

lue

IDs

JohnsonMiller

JohnJones

01234

Millman

ZsuwalskiN

Dictionary

so

rte

d

Value ID implicitly given

by sequence in which

values are stored

Value

Baker

5

Column „Name“ (dictionary compressed)

point into

dictionary

5 J o en s 2 2 h n 4 3

J h n

s o n 0 6 M i l l e r 4 3 m a n

Length

of first

value

o

Length of

prefix

common with

last valueLength of remaining

part after the

common prefix

0 21 4

J o en s

J o h n

2 2

J h no s o n

34

3

Dictionary (Main Storage)

Sorted array of values

Implicit value ID = position in array

Lookup by binary search: works like index

For strings data: additional front-coding

Column stored as value ID sequence

Bit coded using log2(NDICT) bits

Fast comparison ( =, < , > ) on integers

Speeds up scan, join, region queries

Dictionary (Delta)

Unsorted array

For lookup: search tree (CSB+ tree)

Search

Find Value in dictionary

scan value ID sequence for occurrences

Optional index:

For each value in dictionary list of rows with value

© 2011 SAP AG. All rights reserved. 16 Confidential

Column Store:

Compression of Value ID Sequence

3 12

Run length

encoded

2) Run Length Coding

Uncompressed 4 4 4 4 4 45 2 2

value

8Prefix Coded

1) Prefix Coding

Uncompressed 4 4 4 4 4 4 3 5 3 1 1 04 4

4 3 1 1 03 5

5) Indirect coding

2 126

1

32

Uncompressed

055

576126

2

576 55 126 2 2 55 881 212 3 19 461 792 45 13 13 911 28 28 13 13 911 911

12

028911

13

block size=8

Compressed

(conceptual)0 2 3 1 2 0 0 1 881 212 3 19 461 792 45 13 0 2 1 1 0 0 3 3

Number of occurences

value

block wise dictionary coding of Value-IDs

4 4 3 1

N=6, cluster size=4

3) Cluster Coding

Uncompressed4 4 4 3 3

Cluster

coded

Bit vector 110101 Bit i=1 if cluster i was replaced

by single value

4 4 4 1 0 0 0 4 4 4 4 03 3

Removes the value v which occurs most

often. Bit vector indicates positions of v in the

original sequence.

Uncompressed

Sparse Coded 1 0 0 0 03 3

Bitvector 11100000011110

4) Sparse Coding

4

3 35 2235

0 9

4 44 4 4 4 3 3 3 3 3 1 1 0 0 0 0 0 0

34 3 3 1 0 0 0

Dictionary for

Block 3Dictionary for

Block 1

Block 2 is not compressed

4 235

0 31 4 5 6 87 9 10 11 122 13 14 15

start position

v

© 2011 SAP AG. All rights reserved. 17 Confidential

Column Store Architecture

Column Store

Column Store

Execution Plan

Persistence Layer

Main

Storage

Delta

Merge

Operation

Consistent

View

Manager

Per Transaction

Change

Information

Logger

R

Virtual Main Files Redo and Undo Log

Consistent

View Filter

Delta

Storage

Persistence Interface

Transaction

Manager

R

Row Level

Locks

Transaction

Status

History

Storage

Main

Delta

R

Virtual Delta Log

Files

R

Write Operations

Column Store Executor

Column Store Optimizer

R

SQL Processor Calc Engine

OLAP Optimizer

R

Load/Unload

Agent

R

R

R

Column Store Operators

(Read)Standard Operators

Operators optimized

for OLAP

Write

Request

Query

(search)

R

R

R

R

RR

Write delta logLoad tableWrite main and

delta log virtual

files

R

Column Store

Own optimizer and execution control

OLAP engine: Special optimizer and operators

For each column

Compressed main storage

Delta storage with only basic compression

Write operations go into delta only

Merge operation re-encodes main area

Optional history storage (also main + delta)

Delta log written per table

Main area persisted during merge operation

Write and read possible during merge

Loads tables on demand (preload configurable)

© 2011 SAP AG. All rights reserved. 18 Confidential

Concurrency Control and Consistency

MVCC

All records have timestamp ( integer sequences CID, TID)

Updates create new version of data record

Running transactions still read old versions (snapshot)

Concurrent readers not blocked by writer

Concurrent write prevented by write locks

Snapshot Isolation

Operations see a snapshot of the database

created either at start of transaction

or for each statement

Consistent View of the Database

Contains only changes that are part of the snapshot

Transaction Token

Internal object used to calculate the consistent view

Contains for example maximum commit sequence number included in the snapshot

Row store, column store can determine visible rows for a given transaction token

Also used by column store for time base queries (restore historical transaction token)

Multi Version Concurrency Control

time

T1 T3

open

open

inse

rt

Versions of

data item Dcomitted

comitted

commitcommit

V1

V2

read

read

redundant

If history

function is

not required

T5

T4

T2

write

read

read

© 2011 SAP AG. All rights reserved. 19 Confidential

Distributed In-Memory Computing Engine (1)

© SAP 2008 /

Page 19

NewDB

System

Tenant 1

Database

Server

Process

Tenant 2

Database

Server

Process

Tenant 3

Database

Server

Process

Tenant 4

Database

Server

Process

Tenant 5

Database

Server

Process

Tenant N

Database

Server

Process

Tenant Data Tenant Data Tenant Data Tenant Data Tenant Data Tenant Data

Host Host Host

NewDB

System

Database

Server

Process

Data

Partition

Host

Database

Server

Process

Data

Partition

Host

Database

Server

Process

Data

Partition

Host

Data Distribution – No tenant separation

Multi Tenant System

Default: tenants are not distributed across multiple servers

© 2011 SAP AG. All rights reserved. 20 Confidential

Distributed In-Memory Computing Engine (2)

© SAP 2008 /

Page 20

NewDB

Database Server

NewDB

Database Server

Execution

Layer

In-Memory Stores

Persistence Layer

R

R

NewDB

Database Server

Execution

Layer

In-Memory Stores

Persistence Layer

R

R

Execution

Layer

In-Memory Stores

Persistence Layer

R

R

© 2011 SAP AG. All rights reserved. 21 Confidential

Distributed In-Memory Computing Engine

Topology Information / metadata

© SAP 2008 /

Page 21

Host

Host

Host

Master

Name

Server

Topology

And distribution

Information

Slave

Name

Server

Slave

Name

Server

Slave

Name

Server

Topology and distribution

Information (replicated)

Topology and distribution

Information (replicated)

Topology and distribution

Information (replicated)

NewDB

Database Server

NewDB

Database Server

NewDB

Database Server

R

R

R

Fast read access

via shared memory

© 2011 SAP AG. All rights reserved. 22 Confidential

Distributed In-Memory Computing Engine

Components

© SAP 2008 /

Page 22

© 2011 SAP AG. All rights reserved. 23 Confidential

Partitioning

Multi level partitioning

Hash partitioning

Value ( range ) partitioning

Landscape reorganization

Based on table size and join paths

Semi automatic table move and partitioning

Query pruning

• Automatic table query pruning

• Plan rewrite for star schema complex query pruning

Aging ( next release)

Based on application information ( create / close date )

Automatic Business object aging

New references can change the close date

© 2011 SAP AG. All rights reserved. 24 Confidential

Multi-Tenancy Requirements for a Database Management System (1/2)

Minimize TCO by resource sharing

Tenant-independent data

Shared metadata

Use same hardware / lifecycle process

Transparency of resource sharing

Operate on tenant as if it has its own DB

Tenant-specific DB connections

– Local tenant-specific operations

– No need to insert a tenant ID into SQL statements

© 2011 SAP AG. All rights reserved. 25 Confidential

Multi-Tenancy Requirements for a Database Management System (2/2)

Tenant isolation

One tenant cannot directly access other tenant„s data

One tenant should not be influenced by other tenant„s operations

– Independence of other tenant„s resource consumptions or crashes

– Guarantee SLAs

Ease of tenant„s lifecycle operations

Tenant creation, move, delete, copy, backup

Tenant-specific release management

Debugging, tracing, monitoring

© 2011 SAP AG. All rights reserved. 26 Confidential

Tenant Separation

100

200

Client

100

300

200

...

300

A B C

765

770

701

733

777

800

y

z

w

x

z

z

c

a

c

b

a

b

... ... ...

A B C

765

701

y

w

c

c

A B C

770

800

z

z

a

b

... ... ...

A B C

733

777

x

z

b

a

... ... ...... ... ...

T1 in Tenant 100

ABAP client concept

T1 in Tenant 200 T1 in Tenant 300

defined by central

metadata

NewDB tenant concept

Table T1

NewDB tenant

Separate database process

Separate data volumes

Separate instances of tenant-dependent tables

Separate transaction domain

© 2011 SAP AG. All rights reserved. 27 Confidential

Tenant Separation (1/2)

Tenant-specific data

Tenant-specific tables

Tenant-private tables

Tenant-specific DB / OS process

Every tenant has its own physical DB process on OS level

Makes every tenant independent from effects of other tenants

Easy tenant start & stop

Tenant-specific backup/recovery possible (but not intended)

© 2011 SAP AG. All rights reserved. 28 Confidential

Tenant Separation (2/2)

Tenant-specific transaction domain

Every tenant has its own transaction domain

Avoids bottleneck of central transaction manager

Allows local transactions only for non-distributed tenants

A tenant is not allowed to read/write other tenant‟s data/metadata

Exceptions:

– A tenant can read system tenant data

– Lifecycle management operations in system tenant

Disadvantage: Expensive cross-tenant transactions

Tenant-specific extensions

Tenant-specific table metadata for tenant-private tables

Tenant-specific extensions (metadata) to tenant-specific tables

© 2011 SAP AG. All rights reserved. 29 Confidential

Multi Tenant Data Separation

© SAP 2008 /

Page 29

Shared read-only

information

System Tenant

Central

Metadata

Tenant

Independent

Tables

Normal TenantTenant

Private Tables

Private

Metadata

Tenant

Dependent Tables

Normal TenantTenant

Private Tables

Private

Metadata

Tenant

Dependent Tables

Normal TenantTenant

Private Tables

Private

Metadata

Tenant

Dependent Tables

Indirect access (meta data replication or

distributed query)

Table Category Metadata Table Instance Access to Data

Tenant Dependent System Tenant Tenant Read access from all tenants

Tenant Independent System Tenant System Tenant Private to tenant

Tenant Private Tenant Tenant Private to tenant

© 2011 SAP AG. All rights reserved. 30 Confidential

Metadata sharing

Metadata (like table definitions) are divided in global and local metadata

Global metadata

Tenant spanning

Stored in system tenant

Replicated to all hosts

Read-only for tenants

Is delivered by application provider

Local metadata

Tenant-local

Stored in local tenant

Replicated to the other tenant-local processes (in case of distributed tenant)

Types

– Private tenant tables

– Extensions to tenant-independent tables (new columns)

During runtime: Union between global and local metadata

During tenant move, the global metadata must be consistent or the move is

expensive

© 2011 SAP AG. All rights reserved. 31 Confidential

Multi Tenancy in the In-Memory Computing Engine

Table Content Metadata

Tenant-independent

table

Stored in system tenant;

Read access from other

tenants

Stored in system tenant;

Read access from other

tenants

Tenant-dependent table Independent instances of

the table exist in each

tenant with content that is

private to the tenant

Table definition is stored

centrally in system tenant

with read access from

other tenants; tenant-

specific extensions

(additional columns, etc.)

are stored as private

metadata

Tenant-private table Stored as private data of

the tenant with no access

from other tenants

Stored as private meta

data of

the tenant

© 2011 SAP AG. All rights reserved. 32 Confidential

Special Tenants

System tenant

Metadata owner

Holds system-wide data

– Common master data (country names, currency conversion data)

– Delivered by SAP

Data can be read by all tenants

Owner of lifecycle processes

Can read all tenant‟s data

Template tenant(s)

Preconfigured

Not productive

Can be used as templates for creating user tenants

User/Customer tenants

Holds private data

Holds private metadata (extensions)

Can read system tenant data (weak transactional)

Can write system tenant data (exceptional mode)

© 2011 SAP AG. All rights reserved. 33 Confidential

ABAP System On In-Memory Computing Engine

No Tenant Separation on DB Level

NewDB System

(no tenant separation)

ABAP Application Server

Instance 3

ABAP Application Server

Instance 2

ABAP Application Server

Instance 1

HostHostHost

Dispatcher

Work

Process

R

Work

Process

R

Work

Process

R

Dispatcher

Work

Process

R

Work

Process

R

Work

Process

R

Dispatcher

Work

Process

R

Work

Process

R

Work

Process

R

NewDB Database Server A

NewDB Database Server B

Host

Host

R R R

© 2011 SAP AG. All rights reserved. 34 Confidential

ABAP System On In-Memory Computing Engine

With Tenant Separation on DB Level

ABAP Application

Server Instance 2

ABAP Application

Server Instance 1

NewDB

With Tenant

Separation

Dispatcher

Work

Process

R

Work

Process

R

R R

NewDB Database

Server

Tenant 1

Host H1

NewDB Database

Server

Tenant 2

Work

Process

R

R

Work

Process

R

R

Dispatcher

Work

Process

R

Work

Process

R

R R

NewDB Database

Server

Tenant 3

Host H2

NewDB Database

Server

Tenant 4

Work

Process

R

R

Work

Process

R

Rlocal

communication

Client

R

R

Client

R

R

R R

Client

R

R

Client

R

R

Users of

Tenant 2

Users of

Tenant 3

Users of

Tenant 4Users of

Tenant 1

NewDB Database

Server

System Tenant

R

RR

Rlocal

communication

remote

communication

Distributed query

execution (readonly)

© 2011 SAP AG. All rights reserved. 35 Confidential

Multi Tenancy in the In-Memory Computing Engine

Thank You!

Franz Faerber, Michael Muehle

SAP AG

© 2011 SAP AG. All rights reserved. 38 Confidential

Row Store Architecture

Row Store

Transactional

Version Memory

Record

Level Write

Locks

Read

Operations

Version Memory

Consolidation

Persistence Layer

Intermediate

Results

R

Write Operations

Free Pages,

Sparse Pages

Record

Versions

Rowid

Page

Manager

R

Rowid

Transaction

Manager Execution Layer

Checkpoint

Writer

IndexIndex

key rowid

R

R

Log Replay/

Undo Agent

R

R

Record

Versions

Segment

Page

Header

Slot

Page

Page

Write

log entries

invoke

checkpoint

Write

pages R

Row Store

Data records managed in pages

Temporary versions in separate memory

Index only in memory

Optimistic latch free index access

Loads all tables at startup

Incremental transaction logs

Parallel write and restart

Data persisted during savepoint operation