Franz Faerber, Michael Muehle DB development, SAP AG … · 2013-07-11 · Main memory column and...
Transcript of Franz Faerber, Michael Muehle DB development, SAP AG … · 2013-07-11 · Main memory column and...
Main memory column and row store
for Enterprise applications
Franz Faerber, Michael Muehle DB development, SAP AG
April 2011
© 2011 SAP AG. All rights reserved. 2 Confidential
Why In-Memory?
Information at the Speed of Thought
Imagine access to business data,
supporting analytical queries on the transactional data,
no need for data replication, and zero response time
How would this change the way you work with data?
How would this change your perception of information availability?
How would this new information influence your decisions?
In-memory databases are a technology
with huge disruption potential,
we need to assume a competitive leadership position.
© 2011 SAP AG. All rights reserved. 3 Confidential
Motivation:
Make Use of Modern Hardware
Past
CPU clock rate growing
Software runs faster without change
Today
CPU clock rate growth is flat
Number of cores increases
Write software that scales with number of cores!
Game Changing Hardware Trends
Disk is Tape
Main memory affordable
CPU: more cores, no clock rate increase
Software makers need to react
© 2011 SAP AG. All rights reserved. 4 Confidential
Massive amount of memory and parallel processing power
for under $1M
1 server ...8 CPUs, each CPU with 8 cores for $100,000
can execute 64 threads in parallel
A A A A A A A A A A A A A A A A can hold memory modules adding up to ~2TB today
8 Server blades ...for less than $1M
can execute 512 threads in parallel
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A
can hold memory modules adding up to ~16TB
100GB/sec data throughput per server
© 2011 SAP AG. All rights reserved. 5 Confidential
Simplified Memory Hierarchy
(based on Intel Nehalem)
CPU
Core
1st Level
Cache
2nd Level
Cache
Shared 3rd Level Cache
Core
1st Level
Cache
2nd Level
Cache
Core
1st Level
Cache
2nd Level
Cache
Core
1st Level
Cache
2nd Level
Cache
Size Latency
64KB ~4 cycles
2 ns
256KB ~10 cycles
5 ns
8MB 35-40+ cycles
20 ns
several
GBs up
to TBs
~ 200 cycles
~ 100ns
several
TB
several million
cycles
several ms
Main Memory
Disk
© 2011 SAP AG. All rights reserved. 6 Confidential
Main Memory Bottleneck
Disk is Tape
105 times slower compared to main memory access
No disk access during normal operation
Use disk as kind of “archive”
Memory
Price / MB RAM is going down
64 bit architectures make maximum limit of 4GB main memory obsolete
Memory Access
Access to main memory is not arbitrarily fast
Memory bandwidth increases but memory access latency remains the same
Cache misses limit performance
Traditional database: CPU spends 50% of time waiting after cache misses (1999)
Transfer memory-cache is block wise (cache lines) , for example 64 bytes
Main memory access benefits from data locality, random access is slow
CPU
CPU Cache
MainMemory
Disk
CPU Cache
Core
Performance bottleneck in the past:
Disk I/O
Performance bottleneck today:
CPU waiting for data to be
loaded from memory into cache
© 2011 SAP AG. All rights reserved. 7 Confidential
“In-Memory” Computing
“In-Memory” Computing
Keep all required data in main memory
Rarely accessed data can be moved to disk
(e.g. after 1 year)
Compress data in memory
Disk I/O no longer and optimization target
Cache sensitive data layout
High locality
(data that is needed together is stored together)
Compression (decompress in cache)
Parallelization
Data structures allow splitting into pieces that can be processed in parallel
Avoid locks
Unify OLAP and OLTP systems : combine column and row store
Column store for reporting like access
Row store for OLTP like access
© 2011 SAP AG. All rights reserved. 8 Confidential
Row Based and Column Based Data Storage
US
US
US JP
UK
Alpha
Beta
Alpha
Alpha
3.000
1.250
700
450
Country Product Sales
Row-Based
Columnar
Sum(sales):
Sales: 4 byte, country: 2 byte, product: 10 byte
Cache line: 64 Byte → 48 byte waste = 75%
… … …
US Alpha 3.000 US Beta 1.250 JP Alpha 700 UK Alpha 450 …
US US JP UK Alpha Beta Alpha Alpha 3.000 1.250 700 450 … … …
data stored in contiguous memory
High locality
Aggregate sales figures:
Random access!
© 2011 SAP AG. All rights reserved. 9 Confidential
Row Based and Column Based Data Storage
Column Store
Many calculations on single or few columns only
Searches based on values of a few columns
Big number of columns
Big number of rows and columnar operations are required
aggregate, scan, etc.
High compression rates possible
Most columns contain only few distinct values
Row Store
Application often needs to process single records at one time
many selects and /or updates of single records
Application typically needs to access the complete record
Columns contain mainly distinct values
Aggregations and fast searching not required
Small number of rows (e.g. configuration tables)
?
© 2011 SAP AG. All rights reserved. 10 Confidential
In-Memory Computing Engine
Overall Architecture
NewDB
Request Processing
And Execution Control
Disk Storage
Relational Engines
Connection and Session Management
In Memory
Row Store
In Memory
Column
StoreIn Memory
Object
Store
Disk Based
StoreTemporary
Results
Persistence Layer
Data Volumes Transaction Log Volumes
Transaction
Manager
Metadata
Manager
NewDB Clients (Application Server, Analytics Technology, etc)
Page Management Logger
R
Session
ParametersR
SQL Script
R
R
Data Aging Manager
R
MDXPlanning
Engine
Calc EngineAuthorization
Manager
R
R
Request
Parser
Optimizer
Execution Layer
© 2011 SAP AG. All rights reserved. 11 Confidential
SAP In-Memory Computing Engine (NewDB)
High Performance and Scalability
Compression, cache-aware storage
Executing application logic inside database
Distribution
Reducing Complexity and Cost
Hybrid: column based, row based (future: object store and disk based) in one system
Analytics and OLTP in one system
Support for OLAP
MDX, calculation engine, OLAP views
Compatibility and Standard DBMS Features
SQL, ACID
run SAP applications that use Open SQL without changes.
Multi Tenant Support
Support For Temporal Tables
“time travel”, time slider applications, reporting using historical data, change recording
Main memory text search engine
© 2011 SAP AG. All rights reserved. 12 Confidential
“In-Memory” Computing
Business Database
Move data-intensive operations to the data layer
Do the relational operations at database level ( i.e. select for all entries )
Hierarchy handling ( i.e HR cost center hierarchy )
Reporting and planning functionality
Planning engine
Transactional behaviour
Unify DB and application transaction
Push snapshot isolation to the database / remove transactional buffer in the appserver layer
Business Function Library
Provide business functions in the DB layer like
– Currency /Unit conversion
– Date / time /fiscal period /calendar calculation
– Statistical functionality
Deep integration with the application server
Fast communication layer
SQL extensions ( SQL script )
New data types ( text, GUID, …)
© 2011 SAP AG. All rights reserved. 13 Confidential
Temporal Tables
UPDATE article SET SIZE = „S‟ WHERE ID=„712‟
DELETE FROM article WHERE ID=„546‟
Row ID Text Size
1 546 Shirt XL
2 712 Shoe M
3 913 Hat L
Temporal Table (Logical View)
Valid-From
2010-09-11 08:30
2010-10-06 11:42
2010-10-15 16:11
Valid-To
∞
∞
∞
Row ID Text Size
1 546 Shirt XL
2 712 Shoe M
3 913 Hat L
Valid-From
2010-09-11 08:30
2010-10-06 11:42
2010-10-15 16:11
Valid-To
∞
2010-10-26 17:05
∞
4 712 Hat S 2010-10-26 17:05 ∞
Row ID Text Size
1 546 Shirt XL
2 712 Shoe M
3 913 Hat L
Valid-From
2010-09-11 08:30
2010-10-06 11:42
2010-10-15 16:11
Valid-To
2010-10-26 17:30
2010-10-26 17:05
∞
4 712 Hat S 2010-10-26 17:05 ∞
historical
historical
insert
Currently only in column store
© 2011 SAP AG. All rights reserved. 14 Confidential
Intra operator parallization: Aggregation
Thread 1
Thread 2
Fact Table
… Thread 1
Local HT 1 Local HT 2
Cache-Sized Hash Tables
Buffer
Part HT 1 Part HT 2
Thread 3 Thread 4
© 2011 SAP AG. All rights reserved. 15 Confidential
Column Store: Dictionary Compression
JonesMiller
MillmanZsuwalskiBakerMillerJohnMillerJohnsonJones
Column „Name“
(uncompressed)Value-ID sequence
One element for each row in column
415N042431
Va
lue
IDs
JohnsonMiller
JohnJones
01234
Millman
ZsuwalskiN
Dictionary
so
rte
d
Value ID implicitly given
by sequence in which
values are stored
Value
Baker
5
Column „Name“ (dictionary compressed)
point into
dictionary
5 J o en s 2 2 h n 4 3
J h n
s o n 0 6 M i l l e r 4 3 m a n
Length
of first
value
o
Length of
prefix
common with
last valueLength of remaining
part after the
common prefix
0 21 4
J o en s
J o h n
2 2
J h no s o n
34
3
Dictionary (Main Storage)
Sorted array of values
Implicit value ID = position in array
Lookup by binary search: works like index
For strings data: additional front-coding
Column stored as value ID sequence
Bit coded using log2(NDICT) bits
Fast comparison ( =, < , > ) on integers
Speeds up scan, join, region queries
Dictionary (Delta)
Unsorted array
For lookup: search tree (CSB+ tree)
Search
Find Value in dictionary
scan value ID sequence for occurrences
Optional index:
For each value in dictionary list of rows with value
© 2011 SAP AG. All rights reserved. 16 Confidential
Column Store:
Compression of Value ID Sequence
3 12
Run length
encoded
2) Run Length Coding
Uncompressed 4 4 4 4 4 45 2 2
value
8Prefix Coded
1) Prefix Coding
Uncompressed 4 4 4 4 4 4 3 5 3 1 1 04 4
4 3 1 1 03 5
5) Indirect coding
2 126
1
32
Uncompressed
055
576126
2
576 55 126 2 2 55 881 212 3 19 461 792 45 13 13 911 28 28 13 13 911 911
12
028911
13
block size=8
Compressed
(conceptual)0 2 3 1 2 0 0 1 881 212 3 19 461 792 45 13 0 2 1 1 0 0 3 3
Number of occurences
value
block wise dictionary coding of Value-IDs
4 4 3 1
N=6, cluster size=4
3) Cluster Coding
Uncompressed4 4 4 3 3
Cluster
coded
Bit vector 110101 Bit i=1 if cluster i was replaced
by single value
4 4 4 1 0 0 0 4 4 4 4 03 3
Removes the value v which occurs most
often. Bit vector indicates positions of v in the
original sequence.
Uncompressed
Sparse Coded 1 0 0 0 03 3
Bitvector 11100000011110
4) Sparse Coding
4
3 35 2235
0 9
4 44 4 4 4 3 3 3 3 3 1 1 0 0 0 0 0 0
34 3 3 1 0 0 0
Dictionary for
Block 3Dictionary for
Block 1
Block 2 is not compressed
4 235
0 31 4 5 6 87 9 10 11 122 13 14 15
start position
v
© 2011 SAP AG. All rights reserved. 17 Confidential
Column Store Architecture
Column Store
Column Store
Execution Plan
Persistence Layer
Main
Storage
Delta
Merge
Operation
Consistent
View
Manager
Per Transaction
Change
Information
Logger
R
Virtual Main Files Redo and Undo Log
Consistent
View Filter
Delta
Storage
Persistence Interface
Transaction
Manager
R
Row Level
Locks
Transaction
Status
History
Storage
Main
Delta
R
Virtual Delta Log
Files
R
Write Operations
Column Store Executor
Column Store Optimizer
R
SQL Processor Calc Engine
OLAP Optimizer
R
Load/Unload
Agent
R
R
R
Column Store Operators
(Read)Standard Operators
Operators optimized
for OLAP
Write
Request
Query
(search)
R
R
R
R
RR
Write delta logLoad tableWrite main and
delta log virtual
files
R
Column Store
Own optimizer and execution control
OLAP engine: Special optimizer and operators
For each column
Compressed main storage
Delta storage with only basic compression
Write operations go into delta only
Merge operation re-encodes main area
Optional history storage (also main + delta)
Delta log written per table
Main area persisted during merge operation
Write and read possible during merge
Loads tables on demand (preload configurable)
© 2011 SAP AG. All rights reserved. 18 Confidential
Concurrency Control and Consistency
MVCC
All records have timestamp ( integer sequences CID, TID)
Updates create new version of data record
Running transactions still read old versions (snapshot)
Concurrent readers not blocked by writer
Concurrent write prevented by write locks
Snapshot Isolation
Operations see a snapshot of the database
created either at start of transaction
or for each statement
Consistent View of the Database
Contains only changes that are part of the snapshot
Transaction Token
Internal object used to calculate the consistent view
Contains for example maximum commit sequence number included in the snapshot
Row store, column store can determine visible rows for a given transaction token
Also used by column store for time base queries (restore historical transaction token)
Multi Version Concurrency Control
time
T1 T3
open
open
inse
rt
Versions of
data item Dcomitted
comitted
commitcommit
V1
V2
read
read
redundant
If history
function is
not required
T5
T4
T2
write
read
read
© 2011 SAP AG. All rights reserved. 19 Confidential
Distributed In-Memory Computing Engine (1)
© SAP 2008 /
Page 19
NewDB
System
Tenant 1
Database
Server
Process
Tenant 2
Database
Server
Process
Tenant 3
Database
Server
Process
Tenant 4
Database
Server
Process
Tenant 5
Database
Server
Process
Tenant N
Database
Server
Process
Tenant Data Tenant Data Tenant Data Tenant Data Tenant Data Tenant Data
Host Host Host
NewDB
System
Database
Server
Process
Data
Partition
Host
Database
Server
Process
Data
Partition
Host
Database
Server
Process
Data
Partition
Host
Data Distribution – No tenant separation
Multi Tenant System
Default: tenants are not distributed across multiple servers
© 2011 SAP AG. All rights reserved. 20 Confidential
Distributed In-Memory Computing Engine (2)
© SAP 2008 /
Page 20
NewDB
Database Server
NewDB
Database Server
Execution
Layer
In-Memory Stores
Persistence Layer
R
R
NewDB
Database Server
Execution
Layer
In-Memory Stores
Persistence Layer
R
R
Execution
Layer
In-Memory Stores
Persistence Layer
R
R
© 2011 SAP AG. All rights reserved. 21 Confidential
Distributed In-Memory Computing Engine
Topology Information / metadata
© SAP 2008 /
Page 21
Host
Host
Host
Master
Name
Server
Topology
And distribution
Information
Slave
Name
Server
Slave
Name
Server
Slave
Name
Server
Topology and distribution
Information (replicated)
Topology and distribution
Information (replicated)
Topology and distribution
Information (replicated)
NewDB
Database Server
NewDB
Database Server
NewDB
Database Server
R
R
R
Fast read access
via shared memory
© 2011 SAP AG. All rights reserved. 22 Confidential
Distributed In-Memory Computing Engine
Components
© SAP 2008 /
Page 22
© 2011 SAP AG. All rights reserved. 23 Confidential
Partitioning
Multi level partitioning
Hash partitioning
Value ( range ) partitioning
Landscape reorganization
Based on table size and join paths
Semi automatic table move and partitioning
Query pruning
• Automatic table query pruning
• Plan rewrite for star schema complex query pruning
Aging ( next release)
Based on application information ( create / close date )
Automatic Business object aging
New references can change the close date
© 2011 SAP AG. All rights reserved. 24 Confidential
Multi-Tenancy Requirements for a Database Management System (1/2)
Minimize TCO by resource sharing
Tenant-independent data
Shared metadata
Use same hardware / lifecycle process
Transparency of resource sharing
Operate on tenant as if it has its own DB
Tenant-specific DB connections
– Local tenant-specific operations
– No need to insert a tenant ID into SQL statements
© 2011 SAP AG. All rights reserved. 25 Confidential
Multi-Tenancy Requirements for a Database Management System (2/2)
Tenant isolation
One tenant cannot directly access other tenant„s data
One tenant should not be influenced by other tenant„s operations
– Independence of other tenant„s resource consumptions or crashes
– Guarantee SLAs
Ease of tenant„s lifecycle operations
Tenant creation, move, delete, copy, backup
Tenant-specific release management
Debugging, tracing, monitoring
© 2011 SAP AG. All rights reserved. 26 Confidential
Tenant Separation
100
200
Client
100
300
200
...
300
A B C
765
770
701
733
777
800
y
z
w
x
z
z
c
a
c
b
a
b
... ... ...
A B C
765
701
y
w
c
c
A B C
770
800
z
z
a
b
... ... ...
A B C
733
777
x
z
b
a
... ... ...... ... ...
T1 in Tenant 100
ABAP client concept
T1 in Tenant 200 T1 in Tenant 300
defined by central
metadata
NewDB tenant concept
Table T1
NewDB tenant
Separate database process
Separate data volumes
Separate instances of tenant-dependent tables
Separate transaction domain
© 2011 SAP AG. All rights reserved. 27 Confidential
Tenant Separation (1/2)
Tenant-specific data
Tenant-specific tables
Tenant-private tables
Tenant-specific DB / OS process
Every tenant has its own physical DB process on OS level
Makes every tenant independent from effects of other tenants
Easy tenant start & stop
Tenant-specific backup/recovery possible (but not intended)
© 2011 SAP AG. All rights reserved. 28 Confidential
Tenant Separation (2/2)
Tenant-specific transaction domain
Every tenant has its own transaction domain
Avoids bottleneck of central transaction manager
Allows local transactions only for non-distributed tenants
A tenant is not allowed to read/write other tenant‟s data/metadata
Exceptions:
– A tenant can read system tenant data
– Lifecycle management operations in system tenant
Disadvantage: Expensive cross-tenant transactions
Tenant-specific extensions
Tenant-specific table metadata for tenant-private tables
Tenant-specific extensions (metadata) to tenant-specific tables
© 2011 SAP AG. All rights reserved. 29 Confidential
Multi Tenant Data Separation
© SAP 2008 /
Page 29
Shared read-only
information
System Tenant
Central
Metadata
Tenant
Independent
Tables
Normal TenantTenant
Private Tables
Private
Metadata
Tenant
Dependent Tables
Normal TenantTenant
Private Tables
Private
Metadata
Tenant
Dependent Tables
Normal TenantTenant
Private Tables
Private
Metadata
Tenant
Dependent Tables
Indirect access (meta data replication or
distributed query)
Table Category Metadata Table Instance Access to Data
Tenant Dependent System Tenant Tenant Read access from all tenants
Tenant Independent System Tenant System Tenant Private to tenant
Tenant Private Tenant Tenant Private to tenant
© 2011 SAP AG. All rights reserved. 30 Confidential
Metadata sharing
Metadata (like table definitions) are divided in global and local metadata
Global metadata
Tenant spanning
Stored in system tenant
Replicated to all hosts
Read-only for tenants
Is delivered by application provider
Local metadata
Tenant-local
Stored in local tenant
Replicated to the other tenant-local processes (in case of distributed tenant)
Types
– Private tenant tables
– Extensions to tenant-independent tables (new columns)
During runtime: Union between global and local metadata
During tenant move, the global metadata must be consistent or the move is
expensive
© 2011 SAP AG. All rights reserved. 31 Confidential
Multi Tenancy in the In-Memory Computing Engine
Table Content Metadata
Tenant-independent
table
Stored in system tenant;
Read access from other
tenants
Stored in system tenant;
Read access from other
tenants
Tenant-dependent table Independent instances of
the table exist in each
tenant with content that is
private to the tenant
Table definition is stored
centrally in system tenant
with read access from
other tenants; tenant-
specific extensions
(additional columns, etc.)
are stored as private
metadata
Tenant-private table Stored as private data of
the tenant with no access
from other tenants
Stored as private meta
data of
the tenant
© 2011 SAP AG. All rights reserved. 32 Confidential
Special Tenants
System tenant
Metadata owner
Holds system-wide data
– Common master data (country names, currency conversion data)
– Delivered by SAP
Data can be read by all tenants
Owner of lifecycle processes
Can read all tenant‟s data
Template tenant(s)
Preconfigured
Not productive
Can be used as templates for creating user tenants
User/Customer tenants
Holds private data
Holds private metadata (extensions)
Can read system tenant data (weak transactional)
Can write system tenant data (exceptional mode)
© 2011 SAP AG. All rights reserved. 33 Confidential
ABAP System On In-Memory Computing Engine
No Tenant Separation on DB Level
NewDB System
(no tenant separation)
ABAP Application Server
Instance 3
ABAP Application Server
Instance 2
ABAP Application Server
Instance 1
HostHostHost
Dispatcher
Work
Process
R
Work
Process
R
Work
Process
R
Dispatcher
Work
Process
R
Work
Process
R
Work
Process
R
Dispatcher
Work
Process
R
Work
Process
R
Work
Process
R
NewDB Database Server A
NewDB Database Server B
Host
Host
R R R
© 2011 SAP AG. All rights reserved. 34 Confidential
ABAP System On In-Memory Computing Engine
With Tenant Separation on DB Level
ABAP Application
Server Instance 2
ABAP Application
Server Instance 1
NewDB
With Tenant
Separation
Dispatcher
Work
Process
R
Work
Process
R
R R
NewDB Database
Server
Tenant 1
Host H1
NewDB Database
Server
Tenant 2
Work
Process
R
R
Work
Process
R
R
Dispatcher
Work
Process
R
Work
Process
R
R R
NewDB Database
Server
Tenant 3
Host H2
NewDB Database
Server
Tenant 4
Work
Process
R
R
Work
Process
R
Rlocal
communication
Client
R
R
Client
R
R
R R
Client
R
R
Client
R
R
Users of
Tenant 2
Users of
Tenant 3
Users of
Tenant 4Users of
Tenant 1
NewDB Database
Server
System Tenant
R
RR
Rlocal
communication
remote
communication
Distributed query
execution (readonly)
© 2011 SAP AG. All rights reserved. 38 Confidential
Row Store Architecture
Row Store
Transactional
Version Memory
Record
Level Write
Locks
Read
Operations
Version Memory
Consolidation
Persistence Layer
Intermediate
Results
R
Write Operations
Free Pages,
Sparse Pages
Record
Versions
Rowid
Page
Manager
R
Rowid
Transaction
Manager Execution Layer
Checkpoint
Writer
IndexIndex
key rowid
R
R
Log Replay/
Undo Agent
R
R
Record
Versions
Segment
Page
Header
Slot
Page
Page
Write
log entries
invoke
checkpoint
Write
pages R
Row Store
Data records managed in pages
Temporary versions in separate memory
Index only in memory
Optimistic latch free index access
Loads all tables at startup
Incremental transaction logs
Parallel write and restart
Data persisted during savepoint operation