Relational Database Internals

Relational Database Internals

Alex ScottiBloomberg LP

Outline of talk

• History - origins and background

• Internals - theory and practice

• Internals - brief discussion of real systems

• Future - observations, trends, predictions

History

• The early database systems differed from the relational ones in 2 main regards

• Data model

• Transactional semantics

• We’ll be more heavily focused on the transactional issues than the data modeling issues in this talk

• Pre Relational Systems

• Hierarchical Data Models

• IMS

• Network Data Models

• CODASYL

• IMS

• Each record was typed by a “record type” (think “table”)

• Relationships between records are represented as trees (hierarchies) between records linked by their “keys”

• Writing a “query” consisted of writing a program to navigate through these links, traversing records until the right one was found.

• Data types available were SEQUENTIAL, HASH, TREE

• Each acted differently - A program written to use a tree could not have the data structure changed out from under it

• lack of PHYSICAL INDEPENDENCE

• CODASYL

• A more complex “evolution” of the IMS idea, standardized by ANSI, implemented by several vendors

• Honeywell, DEC, Univac

• Idea is instead of pointers forming a strict hierarchy, they now form an arbitrarily complex “network.”

• Able to represent graphs

• Even HARDER to program with than IMS

•In 1970, Ted Codd wrote the foundational paper “A Relational Model if Data for Large Shared Data Banks”

•Codd was primarily a mathematician, not particularly concerned with transaction processing

•However, the two problems were incredibly tightly coupled

•Work at IBM began on “relational databases”

• Including locking, logging, all sorts of things that became the core of an RDBMS

• Codd’s insight

• A database is nothing more than a “fact store” from which it should be possible to logically infer “new facts.”

• Simple but amazingly powerful.

• If the goal is to store facts, then there is no benefit from storing the same fact multiple times or in multiple forms. A fact does not become MORE TRUE by repeating it. The basis of “normalization.”

• Codd’s insight

• If the system knows enough about the data it is storing, you can ASK IT QUESTIONS rather than TELL IT WHAT TO DO.

• Declarative vs Procedural programming model

• AKA, The nail in the coffin for all the Pre Relational systems

• Just a matter of time - If a system is easier to use, and performs fine, why wouldn’t you use it?

• Codd’s terminology has (for the most part) become replaced by the SQL terminology, which we’ll be using throughout the rest of this talk

• Generalized simplification

• “Logically organize your data into tables”

Attribute ColumnTuple RowRelation Table

• Going further, Codd defined 12 “rules” that he hoped to define what it meant to be “relational”

• Key points are

• All information is represented in tables

• Nulls must be uniformly handled by all datatypes

• Physical representation must be abstracted from logical representation

• Physical location and distribution of data must be invisible to users

• Key points are

• “Set based” operations for insert / update / delete

• Integrity constraints must be enforceable by the database system

• There must be no way AROUND the set of enforced constraints

• There must be support for at least 1 “relational language”

• Codd’s work became the basis for a “next generation” database product at IBM called System R.

• System R was treated as a “production proof of concept.” At the end of the project there were several commercial customers.

• Around the same time, work was going on at UC Berkeley on the “Ingres” system, also based on Codd’s idea.

• Neither system was successful at commercializing a general purpose database.

• The award goes to Oracle.

• Oracle shipped a working commercial RDBMS to anyone who would pay before IBM.

• Based also on Codd’s work.

• No common code between System R, Ingres, and Oracle - 3 unique lineages all based on the same idea

• IBM evolved the System R “prototype” into their second system : DB2.

• Ingres went on to be the basis of numerous successful commercial products

• Sybase was based on Ingres code

• Informix contains Ingres code (through Illustra)

• MSSQL contained Ingres code (through Sybase)

• Newer systems - all inspired by the same ideas and following the same principles, but without direct code sharing

• Mysql

• SQLite

• PostgreSQL

Internals

• Buffer Pool

• Log

• Concurrency Control

• Btrees

• Relational Layer

Buffer Pool•Often known as “the cache”

•A page/block oriented data structure

•A page in the pool conceptually “maps” to a block on a disk. (not really always true)

•Needs to interface with the systems BELOW and the systems ABOVE.

•Below - Disks, File systems

•Above - Btrees

• Both page/block oriented interfaces above and below.

• Conceptually, very similar to the VM subsystem of any modern UNIX

• “demand paging”

• Eviction policy based on LRU approximations, often with more “smarts” than VM.

• Higher levels of the system often can pass down “hints” about intended access patterns all the way to the buffer pool.

Buffer Pool - Why?

• Whats the story? it’s a cache, we get it.

• Much more than that going on here!

• Basics of transaction management begins with the buffer pool and the policies and protocols enforced there

• Terminology

• “pinned” - a page that cannot be evicted

• “dirty” - a page that contains data that DOES NOT match the data on the disk

• “clean” - the opposite

• A dirty page BECOMES a clean page when the data in that page is DURABLY written to the disk

• Can we really just write a page to the disk? Not really, it usually involves logging protocols - wait for the next section!

• More terminology

• “forcing” - when a transaction commits, it’s dirty pages are FORCED to durable storage before considering the commit complete

• “stealing” - A dirty page which is a part of an UNCOMMITTED transaction can be written to the disk in an effort to produce usable space in the buffer pool

• What is the simplest?

• FORCE / NOSTEAL

• What is the highest performing and most powerful?

• NOFORCE / STEAL

• Not surprisingly, most real world systems today implement a NOFORCE / STEAL buffer pool policy

• Support for this policy requires logging

•More terminology

•OVERWRITE / NO OVERWRITE

•Whether or not the buffer pool will write changes to a page ON TOP of an existing page, or leave the existing page alone and write to a NEW page.

•OVERWRITE systems are higher performing

•most real world systems implement an OVERWRITE buffer pool.

•NO OVERWRITE example: System R, shadow paging

• How does data actually get written to the disk?

• The “clients” of the buffer pool (the layers above) never concern themselves with writing data. They work at a layer of abstraction where they “get buffer” and “dirty” them.

• Pages get written out (cleaned!) as part of a background process.

• Goal is to keep some portion of the buffer pool clean.

• Why are we trying to keep writing out these pages to disk in the background?

• To make the system more reliable?

• NO! Completely unrelated. Reliability ensured through other means

• To make sure that a READ doesn’t become a WRITE!

• Need a page? Cant get one, all dirty.

• You get to “clean one” (write it) now!

Logging

• Basic Idea behind logging

• Before you do something, write down what it is you intend to do.

• Sounds slow. Why bother with this, just DO IT!

• Nope - The opposite is true. Logging can make things quicker

• The highest performing buffer pool policy of NOFORCE/STEAL actually REQUIRES logging

• Without logging you would compromise with a lower performing policy

• Logging has the capacity to perform “magic”

• Converts RANDOM (slow) I/O into SEQUENTIAL (fast) I/O!

• We’ll come back to this idea

•Expanding on the basic idea of logging

•Theres really two distinct things that you are “writing down” here

•Write down what it is you are about to do: REDO logging - can “do it over”

•Write down the procedure to follow to make it as if what you did NEVER HAPPENED: UNDO logging

•Many times both of these pieces of information are embedded into single “log record” Or not. Conceptually 2 things.

• Mechanics of logging - What’s the data structure?

• In it’s basic form, a log is a simple sequential file. Conceptually it’s not unlike a tape drive.

• Each “record” in the log is identified by a unique identifier, which is typically just the physical location of the record in the file.

• Call this the Log Sequence Number (LSN)

• “Log Buffer” - exactly what it sounds like - a buffer of memory in front of the log.

• An obvious and common optimization to make it less expensive to “write to the log”

• Recoverability is endangered unless the log exposes an interface to FLUSH THE BUFFER. (and it gets called at the right places)

• All real systems work this way

• Subsystems are said to “generate log records” (calling APIs provided by log subsystem)

• Buffer pool may need to log the allocation of a new page

• Btree may need to log a page split

• Relational layer may log an INSERT statement

• Customers of this subsystem all over the database

• 2 approaches to logging

• “Physical logging”

• “Logical logging”

• Physical logging

• Log entire page images

• “redo record” : “log what the page is GOING TO look like”

• “undo record” : “log what the page LOOKS LIKE NOW”

• Problems?

• Inefficient, expensive

• Poor concurrency

• Problems

• Inefficiency mess

• Why log 2 copies of a page when I only changed a few bytes?

• Concurrency mess

• Systems with concurrency control at a finer granularity than the page cannot log this way. We’ll come back to that.

• On the other hand

• Physical logging is appealing because it is simple, and it works because of a nice property of being “testable”

• We can look at a log record ABOUT a page, then look at the page, and determine which state it’s in because we RECORDED the two possible states

• This turns out to be an essential property of reovery

• Logical logging

• Log the high level operations only

• SQL

• INSERT INTO TBL(A) VALUES(1)

• REDO

• “INSERT INTO TBL(A) VALUES(1)”

• UNDO

• “DELETE FROM TBL WHERE A = 1”

• Elegant!

• Simple!

• Compact!

• but it doesn’t WORK!

• That SQL INSERT could decompose into dozens of page writes.

• Some may have been done, then crash. You can’t look at thee pages and tell which ones were done (UNDO THEM) and which ones weren’t

• NO FORCE allows us to mark a transaction “committed” WITHOUT writing all of the pages.

• Some may have been written, then crash

• We can’t tell which one WERE NOT written (REDO them) and which ones WERE (leave them alone)

• It’s often UNSAFE to perform actions multiple times

• Making logical logging work - “Physiological Logging”

• “Physical ABOUT pages, logical ABOUT the contents INSIDE the page”

• The idea is to keep the logging centered on the idea of pages, which works well

• But log less information than a physical scheme would require

• Example Physiological Operation

• “Add item X to page N”

• Push down the logical concept into the page level - logical INSIDE the page

• SQL INSERT statement will decompose into several independent physiological operations

• Each one is INDEPENDENTLY TESTABLE / UNDOABLE / REDOABLE

• AKA, “it works”

• Logging for purposes of recovery

• Key technique is based on something called the “pagelsn”

• Intertwining of the buffer pool and the logger

• Each time you modify a page, store the LSN of the log record describing that modification ON THE PAGE ITSELF

• Testability

• Look at the pagelsn to determine state

• Write Ahead Logging (WAL) Protocol

• Tightly integrated with buffer pool

• Before a dirty page is written to disk, the UNDO information for that page must be durable

• Before a transaction is considered committed, the REDO information for that transaction’s pages must be durable

• And that’s how a NO FORCE / STEAL system can convert random I/O into sequential

• Basic idea behind recovery after crash

• REDO all COMMITTED transactions

• Some pages MAY NOT be written

• as allowed by NO FORCE

• UNDO all UNCOMMITTED transactions

• Some pages MAY HAVE BEEN written

• as allowed by STEAL

Concurrency Control

• Lets talk about ACID now (finally?)

• We’ll use Chris Date’s definition

• Atomic

• A transaction fully completes or no part of it does.

• Correct (Consistent)

• Transactions transform a database from one correct state to another, not necessarily enforcing correctness during the transition between these two states

• Isolated

• Transactions are isolated from each other in such a way that a transaction will be “correct” regardless of what other transactions may be simultaneously executing

• Durable

• A “committed” transaction CAN NOT be “lost” after a system failure

• Concurrency control intertwines will all of these concepts.

• But mostly the I in ACID

• ISOLATED is really just a layman’s shorthand for “SERIALIZABLE”

• Basic Serializability Theory

• A system which runs all transactions sequentially (with no concurrency) produces a “history” known as a “serial history”

• A serial history is BY DEFINITION correct

• You can’t have concurrency problems WITHOUT CONCURRENCY!

• A system which allows for concurrency produces histories comprised of the interleaved execution of the concurrent transactions

• If that history can be said to be EQUIVALENT to a serial history (one produced through non concurrent execution) then the concurrent system’s history is said to be SERIALIZABLE

• EQUIVALENT - “Produces the same output and has the same effect on the database”

• Some formal notation

• rn[x] : Transaction n reads object x

• wm[y] : Transaction m writes object y

• cl : Transaction l commits

• “conflicting operations”

• r conflicts with w

• w conflicts with r

• w conflicts with w

• Conflict Serializability Testing

• A history can be considered equivalent to a serial history if it holds that for all conflicting operations the ordering of the conflicts is the same

• r1[x] r1[y] w1[x] r1[z] c1 r2[x] r2[a] c2

• r1[x] r1[y] w1[x] r2[a] r2[x] r1[z] c1 c2

• r2[x] conflicts with w1[x]

• In both histories order of conflict is same

• Serializability Graph Testing

• A technique to analyze any history for serializability is the “serialization graph”

• For each committed transaction add a directed edge from T1 to T2 if any step of T2 conflicts with T1

• If the resulting graph contains NO CYCLES then the history is serializable

• “Schedulers”

• Histories are said to be “produced” by the execution of event as determined by the “scheduler”

• This may or may not be a “real thing” in a real system.

• As a mental model we consider the scheduler to be a real thing who’s job it is to schedule the interleaving of transactions in such a manner to produce serializable histories

• “Conservative Schedulers”

• Err on the side of delaying execution (blocking) in the hopes of producing serializable histories

• Extreme case - no concurrent execution allowed!

• “Agressive Schedulers”

• Aim to run with more concurrency with the understanding that non serializable histories may be produced and later rejected

• Extreme case – SGT based validating scheduler

• Locking based schedulers

• The most common real world schedulers all involve forms of locking as the basic mechanism

• Serializable histories are produced through a locking technique called 2 Phase Locking (2PL)

• 2PL Rules

• Acquire “read locks” on all objects read

• Acquire “write locks” on all object written

• Only release locks at Commit

• It can be proven mathematically that all possible histories output from a 2PL scheduler are serializable

• It’s not that hard to convince yourself of this intuitively without the math

• 2PL drawbacks

• 2PL can be overly conservative in many cases, delaying concurrency needlessly when serializability would not have been compromised

• 2PL suffers from deadlocks as it allows for arbitrary interleaving of concurrent blocking operations in no defined order

• Serialization Graph Testing (SGT) Schedulers

• At commit time build a serialization graph and detect cycles.

• No real world system works this way

• Just too computationally expensive

• (fancy term for “slow”)

• Optimistic Concurrency Control (OCC) Schedulers

• Track “read sets” and “write sets” of all transactions

• At commit, ensure that no conflict between these sets has occurred.

• Make sure no transaction that started after your BEGIN has any overlap in its write set with your read set

• OCC Problems

• Tradeoff the deadlock problem of 2PL for the “rejection” problem of OCC

• Can be very difficult to efficiently track conflicts.

• Difficult to allow high concurrency - “giant lock” around “validate” and “commit” phases

• No real world system implements a 100% pure OCC scheduler

• Predicate based Concurrency Control

• SQL: “UPDATE X WHERE Y>5 AND Y<10”

• Don’t lock all the rows between 5 and 10

• instead lock the SINGLE PREDICATE of

• “5<y<10”

• Need not be a “lock” - Compatible with OCC/Validating techniques as well

• Problems with predicates

• Gets very complicated very fast to support arbitrarily complex predicates

• Gets really really complicated to detect compatibility/conflicts between arbitrary predicates - much worse than the basic OCC problem

• But basic “degenerate” predicates have been used in real systems. In some systems our example would have been a “range lock”

• Less than serializable

• Many real world systems either do not fully implement serializability or offer optional (typically default) isolation levels that are WEAKER than serializable

• This is almost ALWAYS done for reasons of performance

• One very successful model of reduced isolation in real systems is known as “Snapshot Isolation”

• Snapshot Isolation (SI)

• An SI scheduler is frequently implemented as a Multi Version Concurrency Control (MVCC) system

• MVCC permits the notion of “versions” of objects

• The notation r1[x] w2[y] is extended to r1[x2] r2[y4]

• Transaction 1 reads version 2 of x

• Transaction 2 writes version 4 of y

• SI defines 2 rules for an MVCC system to follow

• Each version of an object x that is READ BY transaction T is the most recently committed version of x as of the BEGIN of T

• 2 Transactions that overlap in BEGIN and COMMIT time do not write to the SAME OBJECT

• Problems with SI schedulers

• SI can generate histories that are not serializable

• The main issue is referred to as “write skew” - Idea is it becomes visible sometimes that there is a “skew” in time - as your reads and your writes appear to execute at different points in time

• Write Skew

• Simple example - Imagine trying to enforce an integrity constraint inside the application (the db doesn’t know of this constraint)

• In a 2PL system, its easy - read all your conditions before committing

• In an SI system that doesn’t work

• SI anomaly of write skew can be worked around in the application if the DBMS provides explicit “locking” primitives that can be used.

• In previous example, the application would be responsible for “locking” the items read to ensure serializability

• Oracle: SELECT FOR UPDATE

• Comdb2: SELECTV

• SI is often “good enough” and can provide much greater concurrency in many cases than 2PL.

• The SI anomalies are not recognized by ANSI SQL.

• So strangely, according to ANSI SQL, an SI system actually IS serializable. (it isn’t)

• Dirty Read, Non Repeatable Read, Phantom

•An SI scheduler can be implemented as type of aggressive, validating scheduler.

•Retaining some of the aspects of OCC deferring the validation of w-w conflicts (rule 2, no overlap in writes) until commit time

•An SI scheduler can also be built from a conservative locking scheduler

•Write locks can be acquired to enforce second rule of SI, ensuring blocking or deadlock for non compliant histories

Btrees

• The workhorse data structure of a Relational Database System

• Most common choice for implementing an index. Sometime a choice for storing data too.

• Key Idea

• Like a binary tree (balanced) but allowing more than one item on a “node” and more than 2 siblings per node

• A node becomes a PAGE - out of practical necessity

• Buffer pool wants pages

• Logging, recovery wants pages

• Concurrency control wants pages

• A page in Btree maps 1 to 1 into a page in the buffer pool

• Which maps (somehow) into a block on a disk

• Buffer pool could overlay on disk

• Typically it overlays on filesystem

• Filesystem often further abstracted from disk

• Hardware RAID, etc

• Logging is often physiological

• “Add item X to Btree” (operation) can generate log record of

• “Insert item X into the array on page 2”

• Forms of logical logging are often used for internal data structure maintenance

• A “page split” may be a logged event

• Key insight into Btree recovery

• If 2 Btrees ACT the same, then they ARE the same

• Recovery NEED NOT create a bit for bit perfect copy of the original data structure, only one that is indistinguishable from the original over all operations defined to be supported by the Btree

• Concurrency control WITHIN the Btree is typically based on a complex locking protocol with the goal of allowing maximum concurrency (reads and writes) to distinct pages in parallel

• Much like concurrency control in general, many “exotic” non locking variants exist - few if any are really used

• Simplified locking

• Always access tree from “top” (parent) to “bottom” (leaf)

• Always hold lock on item ABOVE before attempting access to item BELOW

• release lock on item ABOVE when you know it’s “safe” (you won’t be going “up”)

• Page level granularity for concurrency in the Btree structure

• Real systems often provide transactions with finer (ROW) granularity for concurrency than pages

• Key insight

• A form of logical logging

• Call the low level (page oriented) work the “physical” level and the high level (descriptive) work the “logical level”

• A logical operation to a Btree could be “insert X” while physical (physiological) is

• “Add X to page 43” or even something like

• “split page 43 into pages 43 and 54, update page 532(parent) to see new sibling, update page 87 (to right of 54) to see left to 54, add X to page 54”

• Use logical logging on the Btree for undo

• A tree need not be the same if it can ACT the same!

• In our previous example, we CAN’T physically undo. If we released our page locks BEFORE transaction commit (which we HAVE TO if we want better than page granularity) then another committed transaction could have put data into newly created page 54

• A physical undo would remove page 54.

• It would cause the loss of data from a subsequent COMMITTED transaction!

• We need to logically undo - leave the tree structure alone

• Remove X from 54 is all we need to do.

• Modified 2PL protocol for row level concurrency and serializability

• When reading row X obtain read lock on row

• When writing row Y, obtain write locks on pages modified by row write, obtain row lock on row Y, release page locks on pages modified by row write

• Row locks follow 2PL protocol, always held until commit

• Page locks are released early

Relational Layer

• Relational Algebra defines 8 primitive operations

• RESTRICT: Chose rows

• PROJECT: Chose columns

• PRODUCT: Multiply 2 sets of columns

• UNION: Add 2 sets of rows

• INTERSECT: Produce set of rows in common between 2 sets

• DIFFERENCE: Remove the commonality of set of rows between set 1 and 2 from set 1

• NATURAL JOIN: Produce a set of rows based on common values of a column

• DIVIDE: Opposite of PRODUCT

• Relational Algebra is a procedural way of expressing a problem

• Lay out the “steps” in terms of the “operators”

• Not procedural in terms of “implementation”

• The implementation of each operator is a procedural operation.

• The algebra has specific rules (in terms of what is commutative, etc) which can be used for simplification of expressions

• Relational Languages

• In practice, nobody is writing any math to run a query! No relational algebra, no relational calculus

• SQL is the dominant (only?!) Relational Language. Others have existed.

• Informix: “Informer”

• Ingres: QUEL / PostrgreSQL POSTQUEL

• System R: SEQUEL

• The purpose of a relational language is to expose enough power to allow one to express anything that would be possible in the relational algebra.

• Codd termed this to be “relationally complete”

• SQL is a relationally complete language

• Inspired by parts of the calculus, parts of the algebra, and a desire to be “english like” rather than “mathematical”

• SQL is a “compiled” programming lamguage

• The database parses SQL then compiles it into an intermediary form for execution

• Conceptually, this intermediary form can be thought of as relational algebra

• This compiled form is often referred to as the “query plan”

•SELECT * from users WHERE uuid=123;

•σ uuid 123 (users)

•SELECT name, age FROM users WHERE numchildren > 2 and numcars > 3;

•π name, age (σ numchildren >2 and numcars > 3 (users) )

• Producing a query plan is HARD WORK!

• It’s the job of a component called the “query planner”

• A single query can be represented by an infinite number of query plans

• Most are absurd and would never be generated by anything other than a defective or malicious planner

• Some are MUCH LESS WRONG

• But only 1 is THE BEST (for this input!)

• The job of the planner is to quickly prune down the search space of plans to ones that might have a chance at being good, then quickly evaluate the “goodness” of the remaining choices

• Quick - This is an overwhelming source of tension in the planner - quick vs correct

• If the system took 1 minute to generate a plan to run your query in 1 second or 1 second to get a plan to run your query in 5 seconds, which would you chose?

• 2 Main approaches

• “Rules” based optimization

• Follow specific mechanical rules about the way the SQL was written to produce a plan

• “Cost” based optimization

• Use heuristics to evaluate multiple plans, looking for the one with the lowest “cost”

• Most real world systems today are cost based, with some cases of using rules

• It may be advantageous to employ boolean algebra to rewrite expressions containing ANDs and NOTs to contain ORs if your system allows for OR to be implemented with multiple indexes (and not AND)

• Called a “query rewrite rule”

• Mechanically followed as considered to be “always good”

• Many of the early systems were purely rules based

• A “bad rule” - The order that the tables are listed in should be the order (inner/outer) of the tables in the nested loop of a JOIN

• Exactly what the rules based systems did for years

• (including the first version of Comdb2)

• Cost based optimization is based on the concept of “statistics.”

• The database keeps internal statistics about the CONTENTS of the data

• SELECT * from tbl where X=5

• Table scan on tbl, filtering on X=5

• Index lookup on X=5

• Which is better? It depends

• In most real systems, a table scan is faster than an index scan when the “break even point” is reached - more than a % of rows visited

• The system needs to “know” which % of the rows in the table are likely to contain X=5

• Only with that information can it chose the fastest plan. (for this input and this data!)

• SELECT * from tbl where X=5 and Y=6

• Use index on X=5 and filter for Y=6

• Use index on Y=6 and filter for X=5

• SELECT * from tbl1, tbl2 where tbl1.a=tbl2.a

• For every row in tbl1 look into tbl2 with an index to find corresponding a

• For every row in tbl2 look into tbl1 with an index to find corresponding a

• Real systems gather all sorts of statistics about the data which all feed into the query planner

• Size of table, Size of indexes

• Selectivity of indexes

• Distribution of values in indexes

• Sampling of commonly occurring values

• And more. An open field, still filled with trade secrets.

• Running a query

• The query planner ultimately generates a “program” in the form of some internal intermediary representation of the procedural execution of the query which is handed to the “query executor” for execution.

• The query executor is a “customer” of all the subsystems

• Query executor

• Uses Btrees for access to indexes

• Uses concurrency control to support the SQL notion of a transaction

• Uses logging to make modifications Durable and Atomic

• Uses the buffer pool to retrieve items from disk

Real world Systems

• DB2

• Oracle

• Postgres

• Comdb2

DB2

• Provides serializable isolation

• Uses a 2PL locking protocol

• Complex Btree locking techniques

• Next-key locking

• Key/Value locking

• Key-range locking

• NO FORCE / STEAL buffer policy

• System R was originally FORCE / NO STEAL

• System R didn’t even log

• Over time, it became clear that logging + no force / steal is the key to high performance systems

• UNDO and REDO logging

• Sophisticated cost based query planner

• Cost based query planning was invented in the System R project, described by Selinger in a paper published in 1979

• Oracle sold a Rules based planner until 1992

• Row level locking

• System R was using Row level locking in the late 70s, early 80s.

• DB2 gained row locking in 1995 (main frame only; it took even longer to reach UNIX)

• Oracle gained row locking in 1988

Oracle

• Oracle is it’s own system - Shares nothing at all with System R

• Many interesting approaches to solving the same problems were developed

• Provides Snapshot Isolation

• Does not use 2PL, instead uses a form of MVCC

• The buffer pool itself in Oracle is versioned

• Objects (rows) are not versioned per se

• The pages they exist on are

• When an update occurs, pages are modified IN PLACE.

• When a read needs to see an earlier version of a page, the UNDO logs are consulted to recreate a prior version of this page and place it into the buffer pool

• The algorithm is roughly based on usage of the “pagelsn” (Different terminology in Oracle, same rough concept)

• When you start a transaction (snapshot) record the current LSN as your “birthlsn”

• If you are looking at a page, and the page has a pagelsn LESS THAN the birthlsn of your transaction, then you know you are meant too see everything on that page

• Else, use UNDO log records to reconstruct a version of that page that now has a pagelsn LESS THAN your birthlsn

• Place new(old) page in buffer pool, proceed

• UNDO and REDO logging

• NO FORCE / STEAL buffer management

• UNDO and REDO logs are physically “split” into 2 distinct data structures

• The REDO logs act like a conventional “log” file in Oracle

• The UNDO logs have a much more complex organization for performance reasons due to the unique requirements Oracle places on UNDO for MVCC

• Oracle’s locking protocol is relatively simple

• MVCC takes care of many of the issues that DB2 solves with locking

• No long term read locking ever, not even on rows.

• “first rule of SI” enforced through MVCC policy of producing most recently committed data as of BEGIN

• Long term write locks taken on modified rows

• Used to enforce “second rule of SI”

PostgreSQL

• “Second system” developed after Ingres

• Not based on Ingres, at the time meant as a proving ground for “new ideas”

• Key idea

• NO OVERWRITE

• At the row level. The buffer pool will overwrite pages.

• Old versions of rows don’t disappear after an update, they simply become “older versions” of that row

• Used to implement an SI isolation model on top of a row based MVCC system

• NO FORCE / STEAL

• REDO logging

• No UNDO logging!

• Able to get away with this because of the “no overwrite” nature of updates!

• Earlier versions of PostgreSQL attempted to run without logging. They used a FORCE policy

• Eventually came to the same conclusions as everyone: LOG + NO FORCE + STEAL

Comdb2

• “Second system” developed after Comdbg

• Attempt to produce a relational system maintaining some level of compatibility with earlier pre relational systems.

•Provides Snapshot Isolation

•Rows are versioned

•Undo logs are used to reconstruct rows (not pages)

•Does not use 2PL

•Uses a form of OCC

•Aggressive, Validating scheduler

•Attempt to run transactions concurrently under hopes that work to backout and retry will be minimal

Future

• The RDBMS will continue to evolve as our hardware continues to change

• X86 is dominant - overperforming, underpriced!

• Support that platform and support it well

• Memory is becoming cheap and huge

• Assumptions about what is reasonable to keep in memory and what is on disk are changing

• Networks are reaching latency levels comparable to SMP interconnects

• Distributed systems are more realistic now

• Conversely, HIGH LATENCY, low availability (“the internet”) networks are becoming another reality that must be acknowledged

• Research on relaxed isolation levels that scale across these types of environments will continue - Last word far from said there

• Generally speaking, the highly available, distributed systems will be the most able to adapt and survive

• The idea of a “disk” is changing

• SSD challenges many assumptions about “sequential” vs “random” access

• At best, “tuning” may be needed for some RDBMS

• At worst, a “rewrite” may be in order

• SSD challenges the notion of the OVERWRITE buffer policy being hands down superior.

• SSD is at heart (under the hood) IS a NO OVERWRITE system. It’s easy to imagine a NO OVERWRITE buffer pool manager plugged DIRECTLY into SSD, bypassing file system abstractions

• The best ideas from the “post relational” (no sql) camp will converge with the ideas from the RDBMS producing best of breed systems

• Ease of scaling across commodity hardware

• High availability DESPITE unreliable hardware

• The two unstoppable ideas from the Relational Systems will continue to be the reason why these systems will dominate

• Data Abstraction

• Declarative languages

Relational Database Internals

Documents

Transcript of Relational Database Internals