Relational Database Internals

135
Relational Database Internals Alex Scotti Bloomberg LP

description

Relational Database Internals. Alex Scotti Bloomberg LP. Outline of talk. History - origins and background Internals - theory and practice Internals - brief discussion of real systems Future - observations, trends, predictions. History. - PowerPoint PPT Presentation

Transcript of Relational Database Internals

Page 1: Relational Database Internals

Relational Database Internals

Alex ScottiBloomberg LP

Page 2: Relational Database Internals

Outline of talk

• History - origins and background

• Internals - theory and practice

• Internals - brief discussion of real systems

• Future - observations, trends, predictions

Page 3: Relational Database Internals

History

• The early database systems differed from the relational ones in 2 main regards

• Data model

• Transactional semantics

• We’ll be more heavily focused on the transactional issues than the data modeling issues in this talk

Page 4: Relational Database Internals

• Pre Relational Systems

• Hierarchical Data Models

• IMS

• Network Data Models

• CODASYL

Page 5: Relational Database Internals

• IMS

• Each record was typed by a “record type” (think “table”)

• Relationships between records are represented as trees (hierarchies) between records linked by their “keys”

• Writing a “query” consisted of writing a program to navigate through these links, traversing records until the right one was found.

Page 6: Relational Database Internals

• Data types available were SEQUENTIAL, HASH, TREE

• Each acted differently - A program written to use a tree could not have the data structure changed out from under it

• lack of PHYSICAL INDEPENDENCE

Page 7: Relational Database Internals

• CODASYL

• A more complex “evolution” of the IMS idea, standardized by ANSI, implemented by several vendors

• Honeywell, DEC, Univac

• Idea is instead of pointers forming a strict hierarchy, they now form an arbitrarily complex “network.”

• Able to represent graphs

• Even HARDER to program with than IMS

Page 8: Relational Database Internals

•In 1970, Ted Codd wrote the foundational paper “A Relational Model if Data for Large Shared Data Banks”

•Codd was primarily a mathematician, not particularly concerned with transaction processing

•However, the two problems were incredibly tightly coupled

•Work at IBM began on “relational databases”

• Including locking, logging, all sorts of things that became the core of an RDBMS

Page 9: Relational Database Internals

• Codd’s insight

• A database is nothing more than a “fact store” from which it should be possible to logically infer “new facts.”

• Simple but amazingly powerful.

• If the goal is to store facts, then there is no benefit from storing the same fact multiple times or in multiple forms. A fact does not become MORE TRUE by repeating it. The basis of “normalization.”

Page 10: Relational Database Internals

• Codd’s insight

• If the system knows enough about the data it is storing, you can ASK IT QUESTIONS rather than TELL IT WHAT TO DO.

• Declarative vs Procedural programming model

• AKA, The nail in the coffin for all the Pre Relational systems

• Just a matter of time - If a system is easier to use, and performs fine, why wouldn’t you use it?

Page 11: Relational Database Internals

• Codd’s terminology has (for the most part) become replaced by the SQL terminology, which we’ll be using throughout the rest of this talk

• Generalized simplification

• “Logically organize your data into tables”

Attribute ColumnTuple RowRelation Table

Page 12: Relational Database Internals

• Going further, Codd defined 12 “rules” that he hoped to define what it meant to be “relational”

• Key points are

• All information is represented in tables

• Nulls must be uniformly handled by all datatypes

• Physical representation must be abstracted from logical representation

• Physical location and distribution of data must be invisible to users

Page 13: Relational Database Internals

• Key points are

• “Set based” operations for insert / update / delete

• Integrity constraints must be enforceable by the database system

• There must be no way AROUND the set of enforced constraints

• There must be support for at least 1 “relational language”

Page 14: Relational Database Internals

• Codd’s work became the basis for a “next generation” database product at IBM called System R.

• System R was treated as a “production proof of concept.” At the end of the project there were several commercial customers.

• Around the same time, work was going on at UC Berkeley on the “Ingres” system, also based on Codd’s idea.

Page 15: Relational Database Internals

• Neither system was successful at commercializing a general purpose database.

• The award goes to Oracle.

• Oracle shipped a working commercial RDBMS to anyone who would pay before IBM.

• Based also on Codd’s work.

• No common code between System R, Ingres, and Oracle - 3 unique lineages all based on the same idea

Page 16: Relational Database Internals

• IBM evolved the System R “prototype” into their second system : DB2.

• Ingres went on to be the basis of numerous successful commercial products

• Sybase was based on Ingres code

• Informix contains Ingres code (through Illustra)

• MSSQL contained Ingres code (through Sybase)

Page 17: Relational Database Internals

• Newer systems - all inspired by the same ideas and following the same principles, but without direct code sharing

• Mysql

• SQLite

• PostgreSQL

Page 18: Relational Database Internals
Page 19: Relational Database Internals

Internals

• Buffer Pool

• Log

• Concurrency Control

• Btrees

• Relational Layer

Page 20: Relational Database Internals

Buffer Pool•Often known as “the cache”

•A page/block oriented data structure

•A page in the pool conceptually “maps” to a block on a disk. (not really always true)

•Needs to interface with the systems BELOW and the systems ABOVE.

•Below - Disks, File systems

•Above - Btrees

Page 21: Relational Database Internals

• Both page/block oriented interfaces above and below.

• Conceptually, very similar to the VM subsystem of any modern UNIX

• “demand paging”

• Eviction policy based on LRU approximations, often with more “smarts” than VM.

• Higher levels of the system often can pass down “hints” about intended access patterns all the way to the buffer pool.

Page 22: Relational Database Internals

Buffer Pool - Why?

• Whats the story? it’s a cache, we get it.

• Much more than that going on here!

Page 23: Relational Database Internals

• Basics of transaction management begins with the buffer pool and the policies and protocols enforced there

• Terminology

• “pinned” - a page that cannot be evicted

• “dirty” - a page that contains data that DOES NOT match the data on the disk

• “clean” - the opposite

Page 24: Relational Database Internals

• A dirty page BECOMES a clean page when the data in that page is DURABLY written to the disk

• Can we really just write a page to the disk? Not really, it usually involves logging protocols - wait for the next section!

Page 25: Relational Database Internals

• More terminology

• “forcing” - when a transaction commits, it’s dirty pages are FORCED to durable storage before considering the commit complete

• “stealing” - A dirty page which is a part of an UNCOMMITTED transaction can be written to the disk in an effort to produce usable space in the buffer pool

Page 26: Relational Database Internals

• What is the simplest?

• FORCE / NOSTEAL

• What is the highest performing and most powerful?

• NOFORCE / STEAL

• Not surprisingly, most real world systems today implement a NOFORCE / STEAL buffer pool policy

• Support for this policy requires logging

Page 27: Relational Database Internals

•More terminology

•OVERWRITE / NO OVERWRITE

•Whether or not the buffer pool will write changes to a page ON TOP of an existing page, or leave the existing page alone and write to a NEW page.

•OVERWRITE systems are higher performing

•most real world systems implement an OVERWRITE buffer pool.

•NO OVERWRITE example: System R, shadow paging

Page 28: Relational Database Internals

• How does data actually get written to the disk?

• The “clients” of the buffer pool (the layers above) never concern themselves with writing data. They work at a layer of abstraction where they “get buffer” and “dirty” them.

• Pages get written out (cleaned!) as part of a background process.

• Goal is to keep some portion of the buffer pool clean.

Page 29: Relational Database Internals

• Why are we trying to keep writing out these pages to disk in the background?

• To make the system more reliable?

• NO! Completely unrelated. Reliability ensured through other means

• To make sure that a READ doesn’t become a WRITE!

• Need a page? Cant get one, all dirty.

• You get to “clean one” (write it) now!

Page 30: Relational Database Internals
Page 31: Relational Database Internals

Logging

• Basic Idea behind logging

• Before you do something, write down what it is you intend to do.

• Sounds slow. Why bother with this, just DO IT!

• Nope - The opposite is true. Logging can make things quicker

Page 32: Relational Database Internals

• The highest performing buffer pool policy of NOFORCE/STEAL actually REQUIRES logging

• Without logging you would compromise with a lower performing policy

• Logging has the capacity to perform “magic”

• Converts RANDOM (slow) I/O into SEQUENTIAL (fast) I/O!

• We’ll come back to this idea

Page 33: Relational Database Internals

•Expanding on the basic idea of logging

•Theres really two distinct things that you are “writing down” here

•Write down what it is you are about to do: REDO logging - can “do it over”

•Write down the procedure to follow to make it as if what you did NEVER HAPPENED: UNDO logging

•Many times both of these pieces of information are embedded into single “log record” Or not. Conceptually 2 things.

Page 34: Relational Database Internals

• Mechanics of logging - What’s the data structure?

• In it’s basic form, a log is a simple sequential file. Conceptually it’s not unlike a tape drive.

• Each “record” in the log is identified by a unique identifier, which is typically just the physical location of the record in the file.

• Call this the Log Sequence Number (LSN)

Page 35: Relational Database Internals

• “Log Buffer” - exactly what it sounds like - a buffer of memory in front of the log.

• An obvious and common optimization to make it less expensive to “write to the log”

• Recoverability is endangered unless the log exposes an interface to FLUSH THE BUFFER. (and it gets called at the right places)

• All real systems work this way

Page 36: Relational Database Internals

• Subsystems are said to “generate log records” (calling APIs provided by log subsystem)

• Buffer pool may need to log the allocation of a new page

• Btree may need to log a page split

• Relational layer may log an INSERT statement

• Customers of this subsystem all over the database

Page 37: Relational Database Internals

• 2 approaches to logging

• “Physical logging”

• “Logical logging”

Page 38: Relational Database Internals

• Physical logging

• Log entire page images

• “redo record” : “log what the page is GOING TO look like”

• “undo record” : “log what the page LOOKS LIKE NOW”

• Problems?

• Inefficient, expensive

• Poor concurrency

Page 39: Relational Database Internals

• Problems

• Inefficiency mess

• Why log 2 copies of a page when I only changed a few bytes?

• Concurrency mess

• Systems with concurrency control at a finer granularity than the page cannot log this way. We’ll come back to that.

Page 40: Relational Database Internals

• On the other hand

• Physical logging is appealing because it is simple, and it works because of a nice property of being “testable”

• We can look at a log record ABOUT a page, then look at the page, and determine which state it’s in because we RECORDED the two possible states

• This turns out to be an essential property of reovery

Page 41: Relational Database Internals

• Logical logging

• Log the high level operations only

• SQL

• INSERT INTO TBL(A) VALUES(1)

• REDO

• “INSERT INTO TBL(A) VALUES(1)”

• UNDO

• “DELETE FROM TBL WHERE A = 1”

Page 42: Relational Database Internals

• Elegant!

• Simple!

• Compact!

• but it doesn’t WORK!

• That SQL INSERT could decompose into dozens of page writes.

• Some may have been done, then crash. You can’t look at thee pages and tell which ones were done (UNDO THEM) and which ones weren’t

Page 43: Relational Database Internals

• NO FORCE allows us to mark a transaction “committed” WITHOUT writing all of the pages.

• Some may have been written, then crash

• We can’t tell which one WERE NOT written (REDO them) and which ones WERE (leave them alone)

• It’s often UNSAFE to perform actions multiple times

Page 44: Relational Database Internals

• Making logical logging work - “Physiological Logging”

• “Physical ABOUT pages, logical ABOUT the contents INSIDE the page”

• The idea is to keep the logging centered on the idea of pages, which works well

• But log less information than a physical scheme would require

Page 45: Relational Database Internals

• Example Physiological Operation

• “Add item X to page N”

• Push down the logical concept into the page level - logical INSIDE the page

• SQL INSERT statement will decompose into several independent physiological operations

• Each one is INDEPENDENTLY TESTABLE / UNDOABLE / REDOABLE

• AKA, “it works”

Page 46: Relational Database Internals

• Logging for purposes of recovery

• Key technique is based on something called the “pagelsn”

• Intertwining of the buffer pool and the logger

• Each time you modify a page, store the LSN of the log record describing that modification ON THE PAGE ITSELF

• Testability

• Look at the pagelsn to determine state

Page 47: Relational Database Internals

• Write Ahead Logging (WAL) Protocol

• Tightly integrated with buffer pool

• Before a dirty page is written to disk, the UNDO information for that page must be durable

• Before a transaction is considered committed, the REDO information for that transaction’s pages must be durable

• And that’s how a NO FORCE / STEAL system can convert random I/O into sequential

Page 48: Relational Database Internals

• Basic idea behind recovery after crash

• REDO all COMMITTED transactions

• Some pages MAY NOT be written

• as allowed by NO FORCE

• UNDO all UNCOMMITTED transactions

• Some pages MAY HAVE BEEN written

• as allowed by STEAL

Page 49: Relational Database Internals
Page 50: Relational Database Internals

Concurrency Control

• Lets talk about ACID now (finally?)

• We’ll use Chris Date’s definition

Page 51: Relational Database Internals

• Atomic

• A transaction fully completes or no part of it does.

• Correct (Consistent)

• Transactions transform a database from one correct state to another, not necessarily enforcing correctness during the transition between these two states

Page 52: Relational Database Internals

• Isolated

• Transactions are isolated from each other in such a way that a transaction will be “correct” regardless of what other transactions may be simultaneously executing

• Durable

• A “committed” transaction CAN NOT be “lost” after a system failure

Page 53: Relational Database Internals

• Concurrency control intertwines will all of these concepts.

• But mostly the I in ACID

• ISOLATED is really just a layman’s shorthand for “SERIALIZABLE”

Page 54: Relational Database Internals

• Basic Serializability Theory

• A system which runs all transactions sequentially (with no concurrency) produces a “history” known as a “serial history”

• A serial history is BY DEFINITION correct

• You can’t have concurrency problems WITHOUT CONCURRENCY!

Page 55: Relational Database Internals

• A system which allows for concurrency produces histories comprised of the interleaved execution of the concurrent transactions

• If that history can be said to be EQUIVALENT to a serial history (one produced through non concurrent execution) then the concurrent system’s history is said to be SERIALIZABLE

• EQUIVALENT - “Produces the same output and has the same effect on the database”

Page 56: Relational Database Internals

• Some formal notation

• rn[x] : Transaction n reads object x

• wm[y] : Transaction m writes object y

• cl : Transaction l commits

• “conflicting operations”

• r conflicts with w

• w conflicts with r

• w conflicts with w

Page 57: Relational Database Internals

• Conflict Serializability Testing

• A history can be considered equivalent to a serial history if it holds that for all conflicting operations the ordering of the conflicts is the same

• r1[x] r1[y] w1[x] r1[z] c1 r2[x] r2[a] c2

• r1[x] r1[y] w1[x] r2[a] r2[x] r1[z] c1 c2

• r2[x] conflicts with w1[x]

• In both histories order of conflict is same

Page 58: Relational Database Internals

• Serializability Graph Testing

• A technique to analyze any history for serializability is the “serialization graph”

• For each committed transaction add a directed edge from T1 to T2 if any step of T2 conflicts with T1

• If the resulting graph contains NO CYCLES then the history is serializable

Page 59: Relational Database Internals

• “Schedulers”

• Histories are said to be “produced” by the execution of event as determined by the “scheduler”

• This may or may not be a “real thing” in a real system.

• As a mental model we consider the scheduler to be a real thing who’s job it is to schedule the interleaving of transactions in such a manner to produce serializable histories

Page 60: Relational Database Internals

• “Conservative Schedulers”

• Err on the side of delaying execution (blocking) in the hopes of producing serializable histories

• Extreme case - no concurrent execution allowed!

• “Agressive Schedulers”

• Aim to run with more concurrency with the understanding that non serializable histories may be produced and later rejected

• Extreme case – SGT based validating scheduler

Page 61: Relational Database Internals

• Locking based schedulers

• The most common real world schedulers all involve forms of locking as the basic mechanism

• Serializable histories are produced through a locking technique called 2 Phase Locking (2PL)

Page 62: Relational Database Internals

• 2PL Rules

• Acquire “read locks” on all objects read

• Acquire “write locks” on all object written

• Only release locks at Commit

• It can be proven mathematically that all possible histories output from a 2PL scheduler are serializable

• It’s not that hard to convince yourself of this intuitively without the math

Page 63: Relational Database Internals

• 2PL drawbacks

• 2PL can be overly conservative in many cases, delaying concurrency needlessly when serializability would not have been compromised

• 2PL suffers from deadlocks as it allows for arbitrary interleaving of concurrent blocking operations in no defined order

Page 64: Relational Database Internals

• Serialization Graph Testing (SGT) Schedulers

• At commit time build a serialization graph and detect cycles.

• No real world system works this way

• Just too computationally expensive

• (fancy term for “slow”)

Page 65: Relational Database Internals

• Optimistic Concurrency Control (OCC) Schedulers

• Track “read sets” and “write sets” of all transactions

• At commit, ensure that no conflict between these sets has occurred.

• Make sure no transaction that started after your BEGIN has any overlap in its write set with your read set

Page 66: Relational Database Internals

• OCC Problems

• Tradeoff the deadlock problem of 2PL for the “rejection” problem of OCC

• Can be very difficult to efficiently track conflicts.

• Difficult to allow high concurrency - “giant lock” around “validate” and “commit” phases

• No real world system implements a 100% pure OCC scheduler

Page 67: Relational Database Internals

• Predicate based Concurrency Control

• SQL: “UPDATE X WHERE Y>5 AND Y<10”

• Don’t lock all the rows between 5 and 10

• instead lock the SINGLE PREDICATE of

• “5<y<10”

• Need not be a “lock” - Compatible with OCC/Validating techniques as well

Page 68: Relational Database Internals

• Problems with predicates

• Gets very complicated very fast to support arbitrarily complex predicates

• Gets really really complicated to detect compatibility/conflicts between arbitrary predicates - much worse than the basic OCC problem

• But basic “degenerate” predicates have been used in real systems. In some systems our example would have been a “range lock”

Page 69: Relational Database Internals

• Less than serializable

• Many real world systems either do not fully implement serializability or offer optional (typically default) isolation levels that are WEAKER than serializable

• This is almost ALWAYS done for reasons of performance

• One very successful model of reduced isolation in real systems is known as “Snapshot Isolation”

Page 70: Relational Database Internals

• Snapshot Isolation (SI)

• An SI scheduler is frequently implemented as a Multi Version Concurrency Control (MVCC) system

• MVCC permits the notion of “versions” of objects

• The notation r1[x] w2[y] is extended to r1[x2] r2[y4]

• Transaction 1 reads version 2 of x

• Transaction 2 writes version 4 of y

Page 71: Relational Database Internals

• SI defines 2 rules for an MVCC system to follow

• Each version of an object x that is READ BY transaction T is the most recently committed version of x as of the BEGIN of T

• 2 Transactions that overlap in BEGIN and COMMIT time do not write to the SAME OBJECT

Page 72: Relational Database Internals

• Problems with SI schedulers

• SI can generate histories that are not serializable

• The main issue is referred to as “write skew” - Idea is it becomes visible sometimes that there is a “skew” in time - as your reads and your writes appear to execute at different points in time

Page 73: Relational Database Internals

• Write Skew

• Simple example - Imagine trying to enforce an integrity constraint inside the application (the db doesn’t know of this constraint)

• In a 2PL system, its easy - read all your conditions before committing

• In an SI system that doesn’t work

Page 74: Relational Database Internals

• SI anomaly of write skew can be worked around in the application if the DBMS provides explicit “locking” primitives that can be used.

• In previous example, the application would be responsible for “locking” the items read to ensure serializability

• Oracle: SELECT FOR UPDATE

• Comdb2: SELECTV

Page 75: Relational Database Internals

• SI is often “good enough” and can provide much greater concurrency in many cases than 2PL.

• The SI anomalies are not recognized by ANSI SQL.

• So strangely, according to ANSI SQL, an SI system actually IS serializable. (it isn’t)

• Dirty Read, Non Repeatable Read, Phantom

Page 76: Relational Database Internals

•An SI scheduler can be implemented as type of aggressive, validating scheduler.

•Retaining some of the aspects of OCC deferring the validation of w-w conflicts (rule 2, no overlap in writes) until commit time

•An SI scheduler can also be built from a conservative locking scheduler

•Write locks can be acquired to enforce second rule of SI, ensuring blocking or deadlock for non compliant histories

Page 77: Relational Database Internals
Page 78: Relational Database Internals

Btrees

• The workhorse data structure of a Relational Database System

• Most common choice for implementing an index. Sometime a choice for storing data too.

Page 79: Relational Database Internals

• Key Idea

• Like a binary tree (balanced) but allowing more than one item on a “node” and more than 2 siblings per node

• A node becomes a PAGE - out of practical necessity

• Buffer pool wants pages

• Logging, recovery wants pages

• Concurrency control wants pages

Page 80: Relational Database Internals

• A page in Btree maps 1 to 1 into a page in the buffer pool

• Which maps (somehow) into a block on a disk

• Buffer pool could overlay on disk

• Typically it overlays on filesystem

• Filesystem often further abstracted from disk

• Hardware RAID, etc

Page 81: Relational Database Internals

• Logging is often physiological

• “Add item X to Btree” (operation) can generate log record of

• “Insert item X into the array on page 2”

• Forms of logical logging are often used for internal data structure maintenance

• A “page split” may be a logged event

Page 82: Relational Database Internals

• Key insight into Btree recovery

• If 2 Btrees ACT the same, then they ARE the same

• Recovery NEED NOT create a bit for bit perfect copy of the original data structure, only one that is indistinguishable from the original over all operations defined to be supported by the Btree

Page 83: Relational Database Internals

• Concurrency control WITHIN the Btree is typically based on a complex locking protocol with the goal of allowing maximum concurrency (reads and writes) to distinct pages in parallel

• Much like concurrency control in general, many “exotic” non locking variants exist - few if any are really used

Page 84: Relational Database Internals

• Simplified locking

• Always access tree from “top” (parent) to “bottom” (leaf)

• Always hold lock on item ABOVE before attempting access to item BELOW

• release lock on item ABOVE when you know it’s “safe” (you won’t be going “up”)

Page 85: Relational Database Internals

• Page level granularity for concurrency in the Btree structure

• Real systems often provide transactions with finer (ROW) granularity for concurrency than pages

• Key insight

• A form of logical logging

Page 86: Relational Database Internals

• Call the low level (page oriented) work the “physical” level and the high level (descriptive) work the “logical level”

• A logical operation to a Btree could be “insert X” while physical (physiological) is

• “Add X to page 43” or even something like

• “split page 43 into pages 43 and 54, update page 532(parent) to see new sibling, update page 87 (to right of 54) to see left to 54, add X to page 54”

Page 87: Relational Database Internals

• Use logical logging on the Btree for undo

• A tree need not be the same if it can ACT the same!

• In our previous example, we CAN’T physically undo. If we released our page locks BEFORE transaction commit (which we HAVE TO if we want better than page granularity) then another committed transaction could have put data into newly created page 54

Page 88: Relational Database Internals

• A physical undo would remove page 54.

• It would cause the loss of data from a subsequent COMMITTED transaction!

• We need to logically undo - leave the tree structure alone

• Remove X from 54 is all we need to do.

Page 89: Relational Database Internals

• Modified 2PL protocol for row level concurrency and serializability

• When reading row X obtain read lock on row

• When writing row Y, obtain write locks on pages modified by row write, obtain row lock on row Y, release page locks on pages modified by row write

• Row locks follow 2PL protocol, always held until commit

• Page locks are released early

Page 90: Relational Database Internals
Page 91: Relational Database Internals

Relational Layer

• Relational Algebra defines 8 primitive operations

Page 92: Relational Database Internals

• RESTRICT: Chose rows

• PROJECT: Chose columns

• PRODUCT: Multiply 2 sets of columns

• UNION: Add 2 sets of rows

• INTERSECT: Produce set of rows in common between 2 sets

• DIFFERENCE: Remove the commonality of set of rows between set 1 and 2 from set 1

Page 93: Relational Database Internals

• NATURAL JOIN: Produce a set of rows based on common values of a column

• DIVIDE: Opposite of PRODUCT

Page 94: Relational Database Internals

• Relational Algebra is a procedural way of expressing a problem

• Lay out the “steps” in terms of the “operators”

• Not procedural in terms of “implementation”

• The implementation of each operator is a procedural operation.

• The algebra has specific rules (in terms of what is commutative, etc) which can be used for simplification of expressions

Page 95: Relational Database Internals

• Relational Languages

• In practice, nobody is writing any math to run a query! No relational algebra, no relational calculus

• SQL is the dominant (only?!) Relational Language. Others have existed.

• Informix: “Informer”

• Ingres: QUEL / PostrgreSQL POSTQUEL

• System R: SEQUEL

Page 96: Relational Database Internals

• The purpose of a relational language is to expose enough power to allow one to express anything that would be possible in the relational algebra.

• Codd termed this to be “relationally complete”

• SQL is a relationally complete language

• Inspired by parts of the calculus, parts of the algebra, and a desire to be “english like” rather than “mathematical”

Page 97: Relational Database Internals

• SQL is a “compiled” programming lamguage

• The database parses SQL then compiles it into an intermediary form for execution

• Conceptually, this intermediary form can be thought of as relational algebra

• This compiled form is often referred to as the “query plan”

Page 98: Relational Database Internals

•SELECT * from users WHERE uuid=123;

•σ uuid 123 (users)

•SELECT name, age FROM users WHERE numchildren > 2 and numcars > 3;

•π name, age (σ numchildren >2 and numcars > 3 (users) )

Page 99: Relational Database Internals

• Producing a query plan is HARD WORK!

• It’s the job of a component called the “query planner”

• A single query can be represented by an infinite number of query plans

• Most are absurd and would never be generated by anything other than a defective or malicious planner

• Some are MUCH LESS WRONG

• But only 1 is THE BEST (for this input!)

Page 100: Relational Database Internals

• The job of the planner is to quickly prune down the search space of plans to ones that might have a chance at being good, then quickly evaluate the “goodness” of the remaining choices

• Quick - This is an overwhelming source of tension in the planner - quick vs correct

• If the system took 1 minute to generate a plan to run your query in 1 second or 1 second to get a plan to run your query in 5 seconds, which would you chose?

Page 101: Relational Database Internals

• 2 Main approaches

• “Rules” based optimization

• Follow specific mechanical rules about the way the SQL was written to produce a plan

• “Cost” based optimization

• Use heuristics to evaluate multiple plans, looking for the one with the lowest “cost”

Page 102: Relational Database Internals

• Most real world systems today are cost based, with some cases of using rules

• It may be advantageous to employ boolean algebra to rewrite expressions containing ANDs and NOTs to contain ORs if your system allows for OR to be implemented with multiple indexes (and not AND)

• Called a “query rewrite rule”

• Mechanically followed as considered to be “always good”

Page 103: Relational Database Internals

• Many of the early systems were purely rules based

• A “bad rule” - The order that the tables are listed in should be the order (inner/outer) of the tables in the nested loop of a JOIN

• Exactly what the rules based systems did for years

• (including the first version of Comdb2)

Page 104: Relational Database Internals

• Cost based optimization is based on the concept of “statistics.”

• The database keeps internal statistics about the CONTENTS of the data

• SELECT * from tbl where X=5

• Table scan on tbl, filtering on X=5

• Index lookup on X=5

• Which is better? It depends

Page 105: Relational Database Internals

• In most real systems, a table scan is faster than an index scan when the “break even point” is reached - more than a % of rows visited

• The system needs to “know” which % of the rows in the table are likely to contain X=5

• Only with that information can it chose the fastest plan. (for this input and this data!)

Page 106: Relational Database Internals

• SELECT * from tbl where X=5 and Y=6

• Use index on X=5 and filter for Y=6

• Use index on Y=6 and filter for X=5

• SELECT * from tbl1, tbl2 where tbl1.a=tbl2.a

• For every row in tbl1 look into tbl2 with an index to find corresponding a

• For every row in tbl2 look into tbl1 with an index to find corresponding a

Page 107: Relational Database Internals

• Real systems gather all sorts of statistics about the data which all feed into the query planner

• Size of table, Size of indexes

• Selectivity of indexes

• Distribution of values in indexes

• Sampling of commonly occurring values

• And more. An open field, still filled with trade secrets.

Page 108: Relational Database Internals

• Running a query

• The query planner ultimately generates a “program” in the form of some internal intermediary representation of the procedural execution of the query which is handed to the “query executor” for execution.

• The query executor is a “customer” of all the subsystems

Page 109: Relational Database Internals

• Query executor

• Uses Btrees for access to indexes

• Uses concurrency control to support the SQL notion of a transaction

• Uses logging to make modifications Durable and Atomic

• Uses the buffer pool to retrieve items from disk

Page 110: Relational Database Internals
Page 111: Relational Database Internals

Real world Systems

• DB2

• Oracle

• Postgres

• Comdb2

Page 112: Relational Database Internals

DB2

• Provides serializable isolation

• Uses a 2PL locking protocol

• Complex Btree locking techniques

• Next-key locking

• Key/Value locking

• Key-range locking

Page 113: Relational Database Internals

• NO FORCE / STEAL buffer policy

• System R was originally FORCE / NO STEAL

• System R didn’t even log

• Over time, it became clear that logging + no force / steal is the key to high performance systems

Page 114: Relational Database Internals

• UNDO and REDO logging

• Sophisticated cost based query planner

• Cost based query planning was invented in the System R project, described by Selinger in a paper published in 1979

• Oracle sold a Rules based planner until 1992

Page 115: Relational Database Internals

• Row level locking

• System R was using Row level locking in the late 70s, early 80s.

• DB2 gained row locking in 1995 (main frame only; it took even longer to reach UNIX)

• Oracle gained row locking in 1988

Page 116: Relational Database Internals

Oracle

• Oracle is it’s own system - Shares nothing at all with System R

• Many interesting approaches to solving the same problems were developed

Page 117: Relational Database Internals

• Provides Snapshot Isolation

• Does not use 2PL, instead uses a form of MVCC

• The buffer pool itself in Oracle is versioned

• Objects (rows) are not versioned per se

• The pages they exist on are

Page 118: Relational Database Internals

• When an update occurs, pages are modified IN PLACE.

• When a read needs to see an earlier version of a page, the UNDO logs are consulted to recreate a prior version of this page and place it into the buffer pool

• The algorithm is roughly based on usage of the “pagelsn” (Different terminology in Oracle, same rough concept)

Page 119: Relational Database Internals

• When you start a transaction (snapshot) record the current LSN as your “birthlsn”

• If you are looking at a page, and the page has a pagelsn LESS THAN the birthlsn of your transaction, then you know you are meant too see everything on that page

• Else, use UNDO log records to reconstruct a version of that page that now has a pagelsn LESS THAN your birthlsn

• Place new(old) page in buffer pool, proceed

Page 120: Relational Database Internals

• UNDO and REDO logging

• NO FORCE / STEAL buffer management

• UNDO and REDO logs are physically “split” into 2 distinct data structures

• The REDO logs act like a conventional “log” file in Oracle

• The UNDO logs have a much more complex organization for performance reasons due to the unique requirements Oracle places on UNDO for MVCC

Page 121: Relational Database Internals

• Oracle’s locking protocol is relatively simple

• MVCC takes care of many of the issues that DB2 solves with locking

• No long term read locking ever, not even on rows.

• “first rule of SI” enforced through MVCC policy of producing most recently committed data as of BEGIN

• Long term write locks taken on modified rows

• Used to enforce “second rule of SI”

Page 122: Relational Database Internals

PostgreSQL

• “Second system” developed after Ingres

• Not based on Ingres, at the time meant as a proving ground for “new ideas”

Page 123: Relational Database Internals

• Key idea

• NO OVERWRITE

• At the row level. The buffer pool will overwrite pages.

• Old versions of rows don’t disappear after an update, they simply become “older versions” of that row

• Used to implement an SI isolation model on top of a row based MVCC system

Page 124: Relational Database Internals

• NO FORCE / STEAL

• REDO logging

• No UNDO logging!

• Able to get away with this because of the “no overwrite” nature of updates!

• Earlier versions of PostgreSQL attempted to run without logging. They used a FORCE policy

• Eventually came to the same conclusions as everyone: LOG + NO FORCE + STEAL

Page 125: Relational Database Internals

Comdb2

• “Second system” developed after Comdbg

• Attempt to produce a relational system maintaining some level of compatibility with earlier pre relational systems.

Page 126: Relational Database Internals

•Provides Snapshot Isolation

•Rows are versioned

•Undo logs are used to reconstruct rows (not pages)

•Does not use 2PL

•Uses a form of OCC

•Aggressive, Validating scheduler

•Attempt to run transactions concurrently under hopes that work to backout and retry will be minimal

Page 127: Relational Database Internals
Page 128: Relational Database Internals

Future

• The RDBMS will continue to evolve as our hardware continues to change

Page 129: Relational Database Internals

• X86 is dominant - overperforming, underpriced!

• Support that platform and support it well

• Memory is becoming cheap and huge

• Assumptions about what is reasonable to keep in memory and what is on disk are changing

• Networks are reaching latency levels comparable to SMP interconnects

• Distributed systems are more realistic now

Page 130: Relational Database Internals

• Conversely, HIGH LATENCY, low availability (“the internet”) networks are becoming another reality that must be acknowledged

• Research on relaxed isolation levels that scale across these types of environments will continue - Last word far from said there

• Generally speaking, the highly available, distributed systems will be the most able to adapt and survive

Page 131: Relational Database Internals

• The idea of a “disk” is changing

• SSD challenges many assumptions about “sequential” vs “random” access

• At best, “tuning” may be needed for some RDBMS

• At worst, a “rewrite” may be in order

Page 132: Relational Database Internals

• SSD challenges the notion of the OVERWRITE buffer policy being hands down superior.

• SSD is at heart (under the hood) IS a NO OVERWRITE system. It’s easy to imagine a NO OVERWRITE buffer pool manager plugged DIRECTLY into SSD, bypassing file system abstractions

Page 133: Relational Database Internals

• The best ideas from the “post relational” (no sql) camp will converge with the ideas from the RDBMS producing best of breed systems

• Ease of scaling across commodity hardware

• High availability DESPITE unreliable hardware

Page 134: Relational Database Internals

• The two unstoppable ideas from the Relational Systems will continue to be the reason why these systems will dominate

• Data Abstraction

• Declarative languages

Page 135: Relational Database Internals