Database Techniek Martin Kersten Peter Boncz CWI.
-
Upload
collin-george -
Category
Documents
-
view
221 -
download
5
Transcript of Database Techniek Martin Kersten Peter Boncz CWI.
Database TechniekDatabase Techniek
Martin KerstenPeter Boncz
CWI
©Silberschatz, Korth and Sudarshan4.2Database System Concepts
OutlineOutline
Introduction & Course Organization Recap of Introductory Database Course
SQL
Relational Algebra (X100 flavor)
Storage and File Structures
©Silberschatz, Korth and Sudarshan4.3Database System Concepts
Why a DBMS?Why a DBMS?
Main Advantages Centralization (at least conceptually)
Data Independence (physical changes don’t break legacy apps)
Declarative Data Integrity Constraints
Atomic actions (DBMS recovers consistently from system crash)
Consistency under Multi-User Concurrent Updates
Declarative & Powerful Query Language, Automatically Optimized
Multi-user security
DBMS now is the basic building block of all information systems
Almost everybody in IT works with DBMS on a daily basis
©Silberschatz, Korth and Sudarshan4.5Database System Concepts
DoelDoel
verkrijgen van inzicht in de implementatie technieken binnenin een relationeel DBMS
Beoordeling: Cijfer = (2*tentamen+practicum)/3
tentamen >= 6, practicum >= 6
Literatuur: A. Silberschatz e.a., 'Database system concepts', 4th ed, McGraw-Hill, 2002
http://www.cwi.nl/~manegold/teaching/DBtech/
©Silberschatz, Korth and Sudarshan4.6Database System Concepts
HoorcollegesHoorcolleges
Query OptimizationH14BonczFeb 22
MonetDB/XQueryKersten/BonczMar 155
MonetDB/SQLKersten/NesMar 84
TransactionsH15-17KerstenMar 13
Query ProcessingH13BonczFeb 152
SQL + X100 Alg
Storage + B-Trees
H4 + X100 doc
H11-12
Kersten/
Boncz
Feb 81
OnderwerpMateriaalDocentDatum
Tentamen laatste week maart
©Silberschatz, Korth and Sudarshan4.7Database System Concepts
PracticumPracticum
Assignment 0:
• Hands-on experience with relational DBMSs & SQL
Assignment 1:
• Translating SQL to X100 algebra ("by hand")
Assignment 2: (choose on of)
a) Building logical cost functions for X100 algebra operations ("by hand")
b) Analyse and explain the behaviour of a query optimizer
Begeleider: Marc Makkes ([email protected])
Hard deadlines (first: Saturday, February 17, 2007, 23:59:59 CET! )
Work in couples
©Silberschatz, Korth and Sudarshan4.8Database System Concepts
OutlineOutline
Introduction & Course Organization
Recap of Introductory Database Course SQL Relational Algebra (X100 flavor)
Storage and File Structures
©Silberschatz, Korth and Sudarshan4.9Database System Concepts
SQL re-cap: Basic Structure SQL re-cap: Basic Structure
A typical SQL query has the form:select A1, A2, ..., An
from r1, r2, ..., rm
where P Ais represent attributes
ris represent relations P is a predicate.
This query is equivalent to the relational algebra expression.
projectA1, A2, ..., An(selectP (r1 jointrue r2 jointrue ... jointrue rm))
The result of an SQL query is again a relation. SQL relations may have duplicates
Use select distinct to get a set
©Silberschatz, Korth and Sudarshan4.15Database System Concepts
Relational algebraRelational algebra
SQL
physical algebra
logical algebra
parsing, normalization
logical query optimization physical query optimization
query execution
©Silberschatz, Korth and Sudarshan4.16Database System Concepts
The PracticumThe Practicum
SQL
physical algebra
X100 algebra
parsing, normalization
logical query optimization physical query optimization
X100 system
©Silberschatz, Korth and Sudarshan4.17Database System Concepts
X100 relational algebraX100 relational algebra
MonetDB/X100 is a CWI research projects
http://www.cwi.nl/~boncz/x100.html
high-performance experimental DBMS for e.g. Data warehousing Data mining Information Retrieval Video databases (retrieval by content)
Research goal:
study interaction between modern hardware and database internals
High perf algorithms, compression E.g. exploit CPU caches, Multi-Processors, MEMS
©Silberschatz, Korth and Sudarshan4.18Database System Concepts
X100 relational algebra (Cont.)X100 relational algebra (Cont.)
X100 has a relational algebra interface
Table ::= table(Identifier) select(Table, Expr<bool>) project(Table, [ Expr<T> ] ) join(Table, TABLE, Expr<bool>) aggr(Table, [ Expr<T> ], [ AggrFcn<T>] ) order (Table, [ Expr<T> ] ) topn(Table, [ Expr<T> ], Expr<int> ) Identifier = Table
©Silberschatz, Korth and Sudarshan4.19Database System Concepts
select(Table, Expr<bool>)select(Table, Expr<bool>)
• Relation r A B C D
1
5
12
23
7
7
3
10
• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )
A B C D
1
23
7
10
©Silberschatz, Korth and Sudarshan4.20Database System Concepts
select(Table, Expr<bool>)select(Table, Expr<bool>)
• Relation r A B C D
1
5
12
23
7
7
3
10
• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )
A B C D
1
23
7
10
Functional C-like notation:A = B and d > 5
©Silberschatz, Korth and Sudarshan4.21Database System Concepts
select(Table, Expr<bool>)select(Table, Expr<bool>)
• Relation r A B C D
1
5
12
23
7
7
3
10
• select (r, and( ==(A,B), >(D ,int(‘5’) ) ) )
A B C D
1
23
7
10
All constants denoted ascast: TYPE(‘string’)
©Silberschatz, Korth and Sudarshan4.22Database System Concepts
project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )
Relation r: A B C
10
20
30
40
1
1
1
2
A D
10
10
10
20
Project (r, [ A, D=*(C,int(’10’)) ] )
©Silberschatz, Korth and Sudarshan4.23Database System Concepts
project(Table, [ Expr<T> ] )project(Table, [ Expr<T> ] )
Relation r: A B C
10
20
30
40
1
1
1
2
A D
10
10
10
20
Project (r, [ A, D=*(C,int(’10’)) ] )
X100 is a bag algebra:
no double elimination
©Silberschatz, Korth and Sudarshan4.24Database System Concepts
join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)
Relations r, s:
A B
12412
C D
aabab
E
13123
F
r
A B
11112
C D
aaaab
F
s
join(r, s, ==(B,E))
©Silberschatz, Korth and Sudarshan4.25Database System Concepts
join(Table, Table, Expr<bool>)join(Table, Table, Expr<bool>)
Relations r, t:
A B
12412
C D
aabab
E
13123
F
r
A B
11112
C D
aaaab
F
s
X100 join result is the union of all attributes.
Name conflicts must be resolved with an extra project
E
13123
C
t
join(r, s, ==(B,E))
project( t, [ E,F=C ] )
©Silberschatz, Korth and Sudarshan4.26Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation account grouped by branch-name:
branch-name account-number balance
PerryridgePerryridgeBrightonBrightonRedwood
A-102A-201A-217A-215A-222
400900750750700
branch-name balance
PerryridgeBrightonRedwood
13001500700
aggr( account, [ branch-name ], [ balance = sum(balance) ] )
©Silberschatz, Korth and Sudarshan4.27Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation account grouped by branch-name:
branch-name account-number balance
PerryridgePerryridgeBrightonBrightonRedwood
A-102A-201A-217A-215A-222
400900750750700
branch-name balance
PerryridgeBrightonRedwood
13001500700
aggr( account, [ branch-name ], [ balance = sum(balance) ] )
Identifier = AggrFcn(Identifier)
AggrFcn<T> ::= count<uint>() avg<T>(T) sum<T>(T) min<T>(T) max<T>(T)
©Silberschatz, Korth and Sudarshan4.28Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation r:
A B
C
7
7
3
10
total
27
aggr( r, [], [total = sum(C)])
©Silberschatz, Korth and Sudarshan4.29Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation r:
A B
C
7
7
3
10
total
27
aggr( r, [], [total = sum(C)])
Empty groupby-list Global aggregate
©Silberschatz, Korth and Sudarshan4.30Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation account grouped by branch-name:
branch-name account-number balance
PerryridgePerryridgeBrightonBrightonRedwood
A-102A-201A-217A-215A-222
400900750750700
branch-name
PerryridgeBrightonRedwood
aggr( account, [ branch-name ], [] )
©Silberschatz, Korth and Sudarshan4.31Database System Concepts
aggr(Table, [Expr<T>], [AggrFcn<T>])aggr(Table, [Expr<T>], [AggrFcn<T>])
Relation account grouped by branch-name:
branch-name account-number balance
PerryridgePerryridgeBrightonBrightonRedwood
A-102A-201A-217A-215A-222
400900750750700
branch-name
PerryridgeBrightonRedwood
aggr( account, [ branch-name ], [] )
Empty AggrFcn-list Double elimination
©Silberschatz, Korth and Sudarshan4.32Database System Concepts
order (Table, [ Expr<T>])order (Table, [ Expr<T>])
• Relation r A B C D
23
12
35
25
10
9
7
7
• orderby(r, [D,C desc])
A B C D
35
25
12
23
7
7
9
10
©Silberschatz, Korth and Sudarshan4.33Database System Concepts
topn(Table, [ Expr<T>], int)topn(Table, [ Expr<T>], int)
• Relation r A B C D
23
12
35
25
10
9
7
7
• topn(r, [D,C desc], int(‘2’) )
A B C D
35
25
7
7
©Silberschatz, Korth and Sudarshan4.34Database System Concepts
TPC-H: Data Warehousing Scenario TPC-H: Data Warehousing Scenario
“Give date, priority and sum of the top 10 high revenue orders for construction customers that had been ordered but not yet shipped on march 15 “
http://www.tpc.org• TPC-C transaction processing• TPC-H data warehousing
Large repository of data about Orders, consisting of Lineitems, delivered to Customers.
CUSTOMER 1n ORDER 1n LINEITEM
Query 3:Query 3:
©Silberschatz, Korth and Sudarshan4.35Database System Concepts
SQL Data Warehousing Query SQL Data Warehousing Query (TPC-H Query 3) (TPC-H Query 3)
select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue
from customer, orders, lineitem
where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;
©Silberschatz, Korth and Sudarshan4.36Database System Concepts
SQL SQL Algebra translationAlgebra translation
select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue
from customer, orders, lineitem
where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;
join
©Silberschatz, Korth and Sudarshan4.37Database System Concepts
SQL SQL Algebra translationAlgebra translation
select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue
from customer, orders, lineitem
where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;
join
select
©Silberschatz, Korth and Sudarshan4.38Database System Concepts
SQL SQL Algebra translationAlgebra translation
select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue
from customer, orders, lineitem
where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;
join
select
aggr
©Silberschatz, Korth and Sudarshan4.39Database System Concepts
SQL SQL Algebra translationAlgebra translation
select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue
from customer, orders, lineitem
where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' ando_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;
join
select
aggr
topn
©Silberschatz, Korth and Sudarshan4.40Database System Concepts
Query in X100 AlgebraQuery in X100 Algebra
©Silberschatz, Korth and Sudarshan4.41Database System Concepts
©Silberschatz, Korth and Sudarshan4.42Database System Concepts
OutlineOutline
Introduction & Course Organization
Recap of Introductory Database Course SQL
Relational Algebra (X100 flavor)
Storage and File Structures
©Silberschatz, Korth and Sudarshan4.43Database System Concepts
Storage HierarchyStorage Hierarchy
300GB
300GB
4GB
2GB
2MB
64KB
128B
size bandwidthlatencyEUR/GBUnit
60MB/s (20MB/s)
100000ns202KBNAND Flash
3000MB/s70ns6064BRAM (DDR2)
80MB/s10 min0.1032KBTape (HP)
80MB/s10000000ns0.308KBMagnetic disk (IDE)
7000MB/s10ns64BL2 CPU cache
24000MB/s1ns64BL1 CPU cache
24000MB/s18BCPU registers
©Silberschatz, Korth and Sudarshan4.44Database System Concepts
Hardware TrendsHardware Trends
CPU speed (KHz)
RAM Size (KB) Disk Size (MB)
RAM Bandwidth (MB/s)
Disk Bandwidth (MB/s)
RAM Latency (ns)
Disk Latency (ms)
©Silberschatz, Korth and Sudarshan4.45Database System Concepts
Storage Hierarchy (Cont.)Storage Hierarchy (Cont.)
primary storage: Fastest media but volatile (cache, main memory).
secondary storage: next level in hierarchy, non-volatile, moderately fast access time also called on-line storage
E.g. flash memory, magnetic disks
tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage
E.g. magnetic tape, optical storage
©Silberschatz, Korth and Sudarshan4.46Database System Concepts
Magnetic Hard Disk MechanismMagnetic Hard Disk Mechanism
NOTE: Diagram is schematic, and simplifies the structure of actual disk drives
©Silberschatz, Korth and Sudarshan4.47Database System Concepts
Performance Measures of DisksPerformance Measures of Disks Access time – the time it takes from when a read or write request
is issued to when data transfer begins. Consists of: Seek time – time it takes to reposition the arm over the correct track.
Average seek time is 1/2 the worst case seek time.
– Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement
4 to 10 milliseconds on typical disks Rotational latency – time it takes for the sector to be accessed to
appear under the head. Average latency is 1/2 of the worst case latency. 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
Data-transfer rate – the rate at which data can be retrieved from or stored to the disk. 20 to 60 MB per second is typical Multiple disks may share a controller, so rate that controller can handle
is also important E.g. ATA: 100 MB/second, SCSI: 320 MB/
©Silberschatz, Korth and Sudarshan4.48Database System Concepts
Magnetic Disk Hardware Trends Magnetic Disk Hardware Trends
©Silberschatz, Korth and Sudarshan4.49Database System Concepts
Performance Measures (Cont.)Performance Measures (Cont.)
Mean time to failure (MTTF) – the average time the disk is expected to run continuously without any failure. Typically 3 to 5 years
Probability of failure of new disks is quite low, corresponding to a“theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk
E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours
MTTF decreases as disk ages
©Silberschatz, Korth and Sudarshan4.50Database System Concepts
RAIDRAID
RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks,
providing a view of a single disk of
high capacity and high speed by using multiple disks in parallel, and
high reliability by storing data redundantly, so that data can be recovered even if a disk fails
The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours
(approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are critical with large numbers of disks
©Silberschatz, Korth and Sudarshan4.51Database System Concepts
Improvement of Reliability via RedundancyImprovement of Reliability via Redundancy
Redundancy – store extra information that can be used to rebuild information lost in a disk failure
E.g., Mirroring (or shadowing) Duplicate every disk. Logical disk consists of two physical disks. Every write is carried out on both disks
Reads can take place from either disk If one disk in a pair fails, data still available in the other
Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired
– Probability of combined event is very small
» Except for dependent failure modes such as fire or building collapse or electrical power surges
Mean time to data loss depends on mean time to failure, and mean time to repair E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives
mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)
©Silberschatz, Korth and Sudarshan4.52Database System Concepts
RAID LevelsRAID Levels Schemes to provide redundancy at lower cost by using disk
striping combined with parity bits Different RAID organizations, or RAID levels, have differing cost,
performance and reliability characteristics
RAID Level 1: Mirrored disks with block striping Offers best write performance.
Popular for applications such as storing log files in a database system.
RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not critical.
©Silberschatz, Korth and Sudarshan4.53Database System Concepts
RAID Levels (Cont.)RAID Levels (Cont.)
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on
disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.
©Silberschatz, Korth and Sudarshan4.54Database System Concepts
Choice of RAID LevelChoice of RAID Level
Level 0 provides maximum performance, no safety Level 1 provides much better write performance than level 5
Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes
Level 1 preferred for high update environments such as log disks
Level 1 had higher storage cost than level 5 disk drive capacities increasing rapidly (50%/year) whereas disk
access times have decreased much less (x 3 in 10 years) I/O requirements have increased greatly, e.g. for Web servers When enough disks have been bought to satisfy required rate of I/O,
they often have spare storage capacity so there is often no extra monetary cost for Level 1!
Level 5 is preferred for applications with low update rate,and large amounts of data
Level 1 is preferred for all other applications
©Silberschatz, Korth and Sudarshan4.55Database System Concepts
Hardware IssuesHardware Issues
Hot swapping: replacement of disk while system is running, without power down Supported by some hardware RAID systems,
reduces time to recovery, and improves availability greatly
Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure Reduces time to recovery greatly
Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using Redundant power supplies with battery backup
Multiple controllers and multiple interconnections to guard against controller/interconnection failures
©Silberschatz, Korth and Sudarshan4.57Database System Concepts
Index ClassificationIndex Classification
Primary vs. Secondary primary – the index on the primary key
unique – an index on a candidate key
secondary – not primary
Clustered vs Unclustered clustered – key order corresponds with record order
E.g. B-tree separate from record file
Index-organized table B-tree leaves store records (no file)
unclustered – index contains record-IDs in random order
©Silberschatz, Korth and Sudarshan4.58Database System Concepts
Root
B+Tree n=4
100
120
150
180
30
3 5 11
30
35
100
101
110
120
130
150
156
179
180
200
©Silberschatz, Korth and Sudarshan4.59Database System Concepts
Sample non-leafSample non-leaf
57
81
95
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
©Silberschatz, Korth and Sudarshan4.60Database System Concepts
Sample leaf node:Sample leaf node:
From non-leaf node
to next leaf
in sequence
57
81
95
To r
eco
rd
wit
h k
ey 5
7
To r
eco
rd
wit
h k
ey 8
1
To r
eco
rd
wit
h k
ey 8
5
©Silberschatz, Korth and Sudarshan4.61Database System Concepts
Non-root nodes have to be at least half-fullNon-root nodes have to be at least half-full
Use at least
Non-leaf: n/2 children
Leaf: (n-1)/2 pointers to data
©Silberschatz, Korth and Sudarshan4.62Database System Concepts
Full node min. node
Non-leaf
Leaf
n=4
12
01
50
18
0
30
3 5 11
30
35
©Silberschatz, Korth and Sudarshan4.63Database System Concepts
Insert into B+treeInsert into B+tree
(a) simple case space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root
©Silberschatz, Korth and Sudarshan4.64Database System Concepts
(simple case) Insert key = 32 n=43 5 11
30
31
30
10
03
2
©Silberschatz, Korth and Sudarshan4.65Database System Concepts
(leaf overflow) Insert key = 7 n=4
3 5 11
30
31
30
100
3 5
7
7
©Silberschatz, Korth and Sudarshan4.66Database System Concepts
(internal overflow) Insert key = 160n=4
100
120
150
180
150
156
179
180
200
160
18
0
160
179
©Silberschatz, Korth and Sudarshan4.67Database System Concepts
(new root) insert 45 n=4
10
20
30
1 2 3 10
12
20
25
30
32
40
40
45
40
30new root
©Silberschatz, Korth and Sudarshan4.68Database System Concepts
insert:
1, 2, 10, 20, 3, 12, 30, 32, 25, 40, 45
n=4
©Silberschatz, Korth and Sudarshan4.76Database System Concepts
Interesting problem:Interesting problem:
For B+tree, how large should n be?
…
n is number of keys / node
©Silberschatz, Korth and Sudarshan4.77Database System Concepts
AssumptionsAssumptions
You have the right to set the disk page size for the disk where a B-tree will reside.
Compute the optimum page size n assuming that The items are 4 bytes long and the pointers are also 4 bytes long.
Time to read a node from disk is 10+.0002n
Time to process a block in memory is unimportant
B+tree is full (I.e., every page has the maximum number of items and pointers
©Silberschatz, Korth and Sudarshan4.78Database System Concepts
FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0
What happens to nopt as
Disk bandwidth increases?
Access time stays behind?
CPU get faster?
©Silberschatz, Korth and Sudarshan4.79Database System Concepts
f(n)f(n) = time to find a record= time to find a record
= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)
©Silberschatz, Korth and Sudarshan4.80Database System Concepts
f(n)f(n) = time to find a record= time to find a record
= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)
1994 (book) 2004 (now)
N=500 n=4000
©Silberschatz, Korth and Sudarshan4.81Database System Concepts
f(n)f(n) = time to find a record= time to find a record
= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)
1994
Table 1M records
10ms access time
4MB/s bandwidthn~500-1000
4KB / 8KB pagesBe conservative to limit RAM consumption
©Silberschatz, Korth and Sudarshan4.82Database System Concepts
f(n)f(n) = time to find a record= time to find a record
= log= lognn(T) * (10 + 0.0002n)(T) * (10 + 0.0002n)
2004
Table 10M records
6ms access time
40MB/s bandwidth
n~1000-4000
8KB / 32KB pages
relative benefit decreases so don’t overdo it
©Silberschatz, Korth and Sudarshan4.83Database System Concepts
FIND FIND nnoptopt by by f’(n)f’(n) = 0 = 0
Answer should be nopt = “few thousand”
What happens to nopt as
block sizes are increasing..
Disk bandwidth increases?
Access time stays behind?
CPU get faster?
©Silberschatz, Korth and Sudarshan4.84Database System Concepts
Primary or Auxiliary StructurePrimary or Auxiliary Structure
Primary index Leaf blocks in sequence clustered index Main storage structure for a database table
E.g. B+-tree organized file / hash structured files Typically an index on an unique key
But not necessarily Normally, you can have only one clustered index!
Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)
©Silberschatz, Korth and Sudarshan4.85Database System Concepts
Clustered vs. Unclustered IndexClustered vs. Unclustered Index
Primary index Leaf blocks in sequence clustered index Main storage structure for a database table
E.g. B+-tree organized file / hash structured files Typically an index on an unique key
But not necessarily Normally, you can have only one clustered index!
Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)
low
high
Primary B-Tree index
1 access only
(rest is ‘just’ bandwidth)
©Silberschatz, Korth and Sudarshan4.86Database System Concepts
Clustered vs. Unclustered IndexClustered vs. Unclustered Index
Primary index Leaf blocks in sequence clustered index Main storage structure for a database table
E.g. B+-tree organized file / hash structured files Typically an index on an unique key
But not necessarily Normally, you can have only one clustered index!
Secondary index Also called unclustered index A separate file from where the table is stored Refers with (block/offset) pointers to records in the table file You can define many as you want (to maintain)
low
high
Primary B-Tree index
1 access only
(rest is ‘just’ bandwidth)
Secondary B-tree index
Pay N times
access cost
©Silberschatz, Korth and Sudarshan4.87Database System Concepts
Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?
Secondary indices depend on random I/O
can do asynchronous I/O (multiple I/Os at-a-time)
degenerates into full table scans
©Silberschatz, Korth and Sudarshan4.88Database System Concepts
Block size for sequential reads?Block size for sequential reads?
©Silberschatz, Korth and Sudarshan4.89Database System Concepts
When do random I/Os make sense?When do random I/Os make sense?
©Silberschatz, Korth and Sudarshan4.90Database System Concepts
Are Unclustered Indices a Good Idea?Are Unclustered Indices a Good Idea?
Secondary indices depend on random I/O
can do asynchronous I/O (multiple I/Os at-a-time)
degenerates into full table scans
Is not using an index at all better?
I.e. read the entire table sequentially without any index
Use redundant clustered orderings
– Materialized views
– C-STORE (Stonebraker et al, VLDB 2005), MonetDB/X100
– Database Cracking (Kersten, CIDR 2005+2007)