Architecture-Conscious Database Systems€¦ · PAGE HEADER RH1 1237 Jane 30 RH2 4322 John 45 1563...

Architecture-ConsciousDatabase Systems

Anastassia Ailamaki

Ph.D. ExaminationNovember 30, 2000

DATABASE

MAIN MEMORY

PROCESSORDBMS Execution

DBMS Data andInstructions

1 Megabyte(Buffer pool)

< 1 MBps

6 cycles

The main performance bottleneck was I/O latency

10 cycles/instruction

Hot data

A DBMS on a 1980 Computer

70 cycles

0.33 cycles/instruction

MAIN MEMORY

PROCESSORDBMS Execution

DBMS Data andInstructions

CACHE Data andInstructions

25 MBps

1 Gigabyte1 Terabyte

75 MBps

MAIN MEMORY

DB DB DBHot data migrates to larger and slower main memory

Present and Future Platforms

VAX 11/780 Pentium II Xeon

01020304050607080

CPI Memory latency

One access to memory is 100’s of instruction opportunities

Processor & Memory Speed Gap

On Today’s Computers“When you think about what today’s machines do -

theylook at the instruction stream dynamically, find parallelism on the fly, execute instructions out of order, and speculate on branch outcomes -

it’s amazing that they work.”

John Hennessy, IEEE Computer, August 1999

New architectures are more sophisticated

0.330.8

Theoreticalminimum

Desktop/Engineering

(SPECInt)

DecisionSupport

OnlineTransactionProcessing

High average time per instruction for DB workloads

Why Study Database Performance?

Problem: Where does query execution time go?

Proposed evaluation framework [VLDB’99]Identified bottlenecks in hardware

memory accesshardware implementation details

Discovered two memory-related bottleneckssecond-level cache data accessfirst-level instruction cache access

Methodological discovery: micro-benchmarks

A systematic evaluation framework

Contributions (I): Analysis

Problem: Current data placement hurts caches

Proposed novel data placement [subm. SIGMOD’01]rearranges data records on disk pageoptimizes data cache performance

Evaluated it against the popular scheme70% less data-related memory access delaysdoes not affect I/O behaviorespecially beneficial for decision support workloads

A cache-conscious data placement

Contributions (II): Software

Problem: Hardware design affects DB behavior

Compared Shore on four different systemsdifferent processor architectures/µ-architecturesdifferent memory subsystems

Found evidence that DBMSs would benefit from2-4 way associative, larger L2, no inclusionlarge blocks, no sub-blockinghigh-accuracy branch predictionmemory-aggressive execution engine

Step towards a DSS-centric machine

Contributions (III): Hardware

IntroductionPART I: Analysis

BackgroundQuery execution time breakdownExperimental resultsBottleneck assessment

PART II: Partition Attributes Across (PAX)PART III: Towards a DSS-centric h/w designConclusions

Outline

Workload characterization studies, e.g.,[Barroso 98], [Keeton 98]

Various platforms, mostly multiprocessorOne DBMS per platform

Results:Commercial different than scientific appsOLTP different than DSS workloadsMemory is major bottleneck

No coherent study across DBMSs and workloads

Previous Work

FETCH/DECODE

DISPATCH EXECUTE

RETIRE UNIT

INSTRUCTION POOL

L1 I-CACHE L1 D-CACHE

L2 CACHE

MAIN MEMORY

Branch prediction, non-blocking cache, out-of-order

An Execution Pipeline

Computation

Memory

BranchMispredictions

HardwareResources

Delays (Stalls)

Overlap opportunity:Load AD=B+CLoad E

Execution Time = Computation + StallsExecution Time = Computation + Stalls - Overlap

Where Does Time Go?

Four commercial DBMSs: A, B, C, D6400 PII Xeon/MT running Windows NT 4Used processor counters to measure/estimate

Range Selection(sequential, indexed)

select avg (a3)from Rwhere a2 > Lo and a2 < Hi

Equijoin(sequential)

select avg (a3)from R, Swhere R.a2 = S.a1

Crafted microbenchmarks to isolate execution loops

Setup and Methodology

Measured: Resource stalls, L1I stallsEstimated:

L1 data stalls: # misses * penaltyL2 stalls: # misses * measured memory latencyBranch misprediction stalls: # mispr. * penalty

Overlap: measured CPI / expected CPI

Time Calculations

System B

mBench(seq)

TPC-D mBench(idx)

benchmark

Computation Memory Branch misprediction Resource

System D

mBench(seq)

TPC-D mBench(idx)

benchmark

High CPI compared to integer workloadsSequential scan / TPC-D, 2ary index / TPC-C

Microbenchmarks vs. TPC

10% Sequential Scan

A B C DDBMS

10% 2ary index selection

B C DDBMS

Computation Memory Branch mispredictions Resource

Join (no index)

A B C DDBMS

Stalls at least 50% of timeMemory stalls are major bottleneck

Execution Time Breakdown (%)

10% Sequential Scan

A B C DDBMS

10% 2ary index selection

B C DDBMS

L1 Data L2 Data L1 Instruction L2 Instruction

Join (no index)

A B C DDBMS

L1 instruction and L2 data stalls dominateDifferent memory bottlenecks across DBMSs and queries

Memory Stalls Breakdown (%)

We can use microbenchmarks instead of TPCExecution time breakdown shows trendsMemory access is a major bottleneck

Increasing memory-processor performance gapDeeper memory hierarchies expectedL2 cache data misses

L2 grows (8MB), but will be slowerStalls due to L1 I-cache misses

L1 I-cache not likely to grow as much as L2

We need to address every reason for stalls

Summary of Analysis

Memory

BBranchMispredictions

RHardwareResources

D-cache D

I-cache I

DBMS: improve locality

DBMS + Compiler

Compiler + Hardware

Hardware

Data cache: A clear responsibility of the DBMS

Addressing Bottlenecks

IntroductionPART I: Where Does Time Go?PART II: Partition Attributes Across

The current scheme: Slotted pagesPartition Attributes Across (PAX)Performance Results

PART III: Towards a DSS-centric h/w designConclusions

Outline

Slotted Pages: Used by all commercial DBMSsStore table records sequentiallyIntra-record locality (attributes of record r together)…but pollutes cache

Inspiration: Vertical partitioning [Copeland’85]Store n-attribute table as n single-attribute tablesProblem: High record reconstruction cost

Partition Attributes Across (PAX)Have the cake and eat it, too!

PAX: Inter-record locality, low reconstruction cost

The Data Placement Tradeoff

1237RH1PAGE HEADER

30Jane RH2 4322 John

45 RH3 Jim 20

•••

7658 Susan 52

37Dan87916

43Leon25345

52Susan76584

20Jim15633

45John43222

30Jane12371

AgeNameSSNRID

Records are stored sequentiallyOffsets to start of each record at end of page

Formal name: NSM (N-ary Storage Model)

Current Scheme: Slotted Pages

FIXED-LENGTH VALUES VARIABLE-LENGTH VALUESHEADER

offsets to variable-length fields

null bitmap,record length, etc

All attributes of a record are stored together

Current Scheme: Slotted Pages

MAIN MEMORY

1237RH1PAGE HEADER

45 RH3 Jim 20

•••

7658 52

block 130Jane RH

52 2534 Leon block 4

Jim 20 RH4 block 3

45 RH3 1563 block 2

select namefrom Rwhere age > 40

NSM pollutes the cache and wastes bandwidth

2534 LeonSusan

NSM Cache Behavior

1237RH1PAGE HEADER

RH3 Jim 20

•••

7658 Susan 52

PAGE HEADER 1237 4322

Jane John Jim Susan

30 45 2052

• •••

NSM PAGE PAX PAGE

Partition data within the page for spatial locality

Partition Attributes Across (PAX)

PAGE HEADER 1237 4322

Jane John Jim Suzan

30 45 2052• •••

block 130 45 2052

MAIN MEMORY

select namefrom Rwhere age > 40

Fewer cache misses, low reconstruction cost

PAX: Mapping to Cache

pid 3 2 4v4

43221237

Jane John

}Page Header

attribute sizes

free space# records

# attributes

F - Minipage

presence bits

v-offsets

F - Minipage

V - Minipage

PAX: Detailed Design

Main-memory resident RQuery:

select avg (ai)from Rwhere aj >= Lo and aj <= Hi

PII Xeon running Windows NT 416KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAMUsed processor countersImplemented schemes on Shore Storage Manager

Similar behavior to commercial Database Systems

Basic Evaluation: Methodology

Range Selection (no index)

A B C D ShoreDBMS

Computation Memory Branch mispr. Resource

Range Selection (no index)

A B C D ShoreDBMS

L1 Data L2 Data L1 Instruction L2 Instruction

We can use Shore to evaluate DSS workload behavior

Range selection query on 4 commercial DBMSs + ShoreBreakdown of execution & memory delays

Why Use Shore?

Sensitivity to Selectivity

1% 5% 10% 20% 50% 100%selectivity

NSM L2PAX L2

PAX incurs 70% less data cache penalty than NSM PAX reduces cache misses at both L1 and L2Selectivity doesn’t matter for PAX data stalls

Effect on Accessing Cache Data

Cache data stalls

NSM PAX

page layout

L1 Data stallsL2 Data stalls

PAX: 75% less memory penalty than NSM (10% of time)Execution times converge as number of attrs increases

Execution time breakdown

NSM PAXpage layout

Resource

BranchMispred.Memory

Sensitivity to # of attributes

2 4 8 16 32 64

# of attributes in recordel

NSMPAX

Time and Sensitivity Analysis

PAX and NSM times converge as query covers entire tuple

1 2 3 4 5 6 7

projectivity

NSMPAX

1 2 3 4 5 6 7

# of attributes in predicate

NSMPAX

Sensitivity Analysis (2)Elapsed time sensitivity to projectivity / # predicatesRange selection queries, 1% selectivity

Loaded 100M, 200M, and 500M TPC-H DBsRan Queries:

Range Selections w/ variable parameters (RS)TPC-H Q1 and Q6

sequential scanslots of aggregates (sum, avg, count)grouping/ordering of results

TPC-H Q12 and Q14(Adaptive Hybrid) Hash Join complex ‘where’ clause, conditional aggregates

PII Xeon running Windows NT 4Used processor counters

Evaluation Using a DSS Benchmark

Estimate average field sizesStart inserting recordsIf a record doesn’t fit,

Reorganize page(move minipage boundaries)

Adjust average field sizes

50% of reorganizations accommodate a single record

Elapsed Bulk Load Times

100 MB 200 MB 500 MBDatabase Size

NSMPAXDSM

PAX loads a TPC-H database in 2-26% longer than NSM

Insertions with PAX

TPC-H 100M

8 76 5

Q1 Q6 Q12 Q14TPC-H Query

TPC-H 200M

Q1 Q6 Q12 Q14TPC-H query

NSM PAX

TPC-H 500M

Q1 Q6 Q12 Q14

TPC-H query

PAX improves performance up to 42% even with I/O

Elapsed Execution Time

PAX/NSM Speedup on PII/NT

RS Q1 Q6 Q12 Q14Query

100 MB200 MB500 MB

PAX improves performance up to 42% even with I/OSpeedup differs across DB sizes

Speedup

AdvantagesHigh data cache performanceFaster than NSM for DSS queriesOrthogonal to other storage decisionsDoes not affect I/O performance

Current DisadvantagesComplex free space mgmt with variable length attributes ⇒ Complicates update algorithm

PAX beneficial for read-mostly workloads (e.g., DSS)(update-intensive workloads in future work)

PAX: Summary

IntroductionPART I: Where Does Time Go?PART II: Partition Attributes AcrossPART III: Towards DSS-Centric H/W

Memory subsystemBranch prediction mechanismProcessor pipeline

Conclusions

Outline

ArchitectureRISC or CISC Instruction set

MicroarchitecturePipeline

Speculation (out-of-order, multiple issue)Branch prediction

Memory subsystemCache size, associativityBlock size, subblockingInclusion

Which design looks beneficial for DSS workloads?

Platform Differences

Used four machinesSun UltraSparc: US-II and US-IIi, Solaris 2.6/2.7Intel P6: PII Xeon, Linux v2.2DEC Alpha: 21164A, OSF1 v.4.0

Architecture and Processor Microarchitecture

noyesnonoout of order?

RISCCISCRISCRISCinstruction set

1996199819971997introduced in

532 MHz400 MHz300 MHz296 MHzspeed

US-IIiUS-IIAlpha 21164PII Xeon

UltraSparcCharacteristic

Experimental Setup

yesN/AN/AN/Ainclusion by L3

nonoyesyesinclusion by L2

yesnoyesyesinclusion by L2

block/subblock

size, assoc 4 MB / DMN/AN/AN/AL3

64/64N/AN/AN/A

block/subblock

size, assoc 96KB, 3-way512KB, 4-way512KB, DM2 MB, DML2 64/3232/3264/6464/64

block/subblock

size, assoc 8KB, DM16KB, 4-way16KB, 2-way16KB, 2-wayL1 I 32/1632/3232/3232/32

block/subblock

size, assoc

Alpha 21164PII XeonUltraSparc

32/3232/3232/1632/16

8KB, DM16KB, 2-way16KB, DM16KB, DML1 D

US-IIiUS-IICharacteristic

Cache Hierarchies

Compiled Shore with gcc 2.95.2Alpha version not optimized

Ran DSS workloadRange Selections w/ variable parameters (RS)TPC-H 1, 6, 12, 14

Used processors’ countersSun: run-pic (by Glenn Ammons, modified)PII: PAPI (public-domain counter library)Alpha: DCPI (sampling software by Compaq)

Methodology

The current issue/retire width remains unexploited

Alpha issues at most 2 instructions / cycle (max=4)>60% of the time the Xeon retires 0/1 instruction (max=3)

Superscalar Processor Capability

Alpha 21164 Issue Breakdown

RS Q1 Q6 Q12 Q14query

pipeline dry 0-issue 1-issue 2-issue

PII Xeon (NT) Retire Breakdown

A B C D ShoreDBMS

0-retired 1-retired 2-retired 3-retired

UltraSparc-II

RS Q1 Q6 Q12 Q14DBMS

PII Xeon

D-stalls I-stalls Br. Mispr Other

A21164

Memory + branch misprediction stalls = 35-60% of timeData accesses: major memory bottleneck (esp. Q12, Q14)

Clock-per-Record Breakdown

Q12, Q14

RS, Q1, Q6

Q12, Q14

RS, Q1, Q6

9%22%15%3.5%Branch

misprediction rate 6%1%

515Branch penalty (cycles)

7%18%Branch frequency

Alpha 21164PII XeonCharacteristic

Branch penalty = frequency*misprediction rate*penaltyFrequency is typically 20-25%In-order processors => lower penaltyLow misprediction accuracy may break it (e.g., UltraSparc)

High-accuracy predictors

Branch Prediction

UltraSparc II/IIi cache comparison (RS)

Elapsedtime

L1Dmisses

L1I misses

L2Dmisses

L2I misses

UltraSparc-IIi UltraSparc-II

Small caches should not maintain inclusion

Cache Inclusion

PAX savings on L3 data miss rates

PII Xeon (L2)US-II (L2)A21164 (L3)

Larger cache line = lower miss rates

Cache Block Size

L1 Data Cache Misses (RS)

1 3 5 7 9 11 13 15projectivity

US-IIUS-IIiPII Xeon

L1 Data Cache Misses (RS)

2% 10% 20% 50% 100%selectivity

US-IIUS-IIiPII Xeon

High associativity, no sub-blocking

UltraSparc: direct-mapped, subblocking (32/16)Xeon: 2-way, no subblocking (32/32)

Sub-Blocking / Associativity

PAX/NSM Speedup on Unix (100MB database)

PII XeonUltraSparc-IIA21164

PAX improves all queries

PAX vs. NSM across platforms

Memory HierarchyNon-blocking caches>64-byte block, no sub-blockingGenerous-sized L1-I (128K) and L2 (> 2MB)

A tiny, fast L1/2 with a large, slow L3 won’t add muchHigh associativity (2-4) No inclusion (at least for instructions)

Processor pipelineIssue width is fine, out-of-order overlaps stall timeExecution engine to sustain >1 load/store instr.High-accuracy branch prediction

Summary

…provided that implementation cost remains stable.

Found trends in behavior of commercial DBMSsusing an analytic framework to model execution time

Identified bottlenecks among HW componentsMain memory access is the new DB bottleneckMajor showstoppers: L1 Instruction + L2 Data

Proposed new design for cache performanceIncrease spatial locality using novel data placement70% less data-related memory access delaysSignificant improvement on sequential scans

Evaluated several hardware parametersSuggested DSS-centric processor and memory design

Conclusions

Architecture-Conscious Database Systems€¦ · PAGE HEADER RH1 1237 Jane 30 RH2 4322 John 45 1563...

Documents

Transcript of Architecture-Conscious Database Systems€¦ · PAGE HEADER RH1 1237 Jane 30 RH2 4322 John 45 1563...

SB 1237 TESTIMONY

1237 1238 1 pb

ISC 4322/6300 – GAM 4322 Artificial Intelligence Lecture 1 The Foundations of AI and Intelligent Agents University Of Houston- Victoria Computer Science.

Medical Final 1237 0729.pdf

1237 john in july

IS 1237-1980 Flooring Tiles

Alailima, Weare Testimony on S.1237

16th Congress Senate Bill No. 1237

1237.Quantum Mechanics 1

By: August 8, 2008 Young Ophthalmologists Study #7658.

Global Warming(1237)

Melendres # 1237 Aug 11 TRANSCRIPT - Status Conference

1237 PSDFile/1237-PSD-CMR_CA… · Web viewApplication No. 1237 – Cardiac MRI – Coronary Artery Disease Applicant: The Cardiac Society of A ustralia and New Zealand Date of MSAC

4322 Automated Pressure Calibration System · 3 Fluke Calibration 4322 Automated Pressure Calibration System The 4322 Pressure Controller is the primary component of the 4322 Automated

ME 4322 Final Project Report

1237 Richmond, Wilmette, Illinois

CALIBRATION PROCEDURE NI PXIe-4322 · 2 | ni.com | NI PXIe-4322 Calibration Procedure Documentation Consult the following documents for information about the NI PXIe-4322, NI-DAQmx,

Consola SM3385-68 7658-000 Vito M

Organizational Strategy MANA 4322 Assurance of Learning.

Calendar No. 1237 - Senate