Decibel presentation
-
Upload
albertrcarter -
Category
Engineering
-
view
180 -
download
0
Transcript of Decibel presentation
Decibel: The Relational Dataset Branching System
Michael Maddox (MIT CSAIL), David Goehring (MIT CSAIL), Aaron J. Elmore (University of
Chicago), Samuel Madden (MIT CSAIL), Aditya Parameswaran (University of Illinois), Amol
Deshpande (University of Maryland)
•Motivations:• Need for data management systems that natively support versioning or
branching of datasets to enable concurrent analysis, integration, manipulation, or curation of data across teams• Storage efficiency (reduce redundancy)• Lineage tracking of modifications• Versioned Queries (with runtime efficiency)
•Overview• git versioning semantics applied to a dataset, collection of tables• Append only data store
• Contributions• 3 storage models that implement dataset versioning• Versioned benchmark for evaluating performance of these storage
models
Basic Operations and Workflow• Branch• Commit/Checkout• Data Modification (CRUD)• Single Branch Scan• Multi-Branch Scan• Diff Branches• Merge Branches• Theses basic primitives can be used to implement a wide variety
of versioned queries.
Assumes a unique primary key for each record to track that record across versions.
Versioned Queries
Decibel Architecture
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5 ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
Branch A : File 01
Branch C : File 03Branch B : File 02 Branch E : File 05
Branch D : File 04
Branch point Added after branch
Version First
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5 ID |Ver A|Ver B|Ver C|Ver D|Ver E
Tuple First
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
Branch C : File 03 Branch E : File 05
Branch D : File 04
ID |Brc A|Brc B|Brc C|Brc D|Brc E
Segment Index
ID | Brc C|Brc D
Segment Index
File |Brc A|Brc B|Brc C|Brc D|Brc E
Branch-Segment Bitmap
134
5
ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5
Version A
Hybrid
Storage Model ComparisonsVersion First Tuple First Hybrid
Single Branch Scan
Multi Branch Scan
Branch Membership
Bitmap Storage Efficiency -
Versioned Benchmark: Branching Strategies
Deep
Flat
Scientific
Curation
Master 1
Master 2
Master 4
Master 3
Master 1
Active 1 Active 3
Active 2
Active 2
Master 1
Master 2
Master 4
Master 3Active 1
Active 3
Master 2
Master 4
Master 3
Dev 1
Fix 1
Fix 3
Master 1
Dev 2
Feature 1 Feature2
Dev 2
Feature 1
Fix 2
(a) (b) (c)
(d)
Versioned Benchmark: Data Generation and Loading
Clustered Interleaved• Vary number of branches and data set size• Stagger branch points• Mixed workload (e.g. 20% updates, 80% inserts)• Commits at regular intervals• Insert skew• Follow branching strategy for evolving version
graph• Loading Mode:• Insert/Update into branches to achieve
Physical layout, e.g. clustering level in clustered mode
• Or follow branching strategy in interleaved mode
• Load older branches
Single Version Scan on Flat (100 GB DB) Multi Version Scan on Deep (100 GB DB)
Hybrid storage model captures the best of both worlds and outperforms both extreme storage models in almost every experiment.
Commits• In Version First commits are trivially quick (just storing an offset), but
Tuple First and Hybrid require taking snapshots of columns in the relevant bitmaps.
Future Work• Extending Hybrid in several ways• Leveraging versioning to achieve serializable transactions in OCC style• Schema versioning• Materialization and query optimization• Current system is a row store with full copy on write, next steps would
be experimenting with a column store for more storage efficiency in a manner similar to git