Decibel presentation

14
Decibel: The Relational Dataset Branching System Michael Maddox (MIT CSAIL), David Goehring (MIT CSAIL), Aaron J. Elmore (University of Chicago), Samuel Madden (MIT CSAIL), Aditya Parameswaran (University of Illinois), Amol Deshpande (University of Maryland)

Transcript of Decibel presentation

Page 1: Decibel presentation

Decibel: The Relational Dataset Branching System

Michael Maddox (MIT CSAIL), David Goehring (MIT CSAIL), Aaron J. Elmore (University of

Chicago), Samuel Madden (MIT CSAIL), Aditya Parameswaran (University of Illinois), Amol

Deshpande (University of Maryland)

Page 2: Decibel presentation

•Motivations:• Need for data management systems that natively support versioning or

branching of datasets to enable concurrent analysis, integration, manipulation, or curation of data across teams• Storage efficiency (reduce redundancy)• Lineage tracking of modifications• Versioned Queries (with runtime efficiency)

•Overview• git versioning semantics applied to a dataset, collection of tables• Append only data store

• Contributions• 3 storage models that implement dataset versioning• Versioned benchmark for evaluating performance of these storage

models

Page 3: Decibel presentation

Basic Operations and Workflow• Branch• Commit/Checkout• Data Modification (CRUD)• Single Branch Scan• Multi-Branch Scan• Diff Branches• Merge Branches• Theses basic primitives can be used to implement a wide variety

of versioned queries.

Assumes a unique primary key for each record to track that record across versions.

Page 4: Decibel presentation

Versioned Queries

Page 5: Decibel presentation

Decibel Architecture

Page 6: Decibel presentation

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5 ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

Branch A : File 01

Branch C : File 03Branch B : File 02 Branch E : File 05

Branch D : File 04

Branch point Added after branch

Version First

Page 7: Decibel presentation

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5 ID |Ver A|Ver B|Ver C|Ver D|Ver E

Tuple First

Page 8: Decibel presentation

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

Branch C : File 03 Branch E : File 05

Branch D : File 04

ID |Brc A|Brc B|Brc C|Brc D|Brc E

Segment Index

ID | Brc C|Brc D

Segment Index

File |Brc A|Brc B|Brc C|Brc D|Brc E

Branch-Segment Bitmap

134

5

ID | Attr 1|Attr 2|Attr 3|Attr 4|Attr 5

Version A

Hybrid

Page 9: Decibel presentation

Storage Model ComparisonsVersion First Tuple First Hybrid

Single Branch Scan

Multi Branch Scan

Branch Membership

Bitmap Storage Efficiency -

Page 10: Decibel presentation

Versioned Benchmark: Branching Strategies

Deep

Flat

Scientific

Curation

Master 1

Master 2

Master 4

Master 3

Master 1

Active 1 Active 3

Active 2

Active 2

Master 1

Master 2

Master 4

Master 3Active 1

Active 3

Master 2

Master 4

Master 3

Dev 1

Fix 1

Fix 3

Master 1

Dev 2

Feature 1 Feature2

Dev 2

Feature 1

Fix 2

(a) (b) (c)

(d)

Page 11: Decibel presentation

Versioned Benchmark: Data Generation and Loading

Clustered Interleaved• Vary number of branches and data set size• Stagger branch points• Mixed workload (e.g. 20% updates, 80% inserts)• Commits at regular intervals• Insert skew• Follow branching strategy for evolving version

graph• Loading Mode:• Insert/Update into branches to achieve

Physical layout, e.g. clustering level in clustered mode

• Or follow branching strategy in interleaved mode

• Load older branches

Page 12: Decibel presentation

Single Version Scan on Flat (100 GB DB) Multi Version Scan on Deep (100 GB DB)

Hybrid storage model captures the best of both worlds and outperforms both extreme storage models in almost every experiment.

Page 13: Decibel presentation

Commits• In Version First commits are trivially quick (just storing an offset), but

Tuple First and Hybrid require taking snapshots of columns in the relevant bitmaps.

Page 14: Decibel presentation

Future Work• Extending Hybrid in several ways• Leveraging versioning to achieve serializable transactions in OCC style• Schema versioning• Materialization and query optimization• Current system is a row store with full copy on write, next steps would

be experimenting with a column store for more storage efficiency in a manner similar to git