Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything...

54
7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, Huiqi Hu 1

Transcript of Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything...

Page 1: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

7/19/18

Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage

Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, Huiqi Hu

1

Page 2: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Background

2

¨ Single-NodeIn-MemoryDBMSü Highxact processingperformanceû Limitedscalability

¨ Shared-everythingDBMSü Scalablestorageandxact viafastinter-node

communicationû Expensivenetworkinfrastructure

¨ Shared-nothingDBMSü Scaleoutviahorizontalpartitioningû Poorperformancew/distributedxact

Data

D1 D2 D3 D4

D1 D2

D3 D4

Page 3: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ Designconsiderations¤ Generalworkloadsw/distributedtransactions¤ Storagescalability¤ Commoditymachines

Goal:highperformanceOLTPDBMSw/oassumptiononworkloads orhardware

3

Page 4: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ Overview¤ SeveralS-nodes (storage&snapshotread)¤ AT-node (transactionvalidation/commit&deltaread)¤ SeveralP-units (businesslogicprocessing)

StorageLayer TransactionLayer

ProcessingLayer

T-node

S-nodeS-node S-node

P-unit P-unit P-unit

4

Page 5: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ S-nodes¤ Distributedstorageengine¤ Role:storingaconsistentdatabasesnapshot(SSTable)¤ Feature:supportingscalabledatastorage

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

T-node

S-nodeS-node S-node

P-unit P-unit P-unit

5

Page 6: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ T-node¤ In-memorytransactionengine¤ Role:managingnewlycommitteddatasincethelastsnapshot(Memtable)¤ Feature: servicingperformanttransactionwrites

1 quantity=5 NULL

3 quantity=15 NULLT-node

S-nodeS-node S-node

P-unit P-unit P-unit

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

6

Page 7: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ P-units¤ Distributedqueryprocessingengine¤ Role:SQL,storedprocedure,queryprocessing,remotedataaccess¤ Feature:scalablecomputationpower

T-node

S-nodeS-node S-node

P-unit P-unit P-unit

1 quantity=5 NULL

3 quantity=15 NULL

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

7

Page 8: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

LSM-Treestylestorage

¨ SSTable¤ Aconsistentsnapshot¤ Datapartitionedintotablets(rangesovertables)

¤ TabletsreplicatedoverS-nodes

¨ Memtable¤ Newlycommitteddata¤ StoredinmemoryonT-node¤Multi-versionstorage¤ ReplicatedtobackupT-nodes

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

1 quantity=5 NULL

3 quantity=15 NULLS-node

Item Tableid price quantity1 1.0 52 2.0 203 3.0 154 4.0 405 5.0 506 6.0 60

S-node

T-node

8

Page 9: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionProcessing

¨ Startatransaction¨ Readarecord¨ ProcesstheSQL¨ Writearecord¨ Commit

P-unit

T-nodeS-node

UPDATEItemSETquantity=quantity- 15WHEREid=3;

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

1 quantity=5 NULL

3 quantity=15 NULL

Bufferid quantity3 15

Item(id=1,quantity=30)NULL

request response

startcommit

9

Page 10: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataCompaction

T-node

S-nodeS-node S-node

P-unit P-unit P-unit

¨ AlldataarefirstlywrittenintotheT-node¨ DatacompactionmovescommitteddataintoS-nodes

¤ Doesnotblockon-goingandfuturetransactions

10

Page 11: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

ConcurrencyControl

¨ UseMVOCCtosupportSnapshotIsolation(SI)¤ Preventlostupdateanomaly

¨ DatastructuresontheT-node¤ Atimestampcounter(MVCC)¤ Row-levellatch(OCC)

¨ SnapshotAcquisition¨ Transaction Validation

11

key=1 wts=4col1=5

key=2 wts=3col2=5

wts=2col1=2

Counter:5 Txntxread-timestamp:𝑟𝑡𝑠 = 5

Write(key=1,col1 =3)

Page 12: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Recovery

¨ Writeaheadlog¤ Generateredologentries¤ Groupcommit

¨ T-noderecovery¤ Replayredologentries¤ Thereplaypositionisdeterminedbythelastfinisheddatacompaction

¨ S-nodesdonotlosedata¨ P-unitsdonotstoredata

Memtable

S-nodes

T-nodeSSTable Tablet1

Tablet2 Tablet3

Redolog replayposition

12

Page 13: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataCompaction

¨ Datacompaction(DC)startswhentheT-noderunsoutofmemory¤ NewMemtable𝑚' toaccepttransactionsafterDCinitiation¤Memtable𝑚( isfrozenandmergedintoSSTable

13

timeline

initiate

Memtable𝑚( Memtable𝑚'

SSTable𝑠'SSTable𝑠(

start endget𝑡)*

Page 14: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionandCCduringDataCompaction

14

¨ Goal:minimizeblockingoftransactionprocessing

timeline

initiate

Memtable𝑚( Memtable𝑚'

SSTable𝑠'SSTable𝑠(

start end

𝑡+

get𝑡)*

rw/validate

𝑡,

ro/validate rw/validate

𝑡-

Waitfor

rw/validate

Page 15: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

RemoteDataAccessOptimization

¨ Shared-Everythingarchitecture¤ Latencyboundedbyremotedataaccessbetween

n P-unitandT-noden P-unitandS-node

¤ Reducingremotedataaccesscostn =>moreconcurrenttransactionsn =>higherthroughput

15

Page 16: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

LocalSSTable Cache

¨ BuildSSTableCacheonP-unit¤ SSTableisimmutable¤ P-unitexaminesitslocalcachebeforecommunicatingwithS-nodes

P-unit

Thread1 Thread2 Threadn…

LRURowCache

S-nodes

16

Page 17: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

AsynchronousBitArray

¨ EmptyreadsontheT-node¤ TheT-nodestoresasmallpartofdata¤ Readingnon-existingdataitemsresultsinuselessemptyreads

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

1 quantity=5 NULL

3 quantity=15 NULL

Read(id=1,price=?)

S-node

S-node

T-node

NULL

price=1.0price=1.0

Emptyread

17

Page 18: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Any dataintheT-nodeprice quantity

Tablet1Tablet2

AsynchronousBitArray

¨ AsynchronousBitArray¤ EncodewhetheranyrowinTablet𝑥 hasitscolumn𝑦modified¤ PeriodicallysynchronizedtoP-units

n Falsepositive=>emptyread(correctedafterthefirstaccess)n Falsenegative=>validatingemptyreadsandretry

0 10 0

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

S-node

S-node

1 quantity=5 NULL

3 quantity=15 NULLT-node

18

Page 19: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ Modelatransactionasadirectedacyclicgraph¨ Movereadstostartifpossible

MemtableRead(Item,r1) MemtableRead(Cust,r2)

BufferWrite(Cust,r2,balance =v_balance)

v_balance -=v_price

v_price =r1.pricev_balance =r2.balance

SSTableRead(Item,r1) SSTableRead(Cust,r2)

MemtableRead(Item,r1)

MemtableRead(Cust,r2)

SSTableRead(Item,r1)

SSTableRead(Cust,r2)

v_balance =r2.balance

v_price =r1.price

v_balance -=v_price

BufferWrite(Cust,r2,balance =v_balance)

Commit

Start

19

Page 20: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ GroupT-nodeaccess

MemtableRead(Item,r1) MemtableRead(Cust,r2)

BufferWrite(Cust,r2,balance =v_balance)

v_balance -=v_price

v_price =r1.pricev_balance =r2.balance

SSTableRead(Item,r1) SSTableRead(Cust,r2)

MemtableRead(Item,r1)

MemtableRead(Cust,r2)

SSTableRead(Item,r1)

SSTableRead(Cust,r2)

v_balance =r2.balance

v_price =r1.price

v_balance -=v_price

BufferWrite(Cust,r2,balance =v_balance)

MemtableRead(Item,r1)MemtableRead(Cust,r2)

Start

Commit 20

Page 21: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ Pre-executeS-nodeaccess

MemtableRead(Item,r1) MemtableRead(Cust,r2)

BufferWrite(Cust,r2,balance =v_balance)

v_balance -=v_price

v_price =r1.pricev_balance =r2.balance

SSTableRead(Item,r1) SSTableRead(Cust,r2) SSTableRead(Item,r1)

SSTableRead(Cust,r2)

v_balance =r2.balance

v_price =r1.price

v_balance -=v_price

BufferWrite(Cust,r2,balance =v_balance)

MemtableRead(Item,r1)MemtableRead(Cust,r2)

Start

Commit 21

Page 22: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Experiment

¨ Setting¤ CPU:2.4G Hz 16-Core¤Memory:64GB¤10servers¤Connectedby1GigabitsNetwork

¨ Benchmark:TPC-C¨ Systems

¤Workload:TPC-C¤MySQLCluster¤ VoltDB¤ Tell

Scalability Cross-PartitionTransactions 22

Page 23: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Experiment

¨ Datacompaction

¨ Remotedataaccessoptimization

¨ Systemrecovery

23

Page 24: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Summary

¨ Solar¤ Ashared-everythingOLTPDBMSonCommodityhardware

n Highperformancetransactionprocessingn Scalabledatastoragecapacity

¤ Severalnoveloptimizationtoimproveperformance¤ Empiricalevaluationshowsgreatperformanceandscalability

24

Page 25: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ GroupT-nodeaccess¤ NormalexecutionissuesT-nodeaccessone-by-one¤ TrytobatchmultipleT-nodecommunicationstogether

S-nodes P-unit T-node

localprocess

startread(r1.delta)

read(r1.static)

commit

read(r2.delta)

read(r2.static)

localprocess

S-nodes P-unit T-node

localprocess

read(r1.static)

commit

read(r2.static)

localprocess

startread(r1.delta)read(r2.delta)

25

Page 26: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ Pre-executeS-nodeaccess¤ NormalexecutionissuesS-nodeaccessaftertransactionisstarted¤ Trytopre-executeS-nodereads

S-nodes P-unit T-node

localprocess

read(r1.static)

commit

read(r2.static)

localprocess

startread(r1.delta)read(r2.delta)

S-nodes P-unit T-node

localprocess

read(r1.static)

commit

read(r2.static)

localprocess

startread(r1.delta)read(r2.delta)

26

Page 27: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ Designconsiderations¤ Ashared-everythingarchitecture

n 2-LayerLSM-Treestylestoragen Decouplecomputationfromstorage

¤ Highperformancein-memorytransactionprocessingnMVOCC,combiningtheOCCandtheMVCCn Anon-blockingdatacompactionalgorithm

¤ Fine-grainedremotedataaccessn Datacachen Asynchronousbitarrayn Transactioncompilation

Goal:highperformanceOLTPDBMSwithoutassumingapartitionable workloadoradvancedhardwares

27

Page 28: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

T-node

S-node

S-node

S-node

P-unit

P-unit

P-unit

¨ Overview

28

Page 29: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

T-node

S-node

S-node

S-node

P-unit

P-unit

P-unit

WriteRequest

WriteRequest

WriteRequest

¨ Log-Structuredwrite

29

Page 30: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

T-node

S-node

S-node

S-node

P-unit

P-unit

P-unit

DataCompaction

¨ Log-Structuredwrite

30

Page 31: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ S-nodes¤ Distributedstorageengine¤ Role:storingaconsistentdatabasesnapshot(SSTable)¤ Feature:supportingscalabledatastorage

S-nodeS-node

S-node

TabletTabletTablet TabletTabletTabletTabletTabletTablet

Disk-basedstorage 31

Page 32: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ T-node¤ In-memorytransactionengine¤ Role:managingtherestrecentlycommitteddata(Memtable)¤ Feature: providingperformanttransactionalwrites

S-nodeS-node

S-node

TabletTabletTablet TabletTabletTabletTabletTabletTablet

Disk-basedstorage

compaction

T-node(Replica)T-node(Replica)

T-node

Indexes TxnLogBitArray1 0 0 0 10 0 1 1 0

...

In-memorystorage 32

Page 33: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Architecture

¨ P-units¤ Distributedqueryprocessingengine¤ Role:SQL,storedprocedure,queryprocessing,remotedataaccess¤ Feature:providingscalablecomputationpower

S-nodeS-node

S-node

TabletTabletTablet TabletTabletTabletTabletTabletTablet

Disk-basedstorage

compaction

T-node(Replica)T-node(Replica)

T-node

Indexes TxnLogBitArray1 0 0 0 10 0 1 1 0

...

In-memorystorage

records recordsmutations

P-nodeApplicationLogic

StorageInterface

BitArray Cache

Compiler Executor

P-nodeApplicationLogic

StorageInterface

BitArray Cache

Compiler Executor

P-nodeApplicationLogic

StorageInterface

BitArray Cache

Compiler Executor

33

Page 34: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

LSM-Treestylestorage

¨ SSTable¤ Aconsistentsnapshot¤ Partitionedintotablets¤ ReplicatedoverS-nodes

¨ Memtable¤ Newlycommitteddata¤ In-memorystoredintheT-node¤Multipleversionstorage¤ ReplicatedtobackupT-nodes

Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30

Tablet2id price quantity4 4.0 405 5.0 506 6.0 60

1 quantity=5 NULL

3 quantity=15 NULLS-node

Item Tableid price quantity1 1.0 52 2.0 203 3.0 154 4.0 405 5.0 506 6.0 60

S-node

T-node

34

Page 35: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Read&Writes

¨ Read¤ readandmergeversionsfrombothT-nodeandoneofS-node

¨ Write¤ directlywriteintotheT-node

S-node T-node

P-unit

Staticversion

Deltaversion

S-node T-node

P-unit

Deltaversion

35

Page 36: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionProcessing

¨ P-unitexecutetransactions¤ Startatransaction¤ Fetchrecordsfromremote¤ Executeuser-definedlogics¤ Bufferdatawrites¤ Committhetransaction

S-nodes P-unit T-node

merge(r)

processotherSQLs

startread(r.delta)

read(r.static)

r.new =expr(r)buffer(r.new)

commit

36

Page 37: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Background

¨ Single-NodeIn-MemoryDBMS¤ Hekaton,HyPer¤ Features

n NodiskI/Oduringtransactionprocessing(In-memorystorage)n Transactioncompilationn Lightweightconcurrencycontrol(OCC,MVCC,determinism)n Simplifiedwrite-aheadloggingn Veryhighperformancetransactionprocessing

¤ Limitationsn Databasesizeshouldbesmallerthanmemorycapacity

37

Page 38: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Background

¨ Shared-NothingDBMS¤ VoltDB/HStore,Spanner¤ Features

n Usehorizontalpartitionn Replyontwophasecommitn Scalabletransactionprocessingandstorage

¤ Limitationsn Partitionable workloadn Lowpercentageofdistributedtransactions

38

Page 39: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Background

¨ Shared-EverythingDBMS¤ Oracle RAC,Tell¤ Features

n Sharedata/cacheamongnodesn Relyonfastinter-nodecommunicationn Scalabletransactionprocessingandstorage

¤ Limitationsn Requireadvancednetworkinfrastructure

n InfiniBand switchwith43TB/s,216 ports costsabout$60,000

39

Page 40: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

MemtableRead(Item,r1)

MemtableRead(Cust,r2)

BufferWrite(Cust,r2,balance =v_balance)

v_balance -=v_price

v_price =r1.price

v_balance =r2.balance

SSTableRead(Item,r1)

SSTableRead(Cust,r2)

Start

Commit

¨ Manyremotedataaccessbetweenstart andcommit¨ Groupreadstoreducereadlatency

v_price =Read(Item,id=1,price);

v_balance =Read(Cust,id=5,balance);

40

Page 41: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ Reorderopsw/odatadependencydoesnotchangesemantics

MemtableRead(Item,r1)

v_price =r1.price

SSTableRead(Item,r1)

ReadItemPrice

MemtableRead(Cust,r2)

v_balance =r2.balance

SSTableRead(Cust,r2)

ReadCustomerBalance

MemtableRead(Item,r1)

v_price =r1.price

SSTableRead(Item,r1)

ReadItemPrice

MemtableRead(Cust,r2)

v_balance =r2.balance

SSTableRead(Cust,r2)

ReadCustomerBalance

41

Page 42: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

TransactionCompilation

¨ Onlyopsw/datadependenciescannotbereordered¤ Usethesamevariable,andoneiswrite(identifybyvariablename)¤ Usethesamerecord,andoneiswrite(identifybytablename)

MemtableRead(Item,r1)

v_price =r1.price

SSTableRead(Item,r1)

ReadItemPrice

CalculatetheNewBalance

v_balance -=v_price

MemtableRead(Cust,r2)

v_balance =r2.balance

SSTableRead(Cust,r2)

ReadCustomerBalance

UpdateCustomerBalance

BufferWrite(Cust,r2,balance =v_balance)

42

Page 43: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataAccessDuringCompaction

¨ MemTable Read:alwaysreadthenewMemTable𝑚'

¨ SSTable Read¤Mergeddata(𝑇𝑎𝑏𝑙𝑒𝑡1):readfrom𝑠'¤Mergingdata(𝑇𝑎𝑏𝑙𝑒𝑡2):readfrom𝑠( andthefrozenMemtable𝑚(

Reads1 Readm1

FrozenMemtable(m0) ActiveMemtable(m1)

Tablet2(s0)

Tablet1(s0) Tablet1’(s1)

Tablet2’(s1)

Merged

Merging

43

Page 44: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

SnapshotIsolationDuringDataCompaction

¨ Classifytransactionsintothreetypes:¤ Type1:startvalidationbeforethecompactionisinitialized

n validateon𝑚(,writeon𝑚(

¤ Type2:startvalidationafterthecompactionisinitializedn validateon𝑚( and𝑚',writeon𝑚'

¤ Type3:startsprocessingafterthecompactionisstartedn validateon𝑚',writeon𝑚'

Memtable𝑚( Memtable𝑚'

initiate

timeline

𝑡' startsvalidation

start

𝑡8 startsvalidation

𝑡9 startsprocessing

end 44

Page 45: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

RecoveryduringDataCompaction(DC)

¨ Compactionstartlogentry(CSLE)¤ PersistwhentheDCisstarted¤ Actsasaborderofredologentries

¨ Compactionendlogentry(CELE)¤ PersistwhentheDCisended¤ SavethepositionoftheCSLEoftheDC

¨ Recoveryprocedure¤ ReadCELEtofindthepositionofCSLE¤ ReplaytheredologfromCSLE¤ Atfirst,replaydatainto𝑚(¤ OnceCSLEisencountered,repaydatainto𝑚'

45

Page 46: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Any dataintheT-nodeprice quantity

Tablet1Tablet2

AsynchronousBitArray

¨ Synchronization&usage¤ PeriodicallysynchronizedtoP-units¤ AP-unitcheckitslocalcopytofilteruselessT-nodeaccess

0 10 0

1 quantity=5 NULL

3 quantity=15 NULLT-node

P-unit

0 10 0

Acopyofbitarray

synchronization

Read(id=1,price=?)

46

Page 47: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Any dataintheT-nodeprice quantity

Tablet1Tablet2

AsynchronousBitArray

¨ Synchronization&usage¤ PeriodicallysynchronizedtoP-units¤ AP-unitcheckitslocalcopytofilteruselessT-nodeaccess

0 10 0

1 quantity=5 NULL

3 quantity=15 NULLT-node

P-unit

0 10 0

Acopyofbitarray

Read(id=3,quantity=?)

47

Page 48: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Any dataintheT-nodeprice quantity

Tablet1Tablet2

AsynchronousBitArray

¨ Falsepositive¤ (𝑟𝑜𝑤+, 𝑐𝑜𝑙,) doesnotexistontheT-node,butthebitarraysaysyes

n Anemptyread¤ Reason:bitarraymaintainedattabletgranularity

0 10 0

1 quantity=5 NULL

3 quantity=15 NULLT-node

P-unit

0 10 0

Acopyofbitarray

Read(id=2,quantity=?)

NULL

Tablet1id quantity1 102 203 30 48

Page 49: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

Any dataintheT-nodeprice quantity

Tablet1Tablet2

AsynchronousBitArray

¨ Falsenegative¤ Abitarraycopymayfallbehindthelatestversion¤ (𝑟𝑜𝑤+, 𝑐𝑜𝑙,) existsontheT-node,butthebitarraysaysno¤ Transactionre-checkallpotentialemptyreadsinthevalidationphase

0 10 0

1 quantity=5 NULL

3 quantity=15 NULLT-node

P-unit

0 10 0

Acopyofbitarray

Read(id=2,quantity=?)

5 Quantity=25 NULL

1

Validation

Aborted&Retry

49

Page 50: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataCompaction

¨ Initiate¤ CreateanewMemtable¤ FreezethecurrentMemtable¤ Handlingongoingtransactions

n Case1:validationstartsbeforethecompactionisinitiatedn 𝑡+ and𝑡, areallowedtowritedatainto𝑚(

n Case2:validationstartsafterthecompactionisinitiatedn 𝑡- willwritedatainto𝑚' afterthedatacompactionisstarted

Memtable𝑚( Memtable𝑚'

initiate

𝑡+ startsvalidation

timeline

𝑡, startsvalidation

𝑡- startsvalidation

50

Page 51: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataCompaction

¨ Start¤ Getcompactiontimestamp𝑡)* after𝑡+ and𝑡, abortorobtaincommitTS

n 𝑡- startsvalidationonlyafter𝑡)* isobtained¤ Startdatacompactionafter𝑡+ and𝑡, finishabort/commit

n CreateanewSSTablebymergingtheoldoneandthefrozenMemtable

Memtable𝑚( Memtable𝑚'

initiate

𝑡+ startsvalidation

timeline

𝑡, startsvalidation

SSTable𝑠'SSTable𝑠(

𝑡+,𝑡, abortorobtaincommitTS

startobtain𝑡)*

𝑡+,𝑡, finishabort/commit

𝑡- startsvalidation

51

Page 52: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

DataCompaction

¨ End¤Waituntilthe𝑠' isfullycreated¤ ReleasetheoldMemtableandSSTable

Memtable𝑚( Memtable𝑚'

initiate

timeline

SSTable𝑠'SSTable𝑠(

endstart 52

Page 53: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

ConcurrencyControl

¨ DatastructuresontheT-node¤ Atimestampcounter(MVCC)¤ Row-levellatch(OCC)

¨ Start¤ Acquireread-timestamp𝑟𝑡𝑠

¨ Process¤ Readlatestversionspecifiedby𝑟𝑡𝑠

key=1 wts=4col1=5

key=2 wts=3col2=5

wts=2col1=2

Counter:5 Txntxread-timestamp:𝑟𝑡𝑠 = 5

53

Page 54: Solar: Towards a Shared-Everything Database on …...7/19/18 Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage Tao Zhu, Zhuoyue Zhao, FeifeiLi, Weining

ConcurrencyControl

¨ Commit¤ Acquirelatchesforrecordsinthewriteset¤ Verifythereisnonewerversion¤ Acquirewritetimestamp𝑤𝑡𝑠¤Writeandreleaselatches

key=1 wts=4col1=5

key=2 wts=3col2=5

wts=2col1=2

Counter:5Txntx

read-timestamp:𝑟𝑡𝑠 = 5

Counter:6

write-timestamp:𝑤𝑡𝑠 = 6key=1 wts=6col1=2

key=2 wts=6col2=2

wts=3col1=5

wts=4col1=5

wts=2col1=2

54