Post on 15-Mar-2020
7/19/18
Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage
Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, Huiqi Hu
1
Background
2
¨ Single-NodeIn-MemoryDBMSü Highxact processingperformanceû Limitedscalability
¨ Shared-everythingDBMSü Scalablestorageandxact viafastinter-node
communicationû Expensivenetworkinfrastructure
¨ Shared-nothingDBMSü Scaleoutviahorizontalpartitioningû Poorperformancew/distributedxact
Data
D1 D2 D3 D4
D1 D2
D3 D4
Architecture
¨ Designconsiderations¤ Generalworkloadsw/distributedtransactions¤ Storagescalability¤ Commoditymachines
Goal:highperformanceOLTPDBMSw/oassumptiononworkloads orhardware
3
Architecture
¨ Overview¤ SeveralS-nodes (storage&snapshotread)¤ AT-node (transactionvalidation/commit&deltaread)¤ SeveralP-units (businesslogicprocessing)
StorageLayer TransactionLayer
ProcessingLayer
T-node
S-nodeS-node S-node
P-unit P-unit P-unit
4
Architecture
¨ S-nodes¤ Distributedstorageengine¤ Role:storingaconsistentdatabasesnapshot(SSTable)¤ Feature:supportingscalabledatastorage
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
T-node
S-nodeS-node S-node
P-unit P-unit P-unit
5
Architecture
¨ T-node¤ In-memorytransactionengine¤ Role:managingnewlycommitteddatasincethelastsnapshot(Memtable)¤ Feature: servicingperformanttransactionwrites
1 quantity=5 NULL
3 quantity=15 NULLT-node
S-nodeS-node S-node
P-unit P-unit P-unit
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
6
Architecture
¨ P-units¤ Distributedqueryprocessingengine¤ Role:SQL,storedprocedure,queryprocessing,remotedataaccess¤ Feature:scalablecomputationpower
T-node
S-nodeS-node S-node
P-unit P-unit P-unit
1 quantity=5 NULL
3 quantity=15 NULL
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
7
LSM-Treestylestorage
¨ SSTable¤ Aconsistentsnapshot¤ Datapartitionedintotablets(rangesovertables)
¤ TabletsreplicatedoverS-nodes
¨ Memtable¤ Newlycommitteddata¤ StoredinmemoryonT-node¤Multi-versionstorage¤ ReplicatedtobackupT-nodes
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
1 quantity=5 NULL
3 quantity=15 NULLS-node
Item Tableid price quantity1 1.0 52 2.0 203 3.0 154 4.0 405 5.0 506 6.0 60
S-node
T-node
8
TransactionProcessing
¨ Startatransaction¨ Readarecord¨ ProcesstheSQL¨ Writearecord¨ Commit
P-unit
T-nodeS-node
UPDATEItemSETquantity=quantity- 15WHEREid=3;
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
1 quantity=5 NULL
3 quantity=15 NULL
Bufferid quantity3 15
Item(id=1,quantity=30)NULL
request response
startcommit
9
DataCompaction
T-node
S-nodeS-node S-node
P-unit P-unit P-unit
¨ AlldataarefirstlywrittenintotheT-node¨ DatacompactionmovescommitteddataintoS-nodes
¤ Doesnotblockon-goingandfuturetransactions
10
ConcurrencyControl
¨ UseMVOCCtosupportSnapshotIsolation(SI)¤ Preventlostupdateanomaly
¨ DatastructuresontheT-node¤ Atimestampcounter(MVCC)¤ Row-levellatch(OCC)
¨ SnapshotAcquisition¨ Transaction Validation
11
key=1 wts=4col1=5
key=2 wts=3col2=5
wts=2col1=2
Counter:5 Txntxread-timestamp:𝑟𝑡𝑠 = 5
Write(key=1,col1 =3)
Recovery
¨ Writeaheadlog¤ Generateredologentries¤ Groupcommit
¨ T-noderecovery¤ Replayredologentries¤ Thereplaypositionisdeterminedbythelastfinisheddatacompaction
¨ S-nodesdonotlosedata¨ P-unitsdonotstoredata
Memtable
S-nodes
T-nodeSSTable Tablet1
Tablet2 Tablet3
Redolog replayposition
12
DataCompaction
¨ Datacompaction(DC)startswhentheT-noderunsoutofmemory¤ NewMemtable𝑚' toaccepttransactionsafterDCinitiation¤Memtable𝑚( isfrozenandmergedintoSSTable
13
timeline
initiate
Memtable𝑚( Memtable𝑚'
SSTable𝑠'SSTable𝑠(
start endget𝑡)*
TransactionandCCduringDataCompaction
14
¨ Goal:minimizeblockingoftransactionprocessing
timeline
initiate
Memtable𝑚( Memtable𝑚'
SSTable𝑠'SSTable𝑠(
start end
𝑡+
get𝑡)*
rw/validate
𝑡,
ro/validate rw/validate
𝑡-
Waitfor
rw/validate
RemoteDataAccessOptimization
¨ Shared-Everythingarchitecture¤ Latencyboundedbyremotedataaccessbetween
n P-unitandT-noden P-unitandS-node
¤ Reducingremotedataaccesscostn =>moreconcurrenttransactionsn =>higherthroughput
15
LocalSSTable Cache
¨ BuildSSTableCacheonP-unit¤ SSTableisimmutable¤ P-unitexaminesitslocalcachebeforecommunicatingwithS-nodes
P-unit
Thread1 Thread2 Threadn…
LRURowCache
S-nodes
16
AsynchronousBitArray
¨ EmptyreadsontheT-node¤ TheT-nodestoresasmallpartofdata¤ Readingnon-existingdataitemsresultsinuselessemptyreads
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
1 quantity=5 NULL
3 quantity=15 NULL
Read(id=1,price=?)
S-node
S-node
T-node
NULL
price=1.0price=1.0
Emptyread
17
Any dataintheT-nodeprice quantity
Tablet1Tablet2
AsynchronousBitArray
¨ AsynchronousBitArray¤ EncodewhetheranyrowinTablet𝑥 hasitscolumn𝑦modified¤ PeriodicallysynchronizedtoP-units
n Falsepositive=>emptyread(correctedafterthefirstaccess)n Falsenegative=>validatingemptyreadsandretry
0 10 0
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
S-node
S-node
1 quantity=5 NULL
3 quantity=15 NULLT-node
18
TransactionCompilation
¨ Modelatransactionasadirectedacyclicgraph¨ Movereadstostartifpossible
MemtableRead(Item,r1) MemtableRead(Cust,r2)
BufferWrite(Cust,r2,balance =v_balance)
v_balance -=v_price
v_price =r1.pricev_balance =r2.balance
SSTableRead(Item,r1) SSTableRead(Cust,r2)
MemtableRead(Item,r1)
MemtableRead(Cust,r2)
SSTableRead(Item,r1)
SSTableRead(Cust,r2)
v_balance =r2.balance
v_price =r1.price
v_balance -=v_price
BufferWrite(Cust,r2,balance =v_balance)
Commit
Start
19
TransactionCompilation
¨ GroupT-nodeaccess
MemtableRead(Item,r1) MemtableRead(Cust,r2)
BufferWrite(Cust,r2,balance =v_balance)
v_balance -=v_price
v_price =r1.pricev_balance =r2.balance
SSTableRead(Item,r1) SSTableRead(Cust,r2)
MemtableRead(Item,r1)
MemtableRead(Cust,r2)
SSTableRead(Item,r1)
SSTableRead(Cust,r2)
v_balance =r2.balance
v_price =r1.price
v_balance -=v_price
BufferWrite(Cust,r2,balance =v_balance)
MemtableRead(Item,r1)MemtableRead(Cust,r2)
Start
Commit 20
TransactionCompilation
¨ Pre-executeS-nodeaccess
MemtableRead(Item,r1) MemtableRead(Cust,r2)
BufferWrite(Cust,r2,balance =v_balance)
v_balance -=v_price
v_price =r1.pricev_balance =r2.balance
SSTableRead(Item,r1) SSTableRead(Cust,r2) SSTableRead(Item,r1)
SSTableRead(Cust,r2)
v_balance =r2.balance
v_price =r1.price
v_balance -=v_price
BufferWrite(Cust,r2,balance =v_balance)
MemtableRead(Item,r1)MemtableRead(Cust,r2)
Start
Commit 21
Experiment
¨ Setting¤ CPU:2.4G Hz 16-Core¤Memory:64GB¤10servers¤Connectedby1GigabitsNetwork
¨ Benchmark:TPC-C¨ Systems
¤Workload:TPC-C¤MySQLCluster¤ VoltDB¤ Tell
Scalability Cross-PartitionTransactions 22
Experiment
¨ Datacompaction
¨ Remotedataaccessoptimization
¨ Systemrecovery
23
Summary
¨ Solar¤ Ashared-everythingOLTPDBMSonCommodityhardware
n Highperformancetransactionprocessingn Scalabledatastoragecapacity
¤ Severalnoveloptimizationtoimproveperformance¤ Empiricalevaluationshowsgreatperformanceandscalability
24
TransactionCompilation
¨ GroupT-nodeaccess¤ NormalexecutionissuesT-nodeaccessone-by-one¤ TrytobatchmultipleT-nodecommunicationstogether
S-nodes P-unit T-node
localprocess
startread(r1.delta)
read(r1.static)
commit
read(r2.delta)
read(r2.static)
localprocess
S-nodes P-unit T-node
localprocess
read(r1.static)
commit
read(r2.static)
localprocess
startread(r1.delta)read(r2.delta)
25
TransactionCompilation
¨ Pre-executeS-nodeaccess¤ NormalexecutionissuesS-nodeaccessaftertransactionisstarted¤ Trytopre-executeS-nodereads
S-nodes P-unit T-node
localprocess
read(r1.static)
commit
read(r2.static)
localprocess
startread(r1.delta)read(r2.delta)
S-nodes P-unit T-node
localprocess
read(r1.static)
commit
read(r2.static)
localprocess
startread(r1.delta)read(r2.delta)
26
Architecture
¨ Designconsiderations¤ Ashared-everythingarchitecture
n 2-LayerLSM-Treestylestoragen Decouplecomputationfromstorage
¤ Highperformancein-memorytransactionprocessingnMVOCC,combiningtheOCCandtheMVCCn Anon-blockingdatacompactionalgorithm
¤ Fine-grainedremotedataaccessn Datacachen Asynchronousbitarrayn Transactioncompilation
Goal:highperformanceOLTPDBMSwithoutassumingapartitionable workloadoradvancedhardwares
27
Architecture
T-node
S-node
S-node
S-node
P-unit
P-unit
P-unit
¨ Overview
28
Architecture
T-node
S-node
S-node
S-node
P-unit
P-unit
P-unit
WriteRequest
WriteRequest
WriteRequest
¨ Log-Structuredwrite
29
Architecture
T-node
S-node
S-node
S-node
P-unit
P-unit
P-unit
DataCompaction
¨ Log-Structuredwrite
30
Architecture
¨ S-nodes¤ Distributedstorageengine¤ Role:storingaconsistentdatabasesnapshot(SSTable)¤ Feature:supportingscalabledatastorage
S-nodeS-node
S-node
TabletTabletTablet TabletTabletTabletTabletTabletTablet
Disk-basedstorage 31
Architecture
¨ T-node¤ In-memorytransactionengine¤ Role:managingtherestrecentlycommitteddata(Memtable)¤ Feature: providingperformanttransactionalwrites
S-nodeS-node
S-node
TabletTabletTablet TabletTabletTabletTabletTabletTablet
Disk-basedstorage
compaction
T-node(Replica)T-node(Replica)
T-node
Indexes TxnLogBitArray1 0 0 0 10 0 1 1 0
...
In-memorystorage 32
Architecture
¨ P-units¤ Distributedqueryprocessingengine¤ Role:SQL,storedprocedure,queryprocessing,remotedataaccess¤ Feature:providingscalablecomputationpower
S-nodeS-node
S-node
TabletTabletTablet TabletTabletTabletTabletTabletTablet
Disk-basedstorage
compaction
T-node(Replica)T-node(Replica)
T-node
Indexes TxnLogBitArray1 0 0 0 10 0 1 1 0
...
In-memorystorage
records recordsmutations
P-nodeApplicationLogic
StorageInterface
BitArray Cache
Compiler Executor
P-nodeApplicationLogic
StorageInterface
BitArray Cache
Compiler Executor
P-nodeApplicationLogic
StorageInterface
BitArray Cache
Compiler Executor
33
LSM-Treestylestorage
¨ SSTable¤ Aconsistentsnapshot¤ Partitionedintotablets¤ ReplicatedoverS-nodes
¨ Memtable¤ Newlycommitteddata¤ In-memorystoredintheT-node¤Multipleversionstorage¤ ReplicatedtobackupT-nodes
Tablet 1id price quantity1 1.0 102 2.0 203 3.0 30
Tablet2id price quantity4 4.0 405 5.0 506 6.0 60
1 quantity=5 NULL
3 quantity=15 NULLS-node
Item Tableid price quantity1 1.0 52 2.0 203 3.0 154 4.0 405 5.0 506 6.0 60
S-node
T-node
34
Read&Writes
¨ Read¤ readandmergeversionsfrombothT-nodeandoneofS-node
¨ Write¤ directlywriteintotheT-node
S-node T-node
P-unit
Staticversion
Deltaversion
S-node T-node
P-unit
Deltaversion
35
TransactionProcessing
¨ P-unitexecutetransactions¤ Startatransaction¤ Fetchrecordsfromremote¤ Executeuser-definedlogics¤ Bufferdatawrites¤ Committhetransaction
S-nodes P-unit T-node
merge(r)
processotherSQLs
startread(r.delta)
read(r.static)
r.new =expr(r)buffer(r.new)
commit
36
Background
¨ Single-NodeIn-MemoryDBMS¤ Hekaton,HyPer¤ Features
n NodiskI/Oduringtransactionprocessing(In-memorystorage)n Transactioncompilationn Lightweightconcurrencycontrol(OCC,MVCC,determinism)n Simplifiedwrite-aheadloggingn Veryhighperformancetransactionprocessing
¤ Limitationsn Databasesizeshouldbesmallerthanmemorycapacity
37
Background
¨ Shared-NothingDBMS¤ VoltDB/HStore,Spanner¤ Features
n Usehorizontalpartitionn Replyontwophasecommitn Scalabletransactionprocessingandstorage
¤ Limitationsn Partitionable workloadn Lowpercentageofdistributedtransactions
38
Background
¨ Shared-EverythingDBMS¤ Oracle RAC,Tell¤ Features
n Sharedata/cacheamongnodesn Relyonfastinter-nodecommunicationn Scalabletransactionprocessingandstorage
¤ Limitationsn Requireadvancednetworkinfrastructure
n InfiniBand switchwith43TB/s,216 ports costsabout$60,000
39
TransactionCompilation
MemtableRead(Item,r1)
MemtableRead(Cust,r2)
BufferWrite(Cust,r2,balance =v_balance)
v_balance -=v_price
v_price =r1.price
v_balance =r2.balance
SSTableRead(Item,r1)
SSTableRead(Cust,r2)
Start
Commit
¨ Manyremotedataaccessbetweenstart andcommit¨ Groupreadstoreducereadlatency
v_price =Read(Item,id=1,price);
v_balance =Read(Cust,id=5,balance);
40
TransactionCompilation
¨ Reorderopsw/odatadependencydoesnotchangesemantics
MemtableRead(Item,r1)
v_price =r1.price
SSTableRead(Item,r1)
ReadItemPrice
MemtableRead(Cust,r2)
v_balance =r2.balance
SSTableRead(Cust,r2)
ReadCustomerBalance
MemtableRead(Item,r1)
v_price =r1.price
SSTableRead(Item,r1)
ReadItemPrice
MemtableRead(Cust,r2)
v_balance =r2.balance
SSTableRead(Cust,r2)
ReadCustomerBalance
41
TransactionCompilation
¨ Onlyopsw/datadependenciescannotbereordered¤ Usethesamevariable,andoneiswrite(identifybyvariablename)¤ Usethesamerecord,andoneiswrite(identifybytablename)
MemtableRead(Item,r1)
v_price =r1.price
SSTableRead(Item,r1)
ReadItemPrice
CalculatetheNewBalance
v_balance -=v_price
MemtableRead(Cust,r2)
v_balance =r2.balance
SSTableRead(Cust,r2)
ReadCustomerBalance
UpdateCustomerBalance
BufferWrite(Cust,r2,balance =v_balance)
42
DataAccessDuringCompaction
¨ MemTable Read:alwaysreadthenewMemTable𝑚'
¨ SSTable Read¤Mergeddata(𝑇𝑎𝑏𝑙𝑒𝑡1):readfrom𝑠'¤Mergingdata(𝑇𝑎𝑏𝑙𝑒𝑡2):readfrom𝑠( andthefrozenMemtable𝑚(
Reads1 Readm1
FrozenMemtable(m0) ActiveMemtable(m1)
Tablet2(s0)
Tablet1(s0) Tablet1’(s1)
Tablet2’(s1)
Merged
Merging
43
SnapshotIsolationDuringDataCompaction
¨ Classifytransactionsintothreetypes:¤ Type1:startvalidationbeforethecompactionisinitialized
n validateon𝑚(,writeon𝑚(
¤ Type2:startvalidationafterthecompactionisinitializedn validateon𝑚( and𝑚',writeon𝑚'
¤ Type3:startsprocessingafterthecompactionisstartedn validateon𝑚',writeon𝑚'
Memtable𝑚( Memtable𝑚'
initiate
timeline
𝑡' startsvalidation
start
𝑡8 startsvalidation
𝑡9 startsprocessing
end 44
RecoveryduringDataCompaction(DC)
¨ Compactionstartlogentry(CSLE)¤ PersistwhentheDCisstarted¤ Actsasaborderofredologentries
¨ Compactionendlogentry(CELE)¤ PersistwhentheDCisended¤ SavethepositionoftheCSLEoftheDC
¨ Recoveryprocedure¤ ReadCELEtofindthepositionofCSLE¤ ReplaytheredologfromCSLE¤ Atfirst,replaydatainto𝑚(¤ OnceCSLEisencountered,repaydatainto𝑚'
45
Any dataintheT-nodeprice quantity
Tablet1Tablet2
AsynchronousBitArray
¨ Synchronization&usage¤ PeriodicallysynchronizedtoP-units¤ AP-unitcheckitslocalcopytofilteruselessT-nodeaccess
0 10 0
1 quantity=5 NULL
3 quantity=15 NULLT-node
P-unit
0 10 0
Acopyofbitarray
synchronization
Read(id=1,price=?)
46
Any dataintheT-nodeprice quantity
Tablet1Tablet2
AsynchronousBitArray
¨ Synchronization&usage¤ PeriodicallysynchronizedtoP-units¤ AP-unitcheckitslocalcopytofilteruselessT-nodeaccess
0 10 0
1 quantity=5 NULL
3 quantity=15 NULLT-node
P-unit
0 10 0
Acopyofbitarray
Read(id=3,quantity=?)
47
Any dataintheT-nodeprice quantity
Tablet1Tablet2
AsynchronousBitArray
¨ Falsepositive¤ (𝑟𝑜𝑤+, 𝑐𝑜𝑙,) doesnotexistontheT-node,butthebitarraysaysyes
n Anemptyread¤ Reason:bitarraymaintainedattabletgranularity
0 10 0
1 quantity=5 NULL
3 quantity=15 NULLT-node
P-unit
0 10 0
Acopyofbitarray
Read(id=2,quantity=?)
NULL
Tablet1id quantity1 102 203 30 48
Any dataintheT-nodeprice quantity
Tablet1Tablet2
AsynchronousBitArray
¨ Falsenegative¤ Abitarraycopymayfallbehindthelatestversion¤ (𝑟𝑜𝑤+, 𝑐𝑜𝑙,) existsontheT-node,butthebitarraysaysno¤ Transactionre-checkallpotentialemptyreadsinthevalidationphase
0 10 0
1 quantity=5 NULL
3 quantity=15 NULLT-node
P-unit
0 10 0
Acopyofbitarray
Read(id=2,quantity=?)
5 Quantity=25 NULL
1
Validation
Aborted&Retry
49
DataCompaction
¨ Initiate¤ CreateanewMemtable¤ FreezethecurrentMemtable¤ Handlingongoingtransactions
n Case1:validationstartsbeforethecompactionisinitiatedn 𝑡+ and𝑡, areallowedtowritedatainto𝑚(
n Case2:validationstartsafterthecompactionisinitiatedn 𝑡- willwritedatainto𝑚' afterthedatacompactionisstarted
Memtable𝑚( Memtable𝑚'
initiate
𝑡+ startsvalidation
timeline
𝑡, startsvalidation
𝑡- startsvalidation
50
DataCompaction
¨ Start¤ Getcompactiontimestamp𝑡)* after𝑡+ and𝑡, abortorobtaincommitTS
n 𝑡- startsvalidationonlyafter𝑡)* isobtained¤ Startdatacompactionafter𝑡+ and𝑡, finishabort/commit
n CreateanewSSTablebymergingtheoldoneandthefrozenMemtable
Memtable𝑚( Memtable𝑚'
initiate
𝑡+ startsvalidation
timeline
𝑡, startsvalidation
SSTable𝑠'SSTable𝑠(
𝑡+,𝑡, abortorobtaincommitTS
startobtain𝑡)*
𝑡+,𝑡, finishabort/commit
𝑡- startsvalidation
51
DataCompaction
¨ End¤Waituntilthe𝑠' isfullycreated¤ ReleasetheoldMemtableandSSTable
Memtable𝑚( Memtable𝑚'
initiate
timeline
SSTable𝑠'SSTable𝑠(
endstart 52
ConcurrencyControl
¨ DatastructuresontheT-node¤ Atimestampcounter(MVCC)¤ Row-levellatch(OCC)
¨ Start¤ Acquireread-timestamp𝑟𝑡𝑠
¨ Process¤ Readlatestversionspecifiedby𝑟𝑡𝑠
key=1 wts=4col1=5
key=2 wts=3col2=5
wts=2col1=2
Counter:5 Txntxread-timestamp:𝑟𝑡𝑠 = 5
53
ConcurrencyControl
¨ Commit¤ Acquirelatchesforrecordsinthewriteset¤ Verifythereisnonewerversion¤ Acquirewritetimestamp𝑤𝑡𝑠¤Writeandreleaselatches
key=1 wts=4col1=5
key=2 wts=3col2=5
wts=2col1=2
Counter:5Txntx
read-timestamp:𝑟𝑡𝑠 = 5
Counter:6
write-timestamp:𝑤𝑡𝑠 = 6key=1 wts=6col1=2
key=2 wts=6col2=2
wts=3col1=5
wts=4col1=5
wts=2col1=2
54