Write Optimization of Column-Store Databases in Out-of-Core Environment
-
Upload
feng-george-yu -
Category
Data & Analytics
-
view
62 -
download
0
Transcript of Write Optimization of Column-Store Databases in Out-of-Core Environment
WriteOptimizationofColumn-StoreDatabasesinOut-of-CoreEnvironment
YOUNGSTOWNSTATEUNIVERSITY
Dr.Feng“George”YuAssistantProfessor
DepartmentofComputerScienceandInformationSystemsYoungstownStateUniversity
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
WhatisColumn-StoreDatabase?Column-storedatabaseisalsoknownascolumnardatabaseorcolumn-oriented database
Thehistoryofcolumn-storedatabasecanbetracedbackto1970s.Notuntilabout2005whenmanyopen-sourceandcommercialimplementationsofcolumn-storedatabasestookoff.
Well-knowncolumn-storedatabases:
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
FeaturesofColumn-StoreDatabasesFitswellintothewrite-once-and-read-many environment.•WorksespeciallywellforOLAPanddataminingqueries• Retrievemanyrecordsbutneedonlyafewattributes.
Higherdatacompressionrate• Lowdata-entropy•Muchbetterthanrow-basedstorage
YOUNGSTOWNSTATEUNIVERSITY
Row-BasedtoColumn-Store
YOUNGSTOWNSTATEUNIVERSITY
Fig.1customerDatainRow-BasedandColumn-Store(BAT)Format
id name balance1 Alissa 100.002 Bob 200.003 Charles 300.00
(a) Row-Based Table customer
oid int101 1102 2103 3
(b) BAT customer id
oid varchar101 Alissa102 Bob103 Charles
(c) BAT customer name
oid float101 100.00102 200.00103 300.00
(d) BAT customer balance
Figure 1: customer Data in Row-Based and Column-Store (BAT) Format
much faster in a column-store database.Another featured benefit of the column-store
database is data compression, which can reach a highercompression rate and higher speed than traditionalrow-based database. One of the major reasons is thatthe information entropy in the data of one column islower compared to that of row-based data.
Optimizing write operations in a column-storedatabase has always been a challenge. The data ina column-store database is vertically decomposed intoBATs and randomly distributed over the storage. Fur-thermore, assuming an external memory storage is em-ployed, there is a non-trivial probability that a BAT istoo large to fit into one page on the storage. Therefore,the writing on a column-store database will be signif-icantly delayed by ad hoc access to large BATs acrossmultiple pages. Optimizing the write operations in acolumn-store database is a demanding request.
Existing works majorly focus on write opti-mizations in a main-memory column-store database.Krueger at el [11, 12] introduced the di�erential updateto improve the write performance in MonetDB. A spe-cial columnar data structure called delta bu�er is in-troduced to temporarily store decomposed row-basedinput data. However, to the best of our knowledge,very few works focused on optimizing the write perfor-mance on the out-of-core (OOC or external memory)column-store databases. Vertica [14], a column-storedatabase on large volume OOC storage, introducesa specially designed data storage procedure called k-safety to ensure ACID of update transactions on largevolumes of data and improve the data importation ef-ficiency. Nevertheless, k-safety focuses more on thetransaction control rather than the write performanceimprovement for high velocity update query streams.
In this research, we focus on optimizing the writeoperations (update and deletion) on an OOC column-store database. An operation called AscynchronousOut-of-Core Update was originally designed based ona new data structure called Timestamped Binary As-sociation Table.
The rest of the paper is structured as follows. Sec-tion 2 is the background introduction of column-storedatabases. Section 3 states the proposed method ofupdate and deletion optimization on the OOC column-store database. Experimental results are illustrated in
section 4. And section 5 is the conclusion and futureworks.
2 Background of the Column-StoreDatabase
The data structure of a column-store databaseexclusively uses BAT s (Binary Association Tables). ABAT is a fragment of an attribute in the original row-based storage, which usually consists of an oid (ObjectIdentifier) or ROWID, along with a column of attributevalues, which in a pair is called a BUN (Binary UNits).It is a physical model in a column-store database andthe sole bulk data structure it implements. The BAT iscategorized in a special group of storage models called‘DSM’ (Decomposed Storage Model) [4, 10].
The row-based storage data is the original user in-put data, called the front-end data or logical data. Toinput the data into a column-store database, a map-ping rule should be defined from the logical data struc-ture to the physical data structure, namely BAT.
Example 1. (From Row-Based Table to BAT) Sup-pose the table name is customer, with id as the pri-mary key.
customer(id, name, balance)Primary Key: id
The row-based data is shown in Fig. 1(a).In a columnar database, this logical table willbe decomposed into 3 BATs namely customer id,customer name, customer balance. Each BAT con-tains two columns: an oid and an attribute value col-umn with the column name as the corresponding col-umn data type.
In the example, the logical table is fully decom-posed into 3 BATs, Fig 1(b)-1(d), with each BAT con-tains one of the attributes. This is also referred to asfull vertical fragmentation [1]. Full vertical fragmenta-tion has many advantages. First of all, data accessingis e�cient for queries accessing many rows but withfewer columns involved in the query. Another advan-tage is the reduction of the workload on the CPU andmemory generated by OLAP and data mining queries,
id name balance1 Alissa 100.002 Bob 200.003 Charles 300.00
(a) Row-Based Table customer
oid int101 1102 2103 3
(b) BAT customer id
oid varchar101 Alissa102 Bob103 Charles
(c) BAT customer name
oid float101 100.00102 200.00103 300.00
(d) BAT customer balance
Figure 1: customer Data in Row-Based and Column-Store (BAT) Format
much faster in a column-store database.Another featured benefit of the column-store
database is data compression, which can reach a highercompression rate and higher speed than traditionalrow-based database. One of the major reasons is thatthe information entropy in the data of one column islower compared to that of row-based data.
Optimizing write operations in a column-storedatabase has always been a challenge. The data ina column-store database is vertically decomposed intoBATs and randomly distributed over the storage. Fur-thermore, assuming an external memory storage is em-ployed, there is a non-trivial probability that a BAT istoo large to fit into one page on the storage. Therefore,the writing on a column-store database will be signif-icantly delayed by ad hoc access to large BATs acrossmultiple pages. Optimizing the write operations in acolumn-store database is a demanding request.
Existing works majorly focus on write opti-mizations in a main-memory column-store database.Krueger at el [11, 12] introduced the di�erential updateto improve the write performance in MonetDB. A spe-cial columnar data structure called delta bu�er is in-troduced to temporarily store decomposed row-basedinput data. However, to the best of our knowledge,very few works focused on optimizing the write perfor-mance on the out-of-core (OOC or external memory)column-store databases. Vertica [14], a column-storedatabase on large volume OOC storage, introducesa specially designed data storage procedure called k-safety to ensure ACID of update transactions on largevolumes of data and improve the data importation ef-ficiency. Nevertheless, k-safety focuses more on thetransaction control rather than the write performanceimprovement for high velocity update query streams.
In this research, we focus on optimizing the writeoperations (update and deletion) on an OOC column-store database. An operation called AscynchronousOut-of-Core Update was originally designed based ona new data structure called Timestamped Binary As-sociation Table.
The rest of the paper is structured as follows. Sec-tion 2 is the background introduction of column-storedatabases. Section 3 states the proposed method ofupdate and deletion optimization on the OOC column-store database. Experimental results are illustrated in
section 4. And section 5 is the conclusion and futureworks.
2 Background of the Column-StoreDatabase
The data structure of a column-store databaseexclusively uses BAT s (Binary Association Tables). ABAT is a fragment of an attribute in the original row-based storage, which usually consists of an oid (ObjectIdentifier) or ROWID, along with a column of attributevalues, which in a pair is called a BUN (Binary UNits).It is a physical model in a column-store database andthe sole bulk data structure it implements. The BAT iscategorized in a special group of storage models called‘DSM’ (Decomposed Storage Model) [4, 10].
The row-based storage data is the original user in-put data, called the front-end data or logical data. Toinput the data into a column-store database, a map-ping rule should be defined from the logical data struc-ture to the physical data structure, namely BAT.
Example 1. (From Row-Based Table to BAT) Sup-pose the table name is customer, with id as the pri-mary key.
customer(id, name, balance)Primary Key: id
The row-based data is shown in Fig. 1(a).In a columnar database, this logical table willbe decomposed into 3 BATs namely customer id,customer name, customer balance. Each BAT con-tains two columns: an oid and an attribute value col-umn with the column name as the corresponding col-umn data type.
In the example, the logical table is fully decom-posed into 3 BATs, Fig 1(b)-1(d), with each BAT con-tains one of the attributes. This is also referred to asfull vertical fragmentation [1]. Full vertical fragmenta-tion has many advantages. First of all, data accessingis e�cient for queries accessing many rows but withfewer columns involved in the query. Another advan-tage is the reduction of the workload on the CPU andmemory generated by OLAP and data mining queries,
ABUN consistsof(oid,value)
MappingRules
RelationalData
Column-Store
Challenge•Optimizingwriteoperationsinacolumn-storedatabasehasalwaysbeenachallengebecause:• Dataisvertically decomposedintoBATsandrandomlydistributedoverthestorage.• Thewritingonacolumn-storedatabasewillbesignificantlydelayedbyadhocaccesstolargeBATsacrossmultiplepages.
YOUNGSTOWNSTATEUNIVERSITY
Out-Of-Core(OOC)?Existingworksmajorlyfocusonwriteoptimizationsformain-memorycolumn-storedatabase.Tothebestofourknowledge,veryfewworksfocusonoptimizingthewriteperformanceontheOut-Of-Core (OOC orexternalmemory)column-storedatabases.
YOUNGSTOWNSTATEUNIVERSITY
Traditional Update on BATIntraditionalBAT,anupdatebyagivenOIDinvolvesin2phases:1. SearchthelocationinBATbyOID(Time-consuming)
2.Updatethevalueatthetargetlocation.
YOUNGSTOWNSTATEUNIVERSITY
Motivation1. AvoidSearching!2. Allowmulti-valuesforagivenOID.3. Keepdataconsistent.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
Timestamped BinaryAssociationTable(TBAT)
YOUNGSTOWNSTATEUNIVERSITY
oid float
101 100.00
102 200.00
103 300.00
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
customer_balance customer_balance
BAT TBAT
Supposetheexistingrecordswereinsertedinonebatchattime1.
TheprincipleofAOCupdateistoavoidOOCsearchingandwritingineveryeffortandtousethetimestampfieldofTBATtolabel.InAOCupdate,thenewlyupdateddatathatisdirectlyappendedtotheend ofaTBAT.Insuchamanner,wedon'thavetofrequentlyperformadhocdatasearching.
YOUNGSTOWNSTATEUNIVERSITY
AOCUpdateExample
YOUNGSTOWNSTATEUNIVERSITY
Example:Uupdate queryoncustomer table:update customer set balance=201.00 where id=2Currenttimestampistime2(>time1).
ThenewestTBUNfor201.00isappendedtotheendofTBATcustomer_balance
New update ->
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
Body
Appendix
SelectionafterAOCUpdateThedataconsistency willbeintactinaTBATafterAOCupdate.AftertheTBATofcustomerhasappliedAOCupdates,werunthefollowingquery:
SELECT balance FROM customer WHERE id=2IntheupdatedTBATcustomer_balance,twotupleswillbereturned:
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
Wecomparethetimestamps,time2> time1.Then201.00 isreturnedwhichisconsistentwiththelastupdatevalue.
YOUNGSTOWNSTATEUNIVERSITY
AOCUpdateExperimentPreliminaryexperimentresultsaredesignedinordertocomparethespeedperformancebetweenAOCupdatesonTBATsandtraditionalupdatesonBATs.
TheexperimentisperformedonaCentOS 6.5workstationwithIntelCorei7-37003.4GHzCPU,16GBmemory,and250GBSATA7200RPMharddisk.
TheexperimenttestcodeisimplementedinPython2.7.
YOUNGSTOWNSTATEUNIVERSITY
AOCUpdateExperiment(cont.)
2.27
4.71
7.13
9.59
12.01
1.63E-03 3.25E-03 4.81E-03 6.41E-03 7.95E-03 0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
10% 20% 30% 40% 50%
Ela
pase
d Ti
me
(sec
)
Update Percentage
BAT TBAT
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
AOCUpdateandTraditionalUpdateRunningTime
AOCUpdateExperiment(cont.)
YOUNGSTOWNSTATEUNIVERSITY
1392.84
1449.65
1484.23 1495.56
1509.9
1320
1340
1360
1380
1400
1420
1440
1460
1480
1500
1520
10% 20% 30% 40% 50%
Tim
es F
aste
r (x1
.0)
Update Percentage
TimesofAOCUpdateFasterthanTraditionalUpdate=Time(Update on BAT)
Time(AOC Update on TBAT)
Average1466.436
timesfaster
Overheads
PotentialProblemswithAOCUpdateWhenmanyAOCupdatesareperformed,searchingbecomesgraduallyslowerbecauseitsunsortedappendixrequiresalinearsearch;thegreaterthevolumeofupdateddata,theslowerthesearch.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
Body
Appendix
SearchSpeedDegeneration
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
SelectionQueryExecutionOverhead:TBAToverBAT(time(TBAT)/time(BAT)× 100%)
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
DataCleaning AfterAOCUpdateDatacleaninghasrecentlydrawnalotofattention.
Datacleaninginourcontextistheprocessbywhichwemergetheupdatedupdateddatafromtheappendixintothebody.• Removemulti-valuesofthesameOID• Avoidslowerlinearsearch
Duringnon-peaktimes,OfflineDataCleaningallowsfortheseadjustmentstobemadebymergingintothebodytherecentlyupdateddata.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
ProblemswiththeOfflineDataCleaningMethodThismethodcausesthedatabasetogooffline,meaningthatanyincomingquerieswillhavetowaituntilthedatabasecomesbackonline.
Thislapseinservicemaynotbeappropriateforenvironmentsthatrequireaconstantworkload;inappropriateforconstantinput-streams.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
OnlineDataCleaningThemajordifferenceofonlinedatacleaningistheemploymentofasophisticateddatastructurecalledsnapshot.Theideaoflivesnapshotrootsfromcloudcomputing.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
Body SnapshotofBody
AppendixNew
Appendix(original)
online
merge
read
read
read&write
BodyMerged
AppendixNew
DuringOnlineCleaning AfterOnlineCleaning
OnlineDataCleaning(cont.)TheOnlineEagerDataCleaning(speedpriority)methodmergestheentireappendixoftheTBATsintothebodyinonegotosaveontime.
TheOnlineProgressiveDataCleaning(memory-usagepriority)methodisusedduringmoreextremecaseswhenthefullappendixmaynotfitintomemory.TheDBAmanuallydecidesablocksize,andtheappendixissplitintoseveralofthoseblocksandaddedtoanappendixqueue.Theaboveeagermethodisappliedtotheseappendixfilesandanystreamingupdates(present-time)canbeaddedtoanewsplitappendixfiletobequeuedwhenitfillsuptheblocksize.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
ProgressiveData-CleaningResults
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
UpdateonBATinMap-ReduceInaMap-Reduceenvironment,weassumetheupdatelistofOIDsarecollectedandsubmittedinabatchofUPDATE_LIST
1. Map-ReduceJoinBATLEFTOUTERJOINUPDATE_LISTONOID=>(BATcombinedwithUPDATE_LIST)• Map-sidejoin:whenUPDATE_LISTissmallenoughtofitinto
memory• Reduce-sidejoin:whenUPDATE_LISTislargeenough
2. SelectiveProjection(Map-Only)FOReachrecordin(BATcombineUPDATE_LIST)
IFUPDATE_LISTattributeisnotNULL:outputupdatedvalue(keepthemostrecentupdate)
ELSE:outputoriginalvalue
YOUNGSTOWNSTATEUNIVERSITY
TBAT(Timestamped BAT)TBATin HDFS:struct TBUN{
TIMESTAMP optime,ROWID oid,USER_DEFINED_TYPE attrv
}struct TBAT_slip{
TBUN[max_size_per_HDFS_slip] tbuns}
• Noneedforanyglobalpre-sortingorindexing• ‘attrv’iscanbeanyuserdefinedtypethatflexiblydefinearbitrarykindsofschema
YOUNGSTOWNSTATEUNIVERSITY
AMOUpdate(logical)
YOUNGSTOWNSTATEUNIVERSITY
Example:Updatequeryoncustomer table:
update customer set balance=201.00 where id=2Currenttimestampistime2(>time1).
ThenewestTBUNfor201.00isappendedtotheendofTBATcustomer_balance
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
NewData
OldData
AMOUpdateExperimentPerformedonaClouderaDistributedHadoop(CDH)cluster• 1 masterand3slaves• TotalHDFScapacity=310GB(blocksize=64MB)• InterconnectionisGigabitEthernet
Datasets:1GBand10GBrandomsyntheticdatainBATandTBAT.
Updatequeries:from10%to30%oftheoriginaldata.
YOUNGSTOWNSTATEUNIVERSITY
AMOUpdateExperiment(cont.)
YOUNGSTOWNSTATEUNIVERSITY
1GBUpdateRunningTime
0 50
100 150 200 250 300 350 400 450 500
10 15 20 25 30
Run
ning
Tim
e (s
ec)
Update Percentage (%)
BAT TBAT
YOUNGSTOWNSTATEUNIVERSITY
10GBUpdateRunningTime
0 500
1000 1500 2000 2500 3000 3500 4000 4500 5000
10 15 20 25 30
Run
ning
Tim
e (s
ec)
Update Percentage (%)
BAT TBAT
AMOUpdateExperiment(cont.)
YOUNGSTOWNSTATEUNIVERSITY
Relative OverheadChangingoverDataSets
0 20 40 60 80
100 120 140 160 180
10 15 20 25 30
Ove
rhea
d (%
)
Update Percentage (%)
1GB 10GB
AMOUpdateExperiment(cont.)
GeneralizedWriteOptimizationFramework(DEXA’15)
YOUNGSTOWNSTATEUNIVERSITY
AtomicBuffer
(ReadIn)
AtomicBuffer
AtomicBuffer
AtomicBuffer
AtomicBuffer
ReadOptimizedData
SerializedTBAT_1
SerializedTBAT_2
SerializedTBAT_N
InputStream
AtomicBuffer(Full)
AtomicBuffer(Full)
AtomicBuffer(Full)
…
WriteQueue
Buffe
rPoo
l
…
WriteOptim
izedMod
ule
ReadOptim
izedMod
ule
Publications1. HasteningDataRetrievalonOut-of-CoreColumn-StoreDatabasesusingOffsetB+-Tree
F.Yu,E.S.Jones28thInternationalConferenceonComputerApplicationsinIndustryandEngineering(CAINE2015),October12-14,2015,HiltonSanDiego/HarborIsland,SanDiego,California,USA,pp.313-318
2. AFrameworkofWriteOptimizationonRead-OptimizedOut-of-CoreColumn-StoreDatabasesF.Yu,W.-C.Hou26thInternationalConferenceonDatabaseandExpertSystemsApplications(DEXA2015),Valencia,Spain,September1-4,2015,pp.155-169
3. WriteOptimizationusingAsynchronousUpdateonOut-of-CoreColumn-StoreDatabasesinMap-ReduceF.Yu,E.S.Jones,W.-C.Hou2015IEEEInternationalCongressonBigData,June27- July2,2015,NewYork,USA,pp.720-723
4. OnlineDataCleaningforOut-Of-CoreColumn-StoreDatabaseswithTimestamped BinaryAssociationTablesF.Yu,C.Luo,W.-C.Hou,E.S.JonesProceedingof30thInternationalConferenceOnComputersAndTheirApplications(CATA2015),Honolulu,Hawaii,USA,March9-11,2015,pp.407-412
5. AsynchronousUpdateonOut-of-CoreColumn-StoreDatabasesUtilizingtheTimestampedBinaryAssociationTableF.Yu,C.Luo,W.-C.Hou,E.S.JonesProceedingof27thInternationalConferenceon.ComputerApplicationsinIndustryandEngineering(CAINE2014),NewOrleans,Louisiana,LA,October13-15,2014,pp.215-220.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
SourceCodehttps://github.com/YSU-Data-Lab/TBAT-DEXA15
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
NewChallenges•NewIndexonC-SDBs• Localandglobal• Searching• DataCleaning• ParallelProcessing
•BigData• Searching• DataCleaning• AutoMapping• ToIndexornottoindex?
•BroaderApplications• ScientificsDataManagement• BigDataAnalytics• MachineLearning• OLAP• OLTP• HPC• HTC
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
Thankyou!Feng“George”Yu
ComputerScienceandInformationSystemsYoungstownStateUniversity,Youngstown,OH
YOUNGSTOWNSTATEUNIVERSITY