Column-Oriented Database
Implementation in PostgreSql for
Improving Performance of Read-Only
Queries
Dissertation
submitted in partial fulfillment of the requirements
for the degree of
Master of Technology
by
Aditi D. Andurkar
Roll No: 121022008
under the guidance of
Prof. Shirish P. Gosavi
Department of Computer Engineering and Information Technology
College of Engineering, Pune
Pune - 411005.
June 2012
Dedicated to
my mother
Smt. Rekha D. Andurkar
and
my father
Shri. Deepak A. Andurkar
DEPARTMENT OF COMPUTER
ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation titled
Column-Oriented Database Implementation inPostgreSql for Improving Performance of
Read-Only Queries
has been successfully completed
By
Aditi D. Andurkar
(121022008)
and is approved for the degree of
Master of Technology.
Prof. Shirish P. Gosavi, Dr. Jibi Abraham,
Guide, Head,
Department of Computer Engineering Department of Computer Engineering
and Information Technology, and Information Technology,
College of Engineering, Pune, College of Engineering, Pune,
Shivaji Nagar, Pune-411005. Shivaji Nagar, Pune-411005.
Date :
Acknowledgments
I express my sincere gratitude towards my guide Prof. Shirish P. Gosavi for
his constant help, encouragement and inspiration throughout the project work.
Without his invaluable guidance, this work would never have been a successful
one. I would also like to thank Dr. Arvind Hulgeri for his valuable suggestions
and helpful discussions. Last, but not the least, I would like to thank my dearest
and closest friends and of course parents who have been the backbone, advisors
and constant source of motivation throughout the work.
Aditi D. Andurkar
College of Engineering, Pune
May 21, 2012
iii
Abstract
The era of Column-oriented database systems has truly begun with open source
database systems like C-Store, MonetDb, LucidDb and commercial ones like Ver-
tica. Column-oriented database stores data column-by-column which means it
stores information of single attribute collectively. Row-Store database stores data
row-by-row which means it stores all information of an entity together. Hence,
when there is a need to access the data at the granularity of an entity, Row-Store
performs well. But, in decision making applications, data is accessed in bulk at
the granularity of an attribute. The need for Column-oriented database arose
from the need of business intelligence needed for efficient decision making where
traditional Row-oriented database gives poor performance. PostgreSql is an open
source Row-oriented and most widely used relational database management sys-
tem which does not have facility for storing data in Column-oriented fashion. This
work discusses the best method for implementing column-store on top of Row-store
in PostgreSql along with successful design and implementation of the same. Per-
formance results of our Column-Store are presented, and compared with that of
traditional Row-store results with the help of TPC-H benchmark. We also discuss
about the areas in which this new feature could be used so that performance will
be very high as compared to Row-Stores.
iv
Contents
Acknowledgments iii
Abstract iv
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 Row-Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Column-Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Comparison of Column-Store vs. Row-Store . . . . . . . . . . . . . 3
1.4 PostgreSql Database Management System . . . . . . . . . . . . . . 4
1.5 Thesis Objective and Scope . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Survey 8
2.1 PostgreSql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 System Catalogs and Data Types . . . . . . . . . . . . . . . 8
2.1.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Column-Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Approaches for implementing Column-store . . . . . . . . . . . . . 13
2.3.1 Implementing Column-store on top of row-store . . . . . . . 14
2.3.1.1 Materialized views . . . . . . . . . . . . . . . . . . 14
2.3.1.2 Index-only plans . . . . . . . . . . . . . . . . . . . 14
2.3.1.3 Vertical partitioning . . . . . . . . . . . . . . . . . 14
2.3.2 Modifying the storage layer . . . . . . . . . . . . . . . . . . 15
2.3.3 Modifying the storage and execution layer . . . . . . . . . . 16
2.3.4 Fractured mirrors approach . . . . . . . . . . . . . . . . . . 16
2.3.5 Vertical Partitioning with Super Tuples . . . . . . . . . . . . 17
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Application Areas 18
4 PostgreSql Column-Store Design 20
4.1 Creating a Column-store Relation . . . . . . . . . . . . . . . . . . . 20
4.2 Meta data For the Column-store Relation . . . . . . . . . . . . . . 21
4.3 Inserting data into Column-store table . . . . . . . . . . . . . . . . 22
4.4 Altering the Column-store table . . . . . . . . . . . . . . . . . . . . 23
4.5 Dropping the Column-store table . . . . . . . . . . . . . . . . . . . 23
4.6 Selecting data from Column-Store table . . . . . . . . . . . . . . . . 23
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Experiments and Results 28
6 Conclusion and Future Work 35
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Bibliography 38
vi
List of Figures
1.1 Relation stored Row-by-Row . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Relation stored Column-by-Column . . . . . . . . . . . . . . . . . . 2
1.3 Architecture of PostgreSql Database Management System . . . . . . 5
2.1 Query Processing Stages . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Types of approaches to design Column-store databases . . . . . . . 13
2.3 B+ tree index for a column . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Vertical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Creating Colstore Table . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 System Table pg map for storing mapping between relation and
internal tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 System Table pg attrnum for storing meta-data of Column-Store
Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Taking join of internal tables . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Plan Tree for SELECT query in Row-Store . . . . . . . . . . . . . . 26
4.6 Plan Tree for SELECT query in Col-Store . . . . . . . . . . . . . . 27
5.1 E-R Diagram of TPC-H Benchmark Dataset . . . . . . . . . . . . . 29
5.2 Comparison of Column-Store vs. Row-Store . . . . . . . . . . . . . 30
List of Tables
5.1 Experimental results for simple select query . . . . . . . . . . . . . 30
viii
Chapter 1
Introduction
Whenever we say relational data, most obvious interpretation is a table which
has attribute as one dimension and entity as another. We imagine a table stored
on some storage media in such a 2-dimensional form. But this is just a concept
for better understanding of any relation stored some storage media. At physical
level, it is not possible to store data like the way we imagine. Therefore, Data
are physically stored consecutively one after another in 1-dimensional way. While
storing in 1-dimensional manner we have 2 choices. We can either store the data
entity-by-entity or attribute-by-attribute. This leads to two kinds of databases
Row-Store and Column-Store respectively. They are as follows:
1.1 Row-Store
Traditional Row-Store DBMS stores data tuple by tuple i.e. all attribute values
of an entity are stored together rather sequentially one after the other. Hence,
Row-store is used where information is required from DBMS on a granularity of
an entity. In Row-Store, write-queries like insert, delete, update can be easily
performed [3] since they apply for an entity/tuple. But, read-queries like select
have predicates which are conditions to be applied on attributes/columns and non-
predicates which are columns to be projected as result of the query. Therefore,
these queries apply for attributes/columns rather than entity/tuple. Hence, Row-
Store is said to be Write-Optimized since it favors write operations.
The Figure 1.1 shows the manner in which data is stored in Row-Stores on
physical media.
1
1.2 Column-Store
Figure 1.1: Relation stored Row-by-Row
1.2 Column-Store
Column-Store DBMS stores data column by column i.e. all values of an attribute
are stored collectively. So that, ith value of every column of a relation will form a
tuple together. Hence, Column-store is used where information is required from
DBMS on a granularity of an attribute. In Column-Store, read-queries like select
can be easily performed [3] since they apply for attribute/column. But, write-
queries like insert, update,delete which are applied for entity or tuple are not
easily processed. Therefore, Column-Store is said to be Read-Optimized since it
favors read operations.
The Figure 1.2 shows the manner in which data is stored in Column-Stores on
physical media.
Figure 1.2: Relation stored Column-by-Column
2
1.3 Comparison of Column-Store vs. Row-Store
1.3 Comparison of Column-Store vs. Row-Store
The question of which type of database system is better depends on the kind of
query workloads [3]. If after data insertion, updation, deletions are going to be
more and if accessing entire tuples is a need then Row-Stores are the best. They are
the most common ones for business transactional data processing. For example,
a bank uses databases to store information of its customers. Some customer A
might want to transfer money to the account of customer B. Here, Customer A
and B are entities. Here a simple updation has to be done in accounts of A and B
both which is deduct amount x from account of A and credit amount x to account
of B. As it can be seen information will be required by the bank from DBMS on
the granularity of an entity here, Row-Store which stores data entity-by-entity will
be most obvious choice out of the two database systems we studied.
Now let us consider a different query which needs to find average purchase
amount of a customer at a shop. If we look at this query carefully, it can be seen
that information we require is not entity specific rather it is aggregate information
considering all entities in the table. For this query, only shopping amount attribute
is required and nothing else to find average. Here for directly accessing a column,
Column-Store will be better than Row-Store.
If we consider another query, customers shopping for more than Rs. 5000 every
month,(Owner thinks if additional benefits are given to these customers then they
might visit the shop more often). This query needs only customer name, amount
spent and date attributes from the entire relation. Clearly, rest 10-15 attributes
will be irrelevant (assuming dataset is very large). This query will help to gain
insight into the data and it is not business critical situation like in transactional
processing. For such kind of queries Column-Stores perform better since attributes
are stored separately so irrelevant attributes need not be accessed saving a lot of
processing time. Suppose if queries are going to be Read queries which will help to
gain insight into the data, Column-Stores will certainly perform better. Therefore,
when it comes to analytical applications or decision making applications, column-
stores prove to be the best [3]. Business organizations have to handle large amount
of data and extract meaningful information from that data for efficient decision
making which is commonly termed as Business Intelligence. This includes finding
associations between data, classifying or clustering data etc. This lead to a large
area of research called data mining. It is observed that for these kinds of applica-
tions, once data warehouse is built i.e. once data is loaded, most of the operations
on data are read operations. Unlike business processing, all attributes of an entity
3
1.4 PostgreSql Database Management System
would not be required for the analysis. Row-store, if compared with column-store
for these applications, has significantly slower performance as it has been shown
[1].
Again there are some optimizations possible with Column-Stores and are not
possible with Row-Stores which can improve performance of Column-Stores com-
pared Row-Stores significantly [1, 3]. The first and the most important is Com-
pression [4]. As data are stored column-by-column, compression can be easily
applied on a column. This is possible because a column has a data type in which
similar data is stored. Like mobile number in India will always contain 10 digits.
If one could store data is compressed format, performing column extraction will
become very easy. Next is block processing, where multiple tuples from a column
are extracted and are passed as a block from one operator to another. There
is one more optimization called as Late Materialization where tuple construction
i.e. joining of columns is performed as late as possible. These optimizations are
specific to Column-Stores because Row-Stores do not have required properties to
apply these optimizations.
In recent times, a lot of work is done which compares Row-Store and Column-
Store performances [8, 11, 18, 19].
1.4 PostgreSql Database Management System
PostgreSql is world’s most advanced object-relational database management sys-
tem [7]. It is free and open-source software. It is developed by PostgreSql Global
Development Group consisting of handful of volunteers employed and supervised
by companies such as Red Hat and EnterpriseDB. PostgreSql is available for al-
most all operating systems like: Linux (all recent distributions), Windows, Unix,
Mac OS X, FrreBSD, OpenBSD, Solaris, and all other Unix-like systems. It works
on all majority of architectures like: x86, x86-64, IA64, PowerPC, Sparc, Alpha,
ARM, MIPS, PA-RISC, VAX, M32R [7]. MySQL and PostgreSql both compete
strongly in field of relational databases since they both have advanced function-
alities and also comparable performance and speed and most importantly they
are open-source. PostgreSql is Object-relational in the sense that it is similar to
relational database but its database model is object-oriented. Objects, classes and
inheritance are directly supported in database schemas and query language.
PostgreSql which uses a client/server model can be broken-up into three large
subsystems [7]:
1. Client Server: This subsystem consists of Client Application and Client In-
4
1.4 PostgreSql Database Management System
terface Library. Client Application wants to perform some operation on the
data. This Client application can be a any one of text-oriented tool, a graph-
ical application, or some specialized tool. Therefore, it is the responsibility
of the Client interface library to convert each client application to proper
SQL queries that the server can understand and parse. Hence, server need
not parse different languages and waste its time, but only interpret SQL
queries, which makes the whole system faster.
2. Server Process: Postgres server executes daemon thread constantly which
is the master server process. When it receives a call from a client process,
it forks a new postgres server process. Once the process is created, it links
the client and postgres process so that they can communicate without the
postmaster. An SQL query is passed to Postgres server and it is incremen-
tally transformed into result data. The master server process always waits
for calls from clients. But, slave master processes and clients come and go.
3. Database Control: Stored data is accessed through the Storage subsystem.
Figure 1.3: Architecture of PostgreSql Database Management System
PostgreSql conforms to SQL standard. Its features [7] are as follows:
1. Complex Queries: Queries can nested, consisting of many operators or the
criteria may be complicated.
5
1.4 PostgreSql Database Management System
2. Triggers: It is a piece of code which gets executed implicitly when a certain
event occurs on a specific relation in the database.
3. Views: View is a query which is stored inside database. Whenever a view is
accessed, its corresponding query gets executed internally and this is trans-
parent to the user.
4. Foreign keys: It is a key in a relational table that matches candidate key of
another table.
5. Transactional Integrity: Integrity ensures that a transaction gets completed
fully. If not, then is aborted and rolled back fully So, consistency is main-
tained.
6. Multi version Concurrency Control: Allows to have concurrent access to the
database but at the same time consistency is maintained.
Also, users can extend [7] PostgreSql by adding new
1. Data types: Type of the data stored in database.
2. Operators: Operators like relational, arithmetic etc. which are applied to
compare data, calculate result.
3. Functions: Function is a routine which accepts arguments, performs some
action and returns the outcome of that action.
4. Procedural Languages: Control flow structures for controlling data.
5. Index Methods: For fast data retrieval, data can be indexed using different
types of indexes.
6. Aggregate Functions: Used to perform calculation on multiple values.
PostgreSql provides significant performance features [7] over MySQL as:
1. Efficient executor for static or parameterized SQL
2. Advanced cost optimizer with multiple query plans for query execution
3. Indexing: partial, functional, multiple-index combinations
4. Data compression
5. Improved cache management
6. Asynchronous commit
6
1.5 Thesis Objective and Scope
1.5 Thesis Objective and Scope
Our aim is to implement Column-store on top of Row-store in PostgreSql to im-
prove performance of Read-queries. PostgreSql is basically an open-source row-
oriented database which does not have any feature of storing data column-by-
column. We are enhancing PostgreSql to have this feature so that for decision
making applications performance of Read-queries will be higher than that com-
pared with row-oriented database system. Modifications will be done in such a
way that all query processing for Column-Store table will be transparent to the
user.
1.6 Thesis Outline
In section II, we explore structure of PostgreSql, Column-Store optimizations and
different approaches [1] for implementing column-store as part of our literature
survey. In Section III, we explain Column-store approach to be implemented
in PostgreSql in detail, the new data structures introduced and how read/write
queries are processed i.e. the internal design and working of the query for the
column-store implementation. In Section IV, we will evaluate performance of
our implementation by using TPC-H benchmark and compare it with Row-Store
PostgreSql. Section V concludes the report and explains the future scope of the
work.
7
Chapter 2
Literature Survey
Keeping our aim in mind, our literature survey will consist of 3 main areas Post-
greSql, Column-Store Databases and Approaches for Column-store implementa-
tion.
2.1 PostgreSql
As we have implemented Column-Stores in PostgreSql, there is a need to un-
derstand what are important aspects of PostgreSql. In chapter 1 we have seen
client/server model of PostgreSql. In this section we will see the system catalogs
[7] and data types defined in PostgreSql and query processing stages [7].
2.1.1 System Catalogs and Data Types
System Catalog is the metadata for the system and Metadata is data about data
i.e. which gives descriptive information about stored data itself. For modifying
PostgreSql, these data structures should be thoroughly understood. Also, we
might be required to create new ones. The data structures are as follows:
1. System Catalog for describing a relation
i. pg class: For each row-oriented table, one row is added to this system
table containing name of the relation, its owner, permissions.
ii. pg attribute: For each row-oriented table, one row is added to this
system table for each attribute present in the table having attribute
name, data type etc.
8
2.1 PostgreSql
iii. pg index: For every index created on any column of a relation, a row
is added to this system table containing index name,type of index etc.
Also index is a relation so its entry is also made into pg class.
iv. pg proc: For every defined function, an entry is made into this system
table about its name, arguments types, result types etc.
v. pg language: For defined functions, there is always some implementa-
tion language like C, SQL, PL/SQL. This entry is made into pg language.
2. System Catalog for Aggregate Functions
i. pg aggregate: For each defined aggregate function, an entry is made
into this table about their working-state data type, update function,
result function.
3. System Catalog for Operators
i. pg operator: Various types of operators are used while handling ex-
pression. These operators are included in this system table.
4. System Catalog for Data Types
i. pg type: An entry is made into this system table for every data type
defined. It can built-in or user-defined.
2.1.2 Query Processing
The Figure 2.1 shows the how a query is processed inside database management
system.
Let us understand the how a query is processed [7] below:
1. A connection is established from Application program to the PostgreSql
server. A query fired is transmitted by application program to this server.
2. Parser Stage:
When a query is fired through an application program, it is parsed first. It
is nothing but syntactic analyzer for analyzing text query. Lexical analyzer
(Lexer) converts text query into a sequence of tokens to determine its gram-
matical structure. Then parser checks for correct syntax and builds a parse
tree with the help of tokens identified.
9
2.1 PostgreSql
Figure 2.1: Query Processing Stages
3. Traffic Cop:
Traffic Cop is responsible for differentiating between simple and complex
queries. Some examples of simple queries are BEGIN, ROLLBACK because
any additional processing is not required so rewrite system need not process
such query and are hence are passed on to executor directly. But select
queries, joins will need rewrite system so these are passed to rewrite system.
4. Rewrite System:
The parse tree formed is given as input to this stage which transforms it into
query tree by using rules stored in the system catalogs. It is an internal rep-
resentation of an SQL statement where the single parts that its is built from
are stored separately. This stage does semantic analysis on the text query
which extracts meaning of the query by identifying which tables, functions,
operators are referenced by the query.
5. Planner/Optimizer:
A query can have several execution paths so it creates all possible paths
leading to the same result. Choosing an optimal path is a job of the plan-
ner/optimizer by comparing execution costs of all possible paths. The path
10
2.2 Column-Store
which requires minimal execution cost will be the optimal one.
6. Executor:
Executor executes the query plan decided by the planner with the help of
the storage system and sends the output (rows retrieved) to the application
program.
We have created a layer in between parse tree formation and query tree formation.
This layer transforms queries for column-store in a different manner. Changing
parser would have complicated the code because changing parser is like language
processing. Therefore, modifying query tree is more easy and logically correct.
2.2 Column-Store
There are some advantages of Column-Store over Row-Store. These advantages
are due to the way in which data is stored in Column-Store.
1. Compression:
Storing values of a column contiguously makes the adjacent values on disk
similar to each other or rather of the same data type [4]. This leads us to one
of the most important optimization called compression. We can compress
the data using several existing compression techniques. This optimization is
not possible in row-stores because in a tuple all the attributes are different.
Therefore, compression is a new optimization strategy for column-stores.
One might think that we are trying to save space by compressing data but
disks or space have become so cheap that there is no need to save space
[22, 23]. We are trying to look at this as a technique for improving query
performance. As we are required to read data from disk, it improves I/O
performance by reducing seek times (the data are stored nearer to each
other), reducing transfer times (there is less data to transfer), and increasing
buffer hit rate (a larger fraction of the DBMS fits in buffer pool). Column-
oriented compression schemes also improve CPU performance by allowing
database operators to operate directly on compressed data.
2. Materialization:
In Column-stores, attribute values of a single tuple are stored at multiple
locations on disk. But most of the times queries try to access more than one
column from database. The output standard is always an entity-at-a-time
not column-at-time. Therefore, the attributes from different columns of the
11
2.2 Column-Store
same relation have to be combined together into rows to be displayed. This
reconstruction of tuples is called as materialization [5]. This is nothing but
taking join of various columns. There are two ways to reconstruct tuples
from column-stores:
i. Early Materialization:
In this type, whichever columns are mentioned in the query are retrieved
first join is taken so that we will have row-oriented tuples. And then the
predicated are applied. Those who satisfy are answer to the query. But
this way all advantage of column-stores is lost and process becomes less
efficient. In early materialization [3, 5], as soon as a column is accessed,
its values are added to an intermediate-result tuple, eliminating the need
for future re-accesses.
ii. Late Materialization:
In this approach, tuples are not formed until some part of the query
plan has been processed. Here, predicates are applied on columns of
relation first and those positions which satisfy the predicates are listed.
Finally, a position-wise AND operation is performed on them. Now, you
have a list of positions which satisfy the predicates so again access the
columns required and extract the corresponding values and take a join.
This seems to be a better strategy than early materialization[3, 5].The
primary advantages of late materialization [5] are that it allows the
executor to use high-performance operations on compressed, column-
oriented data and to defer tuple construction to later in the query plan,
possibly allowing it to construct fewer tuples.
One advantage of Late Materialization is that column values can be
stored together contiguously in memory in column-oriented data struc-
tures. This has two performance advantages: First, the column can be
kept compressed in memory using the same column-oriented compres-
sion techniques as were used to store column on disk. Second, looping
through values from a column-oriented data structure tends to be much
faster than looping through values using the tuple iterator interface.
The fundamental problem with delaying tuple construction is that in
some cases columns may need to be accessed multiple times in the query
plan [5]. For example, suppose a column is accessed once to retrieve
positions matching a predicate and a second time (later in the plan)
to retrieve its values. If the matching positions cannot be determined
12
2.3 Approaches for implementing Column-store
from an index, then the column values will be accessed twice. Assum-
ing proper query pipelining, the re-access will incur no disk costs, but
there will be a CPU cost in scanning the block to extract the values
corresponding to the given positions. This cost can be substantial if the
positions are not in sorted order. In many cases, a query outputs fewer
tuples than are actually processed. Predicates usually reduce the num-
ber of tuples output, and aggregations combine tuples into summary
tuples [5]. Thus, if the executor waits long enough before constructing
a tuple, it might be able to avoid constructing it altogether.
2.3 Approaches for implementing Column-store
Keeping our aim in mind, we did some literature survey to study various different
approaches for Column-Store implementation. Now, we explore the approaches in
detail.
Daniel J. Abadi, Samuel R. Madden and Nabel Hackem [1, 3] have done a lot
of work on Column-Stores in recent times. Specifically Daniel J. Abadi has done
immensely valuable work in the field of Column-oriented databases[1, 2, 3, 4, 5,
6, 17, 19]. They implemented Column-Store from scratch called ”C-Store” [2, 12].
They suggested three approaches for the implementation of Column-stores which
are as follows:
Figure 2.2: Types of approaches to design Column-store databases
13
2.3 Approaches for implementing Column-store
2.3.1 Implementing Column-store on top of row-store
This approach is basically implementing column-store by using existing row-store
structure. There would not be any physical changes in the DBMS. Now, we briefly
explain various approaches [1] towards implementing column-store using row-store
as suggested by prior related research work. The approaches are considered in an
order such that the limitations they put on possible queries decrease.
2.3.1.1 Materialized views
Under this approach, an optimal set of materialized views is formed [1]. Every
materialized view of this set contains unique set of columns required to answer
a distinct query. In this manner, for every possible query, materialized view will
be present. So, whenever a select query is fired, relevant materialized view will
be accessed giving us quick results. But, for that all the possible queries should
be known in advance. In analytical applications specifically, for which we are
looking forward, queries cannot be known prior to execution. This is the biggest
disadvantage of this approach which makes it impractical.
2.3.1.2 Index-only plans
This approach allows storing tables in usual row-oriented form. But, it uses in-
dexing for column-store implementation such that for every column of the table
an index is formed which does not follow the physical order in which values are
stored. Thus, it forms an unclustered B+-tree [9] index for each column. Using
these indexes query can be answered without ever going to the underlying row-
store tables [1]. When a predicate is present on any column in the query, the
B+-tree index[24] formed can easily give us the result since it divides domain val-
ues of that column into continuous ranges. On the other hand, if the predicate is
absent, entire index will be searched which will be an overhead. Also this concept
of forming index can be problematic because indexes are formed on every distinct
column. So, for simultaneously considering one predicate and one non-predicate
attribute or more, composite index has to be formed. There will always be a
limitation as to how many composite indexes can be formed.
2.3.1.3 Vertical partitioning
The forthright approach for designing Column-stores is to partition a relation
vertically into various internal tables such that an internal table will be formed
for each attribute of the relation [1, 11]. Since the attribute values of an entity will
14
2.3 Approaches for implementing Column-store
Figure 2.3: B+ tree index for a column
be stored in different internal tables, there has to be some way to link them for
tuple construction. This is because ultimately result of read query will be entity
oriented. Therefore, one more column of table key will be included in every internal
table so that join [10] of various columns of the relation could be taken on the
basis of this common key to construct entire tuple [1]. The required common key
can be formed by adding integer position of tuple in the relation table. Primary
key can also be used as a common key in all internal tables but primary key can be
bulky or composite as well. Therefore, position number is preferred over these. In
this fashion, tables will be physically created in the logical schema of the relation.
When a single column is to be fetched, joining will not be required but when
multiple columns are to be fetched, joining will become necessary. When a query
is fired, only those internal tables will be accessed which correspond to mentioned
attribute and remaining ones will be neglected. Then joining will be taken and
then will be processed further.
Although simple, there are some advantages and disadvantages of this ap-
proach. Most obvious disadvantage is that for each internal table common key
column is required, which occupies memory. For every relation, number of tables
to be created will be large which again increases work for table creation query.
But on the positive side, there are no limitations on the queries which can be fired
on Column-store unlike materialized views. Also, indexes need not be formed
for the attributes as in index-only plans. Out of these, the approach of vertical
partitioning is best since space overhead is the only disadvantage caused.
2.3.2 Modifying the storage layer
The columns are physically stored one by one so that positional join can be taken
to form a tuple and table keys would not be required [3]. therefore, this approach
does not require any change to logical schema of the database system. Rather
15
2.3 Approaches for implementing Column-store
Figure 2.4: Vertical Partitioning
tables are stored on storage column by column. Here, tuple identifiers are not
required for tuple construction. Implicit column positions can be used to take
join. When columns are stored using this way, tuple headers and tuple IDs are
not stored along with actual data of the column.
For this approach, tuples are constructed before execution of the query because
the executor of DBMS can not operate on data stored in column format. It is
required to store columns in position order so that to match up all attribute
values of any tuple, position order could be used.
2.3.3 Modifying the storage and execution layer
The columns are physically stored column by column. Therefore, positional join
can be taken. Also, executor can process the data while it is still in columns
[3]. In previous approach, no changes to query executor were made due to which
tuple construction was required to be done before query execution. But, in this
approach, query executor can also be modified to work on column data so that
tuple construction could be delayed and different query plan could be used which
applies predicates on attributes first. And result of this predicate application could
be used in query plan which will certainly improve performance.
2.3.4 Fractured mirrors approach
R. Ramamurthy, D. Dewitt, and Q. Su. suggested ”Fractured mirrors approach”
[13] in which Row-Stores mainly processes insert, updates and deletes whereas
16
2.4 Summary
Column-Store mainly processes reads. It is also called as Hybrid Row/Column
Approach. There is a process which shifts data from Row-Store to Column-Store
but runs in background. They came to the conclusion that tuple overhead leads
to substantial problems and fetching large chunks of data is required to reduce
tuple reconstruction times.
2.3.5 Vertical Partitioning with Super Tuples
A. Halverson, J. L. Beckmann, J. F. Naughton, and D. J. Dewitt. implemented
Column-Store in Shore database using Vertical Partitioning approach [11]. They
compared performance of Row-store vs this new implementation. They found
that this new implementation could give tough competition to Column-Stores
implemented from scratch if an optimization called ”super tuples” is used. Using
this, header information need not be duplicated and chunks of data could be
accessed as block. Although accessing single row will become difficult, according
to their analysis, performance of Column-Store and new implementation could
match. But, they had not considered optimizations possible in Column-Store at
that time.
2.4 Summary
In the literature survey, we learned about PostgreSql query processing in brief.
Thus, we decided to do modifications in the process between parse tree formation
and query tree formation stages which simplifies our code and its understanding.
Then we studied optimizations specific to Column-Store. For implementation of
Column-Store, we considered various approaches from the past research. We ob-
served that out of these, modifying the storage layer or execution layer or both
would completely change the DBMS. Hence, changing logical schema is chosen as
an approach for our implementation. Considering all advantages and disadvan-
tages of all three approaches, vertical partitioning is considered for implementing
column-store on top of row-store.
17
Chapter 3
Application Areas
The Column-Store Database System which we have built on top of Row-Store has
got applications in certain areas as already discussed. There as certain criteria
under which this could be used to get full advantage of the implementation oth-
erwise performance could become worse. First of all the relations present in the
database should contain large number of columns. i.e. there should be at least
5-6 or more columns in any relation. Since we are storing data column-by-column,
this condition has to be satisfied. Secondly,Queries fired have to be analytical
and not transactional wherein less attributes are accessed and more of aggregate
operations are involved. Roughly if one third of the attributes are accessed in the
query(including predicates and non-predicates) then performance obtained will
be very high which we will see in experiments and result section. On very large
dataset performance improvement of Column-Store as compared to Row-Store will
be very high than small dataset.
Let us be more specific about the application areas:
1. Data Warehousing
i. High end (clustering): A computer cluster is collection of proximately
connected computers that work together which act as a single system.
They are used to process a complicated problem which requires a lot of
computation. Simultaneously all computers will work on different tasks
of the same problem. This leads to performance improvement for the
results obtained. In such cases, large data is stored on clusters. But,
only some is processed by every computer for the specific job assigned.
In such cases, Column-Stores prove to be useful.
ii. Mid end/Mass Market: Mass market is a largest group of consumers of
a specified industry product. Suppose a shop owner decides to check
18
his regular customers buying large amount of his products (allowing
him to make good profit). Such problems can be solved with better
performance with t he help of Column-Stores.
iii. Personal Analytics: This is an emerging area wherein a person will get
to analyze his/her life. He will be able to do automatic computational
history explaining how and why things happened and then making pro-
jections and predictions. For this a person’s mail, messages, etc will be
analyzed.
2. Data Mining: Extracting meaningful information from raw data is nothing
but data mining. Here also, Column-Stores can be used for useful informa-
tion retrieval
3. Google Big Table: Big Table is a compressed, high performance, and pro-
prietary data storage system built on Google File System
4. Semantic web data management: Semantic web allows data to be shared
and reused again across applications,enterprises, and community boundaries.
It uses RDF as a flexible data model and use ontology to represent data
semantics. Here also, one deals with large amount of a data.
5. Information retrieval: Information retrieval includes searching for informa-
tion within documents, and for meta data about documents. There may be
large amount of documents to be processed wherein Column-Stores can be
used.
6. Scientific datasets: Scientific datasets have huge amount of data and con-
clusions are to be drawn from the data. Thus, for experimentally proving
something Column-Stores can help in processing faster.
19
Chapter 4
PostgreSql Column-Store Design
In this section we give a brief idea about how Vertical partitioning approach [1, 3]
is implemented in PostgreSql for having Column-store feature. There are a lot of
DDL and DML queries implemented and the required design modifications into
PostgreSql are proposed as follows:
4.1 Creating a Column-store Relation
For every column-store type of relation a number of internal relations will be cre-
ated equal to number of attributes present in the relation. Hence, a unique iden-
tifier (Oid) will be allotted for every internal table created. No table is created by
the name of mentioned relation, directly internal tables are created corresponding
the attributes,thus saving one unique identifier.
Each internal relation will consist of two columns <record id, attribute> wherein
record id column will be common with other attributes of the same relation. This
column will act as a unique identifier of the tuple as a whole. For creating a
column-store, table users will be given an option of colstore. A new keyword
colstore will be included in the create query. So, a new query would look like
Create colstore table table-name (attr1 datatype, attr2 datatype ...);
The rest of all PostgreSql queries remain syntactically intact. Along with
internal tables we propose to create a sequence and a view corresponding to main
relation. A sequence is a database object which can generate unique integers
sequentially. The reason for creating an internal sequence is that record id has
to be incremented internally whenever any data is inserted into the relation. So
simply sequence value can be incremented internally and passed along with the
values sent by the user.
View is basically a stored query. It does not take much space as only view
20
4.2 Meta data For the Column-store Relation
Figure 4.1: Creating Colstore Table
definition is stored inside database and not the data. Whenever any changes are
made in the tables which are present in the view, those changes are automatically
reflected in the result since query stored for view is fired every time view is called.
This feature of view will help us whenever we want to take join of all internal
tables(For example, select * from table name ...). So a logical view is formed which
takes natural join of all internal tables on the basis of record id. But when join of
all internal tables is not required, we propose to take join of only those internal
tables whose corresponding attributes are mentioned in the query as predicate or
non-predicate rather than using existing view. Name of internal tables, sequence
and view will be made unique by concatenating their respective unique identifiers
(Oids) to their names so ambiguity in the names will be easily avoided. Also it is
made sure that user would not be able to make any Row-Store or Column-Store
relation with the name of any existing relation.
4.2 Meta data For the Column-store Relation
Now that internal tables, sequence has all been created, there is a need to store
meta-data of the relation i.e. to identify which internal tables correspond to
which column-store relations. For storing this mapping between the relation and
internal tables, a new data structure is created named pg map. One more data
structure is created for storing sequence id generated for column-store relation
named pg attrnum. Every entry in these two system tables is uniquely identified
by a key. For pg map <relation-name, attribute number> forms a key whereas for
21
4.3 Inserting data into Column-store table
pg attrnum, <relation-name> forms the key. These two system tables are showed
in figures 4.2 and 4.3 When a column-store table is created, respective values
are entered into these tables. Therefore, whenever user wants to fire any type
of query on column-store relations, these system tables will be accessed. System
caches are built for both the relations so that searching in these tables will be
faster [29, 31, 32]. These tables will be searched on the basis of a unique key as
explained earlier. Similarly, when any column-store relation is dropped, all internal
tables and corresponding sequence are dropped. Also, corresponding system table
entries are deleted.Users can not directly modify newly created system tables so
that these system tables remain consistent.
In chapter 2, we had seen various system catalogs for describing, Row-Store re-
lation,functions, operators etc. Similarly, for Column-Store we define new system
catalog for describing a Column-Store relation.
i. pg map: For every attribute in Column-Store relation, an internal Row-
Store table is created. Therefore, for each attribute of Column-Store table,
there will be a mapping stored between [Colstore table, attribute] and corre-
sponding row-store table created. So, for each attribute-relation (Row-Store
relation) created, a row is created containing its object identifier, name, cor-
responding attribute number, name of Column-Store relation.
ii. pg attrnum: For each Column-Store table, there will be an entry in this
system table which will store number of attributes, corresponding object
identifier of sequence created, view created.
Figure 4.2: System Table pg map for storing mapping between relation and inter-nal tables
4.3 Inserting data into Column-store table
All modifications for executing these kinds of data manipulation queries are done
at the query tree formation stage. Values which are passed by user for insertion
22
4.4 Altering the Column-store table
Figure 4.3: System Table pg attrnum for storing meta-data of Column-Store Table
are taken as a list in PostgreSql [7].In Row-Store this list is directly inserted into
required table.But, for Column-Stores, this list is broken into separate column
values and values are passed to the corresponding tables along with the next
unique sequence value generated. Since each internal table has two attributes
<record id, attribute>, value of attribute column is given as input by user but
input for record id is generated internally by using unique sequence generated
for each relation. Whenever insert is performed, sequence value is incremented
each time. If a select clause is present within insert then it will be processed and
expression list generated will be broken and sent similarly. This way even if user
fires only one insert statement, multiple insert statements will be generated and
processed internally.
4.4 Altering the Column-store table
Adding or dropping any column from Column-store means creating or deleting
internal table respectively. So, if user wants to add a column, an internal table
will be created corresponding to the relation mentioned and system tables will be
updated accordingly. Similar is the case with dropping a column.
4.5 Dropping the Column-store table
When any Column-Store relation is dropped, all internal tables are dropped. Also,
view, sequence created at the time of table creation are also dropped. Most
importantly, system tables are freed from entries corresponding to relation being
dropped.
4.6 Selecting data from Column-Store table
Select query is the one of which Column-Stores are expected to improve the per-
formance. In Row-Stores, even when only some of the attributes are required to
be accessed, all irrelevant attributes are accessed which increase execution time
23
4.6 Selecting data from Column-Store table
of the query. Using Column-Stores, only attributes which are present the select
query as a predicate or non-predicate, are accessed which reduces execution time
as compared to that in Row-Stores [8]. This concept is implemented for Column-
Store implementation in PostgreSql. Basically as we have created internal tables
for every attribute of any Column-Store relation, we have to take join of required
internal tables to produce the result. Because the result of any query is always
entity-oriented, this tuple construction is required by taking join. Here join is
actually natural join taken on the basis of the common key defined earlier i.e.
record id. For taking join of only required internal tables, we first identify at-
tributes present in the select query as either predicates or non-predicates. Then
we find internal table names for all those attributes which are present in the query
and take their natural join based on the common key Record id and form a join
node. We add this join node to list of from clause. This process is repeated for
all relations present in the from clause entered by the user. Point to be noted is
that join internal tables corresponding to only one relation is taken. In such a
way, we would not have to access irrelevant attributes for any relation. Internal
tables corresponding to irrelevant attributes will not be included in the join for-
mation. Only when all attributes are needed to be accessed (For example, select
* from table-name...), view created will be accessed directly rather than adding
internal tables one-by-one to the from clause. View is nothing but natural join of
all internal tables of the mentioned colstore relation.
Figure 4.4 explains various stages of a column-store table on which select query
is applied. This will make the scenario more clear.
Figure 4.4: Taking join of internal tables
24
4.6 Selecting data from Column-Store table
Let us see what difference does this approach make in the query plan of a
SELECT query which is as follows:
select
sum(l extendedprice) / 7.0 as avg yearly
from
lineitem,part
where
p partkey = l partkey and p brand = ’Brand#13’
group by
avg yearly
order by
avg yearly;
In this SELECT query, p partkey, l partkey and p brand are predicates. Pred-
icate is a attribute present in a query on which some condition is applied. Also,
l extendedprice is a non-predicate. Non-predicate is an attribute present in the
query which is to be projected. Hence, these attributes are considered while tak-
ing natural join for corresponding tables. Here, 2 natural joins of internal tables
will be taken. One will be of lineitem l extendedprice and lineitem l partkey and
another will be of part p partkey and part p brand.
It can be seen in figure 4.6 that lineitem table has been replaced with the
internal table names whose corresponding attributes are present in the query.
l extendedprice is present as predicate, l partkey is present as non-predicate in
the select query so before forming a query plan from query tree, we modify
query tree by replacing lineitem with natural join of lineitem l extendedprice and
lineitem l partkey so that internal tables are used for accessing data. Similarly,
part table is replaced by natural join of part p partkey, part p brand. Now, let
us consider what are the consequences of doing this. By looking at the plan tree
one will obviously say that unnecessary overhead of join is introduced for lineitem
as well as part tables. But, as dataset of lineitem, part grows the performance
of Row-store degrades. Because irrelevant columns are also accessed as can be
seen from figure 4.5. Suppose part contains 9 columns and lineitem contains 16
columns, and number of rows are in thousands. Then in Column-Stores, we just
have to consider 4 internal tables corresponding to 4 attributes. Form 2 join nodes
by taking 2 natural joins. Afterwards, their hash join is taken. But for Row-Stores,
we have to consider 2 tables having 25 attributes and take hash join of 2 tables
25
4.7 Summary
Figure 4.5: Plan Tree for SELECT query in Row-Store
one having 16 attributes and the other having 9 attributes. This overhead of Row-
Store goes on increasing with the increase in number of columns and data inserted
into tables. Thus, our approach works very well on big data having large number
of attributes
4.7 Summary
In this section, design and architectural details of our Column-Store implemen-
tation in PostgreSql are described. Create table query is modified as required.
Insertion is quite slow as compared to Row-Store but for large datasets, good per-
formance of insert is not required. Data is loaded in warehouse and is not time
critical query. Select query is modified for Column-Store in such a way that its
performance is improved by a large factor for analytical queries.
26
4.7 Summary
Figure 4.6: Plan Tree for SELECT query in Col-Store
27
Chapter 5
Experiments and Results
The main aim of our work is improving performance of SELECT query. Write
queries like insert, update, delete will give be very slow in Column-Stores as
compared to Row-Stores. On small dataset, the results of SELECT queries in
Column-Store are poor which was as expected. This is because of a number of
join operations performed for each relation. But, on large datasets, Column-Store
gives excellent results for select queries which have small number of attributes to be
accessed. Our implementation is basically for large datasets. We have mentioned
that point in chapter 3. For evaluating the performance of our implementation,
we use TPC-H benchmark [14, 15, 16, 25]. Dataset size we have taken for our
analysis is 5000 tuples per table. The schema diagram of their dataset is as given
below:
For each attribute of each table shown in the figure 5.1, an internal table
will be created in our implementation of Column-Store Database. Firstly, we
check how is the performance of select query on gradually increasing the number
of columns accessed. lineitem table consists of 16 attributes and orders table
consists of 9 attributes. Hence, we start by selecting 2 attributes, one from each
table. Then, we gradually increase this number to 25 and observe the performance
of SELECT query. This will give us exact idea of how SELECT query in Column-
Store behaves.
This table 5.1 shows that as the number of attributes approach maximum
possible value the execution time goes on increasing. Until the value of number
of attributes is 8, the execution time required for Column-Store is less than that
for Row-Store. In fact, Column-Store execution time is excellent until number
of attributes accessed are 8. Again point to be considered is that orders has 9
attributes which is less than 15 of lineitem. Therefore, the increase in execution
time is not always in the same proportion. It can be seen that when number of
28
Figure 5.1: E-R Diagram of TPC-H Benchmark Dataset
attributes are increased from 5 to 6 then there is a sudden increase in execution
time. This is because, the effect of accessing one attribute from orders on execution
time is more than the effect of accessing one attribute from lineitem. The graph
is plotted as shown in figure 5.2.
From these results, it is concluded that if 1/3th of the attributes are accessed
then performance of Column-Store is very good as compared to Row-Store. But,
2 conditions should be satisfied. First, number of attributes for tables must be
high. Second, Dataset size should be in the range of thousands, lacks and more.
The more the number of attributes and the larger the dataset, the lesser will be
the execution time in Column-Store as compared to Row-Store.
Now let us see the performance comparison of Row-Store against Column-
Store with the help of some TPC-H benchmark[20] queries. We have considered
29
Table 5.1: Experimental results for simple select queryNumber ofAttributesAccessed
Execution Timein seconds forRow-Store
Execution Timein seconds forColumn-Store
2 257.462 sec 128.731 sec3 257.326 sec 128.899 sec4 259.526 sec 153.923 sec5 258.694 sec 168.932 sec6 260.090 sec 208.639 sec7 268.338 sec 226.282 sec8 270.112 sec 264.680 sec9 273.445 sec 288.565 sec15 280.778 sec 8543.667 sec25 290.199 sec 20899.542 sec
Figure 5.2: Comparison of Column-Store vs. Row-Store
30
those queries which are suitable for Column-Oriented databases. i.e queries which
have less attributes to be accessed. For queries which access large number of
attributes, performance will certainly be worse as compared to Row-Stores. We
have considered following 10 queries for evaluating our performance.
1. Select
l returnflag,sum(l quantity) as sum qty,sum(l extendedprice) as sum base price,
sum(l extendedprice * (1 - l discount) * (1 + l tax)) as sum charge,
avg(l extendedprice) as avg price, avg(l discount) as avg disc
from
lineitem
group by
l returnflag;
Row-Store: 12.015 sec
Column-Store: 7.0 sec
2. Select
n name,sum(l extendedprice) as revenue from
nation,lineitem,region
where
r name = ’AFRICA’
group by
n name
order by
revenue;
Row-Store: 16.086 sec
Column-Store: 1.527 sec
3. Select
c name,sum(l quantity)
from
customer,orders,lineitem
where
c custkey = o custkey
and o orderkey = l orderkey
group by
c name;
31
Row-Store: 13.087 sec
Column-Store: 9.14 sec
4. Select
sum(l extendedprice) / 7.0 as avg yearly
from
lineitem,part
where
p partkey = l partkey
and p brand = ’Brand#13’;
Row-Store: 12.978 sec
Column-Store: 8.284 sec
5. Select
min(ps supplycost)
from
lineitem,supplier,nation,region,part,partsupp
where
p partkey = l partkey
and s suppkey = l suppkey
and s nationkey = n nationkey
and n regionkey = r regionkey
and r name = ’AMERICA’;
Row-Store: 14.847 sec
Column-Store: 1.739 sec
6. Select
sum(l extendedprice * l discount) as revenue
from
lineitem
where
l quantity¡25;
Row-Store: 12.936 sec
Column-Store: 2.638 sec
32
7. Select
l shipmode,
sum(case when o orderpriority = ’1-URGENT’ or o orderpriority = ’2-HIGH’
then 1 else 0 end) as high line count,
sum(case when o orderpriority ¡¿ ’1-URGENT’ and o orderpriority ¡¿ ’2-HIGH’
then 1 else 0 end) as low line count
from
lineitem,orders
group by
l shipmode;
Row-Store: 326.421 sec
Column-Store: 162.005 sec
8. Select
100.00 * sum(case when p type like ’PROMO%’ then
l extendedprice *(1 - l discount) else 0 end) /
sum(l extendedprice * (1 - l discount)) as promo revenue
from
lineitem, part;
Row-Store: 388.471 sec
Column-Store: 193.018 sec
9. Select
l suppkey
from
lineitem
where
l shipdate >= date ’1994-08-01’
and l shipdate < date ’1994-08-01’ + interval ’3’ month
group by
l suppkey;
Row-Store: 12.959 sec
Column-Store: 6.507 sec
10. Select
substring(c phone from 1 for 2) as cntrycode
33
from
customer
where
substring(c phone from 1 for 2) in (’40’, ’41’, ’33’, ’38’, ’21’, ’27’, ’39’);
Row-Store: 0.021 sec
Column-Store: 0.018 sec
34
Chapter 6
Conclusion and Future Work
6.1 Conclusion
Read queries when applied on huge datasets perform poorly due to their storage
structure (tuple-by-tuple). But, Column-Stores give a very good performance for
such queries. In PostgreSql, Column-Store could not be built from scratch due
to its Row-oriented structure. Thus, we decided to implement Column-Store on
top of Row-Store. The design of Column-Store on top of Row-Store is a great
challenge because modifications should be done at proper stages of query pro-
cessing to get optimal performance improvement over Row-Store. In our work,
we investigated various approaches of implementation of Column-Store on top of
Row-Store and found that Vertical Partitioning is most preferred of all due to less
complexity and no limitations on the kind of possible read queries. We studied
the architecture of PostgreSql. After understanding the intricacies of PostgreSql,
query tree formation stage was found to be most suitable for modification. The
thesis discussed the design and architecture of Column-Store Database System
along with its implementation in PostgreSql.
The results show that performance of our Column-Store implementation is
very high as compared to Row-Store in queries which access less attributes. Also,
relation should consist of large number of attributes. We see that as number of
columns accessed increases, the performance of Column-Store degrades which is
as expected. This is because number of joins of internal tables increases in such
a case which leads to increase in execution time. The same case would be very
efficient in Row-Store. But, the idea behind Column-Stores is to use them for
specific applications as described in chapter 3.
The main focus of this thesis was to implement Column-Store in PostgreSql in
a systematic manner so that performance of read-oriented queries becomes better
35
6.2 Related Work
than Row-Store. We have compared Row-Store and Column-Store performance
on TPC-H benchmark. The results clearly show that Column-Stores are better
than Row-Stores in the cases we expected.
6.2 Related Work
Column-Store is growing as an area of research. Recently, a lot of work has
been done in this area. Daniel J. Abadi, Samuel R. Madden, Nabil Hachem
did research on which one is better: Row-Store or Column-Store [1, 3, 4, 5]. Also
whether Column-Stores have something fundamental about the way in which they
are architected internally or the performance gains obtained by them are also pos-
sible with the conventional systems. They built a Column-Store System from
scratch called C-Store in which they implemented all optimizations possible for a
Column-Store such as Compression, Late Materialization, Positional Join, Block
Iteration. And proved that Column-Stores have performance improvements due
to their structure. Daniel J. Abadi, Daniel S. Myers, David J. DeWitt studied var-
ious materialization strategies possible for the execution of select queries. Miguel
C. Ferreira, Daniel J. Abadi, Samuel R. Madden studied compression schemes
possible in Column-Stores, their implementation and their benefits.
Halverson et al. built both Vertically partitioned Shore database (by modi-
fying schema of Shore) [11]and Column-store from scratch. He found that if an
optimization called as ”super tuples” is applied in Vertical Partioning, then Verti-
cal Partioning becomes competitor for Column-Store built from scratch. Concept
of Super Tuples avoids duplicating header information and groups many tuples to-
gether in a block, which can reduce the overheads of the fully vertically partitioned
scheme.
The fractured mirrors approach [13] is another approach for implementing
Column-Store, in which hybrid row-column approach is implemented. Here, a
simple concept is applied based on the understanding of Row-Store and Column-
Store. We know that Column-Store is optimized for Read queries and Row-Store
is optimized for Write queries. So, Row-Store primarily processes updates and
Column-Store primarily processes reads. A background process runs in parallel
which migrates data from Row-Store to Column-Store. They concluded that tuple
overheads in here are a significant problem, and that if large blocks of tuples are
prefetched from disk then tuple reconstruction times improves.
36
6.3 Future Work
6.3 Future Work
One very useful extension to this work is to pack many tuples together to form
page sized ”Super Tuples” [11]. This way duplication of header information can be
avoided and many tuples could be processed together in a block. The super tuple
design uses a nested iteration model, which ultimately reduces CPU overhead
and disk I/O. But, again accessing single tuple becomes difficult here. Since, our
implementation is application specific, it can be assumed that we would not be
required to access specific tuple. Compression techniques [4] could also be applied
while data storage not for saving disk space but for increasing performance by
doing operations on compressed data. Compression optimization is unique to
Column-Stores since similar data are stored on disk contiguously. This is because
data of same attribute will be of same data type.
37
Bibliography
[1] Daniel J. Abadi, Samuel R. Madden, Nabil Hachem, Column-Stores vs Row-
Stores : How Different Are They Really?, Vancouver, BC, Canada, SIG-
MOD08, June, 2008.
[2] Mike Stonebraker, D. J. Abadi, C-Store : A Column-oriented DBMS, 31st
VLDB Conference, Throndhiem, Norway, 2005.
[3] D. J. Abadi, Query execution in column-oriented database systems, MIT PHD
Dissertation, PhD Thesis, 2008.
[4] D. J. Abadi, S. R. Madden, and M. Ferreira, Integrating and execution in
column-oriented database systems, SIGMOD, pages 671-682, 2006.
[5] D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden, Materialization
strategies in a column-oriented DBMS, ICDE, pages 466-475, 2007.
[6] Stavros Harizopoulos (HP Labs), Daniel Abadi (Yale), Peter Boncz (CWI),
Column-Oriented Database Systems, VLDB Tutorial 2009.
[7] http://www.postgresql.org/.
[8] S. Harizopoulos, V. Liang, D. J. Abadi, and S. R. Madden, Performance trade-
offs in read-optimized databases, VLDB, pages 487498, 2006.
[9] G. Graefe, Efficient columnar storage in b-trees, SIGMOD Rec., 36(1),pages
36, 2007.
[10] A. Weininger, Efficient execution of joins in a star schema, SIGMOD, pages
542545, 2002.
[11] Halverson, J. L. Beckmann, J. F. Naughton, and D. J. Dewitt, A Compar-
ison of C-Store and Row-Store in a Common Framework, Technical Report
TR1570, University of Wisconsin-Madison, 2006.
38
BIBLIOGRAPHY
[12] C-Store source code, http://db.csail.mit.edu/projects/cstore/.
[13] R. Ramamurthy, D. Dewitt, and Q. Su. A case for fractured mirrors. In
VLDB, pages 89 101, 2002.
[14] TPC-H, http://www.tpc.org/tpch/.
[15] TPC-H benchmark with postgresql, http://www.fuzzy.cz/en/articles/
dss-tpc-h-benchmark-with-postgresql/.
[16] TPC-H Result Highlights Scale 1000GB. http://www.tpc.org/tpch/
results/tpchresultdetail.asp?id=107102903.
[17] Daniel J. Abadi, ColumnStores For Wide and Sparse Data, Conference on
Innovative Data Systems Research (CIDR) 710, 2007.
[18] G. Copeland and S. Khoshafian, A decomposition storage model, SIGMOD,
pages 268279, 1985.
[19] S. Harizopoulos, V. Liang, D. J. Abadi, and S. Madden, Performance tradeoffs
in read-optimized databases, VLDB, pages 487498, 2006.
[20] TPC-H Result Highlights Scale 1000GB. http://www.tpc.org/tpch/
results/tpch_result_detail.asp?id=107102903
[21] Rakesh Agrawal, Amit Somani, and Yirong Xu, Storage and Querying of
E-Commerce Data, VLDB, 2001.
[22] Gennady Antoshenkov, David B. Lomet, and James Murray, Order preserving
compression, ICDE 96, IEEE Computer Society, pages 655663, 1996.
[23] Balakrishna R. Iyer and David Wilhite, Data compression support in
databases, VLDB 94, pages 695704, 1994.
[24] Patrick ONeil and Dallan Quass, Improved query performance with variant
indexes, SIGMOD, pages 3849, 1997.
[25] Patrick E. ONeil, Elizabeth J. ONeil, and Xuedong Chen, The Star Schema
Benchmark (SSB), http://www.cs.umb.edu/poneil/StarSchemaB.PDF.
[26] A. Zandi, Balakrishna R. Iyer, and Glen G. Langdon Jr., Sort order preserving
data compression for extended alphabets, Data Compression Conference, pages
330339, 1993.
39
BIBLIOGRAPHY
[27] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft, Compressing re-
lations and indexes, ICDE98, pages 370379, 1998.
[28] Vertica, http://www.vertica.com/.
[29] Peter Boncz, Stefan Manegold, and Martin Kersten, Database architecture
optimized for the new bottleneck: Memory access, VLDB, pages 5465, 1999.
[30] Setrag Khoshafian, George Copeland, Thomas Jagodis, Haran Boral, and
Patrick Valduriez, A query processing strategy for the decomposed storage
model, ICDE, pages 636643, 1987.
[31] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz, Super-Scalar
RAM-CPU Cache Compression, ICDE, 2006.
[32] J. Zhou and K. A. Ross, Buffering database operations for enhanced instruc-
tion cache performance, SIGMOD, pages 191202, 2004.
40
Top Related