A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification
Optimizing disk storage to support statistical analysis operations
-
Upload
allen-parrish -
Category
Documents
-
view
214 -
download
0
Transcript of Optimizing disk storage to support statistical analysis operations
www.elsevier.com/locate/dsw
Decision Support Systems 38 (2005) 621–628
Optimizing disk storage to support statistical analysis operations
Allen Parrish*, Susan Vrbsky, Brandon Dixon, Weigang Ni
Department of Computer Science, The University of Alabama, Box 870290, Tuscaloosa, AL 35487-0290, USA
Received 1 September 2001; accepted 30 September 2003
Available online 11 December 2003
Abstract
Data stored in spreadsheets and relational database tables can be viewed as ‘‘worksheets’’ consisting of rows and columns,
with rows corresponding to records. Correspondingly, the typical practice is to store the data on disk in row major order. While
this practice is reasonable in many cases, it is not necessarily the best practice when computation is dominated by column-based
statistics. This short note discusses the performance tradeoffs between row major and column major storage of data in the
context of statistical data analysis. A comparison of a software package utilizing column major storage and one using row major
storage confirms our results.
D 2003 Elsevier B.V. All rights reserved.
Keywords: Database systems; Data warehouses; Statistical tools; Business intelligence tools; Data mining
1. Introduction would be consistent with the row-centric nature of
A worksheet is a collection of rows and columns of
data. Spreadsheets and relational database tables both
represent examples of worksheets. Worksheets asso-
ciated with such applications are stored in persistent
form on disk. In addition, there are a number of
statistical analysis applications that operate in a sim-
ilar fashion to spreadsheets but with more advanced
statistical operations. Such applications also manipu-
late worksheets of data that are stored on disk.
Most commercial applications do not publish their
disk data storage formats. However, we believe that
the vast majority of worksheet-based applications
store their data on disk in row major order (i.e., with
each row stored contiguously on the disk). This
0167-9236/$ - see front matter D 2003 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2003.09.005
* Corresponding author. Tel.: +1-205-348-3749; fax: +1-205-
348-0219.
such data, in the sense that each row typically
represents a record aggregating a set of information
regarding some type of semantic entity. In particular,
rows are often inserted or deleted as a unit, and to
store rows contiguously is most efficient in the
context of record-based updates.
While updates tend to be row-oriented, other
operations tend to be column-oriented. In particular,
many statistical analysis computations are column-
centric and may be applied in many cases to a
relatively small number of columns. In the context
of column-centric operations, the idea of storing
worksheets in column major order (i.e., with each
column stored contiguously on the disk) has some
merit. In general, the performance tradeoffs between
row major and column major storage are not entirely
clear in the context of column-centric operations.
In this short note, we examine the performance
tradeoffs between row major and column major disk
Fig. 2. Row major order.
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628622
worksheet storage in the context of column-centric
statistical analysis operations. In particular, we show
that different types of access and update patterns favor
one approach over the other. However, under certain
realistic conditions, column major storage is much
faster than row major storage. Given that the under-
lying record-based data is typically row-centric, we
believe this result is not considered in typical com-
mercial practice.
Column major storage requires the use of some
form of indexing whenever a subset of the data is to
be analyzed. Without indexing (or a variant thereof),
column major storage is prohibitively expensive when
accessing a subset of the rows. We introduce a variant
of indexes which we call filters. Filters provide
efficient data access in column major order but
without the transaction-driven overhead of indexing.
Filters represent a lightweight implementation of
indexes that is sufficient for general statistical pro-
cessing of subsets and are suitable for either row
major or column major order. We discuss performance
tradeoffs when processing subsets of data in the
context of filters. In addition, we show the results of
a simple empirical comparison between statistical
packages using column major storage versus row
major storage.
2. Worksheet storage models
Our overall conceptual model is quite simple. As
noted in Section 1, a worksheet is a matrix consisting
of rows and columns. We allow for the possibility that
the number of rows in each column may be distinct.
We can refer to cells in each row and column via
indices (e.g., row i, column j). Fig. 1 contains an
example of a worksheet.
As discussed in the Introduction, row major order
involves storing worksheet rows contiguously on
disk; column major order involves storing columns
Fig. 1. Worksheet.
contiguously on disk. While these schemes have been
classically regarded for many years as alternatives for
in-memory array storage [3], they have not typically
been used to discuss disk storage which is our concern
here.
Simple implementations of row major and column
major order are as follows. For row major order (Fig.
2), each row could be stored as a logical record within
a flat file, with some kind of separation delimiter
between the various columns. In the example each
row is a proper subset of a physical block. Fig. 2
illustrates row major storage of the worksheet in Fig. 1.
For column major order (Fig. 3), the above file is
‘‘inverted,’’ much as if it were a standard matrix. In
particular, the ith physical block contains the ith col-
umn of the worksheet. In the example, each column is a
subset of its physically containing block. Fig. 3 illus-
trates column major storage of the worksheet in Fig. 1.
3. Performance analysis
We briefly consider a simple performance compar-
ison between the row major and column major
approaches. In conducting this analysis, we first as-
sume that the worksheet is rectangular (that is, that
every column has an equal number of rows and every
row has an equal number of columns). As such,
adding or deleting a row requires adding or deleting
values for all columns. Similarly, adding or deleting a
Fig. 3. Column major order.
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628 623
column requires adding or deleting values for all
rows. This rectangularity assumption is typical for
databases, although it is less necessary for statistical
packages.
We focus principally on the cost of performing a
simple statistical computation on the worksheet. We
choose the frequency distribution (FREQ) operation
as representative of a class of statistical operations
(such as minimum, maximum, sum, average, median,
mode, etc.) which can be applied to one or more
columns of the worksheet. (While FREQ could also
be applied to rows, we assume that it is predominantly
of interest on columns. This is consistent with the
arguments made in Section 1 regarding the row-
centric nature of most data.) FREQ counts the number
of occurrences of each value in the column and
produces a table and/or histogram representing the
distribution. Although many statistical operations are
more complex than FREQ, their disk and memory
costs are proportional to FREQ when the two storage
schemes are compared. In particular, each of these
operations requires access to the values for each row
for one particular column as well as maintenance of a
single value in memory.
We assume that there are n rows and k columns in
the worksheet. Moreover, to simplify the analysis, we
assume that the block size is exactly k. Thus, with row
major storage, each row consumes exactly one phys-
ical block. However, for column major storage, a
column may span multiple blocks. These assumptions
are realistic in that for most data sets of significant
size, the number of rows (records) will be much larger
than the number of columns (attributes). Hence, a
column is going to span multiple blocks, while a row
will be much closer to the size of a single block.
With column major storage, qn/ka disk reads are
required to perform FREQ on a single column. The
number of disk reads required to perform FREQ on f
variables with inverted storage is therefore qn/ka f. Onthe other hand, exactly n reads are required to perform
FREQwhen rowmajor storage is used regardless of the
size of f. Note that as f increases, the time requirement
for column major storage approaches the time require-
ment for row major storage. In particular, when f = k,
exactly n disk reads are required in both cases.
Because statistics may often be computed on a
small number of columns relative to the size of the
worksheet, column major storage is more efficient
than row major storage on column-oriented statistical
operations. Indeed, when the number of columns
involved in the computation is truly small, column
major storage is much more efficient than row major
storage. Of course, this assumes that statistical oper-
ations are entirely column-based in accordance with
the model described in Section 1.
When considering other operations (such as addi-
tions and deletions that modify the worksheet), column
major storage becomes less desirable. In particular,
with the rectangularity constraint, rows and columns
must be added and deleted as a whole. With row major
storage, adding or deleting a row will require only one
disk access. With column major storage, however,
adding or deleting a row will require k disk accesses
(because each column for a given row appears in a
separate block). Because adding a row is a common
database operation, column major storage is likely to
be a disadvantage in a transaction-based environment
where the worksheet is frequently modified in a row-
based fashion. Indeed, with a complete RDBMS
system, it is likely that row major storage is most
efficient for database tables because of the support for
transactions as well as for joins and other general
relational operations. Of course, if the rectangularity
constraint is eliminated, then additions and deletions
can occur one value at a time (rather than a whole row
or column), not requiring multiple disk access for any
single change. Without the rectangularity constraint,
row major order offers no performance advantage in
the context of updates.
4. Subsets, indexes and filters
It is typical for most worksheet-based applications
to allow access to a subset of the rows in a particular
worksheet. Such subsets are typically based on Bool-
ean restrictions over a subset of the columns in the
row. With a conventional RDBMS, one can optionally
define indexes (which can be retained in files) in order
to provide fast access to a subset of records satisfying
particular sets of criteria [1]. Indexes are constructed
on one or more columns and may be used to effi-
ciently access records satisfying any Boolean criterion
over those columns.
Indexes may be used with row major or column
major worksheet storage. Indexes are particularly
Table 1
Row major versus column major storage access with filters
Column major storage Row major storage
Best case f g
Worst case fg g
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628624
geared toward transactional RDBMS systems and
require overhead to support transactions and relational
operations. As Section 3 discussed, column major
storage is not generally a good strategy for transac-
tional RDBMS systems. As such, we propose a variant
of the standard index that we call a filter. Filters are
applicable to row major or column major worksheet
storage and require less overhead than indexes. In
particular, a filter is a bit string of length n, where
the ith bit specifies in the obvious way whether the ith
row (record) is part of the analysis (‘0’ out, ‘1’ in).
Filters allow the specification of a complex Boolean
expression of dataset variables to identify subsets of
the data. Filters are applied during the analysis process
to specific files containing variables for which analysis
is desired by serving as a mask to control which values
are read. In particular, values are only read from files
corresponding to positions within the bitmap contain-
ing a ‘1’. Effectively, filters define specific subsets,
while indexes provide a mechanism to obtain fast
access to any subset whose membership criterion is
based on the columns used within the index.
We note that indexes are optional with row major
order. Without an index, it is still possible to make a
single pass through the worksheet. When each row is
read, it can be analyzed at that point to determine
whether it is in the desired subset (i.e., on the fly).
However, with column major order, the lack of a filter
or index can increase the number of disk reads.
Without filters in column major order, each analysis
operation requires qn/ka j disk reads, where k is the
block size; n is the number of records; and j is the
number of variables to define the filter. Thus, if there
are f analysis operations, the total time required is
proportional to qn/ka( f + j). As j becomes larger, this
extra cost becomes prohibitive, making filters very
important for implementing subset restrictions in the
context of column major order. Filters can also be
used as part of a row major storage implementation
as an alternative to more traditional indexing, with
similar performance benefits.
In order to conduct a fair comparison involving
subset restrictions between row major and column
major order implementations, we assume the use of
the same bitmap model of filters for both implemen-
tations as described above. We first consider filter
creation cost. If the number of variables in the
Boolean expression defining a filter is j and there
are n rows with block size k, then j qn/ka disk reads are
required to construct a filter in the context of column
major storage. On the other hand, n disk reads are
required in a row major storage implementation.
We now consider the number of disk reads required
for both row major and column major storage struc-
tures when filters are involved. We first consider
column major storage. The best case for column major
occurs when the values in the subset defined by the
filter are clustered so as to minimize the number of
disk reads. For example, suppose there are g values in
the subset defined by the filter. Then if all g values are
contiguous, and g < k, then only one disk access is
required for each variable in the frequency analysis.
Thus, in a case where only one variable is involved in
the frequency analysis, it is possible to only require
one disk access for the entire analysis. The worst case
occurs if the g values are distributed, such that each
value occurs in a different block, in which case g disk
accesses are required for each of the f variables in the
frequency analysis (i.e., fg).
On the other hand, with rowmajor storage, exactly g
disk accesses are required for any frequency analysis,
regardless of the number of variables in the analysis.
Thus, the worst case for rowmajor storage is better than
the worst case for column major storage by a factor of f
(the number of variables in the analysis). Table 1 shows
the summary of this comparison.
A performance comparison involving a complete
traditional implementation of indexing will be similar
to this one. The difference is that the cost of managing
the index is higher than that of managing the filter.
Because transactional systems are not amenable to
column major order, indexing mechanisms with their
associated overhead introduces unnecessary complex-
ity into this analysis.
In Fig. 4(a) and (b), we further illustrate the
performance differences between the row major and
column major storage methods by utilizing specific
values for the block size, database size, number of
columns, etc. The figures show the time to access the
Fig. 4. Access time for varying numbers of data values. (a): Small numbers of columns. (b): Large numbers of columns.
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628 625
data values in seconds, for varying numbers of data
values accessed. In these figures, we assume a data-
base with 100,000 rows and 200 columns, and we
assume a block (page) size equal to 200 data elements.
In the row major method, we assume that one record
of data fits on one block, and in the column major
method, we assume 200 data values fit per block
(requiring 500 blocks of data for each column). The
time to access the data is based on a fast disk access
rate of 10 ms to access one block of data.
Fig. 4(a) illustrates the number of seconds to access
the data for 1 to 5500 data items. The diagonal line
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628626
illustrates the time to access the data using a row
major storage method, and it ranges from 0.01 to 55 s.
As mentioned previously, with each additional data
value to access, the row major storage method
requires an additional record to be read, and hence,
results in an additional disk access. The best and worst
cases for the row major storage method are the same.
The results for the column major storage method
are also illustrated for the column major best case in
Fig. 4(a). In the best case, we assume that all of the
requested data values are clustered on the same pages.
The best case for column major storage with one
column is the lowest line in the graph. The time to
access the data for one column ranges from 0.01 to
0.11. The best case, when 10 columns are requested,
is similar to the one-column line, as values range from
0.1 to 1.1.
We also illustrate the time to access the data for the
column major in the worst case, which occurs when
each data value requested is stored on a different
block. Fig. 4(a) shows the access time for 1 to 5500
data items for the worst case row major when 1
column, 2 columns, 5 columns and 10 columns are
requested. The row major method is not affected by
the number of columns requested because the entire
record, with all of its columns, are accessed each time.
However, this is not the case for the column major
method. As illustrated in Fig. 4(a), the time to access
one column of the data is the same for the worst case
column major and the row major until more than 500
values are requested. At this point, all of the blocks for
the requested column have been accessed, and hence,
the time to access additional data values does not
increase for the column major and remains at 5 s. The
worst case column major for two columns is greater
than the row major method until more than 1000
values are requested, but then again it remains less
than the row major. A similar trend is illustrated for
the column major worst case when 5 and 10 columns
are accessed. We emphasize that Fig. 4(a) only illus-
trates the results for 5500 of the 100,000 data values
per column, and as the percentage of the data that is
included in the filter increases, the time to access the
data for the row major method continues to increase,
while the time for the worst case column major method
remains the same.
Fig. 4(b) illustrates the access time for large
numbers of columns (100 and 200) for accessing 1
to 100,000 data values for the row major, best case
column major and worst case column major meth-
ods. An average case would fall somewhere in
between the best and worst cases illustrated in Fig.
4. The best case results for 100 and 200 columns
remain significantly smaller than those for the row
major or worst case column major method. The
results for 100 columns when the worst case column
major is utilized are worse than the row major until
50,000 data values are accessed. When more than
50,000 data values are access, the row major has a
higher access rate. The row major has a shorter
access time than the column major worst case with
200 columns for all values until all 100,000 of the
data values are read.
The disk access time is typically more time
consuming than the computation of the data. An
efficient implementation could hide the cost of pro-
cessing the data, but if not hidden, we would expect
that the time to process the data would increase
proportionally with the number of data items pro-
cessed for both row and column major methods.
However, we would still expect the disk access time
to dominate the time to process the data. We also
note that we expect an implementation of row major
and column major storage methods to use additional
efficient disk access strategies, such as reading more
than one page at a time (prefetch), to improve disk
access.
In summary, if the number of variables being
processed is small, then column major storage with
filters (or indexing) is the less expensive implemen-
tation in the context of subset restrictions. Filters (or
indexing) represent an essential mechanism for deal-
ing with subset restrictions with column major storage
and a useful mechanism with row major storage if a
large number of analysis operations are conducted
with a fixed subset.
5. A column major implementation
The Critical Analysis Reporting Environment
(CARE) is a software system developed for traffic
crash problem identification and evaluation [2].
CARE was originally developed to provide analyt-
ical processing of traffic crash report databases.
CARE has been developed principally as a stand-
Table 3
Performance results (in s) from selecting 50% of the data records
1
Column
10
Columns
50
Columns
100
Columns
150
Columns
200
Columns
CARE 0.1 1.0 4.1 8.2 13.3 18.6
SPSS 3.7 4.8 6.3 8.7 10.1 11.5
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628 627
alone product for the Windowsk platform, although
a client–server version also exists that runs over the
World Wide Web. CARE stores its data entirely in
disk-based worksheets in column major order. Fil-
ters are used to support the retrieval of subsets.
CARE exclusively processes categorical data, that
is, data that are either nominal or ordinal, reflecting
the type of multiple-choice data typically recorded
on traffic crash reports. CARE implements a model
of filters (as defined in Section 4) to deal with data
subsets.
In this section, we compare the performance of
CARE with a well-known statistical package (SPSS)
in a limited illustration. This illustration, although
not a complete experiment, supports the analysis
provided in Section 4. As noted above, the CARE
data storage structure is in column major order;
although the storage organization for SPSS is not
published to our knowledge, we expect that it is row
major.
The same data files were used when running both
CARE and SPSS. The data consisted of an actual data
file from the Alabama Department of Transportation
containing Alabama traffic crash data. There were
137,713 rows and 217 columns in the table. We
initially accessed all of the data records in the table.
We then varied the number of columns on which to
perform the frequency counts as: 1, 10, 50, 100, 150
and 200 columns. Table 2 contains our results (in s) in
computing frequency counts.
Note the table confirms our previous analysis in
Section 3. In particular, when analyzing only a small
number of columns, CARE’s performance (using
column major storage) is significantly better than
either of the row major storage implementations.
However, as more columns are added, the perfor-
mance worsens significantly; with 200 columns,
CARE is actually slightly worse than SPSS.
We also examined the performance when a subset
of records was selected. We selected a subset of
Table 2
Performance results (in s) from selecting 100% of the data records
1
Column
10
Columns
50
Columns
100
Columns
150
Columns
200
Columns
CARE 0.1 1.1 4.4 8.8 14.1 19.3
SPSS 4.3 5.2 7.8 11.1 13.8 17.25
approximately 50% of the records, chosen by select-
ing specified data values for one of the columns. In
CARE, filters were created on a column to select the
desired subset, while in SPSS, the Select Cases
option was used. Table 3 illustrates our results in
this case.
Our results in Table 3 are generally similar to Table
2, where all of the records were utilized in the
analysis. Utilizing the Select Cases of SPSS reduces
the total number of disk accesses to the number of
cases in the subset selected. However, in the case of
column major order, utilizing a filter in CARE will
reduce the number of disk accesses, but as discussed
in Section 4, one cannot predict by how much. With
200 columns in Table 3, SPSS is substantially faster
than CARE. However, CARE (with its column major
data storage) is still substantially faster than SPSS
when a small number of columns is used in the
analysis.
6. Conclusion
In this paper, we have compared the performance
of applications using disk-based data worksheets in
two storage formats: row major versus column major.
Our conclusion is that column major order is better
in situations where a small number of variables are
being analyzed, provided that updates are infrequent.
Because this is not the typical case in general
purpose database systems, the applicability of col-
umn major order is restricted to special purpose data
analysis applications such as our CARE system
presented in Section 5. While this is a very simple
concept, we see little or no evidence of column
major order even in the context of special purpose
applications. In cases where small numbers of vari-
ables but large numbers of data values are analyzed,
column major order provides substantial performance
improvements.
A. Parrish et al. / Decision Support Systems 38 (2005) 621–628628
References
[1] R. Elmasri, S. Navathe, Fundamentals of Database Systems,
Addison-Wesley, Boston, MA, 2000.
[2] A. Parrish, B. Dixon, D. Cordes, S. Vrbsky, D. Brown, CARE:
a tool to analyze automobile crash data, IEEE Computer 36 (6)
(2003 June), pp. 22–30.
[3] M. Scott, Programming Language Pragmatics, Morgan-Kauf-
man, San Francisco, CA, 2000.
Allen Parrish is an Associate Professor
in the Department of Computer Science
and Director of the CAREResearch and
Development Laboratory at The Uni-
versity of Alabama. He received a PhD
in Computer and Information Science
from The Ohio State University in
1990. His research interests are in soft-
ware testing, software deployment,
data analysis and visualization, and
highway safety information systems.
His sponsors have included the National Science Foundation, the
Federal Aviation Administration, the National Highway Traffic
Safety Administration and a variety of highway safety-related state
agencies. He is a member of the IEEE Computer Society Educational
Activities Board.
Susan V. Vrbsky is an Associate Professor
in the Department of Computer Science at
The University of Alabama. She received
her PhD in Computer Science from The
University of Illinois, Urbana-Champaign
in 1993. Her research interests include
database systems, uncertainty and approx-
imations, real-time database systems and
mobile data management. Her sponsors
have included the National Science Foun-
dation, the Federal Aviation Administration
and the Alabama Department of Transportation.
Brandon Dixon is an Associate Professor in the Department of
Computer Science at The University of Alabama. He received a
PhD in Computer Science in 1993 from Princeton University. His
interests are in computer science theory, algorithm design and
software engineering. His current research sponsors include the
National Science Foundation, NASA, the Federal Aviation Admin-
istration and the Alabama Department of Transportation.
Weigang Ni is a PhD candidate in the Department of Computer
Science at The University of Alabama. He is expected to receive his
degree in December 2003. His research interests include real-time
databases and mobile databases.