GPU SQL Query Accelerator · 2017-07-19 · accelerators along with multicore CPUs in boosting...
Transcript of GPU SQL Query Accelerator · 2017-07-19 · accelerators along with multicore CPUs in boosting...
International Journal of Information Technology Vol. 22 No. 1 2016
1
Keh Kok Yong, Hong Hoe Ong
Accelerative Technology Lab MIMOS Berhad
Kuala Lumpur, Malaysia [email protected], [email protected]
Vooi Voon Yap
Department of Electronic Engineering Universiti Tunku Abdul Rahman
Perak, Malaysia [email protected]
Abstract
The world rapidly grows with every connected sensors and devices with geo-location capabilities to
update its location. Data analytic industries are finding ways to store the data, and also turn this raw
data into valuable information as an eminent business intelligence services. It has inadvertently
conformed a flood of granular data about our world. Crucially, this data flood has outpaced the
traditional compute capabilities to process and analyze it. Thus, it reveals the potential economic
benefits and becomes an overwhelming new research area that requiring sophisticated mechanisms
and technologies to reach the demand. Over the past decade, there have attempts of using
accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an
emerging SQL-like query accelerator, Mi-Galactica. In addition, we extended our system by
offloading geo-spatial computation to the GPU devices. The query operation executes parallelly
with drawing support from a high performance and energy efficient NVIDIA Tesla technology. Our
result has shown the significant speedup.
Keyword: Geospatial, Graphics Processing Units, Database Query Processing, Big Data, Cloud
GPU SQL Query Accelerator
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
I. Introduction
The world rapidly grows with every connected sensors and devices with geo-location capabilities to
update its location. These location-aware data refers as spatial dataset. Gartner reported that Cisco
had projected 50 billion of connected objects. Besides, Digital Universe & EMC estimated that 44
trillion gigabytes of data will be collected in the year of 2020 [1]. Data analytic industries are
finding ways to store the spatial dataset, and also turn this raw data into valuable information as an
eminent business intelligence services. In addition, the value of spatial dataset is already evident.
DataSift uses the collected social media data for predicting consumer actions. Facebook uses an
accumulation of 350 million daily photos upload for the deep learning in image recognition [2].
Importantly, the demand of speedy computation with an appealing visualization is crucial to success.
Thus, it reveals the potential economic benefits and becomes an overwhelming new research area
that requiring sophisticated mechanisms and technologies to reach the demand.
A Graphics Processing Unit (GPU) is not only used for the optimization of image filtering and video
processing, but also it is widely adopted in accelerating big data analytics for scientific, engineering,
and enterprise applications. Jem uses GPU to accelerate texture-based geographic mapping, which
exploiting rendering performance [3]. Chenggang uses the two latest GPU technologies, Kepler and
MIC for accelerating Geospatial application. The parallel implementation shows the massive
speedup and strong scalability in a cluster [4]. Various recent studies have shown that the GPU
manages unprecedented acceleration of applications by offloading the compute-intensive tasks [5]
[6] [7]. The ultra-fast analytic application is crucial to drive business success through quick and
accurate decision making from mining the massive data.
Over the past decade, GPUs have taken the lead in high performance computing. Its evolution of
parallel processing components to fully programmable and powerful co-processors working along
with CPU has allowed cheaper, more energy efficient and faster super computers to be built. Zhe
uses a cluster of GPUs with 30 worker nodes to develop a parallel flow simulation using the lattice
Boltzmann model (LBM) [8]. Titan is the first major supercomputing system to utilize hybrid
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
3
architecture with both 16-core AMD Opteron CPUs and NVIDIA Tesla K20 GPU accelerators for
scientific computation such as simulation of climate change, nuclear energy modelling, nanoscale
analysis of materials, and other disciplines. The top ranked energy efficient supercomputers in the
world, TSUBAME-KFC, Wilkes and HA-PACS, use NVIDIA’s Kepler GPUs along with high speed
network communication devices such as Infiniband. These facilities have allowed computational
scientists and researchers to address the world’s most challenging computational problems by up to
several orders of magnitude faster.
However, there are studies pointing out that using GPU as a general-purpose computing device has
limitations [9] [10]. The fundamental problem of data transfer between CPUs and GPUs is a cause
of tremendous concern to the accelerator community. The ultra-high speed computation provided by
the GPU may not be able to compensate for the IO latency experienced at the PCIe bus. In addition,
it may turn out to be even more expensive if the parallel computation is not complex enough, where
more time is spent on transferring data to and from the GPU rather than for computation. Despite
this shortcoming, various empirical studies and experiments have shown that GPU is highly energy
efficient and has contributed to significant performance breakthrough across the computing industry.
Exploiting current GPU computing capabilities for database operations, we have to take into a
consideration of the hardware characteristics on a parallel algorithm execution. Also, it needs the
main processor, CPU to direct the main workflow. We propose and implement a GPU query
accelerator called Mi-Galactica using CUDA, and benchmark its performance on NVIDIA Tesla
Kepler architectures against standard PostgreSQL and various distributed Hadoop systems. The
detailed needs for this accelerator and parallel query processing in our work are:
• Partitioning data into fine grained chunks for parallel processing and reducing I/O access
• Applying compression and decompression mechanisms to speedup data I/O operations via
PCI-e transfer.
• Maximizing the usage of single instruction for multiple data to optimize the degree of
parallelism for query operations
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
• Performance on the GPU implementation should yield a significant acceleration over one
based solely on the CPU
The paper is organized as follows. In Section 2, it provides a review of database accelerator related
works. In Section 3, it briefly discusses the parallel CUDA programming on GPU and the
architecture of the latest NVIDIA Tesla technology. In Section 4, we present the implementation of
the proposed GPU query accelerator with ESRI GIS software. Section 5 briefly discusses
experiments and performance results. Finally, Section 6 concludes and discusses future work.
II. Database Accelerator
Database systems are extremely important to a wide array of industries. There have been
tremendous changes in the various hardware technologies used in accelerating database operations.
The well-known emerging technologies like GPU and FPGA (Field-Programmable Gate Arrays)
have led to an evolution of parallelism, compilation, and I/O reduction for producing more highly
efficient systems. In Govindarauju’s experiment, it presented several common query operations in
million records which storing in a database by using NVIDIA GeForce FX5900 [11] . It showed
GPUs as an effective co-processor for performing database operations. Mueller used FPGAs to
accelerate data processing [12]. This work opened up interesting opportunities for heterogeneous
many-core implementation. In addition, these hardware accelerators offer significant benefits in
power consumption.
Recent works on FPGA query accelerators have attempted various approaches to parallelize data
processing. Glacier [13] implements a set of streaming operators in composing digital circuit. It has
a library of compositional hardware modules. Each circuit is able to execute a specific query.
Woods [14] presents an FPGA framework for static complex event detection. This research looks to
transfer more complex data processing to FPGAs as a means to enhance the classical CPU-based
architectures. Netezza [15], [16] provides a pipeline consisting of DMA, Decompress, Project, and
Restrict computing engines. It reduces the amount of data access by performing projection and
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
5
restriction using data from prior requested tables. It hides the slow I/O transfer latency, by
compression and decompression of data. However, FPGA raises immense challenges to the
developers. It is generally more complicated and difficult to implement and debug. Thus, it has not
been able to gain a large foothold in the market.
A large body of research has investigated the acceleration of database systems using NVIDIA GPUs
with CUDA programming. Bakkum 1 implemented a GPU query acceleration database called
Virginian Database [17]. It is based on SQLite and develops a subset of commands that are directly
executed in GPU. Also, it uses GPU-mapped memory with efficient caching; therefore, it can
compute a larger size of data which exceeds GPU physical memory size. CoGaDB 2 is a new
column-oriented and GPU accelerate database management system. It designs a co-processing
scheme for GPU memory by caching the working set of data. It minimizes the performance penalty
by using a list of tuple identifiers representation for the data rather than the complete data to
minimize transportation between CPU and GPU [18]. Heimel’s3 [19] approach uses GPU-assisted
query optimization for real-valued range queries based on kernel density estimation into
PostgreSQL. It uses OpenCL because of its open standard that allows it to be easily ported to other
accelerator devices. PG-Storm4 developed a Foreign Data Wrapper module in PostgreSQL, and
offloads the sequential scan operation with massive data to GPU. It also takes the advantage of
GPU’s massively parallel computation capabilities to perform numerical calculation. Todd and Sam
built a Massive Parallel Database (MapD5) to handle big data analytics for an almost boundless
number of interactive socio-economic queries. It applies to geospatial visualization tool that can
probe and inspect more than a billion tweets worldwide. This has given a new emerging trend to
database management system.
1 https://github.com/bakks/virginian 2 http://wwwiti.cs.uni-magdeburg.de/iti_db/research/gpu/cogadb/ 3 https://bitbucket.org/mheimel/gpukde/ 4 https://wiki.postgresql.org/wiki/PGStrom 5 http://www.map-d.com/
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
Plentitude of emphasis in researching query related parallel algorithms have cultivated to the
development of GPU database. Red Fox works on relational operators to be executed in a GPU
parallel manner [20]. [21], [22] investigate GPU acceleration in indexing, scan and search
operations. [23] examine the important computational building blocks of aggregation. [24] focus the
studies on optimizing GPU sort. These studies have significantly brought up the awareness of using
GPU in big data analytic businesses. It is our belief that GPU can be beneficial for query processing
and widely deployable in big data analytics for database systems. This GPU query accelerator has to
be cautiously designed for parallel data structure and harmonizing the processes between CPU and
GPU.
It is our belief that GPU can be beneficial for query processing by widely deploying it for big data
analytics in database systems. The GPU query accelerator has to be carefully designed with parallel
data structure, and harmonizing the processing between CPU and GPU.
III. Graphic processing unit
In this section, we first discuss the background of GPUs and introduce the NVIDIA’s Kepler
Architecture. Next, we describe how thread and block works in the NVIDIA Kepler architecture.
Finally, we discuss the memory hierarchy of the NVIDIA’s Kepler architecture.
A. Background
GPUs first gained popularity with the rise of 3D gaming in the mid-1990. The demand of even more
powerful and energy efficient GPUs has become increasing ever since. The increase of
computational power of GPUs has attracted many researchers to use the GPU for more general
purpose computing. NVIDIA realized the potential GPUs for general computing and released CUDA
(Compute Unified Device Architecture) in 2006 so that the researcher community can leverage upon
the power of the large number of streaming processors in GPUs. GPUs nowadays are powering a
large range of industries from supercomputers to embedded system.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
7
The latest NVIDIA GPU architecture is called Tesla Maxwell, just introduce Q3 2015. These new
cards focus on deep learning sector. However, this paper is based on the Kepler architecture. It has
included a lot of improvements from its predecessor architecture, Fermi. With the current
architecture, a single GPU die can contain up to 2880 CUDA cores. Besides that, the Kepler
architecture introduced new features like Dynamic Parallelism, Hyper-Q, Grid Management unit,
and NVIDIA GPU Direct. It also contains enhanced memory subsystems offering additional caching
capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially
faster DRAM I/O implementation. The principal design goal for the Kepler architecture has been
met with the new features providing the improved power efficiency.
B. Grid, Blocks, Threads and Warps
The programming model for CUDA introduces us to the concepts of threads, blocks, and grids
which run GPU codes called kernels. These threads, blocks, and grids will then run in multiple
SMXs (streaming multiprocessor) in the GPUs in groups of warps. Figure 1 shows the examples of
threads, blocks, and grids. From a programmer’s perspective, they only need to handle the threads,
blocks, and grids assignments, and kernel programming, while the hardware will manage how all the
threads, blocks, and grids are mapped into the SMXs and warps.
In CUDA, all the threads in the same grid will execute the same kernel function but each thread
mostly handles different data. This type of programming model is known as Single Instruction
Multiple Data (SIMD). With the new Kepler architecture, a block can consist up to 1024 threads in
each of the x, y, and z dimensions. The maximum number of block in the x dimension in a grid can
go up to 232-1.
Previously on the Fermi architecture, once a kernel has been launched, its dimension cannot be
changed. In the Kepler architecture, the programmer is allowed to launch another set of grids and
blocks within the kernel, which enables a more flexible programming model. This feature is called
the Dynamic Parallelism.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
A warp is a unit of thread scheduling in the SMXs. Once a block is assigned to an SMX, it is divided
into units of warps. Each warp can support up to 32 threads in the Kepler architecture. Each thread in
a warp will run in parallel executing the same line of code. To increase the efficiency of the warps,
we should avoid branch divergence as much as possible. Branch divergence occurs when threads
inside a warp branches into different execution paths.
Thread Grid 0
Grid 1
Block
...
...
(i) (ii) (iii)
Figure 1: Thread, Block and Grids
192 CUDA CORES
Shared Memory
L1 cacheRead-Only data cache
L2 cache
DRAM
Level 2
Level 3
Level 4
Registers Level I
Figure 2: Hierarchy of GPU memory
C. Memory Hierarchy
There are four levels of memory hierarchy in the NVIDIA GPUs as shown in Figure 2. The first
level is register memory. Register memory is a local memory for the in the CUDA cores and have a
total size of 64KB. It is the fastest memory among all the memory types in the SMX. The second
level consists of Shared Memory, L1 cache and read-only data cache. These memories are located
very near the SMX core, and are shared among the 192 CUDA cores in the SMX. The Shared
Memory is usually used to communicate among different threads in the block. The third level
consists of L2 cache, and finally, the fourth level consists of DRAM memory that serves as the main
storage in GPUs and is used to send and read data in bulk from the CPU’s memory.
In the Fermi architecture, the Shared Memory and L1 Cache can be configured as 48 KB of Shared
Memory with 16 KB of L1 cache, or vice versa. The new Kepler architecture allows for additional
flexibility by permitting a 32KB / 32KB split between Shared Memory and L1 cache. The Read-
Only data cache is also new in the Kepler architecture. Previously, programmers would use the
Texture unit to store and load cache memory. However, this method had many limitations. The
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
9
benefit of the “Read-Only Data cache” uses a separate load set footprint off from Shared/L1 cache
memory. It also supports full speed unaligned memory access patterns.
IV. Implementation
A. Overview of Mi-Galactica
Mi-Galactica is a SQL-like query accelerator. There are four major components to formulate the
system: Connector, Preprocessor, Scheduler and Query Engine; as shown in Figure 3. Connector
enables Mi-Galactica to communicate to PostgreSQL and MySQL. It is to perform frontend
application interaction, data extraction and data interchange. In addition, it can support for the
comma-separated values (CSV) files processing. Scheduler is an internal task engine for managing
the user workloads. Query engine carries out various processes of query investigation by performing
the basic parsing and positioning operations. Then, it produces an execution query plan. There is
further adjustment of the plan by analyzing and tracing parallelizable points and rearranges clause of
objects execution. Mi-Galactica execution engine performs the accelerated query execution in either
CPU or GPU. On the other hand, source data in the database which is needed to be transformed,
and output to a parallel columnar accessible storage system. These components are designed to run
on an energy efficient commodity of GPU accelerator. In addition, it powers to strive forward for
handling big data challenges.
Mi-Galactica adopts the effectiveness from the previous studies [17], [18], [19], [20] and [21] in
query co-processing of the heterogeneous workloads. Figure 4 shows the architectural design of
coupled CPU-GPU architectures. It designs to support plug-ins for acceleration components, which
enabling customization. It eases up developer for adding new features and improves productivity.
In addition, it reduces the size of application as well. The implementation of plug-ins functionality
uses shared libraries. It is installed in a place that prescribed by Mi-Galactica.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
Figure 3: Mi-Galactica Four Major Components
Figure 4: Mi-Galactica Architecture
Figure 5: GPU Columnar File System
Source data in the database requires a preprocessing stage. It converts the data into a parallelizable
files structure, GPU FS (File System), then it stores into a column-based orientation, as shown in
Figure 5. Thus, data can access independently, compute parallelly and maximize CPU multithreaded
processing. Each column segments into multi files and the size of each segment is customizable.
Therefore, CPU and GPU have sufficient amount of memory to compute larger data set.
Furthermore, it allows GPU to independently process each column in the segment. Nevertheless, the
change of the data in the database does not automatically trigger an update on the preprocessed data.
Thus, it needs to be re-created or complemented (when only new data is added). The CudaSet is a
parallel file structure, which improving the parallel geo-spatial processing jobs in GPU. It is not a
legacy array of structure (AOS) design that losing of bandwidth and wasting of L2 cache memory.
Mi-Galactica uses CudaSet representation to arrange data in Structure of Array (SOA) access
pattern. It certainly gains high throughput by coalescing the memory access in GPU. Also, it is
critical to memory-bound kernel functions. The required elements of structure can load individually
and no data interleaving as shown in Figure 6. Thus, it achieves high global memory performance.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
11
Figure 6: SOA CudaSet Structure
The overhead of data transfer becomes an important factor. It is a bottleneck for fetching data to
GPU computation. Mi-Galactica uses compression to alleviate the performance issues. It
compresses the data into smaller size, which reduces the I/O operation and offloads the task to GPU.
It restructured the data processing by using co-processers schema structure for given database
architecture differently. There are dual compression scheme implemented on GPU. Firstly,
compression scheme applies on Integer data type; it is based on Zukowski work, PFOR-Delta [25].
It store differences instead of actual value. Only the difference of the data is stored between
subsequent values. Thus, bit packing mechanism can further optimize it by using just enough
number of bits to store each element. Secondly, string compression scheme was applied on
characters or text data type; it is based on Lempel–Ziv (LZ77) compression algorithm [26] with
dynamic representation and expression matching. It is a fine-grained parallel redundancy for
encoding and decoding of data with flexible representation. The key of the efficiency is fast retrieval
on the compressed data on CPU. Then, the lookup process has offloaded to GPU.
Query engine comprises both CPU and GPU phases. The CPU phases are in charge of parsing clause
into objects. It identifies the required data source, translates the operation into low level instruction
sets. Then, it arranges execution sequence and dispatch for execution. It uses the combination of
Bison and Flex implementing a SQL query parser. There can be both CPU and GPU related
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
workloads. However, CPU starts initializing GPU contexts, preparing input data, launching GPU
kernel functions, materializing query results, and controlling the steps of query progress. GPU
phases are in executing specific optimized kernel functions. These are mostly aggregate and
compute intensive functions. The data are used thousands of core to process at once. The GPU
computation related operations involve select, sort, projection, joins and basic aggregation. It
utilizes the mixture of in-house built accelerated parallel processing library – Mi-AccLib6 and open
source libraries such as NVIDIA Thrust7 and CUDPP8.
Scheduler is responsible for managing the received queries. There is an implementation on queue
processing across a pool of work threads in CPU. It controls the concurrency level and intensity of
resource contention on CPU. The resource monitor collects the current status of GPU devices usage.
Then, scheduler uses the collected information for assigning the task to the available GPU. At this
stage, CPU performs an important role in concurrent queueing. Thus, data can safely be added by
one thread, joined or removed by another thread without corrupting the data. In addition, it
maintains optimal concurrency level and workload on the GPUs. There is a data swapping
mechanism to maximize the effective utilization of GPU device memory. Through these processes,
it improves resource utilization. This implementation uses mixture of API (Application Program
Interface) in Boost9 libraries.
Mi-Galactica optimizes the concurrency through pipelining mechanism to overlap the data transfer
via PCI-e bus and the arithmetic computation. These CUDA streams can be executed
asynchronously, which queues the work and returns to CPU immediately. Pinned memory
mechanism is often adapted in certain queries implementation. It uses the Direct Memory Access
(DMA) engine, which can achieve a higher percentage of peak bandwidth. Hype-Q in Kepler
6 MIMOS accelerated library consist of high speed multi-algorithm search engines for text processing, data security engine and video
analytics engines, http://atl.mimos.my/ 7 Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity,
https://developer.nvidia.com/thrust 8 CUDPP is the CUDA Data Parallel Primitives Library, http://cudpp.github.io/ 9 Boost is a set of libraries for the C++ programming language that provide support for tasks and structures, http://www.boost.org/
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
13
architecture supports multiple CPU threads to be launched in GPU simultaneously, thereby intensely
rising GPU utilization and reducing CPU idle times.
B. Interacting Mi-Galactica with ESRI ArcGIS
There are two methods for Mi-Galactica interacting with ESRI Geographic Information System
software, web based GIS and geodatabase management applications, such as ArcGIS desktop,
FileGDB and ArcGIS JavaScript. Both methods are adopting standard database system to view,
store, query and analyze the contain geo dataset. Users use the choice of database such Oracle,
PostgreSQL, Microsoft SQL server and others. ArcGIS transforms the geographical computation
into set of SQL statements. Then, it channels to Mi-Galactica for the parallel computation. Once it
completed, the result set returns back to ArcGIS application for processing the map visualization.
Our in-house builds ODBC database connector is to divert the SQL operation to Mi-Galactica.
V. Experiment Results
In this section, we report our experimental results and analysis. We focus on the execution time on
Mi-Galactica with the utilizing of GPU accelerator and a CPU based system, Apache Spark. It is
one of the fast engines for big data processing system in the market at the moment. We measure the
execution period by adding fixed number of data records for each test run. The dataset is from 1 to
20 million rows of records. In addition, we compare the data preprocessing operation in using both
systems. Heat map is generated with 2 hours’ time interval location data of everyday of the month.
A. Hardware and Operation System
We performed the experiments into a single NVIDIA Tesla K20c GPU computational device. It has
2496 CUDA cores and 5GB GDDR5 RAM, launched in 2013. The workstation is HP Z800. It
contains dual sockets Intel Xeon X5680 CPU with the total of 12 cores, clock rate is 3.33GHz, 32GB
DDR3 RAM, 1TB Hard Disk, and ATI FireMV 2260 as a display device. For software, it has
Microsoft Windows 7 (64bits), Spark Version 1.3.0 and CUDA 7 (Release Candidate).
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
B. Experiment 1: Data Preprocessing
The computation time of data preprocessing is tested with various size of data. The raw input data
stores in CSV format. It need to process and import into corresponding data warehouse systems.
Spark is converted into Parquet format, which is in columnar storage layout. Mi-Galactica is stored
in a GPU accessible columnar format. Figure 7 shows the execution time on data preprocessing.
The timing includes the data transfer between hard disk, CPU and GPU memory.
The data preprocessing is not a compute intensive operation and requiring high I/O data transfer
between the CPU and GPU memory. In addition, Spark has a minimum startup overhead without
enabling the data compression for processing the raw CSV files. As observed in Figure 7, the
processing speed of Spark eventually overtakes Mi-Galactica. Spark applies the optimization of
utilizing the CPU multi-threading features to preparing these CSV files.
Figure 7: Result of Data Preprocessing
C. Experiment 2
There are a series of processes to produce a heat map. It represents the geographic density of features
on a map. The colored areas represent points that making for layers with a large number of features.
It requires certain toolboxes in ArcGIS to complete entire process and visualize the map, such as
Density toolset, Spatial Analyst & Statistics toolbox. These set of toolboxes generate the SQL
statement and pass to database system to execute. Figure 8 shows a sample of SQL statement for
Heat map.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
15
Figure 8: Sample of Heat Map SQL statement
The speedup measurement calculates in (Spark’s Execution Time) / (Mi-Galactica Execution Time).
The results turn out that Mi-Galactica had out performed Spark. The speedup is between 4x to 18x
speedup in executing the SQL statement shown in Figure 9. As observed, Mi-Galactica reduces the
speedup while the rows of data are increasing. It is due to the handling of the I/O (Input/Output)
movement between CPU and GPU memory without applying an efficient streaming mechanism
during this preprocessing stage. In fact, there does not have complex computation to maximize the
GPU resource utilization as well. Thus, it lost the optimization effort at transferring data between
CPU and GPU. However, Mi-Galactica performance is still good enough to provide timely
visualization on geo-location data. Figure 10 shows a snapshot of final visualization output of the
heat map.
Figure 9: Heat map query execution
Figure 10: Visualization of Heat Map
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
VI. Conclusion
We have presented Mi-Galactica as a GPU query accelerator for assisting geo-spatial data
computation via the generated SQL statement from ESRI software. It applies on any data analytic
operations by using SQL statement, such as Data Cleansing as an example application. The results
have shown that the GPU based solution is an alternative for Big Data processing. In addition, our
GPU query accelerator approach has facilitated seamless integration to other front end application
via a database connector. It allows users to exploit the powers of the GPU by providing the ability
for efficient work distribution in GPU cores with regards to I/O access and compression. In the
future, we are extending our system to be executed in distributed computation environment with
multi nodes processing to support bigger dataset. Furthermore, Mi-Galactica strives towards the
full support of the SQL standard and enabling parallel accelerated query processing.
VII. References
[1] C. McLellan, “The internet of things and big data: Unlocking the power,” ZDNet, Mar-2015.
[2] C. Smith, “Social Media’s New Big Data Frontiers -- Artificial Intelligence, Deep Learning,
And Predictive Marketing,” Business Insider Australia, 2014.
[3] M. Jern, T. Astrom, and S. Johansson, “GeoAnalytics Tools Applied to Large Geospatial
Datasets,” in Information Visualisation, 2008. IV ’08. 12th International Conference, 2008,
pp. 362–372.
[4] C. Lai, M. Huang, X. Shi, and H. You, “Accelerating Geospatial Applications on Hybrid
Architectures,” in High Performance Computing and Communications 2013 IEEE
International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013
IEEE 10th International Conference on, 2013, pp. 1545–1552.
[5] J. Zhang and S. You, “CudaGIS: Report on the Design and Realization of a Massive Data
Parallel GIS on GPUs,” in Proceedings of the Third ACM SIGSPATIAL International
Workshop on GeoStreaming, 2012, pp. 101–108.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
17
[6] C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, “GPU accelerated item-based
collaborative filtering for big-data applications,” in Big Data, 2013 IEEE International
Conference on, 2013, pp. 175–180.
[7] S. K. Prasad, M. McDermott, S. Puri, D. Shah, D. Aghajarian, S. Shekhar, and X. Zhou, “A
Vision for GPU-accelerated Parallel Computation on Geo-spatial Datasets,” SIGSPATIAL
Spec., vol. 6, no. 3, pp. 19–26, Apr. 2015.
[8] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, “GPU Cluster for High Performance
Computing,” in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, 2004, p.
47–.
[9] L. A. S. Gomes, B. S. Neves, and L. B. Pinho, “Empirical Analysis of Multicore CPU and
GPU-Based Parallel Solutions to Sustain Throughput Needed by Scalable Proxy Servers for
Protected Videos,” in Computer Systems (WSCAD-SSC), 2012 13th Symposium on, 2012, pp.
49–56.
[10] C.-J. S. Kyle E Niemeyer, “Recent progress and challenges in exploiting graphics processors
in computational fluid dynamics,” J. Supercomput., vol. 67, pp. 528–564, 2014.
[11] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, “Fast Computation of
Database Operations Using Graphics Processors .” ACM, New York, NY, USA, 2005.
[12] R. Mueller, J. Teubner, and G. Alonso, “Data Processing on FPGAs,” Proc. VLDB Endow.,
vol. 2, no. 1, pp. 910–921, Aug. 2009.
[13] R. Mueller, J. Teubner, and G. Alonso, “Glacier: A Query-to-hardware Compiler .” ACM,
New York, NY, USA, pp. 1159–1162, 2010.
[14] L. Woods and G. Alonso, “Fast data analytics with FPGAs .” pp. 296–299, Apr-2011.
[15] F. D. Hinshaw, J. K. Metzger, and B. M. Zane, “Optimized database appliance,” 2011.
[16] P. Francisco, “The Netezza Data Appliance Architecture: A Platform for High Performance
Data Warehousing and Analytics ,” 2011.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
[17] P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,”
in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics
Processing Units, 2010, pp. 94–103.
[18] S. Breß and G. Saake, “Why It is Time for a HyPE: A Hybrid Query Processing Engine for
Efficient GPU Coprocessing in DBMS,” Proc. VLDB Endow., vol. 6, no. 12, pp. 1398–1403,
Aug. 2013.
[19] M. Heimel and V. Markl, “A First Step Towards GPU-assisted Query Optimization,” Proc.
VLDB Endow., pp. 33–44, 2012.
[20] H. Wu, F. Drive, G. Diamos, S. Baxter, M. Garland, T. Sheard, M. Aref, and S. Yalamanchili,
“Red Fox: An Execution Environment for Relational Query Processing on GPUs,” in
Proceeding of theInternational Symposium on Code Generation and Optimization (CGO),
2014, pp. 44:44–44:54.
[21] F. Beier, T. Kilias, and K.-U. Sattler, “GiST Scan Acceleration Using Coprocessors .” ACM,
New York, NY, USA, pp. 63–69, 2012.
[22] K. K. Yong and E. K. Karuppiah, “Hash match on GPU,” in IEEE International Conference
on Open Systems, ICOS 2013, 2013, pp. 150–155.
[23] T. Lauer, A. Datta, Z. Khadikov, and C. Anselm, “Exploring Graphics Processing Units As
Parallel Coprocessors for Online Aggregation .” ACM, New York, NY, USA, pp. 77–84,
2010.
[24] D. Cederman and P. Tsigas, “GPU-Quicksort: A Practical Quicksort Algorithm for Graphics
Processors,” J. Exp. Algorithmics, vol. 14 . ACM, New York, NY, USA, pp. 4:1.4–4:1.24,
Jan-2010.
[25] M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super-Scalar RAM-CPU Cache
Compression .” p. 59, Apr-2006.
[26] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” Inf. Theory,
IEEE Trans., vol. 23, no. 3, pp. 337–343, May 1977.