FAST MAP PROJECTION ON CUDA.ppt
-
Upload
grssieee -
Category
Technology
-
view
955 -
download
1
Transcript of FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA
Yanwei Zhao
Institute of Computing Technology
Chinese Academy of Sciences
July 29, 2011
OutlineOutline
Institute of Computing Technology,Chinese Academy of Sciences
OutlineOutline
Institute of Computing Technology,Chinese Academy of Sciences
Map Projection Establish the relationship between two different
coordinate systems. geographical coordinates → planar cartesian map space
coordinate system
Complicated and time consuming arithmetic operations. Fast answer with desired accuracy→ Slow exact
answer It's need to be accelerated for interactive GIS scenarios.
Institute of Computing Technology,Chinese Academy of Sciences
GPGPU(The general purpose computing on graphics processing units)
GPGPU is a young area of research. Advantage of GPU
Flexibility Power processing Low cost
GPGPU in applications other than 3D graphics GPU accelerates critical path of application
Institute of Computing Technology,Chinese Academy of Sciences
CUDA(Common Unified Device Architecture) NVIDIA's parallel computing
architecture C base programming
language and development toolkit
Advantage: Programmer can focus on the
important issues rather than an unfamiliar language
No need of graphics APIs and write efficient parallel code
Institute of Computing Technology,Chinese Academy of Sciences
The characteristic of Map Projection
Huge amount of coordinates to handle
The complexity of arithmetic operations
The requirement of a realtime response
Institute of Computing Technology,Chinese Academy of Sciences
Our proposals
using the new technology CUDA on the GPU
Take Universal Transverse Mercator (UTM) projection as an example
Performance: Improvement of up to 6x to 8x
(include transfer time) Speed up 70x to 90x
(not include transfer time)Institute of Computing Technology,
Chinese Academy of Sciences
OutlineOutline
Institute of Computing Technology,Chinese Academy of Sciences
Algorithm frameworkCPU
CPU
3. Copy the data from CPU to GPU global memory
5. Copy the result from GPU to CPU
GPU
1.Open the shapefile2.Read the coordinates of all features
6.free up the device memory
Block 0
…………
Thr
ead
0
Thr
ead
1
Thr
ead
n
Block m
……
Thr
ead
0
Thr
ead
1
Thr
ead
n
4. Execute the kernel function
7.Save or display the result
Striped partitioning
Matrix distribution
Institute of Computing Technology,Chinese Academy of Sciences
Striped partitioning
Define the number of block and thread: Block_num,Thread_num
CUDA built-in parameters: GridDim, BlockDim
Geographic feature number: fn
Each block runs features: fn/GridDim.x
Institute of Computing Technology,Chinese Academy of Sciences
…
…
…
…
Block 0
Block 1
Block m
feature 0
feature 1
feature m
feature m+1
feature m+2
feature 2m
……
……
coord 0
coord 1
coord n
coord 0
coord 1
coord n
thread 0
…
thread 1
thread n
The relationship between blocks
and features
The relationship between threads and coordinates
Striped partitioning
For surrounding loop: Blocks and features Block → Feature[i] i = blockidx.x*(fn/GridDim.x)
(1)
Block → next Feature[k] k = i + fn/GridDim.x (2)
For inner loop: Threads and coordinates thread→coord[j]
j = threadIdx.x thread→next coord[k]
k = j +Thread_numInstitute of Computing Technology,
Chinese Academy of Sciences
…
…
…
…
Block 0
Block 1
Block m
feature 0
feature 1
feature m
feature m+1
feature m+2
feature 2m
……
……
coord 0
coord 1
coord n
coord 0
coord 1
coord n
thread 0
…
thread 1
thread n
The relationship between blocks
and features
The relationship between threads and coordinates
Striped partitioning
For surrounding loop: Blocks and features Block → Feature[i]
i = blockidx.x*(fn/GridDim.x) Block → next Feature[k]
k = i + fn/GridDim.x
For inner loop: Threads and coordinates thread→coord[j]
j = threadIdx.x (1) thread→next coord[k] k = j +Thread_num (2)
Institute of Computing Technology,Chinese Academy of Sciences
…
…
…
…
Block 0
Block 1
Block m
feature 0
feature 1
feature m
feature m+1
feature m+2
feature 2m
……
……
coord 0
coord 1
coord n
coord 0
coord 1
coord n
thread 0
…
thread 1
thread n
The relationship between blocks
and features
The relationship between threads and coordinates
Matrix distribution
. . 1
. .
fn gridDim x grdiDim yk
gridDim x grdiDim y
Institute of Computing Technology,Chinese Academy of Sciences
Define the number of block and thread: grid(br,bc), block(tr,tc)
Each block run k features, where: (1)
Feature[i]: (2)
(3)
. .
. .
i blockIdx y GridDim x k
i blockIdx y GridDim x k k
Matrix distribution
Each block run s coordnates, where:
(1)
coord[j]:
[ ]. . . 1
. .
feature i size blockDim x blockDim ys
blockDim x blockDim y
. .
. .
j threadIdx y BlockDim x s
j threadIdx y BlockDim x s s
Institute of Computing Technology,Chinese Academy of Sciences
OutlineOutline
Institute of Computing Technology,Chinese Academy of Sciences
Experiment Environment
Hardware: CPU: Intel Core2 Duo CPU E8500 at 3.18GHz with
2GB of internal memory GPU: NVIDIA GeForce 9800 GTX+ graphics card
which has 512MB memory, 128 CUDA cores and 16 multiprocessors
Software: Microsoft Windows XP Pro SP2 Microsoft Visual Studio 2005 NVIDIA driver 2.2, CUDA sdk 2.2 and CUDA toolkit 2.2
Institute of Computing Technology,Chinese Academy of Sciences
The data parallel degree
total CPU time : initialization and file reading time serial projection time
Institute of Computing Technology,Chinese Academy of Sciences
The data parallel degree
total CPU time : initialization and file reading time serial projection time
Map projection can achieve more than 90 percent of parallelism.
Institute of Computing Technology,Chinese Academy of Sciences
Comparing with CPU
Block_num=64 Thread_num=512
Institute of Computing Technology,Chinese Academy of Sciences
Comparing with CPU
Total time = map projection time + data transfer time
Institute of Computing Technology,Chinese Academy of Sciences
Comparing with CPU
If consider the total time, the performance can obtain 6x to 8x.
Institute of Computing Technology,Chinese Academy of Sciences
Comparing with CPU
If only compare map projection time, we can obtain 70x to 90x speedups.
Institute of Computing Technology,Chinese Academy of Sciences
The performance of different task assignments
striped partitioning : Block_num=64, Thread_num=512
matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads
Institute of Computing Technology,Chinese Academy of Sciences
The performance of different task assignments
striped partitioning : Block_num=64, Thread_num=512
matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads
Striped: 6x to 8x
Matrix: 4x to 6x
Institute of Computing Technology,Chinese Academy of Sciences
The performance of different task assignments
thre
ad 0
thre
ad 1
thre
ad n
-1
……
thre
ad 0
thre
ad 1
thre
ad n
-1
……
thre
ad 0
thre
ad 1
thre
ad n
-1
…………
Block 0 Block 1 Block m-1
Global Memory
……………… …… …… ……0 1 n-1 n n+
1
2n mn
mn
+n
t(0,0) t(1,0) t(n,0)
t(0,1) t(1,1) t(n,1)
t(0,n) t(1,n) t(n,n)
…
…
…
… … … …
Block(0,0)
t(0,0) t(1,0) t(n,0)
t(0,1) t(1,1) t(n,1)
t(0,n) t(1,n) t(n,n)
…
…
…
… … … …
Block(m,0)
… … …Global
Memory
B(0,0) B(1,0) B(m,0)
B(0,m) B(1,m) B(m,m)
…
…
… … … …
Grid 0
BlockDim.x*GridDim.x
Matrix Striped
Institute of Computing Technology,Chinese Academy of Sciences
The performance of different task assignments
thre
ad 0
thre
ad 1
thre
ad n
-1
……
thre
ad 0
thre
ad 1
thre
ad n
-1
……
thre
ad 0
thre
ad 1
thre
ad n
-1
…………
Block 0 Block 1 Block m-1
Global Memory
……………… …… …… ……0 1 n-1 n n+
1
2n mn
mn
+n
t(0,0) t(1,0) t(n,0)
t(0,1) t(1,1) t(n,1)
t(0,n) t(1,n) t(n,n)
…
…
…
… … … …
Block(0,0)
t(0,0) t(1,0) t(n,0)
t(0,1) t(1,1) t(n,1)
t(0,n) t(1,n) t(n,n)
…
…
…
… … … …
Block(m,0)
… … …Global
Memory
B(0,0) B(1,0) B(m,0)
B(0,m) B(1,m) B(m,m)
…
…
… … … …
Grid 0
BlockDim.x*GridDim.x
Matrix Striped
All threads in the block accessing consecutive memory.it can only ensure each row of
threads in the block handle consecutive data
Institute of Computing Technology,Chinese Academy of Sciences
OutlineOutline
Institute of Computing Technology,Chinese Academy of Sciences
Conclusion and Future work Implement a fast map projection method.
CUDA-enabled GPUs high speed-up compared to the CPU-based
method the power of modern GPU is able to considerably
speed up in the field of geoscience DEM-based spatial interpolation raster-based spatial analysis
Future work: GPU implementation of other GIS application
Institute of Computing Technology,Chinese Academy of Sciences
Thank you!Q & A
Yanwei Zhao
Institute of Computing Technology
Contact: [email protected]
Institute of Computing Technology,Chinese Academy of Sciences