FAST MAP PROJECTION ON CUDA.ppt

FAST MAP PROJECTION ON CUDA

Yanwei Zhao

Institute of Computing Technology

Chinese Academy of Sciences

July 29, 2011

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

Map Projection Establish the relationship between two different

coordinate systems. geographical coordinates → planar cartesian map space

coordinate system

Complicated and time consuming arithmetic operations. Fast answer with desired accuracy→ Slow exact

answer It's need to be accelerated for interactive GIS scenarios.


GPGPU(The general purpose computing on graphics processing units)

GPGPU is a young area of research. Advantage of GPU

Flexibility Power processing Low cost

GPGPU in applications other than 3D graphics GPU accelerates critical path of application


CUDA(Common Unified Device Architecture) NVIDIA's parallel computing

architecture C base programming

language and development toolkit

Advantage: Programmer can focus on the

important issues rather than an unfamiliar language

No need of graphics APIs and write efficient parallel code


The characteristic of Map Projection

Huge amount of coordinates to handle

The complexity of arithmetic operations

The requirement of a realtime response


Our proposals

using the new technology CUDA on the GPU

Take Universal Transverse Mercator (UTM) projection as an example

Performance: Improvement of up to 6x to 8x

(include transfer time) Speed up 70x to 90x

(not include transfer time)Institute of Computing Technology,


OutlineOutline


Algorithm frameworkCPU

CPU

3. Copy the data from CPU to GPU global memory

5. Copy the result from GPU to CPU

GPU

1.Open the shapefile2.Read the coordinates of all features

6.free up the device memory

Block 0

…………

Thr

ead

0

Thr

ead

1

Thr

ead

n

Block m

……

Thr

ead

0

Thr

ead

1

Thr

ead

n

4. Execute the kernel function

7.Save or display the result

Striped partitioning

Matrix distribution



Define the number of block and thread: Block_num,Thread_num

CUDA built-in parameters: GridDim, BlockDim

Geographic feature number: fn

Each block runs features: fn/GridDim.x


…

…

…

…

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

…

thread 1

thread n

The relationship between blocks

and features

The relationship between threads and coordinates


For surrounding loop: Blocks and features Block → Feature[i] i = blockidx.x*(fn/GridDim.x)

(1)

Block → next Feature[k] k = i + fn/GridDim.x (2)

For inner loop: Threads and coordinates thread→coord[j]

j = threadIdx.x thread→next coord[k]

k = j +Thread_numInstitute of Computing Technology,


…

…

…

…

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

…

thread 1

thread n


and features



For surrounding loop: Blocks and features Block → Feature[i]

i = blockidx.x*(fn/GridDim.x) Block → next Feature[k]

k = i + fn/GridDim.x

For inner loop: Threads and coordinates thread→coord[j]

j = threadIdx.x (1) thread→next coord[k] k = j +Thread_num (2)


…

…

…

…

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

…

thread 1

thread n


and features


Matrix distribution

. . 1

. .

fn gridDim x grdiDim yk

gridDim x grdiDim y


Define the number of block and thread: grid(br,bc), block(tr,tc)

Each block run k features, where: (1)

Feature[i]: (2)

(3)

. .

. .

i blockIdx y GridDim x k

i blockIdx y GridDim x k k

Matrix distribution

Each block run s coordnates, where:

(1)

coord[j]:

[ ]. . . 1

. .

feature i size blockDim x blockDim ys

blockDim x blockDim y

. .

. .

j threadIdx y BlockDim x s

j threadIdx y BlockDim x s s


OutlineOutline


Experiment Environment

Hardware: CPU: Intel Core2 Duo CPU E8500 at 3.18GHz with

2GB of internal memory GPU: NVIDIA GeForce 9800 GTX+ graphics card

which has 512MB memory, 128 CUDA cores and 16 multiprocessors

Software: Microsoft Windows XP Pro SP2 Microsoft Visual Studio 2005 NVIDIA driver 2.2, CUDA sdk 2.2 and CUDA toolkit 2.2


The data parallel degree

total CPU time : initialization and file reading time serial projection time


The data parallel degree

total CPU time : initialization and file reading time serial projection time

Map projection can achieve more than 90 percent of parallelism.


Comparing with CPU

Block_num=64 Thread_num=512


Comparing with CPU

Total time = map projection time + data transfer time


Comparing with CPU

If consider the total time, the performance can obtain 6x to 8x.


Comparing with CPU

If only compare map projection time, we can obtain 70x to 90x speedups.


The performance of different task assignments

striped partitioning : Block_num=64, Thread_num=512

matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads



striped partitioning : Block_num=64, Thread_num=512

matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads

Striped: 6x to 8x

Matrix: 4x to 6x



thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

…………

Block 0 Block 1 Block m-1

Global Memory

……………… …… …… ……0 1 n-1 n n+

1

2n mn

mn

+n

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

…

…

…

… … … …

Block(0,0)

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

…

…

…

… … … …

Block(m,0)

… … …Global

Memory

B(0,0) B(1,0) B(m,0)

B(0,m) B(1,m) B(m,m)

…

…

… … … …

Grid 0

BlockDim.x*GridDim.x

Matrix Striped



thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

…………

Block 0 Block 1 Block m-1

Global Memory

……………… …… …… ……0 1 n-1 n n+

1

2n mn

mn

+n

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

…

…

…

… … … …

Block(0,0)

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

…

…

…

… … … …

Block(m,0)

… … …Global

Memory

B(0,0) B(1,0) B(m,0)

B(0,m) B(1,m) B(m,m)

…

…

… … … …

Grid 0

BlockDim.x*GridDim.x

Matrix Striped

All threads in the block accessing consecutive memory.it can only ensure each row of

threads in the block handle consecutive data


OutlineOutline


Conclusion and Future work Implement a fast map projection method.

CUDA-enabled GPUs high speed-up compared to the CPU-based

method the power of modern GPU is able to considerably

speed up in the field of geoscience DEM-based spatial interpolation raster-based spatial analysis

Future work: GPU implementation of other GIS application


Thank you!Q & A

Yanwei Zhao

Institute of Computing Technology

Contact: [email protected]


FAST MAP PROJECTION ON CUDA.ppt

Technology

Transcript of FAST MAP PROJECTION ON CUDA.ppt