UC Berkeley Above the Clouds A Berkeley View of Cloud Computing 1 UC Berkeley RAD Lab.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley)...
-
Upload
nelson-mccarthy -
Category
Documents
-
view
213 -
download
0
Transcript of Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley)...
Efficient Parallel CKY Parsing on GPUs
Youngmin Yi (University of Seoul)
Chao-Yue Lai (UC Berkeley)
Slav Petrov (Google Research)
Kurt Keutzer (UC Berkeley)
CKY Parsing
• Find the most likely parse tree for a given sen-tence
• Parse trees can be used in many NLP applications – Machine translation– Question answering– Information extraction
• Dynamic Programming in O(|G|n3)– n is number of words in a sentence– |G| is size of the grammar
I love you . love
you .
you .
. you
love I
I love
I love you love
you
(0,0)
(0,1)
(0,2)
(0,3)
(1,3)
(2,3)
(3,3)
(2,2)(1,1)
(1,2)
Why Faster Parsers?
• O(|G|n3)– n is on average about 20– |G| is much more larger
• grammars with high accuracy: >1,000,000 rules
• We need faster parsers for real-time NLprocessing with high accuracy!
GPUs
• Manycore era– Due to “Power Wall”, it is unlikely that CPUs
with faster clock frequency appear
– Instead, number of processing cores will con-tinue toincrease
• GPU (Graphics Processing Unit)– Currently available manycore architecture:– 480 processing cores in GTX480
Overall Structure• Hierarchical parallel platform
– Several Streaming Processors (SP) grouped into aStreaming Mulitprocessor (SM)
…
Memory Types
• Different types of memory– Global, shared, texture, constant memory
• Can you guys please add a bit more here?
CUDA
• CUDA (Compute Unified Device Architec-ture)– Parallel programming framework for GPUs
• Programming model, language, compilers, APIs
– Allows general purpose computing on GPUs
Thread and Thread Block in CUDA
• Thread blocks (Blocks)– Independent execution units
• Threads– Maximum threads per block: 512 or 1024
• Warps– Group of threads executed together: 32
• Kernel– Configured as #blocks, #threads
9
• Fork-join programming model, host+device program– Serial or modestly parallel parts in host C code– Highly parallel parts in device kernel C code
Serial code (host)
. . .
. . .
Parallel code in kernel (device)
KernelA<<< nBlk, nThr >>>(args);
Serial code (host)
Parallel code = kernel (device)
KernelB<<< nBlk, nThr >>>(args);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Programming Model in CUDA
SIMT model in CUDA
__global__ void Kenrel1(..)
{if( threadIdx.x < a)
...else
...}
• SIMT (Single Instruction Multiple Thread)– Not SIMD (Single Instruction Multiple Data) be-
cause…• Threads can actually execute different locations of the
program
– Not SPMD (Single Program Multiple Data) because…• Threads with different execution path cannot execute in
parallel __global__ void Kenrel2(..){ int tx = threadIdx.x;
for(i=0; i<LoopCount[tx]; i++)
... }
Parallelisms in CKY Parsing• Dynamic Programming
– Iterations must be executed in serial
• But, in each iteration– About a million rules (with thousands of symbols) need to
be evaluated for each span
I love
you .love
you .you .
.
yo
u
love
I
I love
I love
you love
you
(0,0)
(0,1)
(0,2)
(0,3)
(1,3)
(2,3)
(3,3)
(2,2)
(1,1)
(1,2)
Rules
Spans
Unary Rule Relaxation
Binary Rule Re-laxation
# rules, # span
Thread-Mapping• Map a symbol to a thread?
– Not good for load balancing– Remember SIMT!
• Map a rule to a thread?– 850K rules good concurrency– Thread blocks are just groups of the same # of threads
…
Block-Mapping
• Map each symbol to a thread block– and map the rules to threads in the thread block that cor-
responds to the parent symbol – (+) All the threads in the same thread block has the
same parent – (-) What if #rules of a symbol exceeds the #thread limit?
…
Block-Mapping
…
0 1 2 3 4 5 6 7 1023
…
0 1 2 3 4 5 6 7 1023
0 1 2
Symbol i
Virtual Sym-bol j
Virtual Symbol j+1
Span-Mapping
• It is easy to further parallelize another level of parallel-ism orthogonally– Simply add another dimension in the grid of thread blocks
…
…
blockIndex.y=0blockIndex.y=1
blockIndex.y=n-len+1
blockIndex.x=sym0
blockIndex.x=sym1
…
…
span index
Synchronization
• Massive number of threads with the same parent symbol need to update its computed score correctly such that the reduced final value is the maximum value
Atomic Operations• atomicMax(&max,value);
– CUDA API– Much efficient for shared memorythan global memory
shared memory
global memory
Parallel Reduction
• After log2N steps (N is #threads in a block), the reduced value is obtained– All the threads work for the same symbol– An option only for block-mapping
• __syncthreads()
Reducing Global Memory Using Texture Memory
• Grammar information– parent[], lchild[], rchild[]– Read-only throughout the whole program
• Scores updated in the previous iterations of dynamic programming– scores[][][]– Read-only
• Locate such read-only data in texture memory!• But, in case of scores[][][], we need to locate newly
updated scores in the current iteration to the texture memory– Locating array in texture memory = cudaBindTexture( )– The execution time of this API is proportional to the array size– (-) scores[start][stop][S] is huge array…
Sj SrSs
scores[wp][wd][Sr],scores[wd+1][wq][Ss]
Reducing Global Memory Using Texture Memory (Cont’d)
• Change the layout – scores[start][stop][S] scores[len][start][S]– We only need to update the part of scores[][][] when
len=current iteration
I love you . love
you .
you .
. you
love I
I love
I love you love
you
(0,0)
(0,1)
(0,2)
(0,3)
(1,3)
(2,3)
(3,3)
(2,2)(1,1)
(1,2)len=1
len=2
len=3
len=4
Experimental Results
• GTX285 – No cache memory supported– Low memory bandwidth
speedup
thread-atom
6.4
block-atom
8.1
block-pr
10.1
block-atom-SS
11.1
block-pr-SS
14.2
block-atom-SS-tex
11.9
block-pr-SS-tex
17.4
Experimental Results
• GTX480 – Cache memory supported– Higher memory bandwidth
speedup
thread-atom
13.2
block-atom
14.1
block-pr
25.8
block-atom-SS
15.2
block-pr-SS
23.4
block-atom-SS-tex
13.9
block-pr-SS-tex
22.2
Conclusions
• We explored design space for parallelizing CKY parsing on a GPU– Different mappings, synchronization methods, – Utilizing different types of memories
• We compared each version two GPUs– 26X on GTX480, 17X on GTX285
• We expect scalable performance gains as the numberof processing cores increases in future GPUs