Thinking Parallel, Part 3
-
Upload
cmaestrofdez -
Category
Documents
-
view
214 -
download
0
Transcript of Thinking Parallel, Part 3
-
7/27/2019 Thinking Parallel, Part 3
1/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 1/11
NVIDIA Developer Zone
Secondary links
CUDA ZONE
Hom e > CUDA ZONE > Parallel Fora ll
Thinking Parallel, Part III: Tree Construction on the GPU
By Tero Karr as, posted Dec 19 201 2 at 1 0:49PM
Tags: Algorithms, Occ upancy , Parallel Programming
Inpart IIof this series, we looked at hierarchical tree trav ersal as a means of quickly identify ing
pairs of potentially co lliding 3D objec ts and we demonstrated how optimizing for low divergence
can result in substantial performance gains on massively parallel processors. Hav ing a fast
trav ersal algorithm is not v ery useful, though, unless we also have a tree to go with it. In this
part, we will close the c ircle by looking at tree building; specifically, parallel bounding vo lume
hierarchy (BVH) construction. We will also see an ex ample of an algorithmic optimization that
would be completely po intless on a single-cor e pro cessor, but leads to substantial gains in aparallel setting.
There are many use c ases for BVHs, and also many way s of constructing them. In our case,
construction speed is of the essence. In a physics simulation, objects keep moving from one time
step to the next, so we will need a different BVH for eac h step. Furthermore, we know that we are
going to spend only about 0 .25 milliseconds in trav ersing the BVH, so it makes little sense to
spend much more on constructing it. Onewell-kno wn appro ach for handling dy namic scenes is to
essentiallyrecycle the same BVH over and over. The basic idea is to only recalculate the
bo unding boxes of the nodes ac cor ding to the new objec t lo cations while keeping the hierarc hic al
structure of nodes the same. I t is also possible to make small incremental modifications to
improve the node structure around obje cts that hav e moved the most. However, the main
problem plaguing these algorithms is that the tree can deteriorate in unpredictable way s over
time, which can result in arbitrarily bad traversal performance in the worst case. To ensure
predictable worst-case behavior, we instead choose to build a new tree from scratch ev ery time
step. Lets look at how.
EXPLOITING THE Z-ORDER CURVE
The most promising current parallel BVH construc tion approac h is to use a so-called linear BVH(LBVH). The idea is to simplify the problem by first choosing the order in which the leaf nodes
(each correspo nding to one object) appear in the tree, and then generating the internal nodes in a
way that respects this o rder. We generally want objec ts that located close to each other in 3D
space to also reside nearby in the hierarchy, so a reasonable choice is to sort them along a space-
Dev eloper Cent er s Tech nologies Tools
Resou rces Com m unity
Log In
http://en.wikipedia.org/wiki/Space-filling_curvehttp://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpuhttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/blog/tags/4843https://developer.nvidia.com/blog/tags/4846https://developer.nvidia.com/blog/tags/231https://developer.nvidia.com/user/loginhttps://developer.nvidia.com/suggested-readinghttps://developer.nvidia.com/https://developer.nvidia.com/https://developer.nvidia.com/technologieshttps://developer.nvidia.com/toolshttps://developer.nvidia.com/http://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttps://developer.nvidia.com/user/loginhttps://developer.nvidia.com/https://developer.nvidia.com/suggested-readinghttps://developer.nvidia.com/toolshttps://developer.nvidia.com/technologieshttps://developer.nvidia.com/http://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttp://visual-computing.intel-research.net/publications/papers/2008/async/AsyncBVHJournal2008.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpuhttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/blog/tags/231https://developer.nvidia.com/blog/tags/4846https://developer.nvidia.com/blog/tags/4843https://developer.nvidia.com/blog/term/2https://developer.nvidia.com/category/zone/cuda-zonehttps://developer.nvidia.com/https://developer.nvidia.com/ -
7/27/2019 Thinking Parallel, Part 3
2/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 2/11
filling curve. We will use theZ-order curve for simplicity .
The Z-order c urv e is defined in terms ofMorto n codes. To calculate a Morton co de for the given3D point, we start by looking at the binary fixed-point representation of its coordinates, as shown
in the top left part of the figure. First, we take the fractional part of each co ordinate and ex pand it
by inse rting two gaps after each bit. Second, we inter leav e the bits o f all three c oor dinate s
together to form a single binary number. If we step through the Morton codes obtained this way
in increasing order , we are effectiv ely stepping along the Z-order curve in 3D (a 2D
representation is shown on the right-hand side of the figure). In practice, we can determine the
order of the leaf nodes by assigning a Morton code for each ob ject and then sorting the objects
accordingly. A s mentioned in the contex t of sort and sweep inpart I, parallel radix sort is just the
right tool for this job. A good way to assign the Morton code for a given object is to use the
centro id point of its bounding box , and express it relative to the bo unding box of the scene. The
ex pansion and interleaving of bits can then be performed efficiently by ex ploiting the arcane bit-
swizzling properties of integer multiplication, as shown in the following code.
// Expands a 10-bit integer into 30 bits
// by inserting 2 zeros after each bit.
unsignedint expandBits(unsignedint v)
{
v =(v *0x00010001u)&0xFF0000FFu;
v =(v *0x00000101u)&0x0F00F00Fu;
v =(v *0x00000011u)&0xC30C30C3u;
v =(v *0x00000005u)&0x49249249u;
return v;
}
// Calculates a 30-bit Morton code for the
// given 3D point located within the unit cube [0,1].
unsignedint morton3D(float x,float y,float z)
{
x = min(max(x *1024.0f,0.0f),1023.0f);
y = min(max(y *1024.0f,0.0f),1023.0f);
z = min(max(z *1024.0f,0.0f),1023.0f);
http://en.wikipedia.org/wiki/Space-filling_curvehttps://developer.nvidia.com/content/thinking-parallel-part-i-collision-detection-gpuhttp://en.wikipedia.org/wiki/Z-order_curvehttp://en.wikipedia.org/wiki/Space-filling_curve -
7/27/2019 Thinking Parallel, Part 3
3/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 3/11
unsignedint xx = expandBits((unsignedint)x);
unsignedint yy = expandBits((unsignedint)y);
unsignedint zz = expandBits((unsignedint)z);
return xx *4+ yy *2+ zz;
}
In our ex ample dataset with 12,486 objects, assigning the Morton codes this way takes 0.02
milliseco nds on GeForce GTX 690 , whereas sorting the objec ts takes 0.18 ms. So far so goo d, but
we still have a tree to b uild.
TOP-DOWN HIERARCHY GENERATION
One of the great things about LBVH is that once we hav e fixed the order o f the leaf nodes, we c an
think of each internal node as just a linear range ov er them. To illustrate this, suppose that we
haveNleaf nodes in total. The root node co ntains all of them, i.e. it cov ers the range [0,N-1]. The
left child of the root must then cov er the range [0,], for some appro priate choice of , and the
right child cov ers the range [+1,N-1]. We can continue this all the way down to obtain thefollowing recursiv e algorithm.
Node* generateHierarchy(unsignedint* sortedMortonCodes,
int* sortedObjectIDs,
int first,
int last)
{
// Single object => create a leaf node.
if(first ==last)
returnnewLeafNode(&sortedObjectIDs[first]);
// Determine where to split the range.
int split = findSplit(sortedMortonCodes, first,last);
// Process the resulting sub-ranges recursively.
Node* childA = generateHierarchy(sortedMortonCodes, sortedObjectI
first, split);
Node* childB = generateHierarchy(sortedMortonCodes, sortedObjectI
split +1,last);
returnnewInternalNode(childA, childB);
}
We start with a range that covers all o bje cts (first=0, last=N-1), and determine an appropriateposition to split the range in two (split=). We then repeat the same thing for the resulting sub-
ranges, and generate a hierarchy where each such split corresponds to one internal node. The
recursion terminates when we encounter a range that co ntains only one item, in which case we
create a leaf node.
http://en.wikipedia.org/wiki/GeForce_600_Series -
7/27/2019 Thinking Parallel, Part 3
4/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 4/11
The only remaining question is how to choose . LBVH determines acc ording to the highest bit
that differs between the Morton co des within the given range. In other words, we aim to partition
the ob jects so that the highest differing bit will be zero for all objects in childA, and one for all
objects in childB. The intuitive reason that this is a good idea is that partitioning objec ts by the
highest differing bit in their Morton codes co rrespo nds to classify ing them on either side of an
axis-aligned plane in 3D. In prac tice, the most efficient way to find out where the highest bit
changes is to use binary search. The idea is to maintain a current best guess for the po sition, and
try to adv ance it in exponentially dec reasing steps. On each step, we check whether the propo sed
new position would violate the requirements for childA, and either accept o r reject it
acc ordingly. This is illustrated by the following code, which uses the __clz()intrinsic function
available in NVI DIA Fermi and Kepler GPUs to count the number of leading zero bits in a 32-bit
integer.
int findSplit(unsignedint* sortedMortonCodes,
int first,
int last)
{
// Identical Morton codes => split the range in the middle.
unsignedint firstCode = sortedMortonCodes[first];
unsignedint lastCode = sortedMortonCodes[last];
-
7/27/2019 Thinking Parallel, Part 3
5/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 5/11
if(firstCode == lastCode)
return(first +last)>>1;
// Calculate the number of highest bits that are the same
// for all objects, using the count-leading-zeros intrinsic.
int commonPrefix = __clz(firstCode ^ lastCode);
// Use binary search to find where the next bit differs.
// Specifically, we are looking for the highest object that
// shares more than commonPrefix bits with the first one.
int split = first;// initial guess
int step =last- first;
do {
step =(step +1)>>1;// exponential decrease
int newSplit = split + step;// proposed new position
if(newSplit commonPrefix)
split = newSplit;// accept proposal
}
}
while(step >1);
return split;
}
How should we go about parallelizing this kind of recursive algorithm? One way is to use the
approach presented by Garanzha et al., which processes the levels of nodes sequentially ,
starting from the root. The idea is to maintain a growing array of nodes in a breadth-first order ,
so that ev ery level in the hierarchy c orresponds to a linear range of nodes. On a given level, we
launch one thread for each node that falls into this range. The thread starts by reading firstand
lastfrom the node array and calling findSplit(). It then appends the resulting child nodes to the
same node array using an atomic counter and writes out their corre sponding sub-ranges. This
process iterates so that each level outputs the nodes c ontained on the nex t leve l, which then get
processed in the next ro und.
OCCUPANCY
The algorithm just described (Garanzha et al.) is surprisingly fast when there are millions of
objec ts. The algorithm spends most of the exec ution time at the bottom levels of the tree, which
http://scholar.google.fi/scholar?q=Simpler+and+faster+HLBVH+with+work+queues -
7/27/2019 Thinking Parallel, Part 3
6/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 6/11
contain more than enough work to fully employ the GPU. There is some amo unt of data
dive rgence on the higher levels, as the threads are accessing distinct parts of the Morton code
array . But those lev els are also less significant considering the ov erall execution time, since they
do not c ontain as many nodes to begin with. In our c ase, however, there are only 1 2K objects
(recall the example see n from the last post). Note that this is actually less than the number of
threads we would need to fill our GTX 690, ev en if we were able to parallelize every thing
perfectly . GTX 690 is a dual-GPU card, where each o f the two GPUs can run up to 16K threads in
parallel. Even if we restric t ourselv es to only one of the GPUsthe other one c an e.g. handlerendering while we do phy sicswe are still in danger o f running low on parallelism.
The top-down algorithm takes 1.04 ms to process our workload, which is more than twice the
total time taken by all the other processing steps. To explain this, we need to c onsider another
metric in addition to divergence: occupancy. Occ upancy is a measure of how many thre ads are
exe cuting on average at any giv en time, relative to the maximum number of threads that the
processor can theoretically support. When occ upancy is low, it translates directly to
performance: dropping occupancy in half will reduce performance by 2x. This dependence gets
gradually weaker as the number of activ e threads increases. The reason is that when occ upancy is
high enough, the ov erall performance starts to bec ome limited by other factors, such asinstruction throughput and memory bandwidth.
To illustrate, co nsider the case with 12K objec ts and 16K threads. If we launch one thread per
object, our o ccupancy is at most 7 5%. A bit low, but not by any means catastrophic. How does
the top-down hierarchy generation algorithm compare to this? There is only one node o n the first
level, so we launch only o ne thread. This means that the first level will run at 0.006% occupancy !
The second lev el has two nodes, so it runs at 0.01 3% occ upancy . Assuming a balanced hierarchy,
the third level runs at 0.025% and the fourth at 0.0 5%. Only when we get to the 1 3th lev el, can we
ev en hope to reac h a reasonable occupancy of 25%. But right after that we will already run out of
work. These numbers are somewhat discouragingdue to the low occupanc y , the first level will
cost roughly as much as the 13th lev el, even though there are 4096 times fewer nodes to process.
FULLY PARALLEL HIERARCHY GENERATION
There is no way to avoid this problem without so mehow changing the algorithm in a fundamental
way . Even if our GPU suppor ts dynamic parallelism (as NV IDIA Tesla K20 does), we canno t av oid
the fact that every node is dependent on the results of its parent. We have to finish processing the
root b efore we know which ranges its children cov er, and we cannot ev en hope to start
proc essing them until we do. In other words, regardless of how we implement top-downhierarchy generation, the first level is doomed to run at 0.006% occupancy. Is there a way to
break the dependency between nodes?
In fact there is, and I recently presented the so lution at High Performance Graphics 201 2 (paper,
slides). The idea is to number the internal nodes in a very specific way that allows us to find out
which range o f objec ts any giv en node c orresponds to, without having to know anything abo ut
the rest o f the tree. Utilizing the fact that any binary tree withNleaf nodes always has ex actlyN-1
internal nodes, we can then generate the entire hierarchy as illustrated by the following
pseudocode.
Node* generateHierarchy(unsignedint* sortedMortonCodes,
int* sortedObjectIDs,
int numObjects)
http://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_slides.pptxhttp://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_paper.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu -
7/27/2019 Thinking Parallel, Part 3
7/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 7/11
{
LeafNode* leafNodes =newLeafNode[numObjects];
InternalNode* internalNodes =newInternalNode[numObjects -1];
// Construct leaf nodes.
// Note: This step can be avoided by storing
// the tree in a slightly different way.
for(int idx =0; idx < numObjects; idx++)// in parallel
leafNodes[idx].objectID = sortedObjectIDs[idx];
// Construct internal nodes.
for(int idx =0; idx < numObjects -1; idx++)// in parallel
{
// Find out which range of objects the node corresponds to.
// (This is where the magic happens!)
int2 range = determineRange(sortedMortonCodes, numObjects, id
int first = range.x;
intlast= range.y;
// Determine where to split the range.
int split = findSplit(sortedMortonCodes, first,last);
// Select childA.
Node* childA;
if(split == first)
childA =&leafNodes[split];
else
childA =&internalNodes[split];
// Select childB.
Node* childB;
if(split +1==last)
childB =&leafNodes[split +1];
else
childB =&internalNodes[split +1];
// Record parent-child relationships.
internalNodes[idx].childA = childA;internalNodes[idx].childB = childB;
childA->parent =&internalNodes[idx];
childB->parent =&internalNodes[idx];
-
7/27/2019 Thinking Parallel, Part 3
8/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 8/11
}
// Node 0 is the root.
return&internalNodes[0];
}
The algorithm simply allocates an array ofN-1 internal nodes, and then proc esses all of them in
parallel. Each thread starts by determining which range of objects its node c orresponds to, with a
bit of magic, and then proceeds to split the range as usual. Finally , it selec ts c hildren for the no de
acc ording to their respect ive sub-ranges. If a sub-range has only one object, the c hild must be a
leaf so we reference the corresponding leaf node direc tly. Otherwise, we reference another
internal node from the array .
The way the internal nodes are numbered is already ev ident from the pseudocode. The root has
index 0 , and the children of ev ery node are loc ated on either side of its split position. Due to
some nice properties of the sorted Morton codes, this numbering scheme never results in any
duplicates or gaps. Furthermo re, it turns out that we can implement determineRange()in much
the same way as findSplit(), using two similar binary searches ov er the nearby Morton codes.
For further details on how and why this works, please see thepaper.
How does this algorithm compare to the recursive top-down approach? It clearly performs more
workwe no w need thr ee binary searches per no de, whereas the top-down approach only needs
one. But it does all of the work completely in parallel, reaching the full 7 5% occupanc y in our
ex ample, and this makes a huge difference. The execut ion time of parallel hierarchy generation is
merely0.02 msa 50x improvement ov er the top-down algorithm!
Y ou might think that the top-down algorithm should start to win when the number of objec ts is
sufficiently high, since the lack of occupancy is no longer a problem. Howev er, this is not the case
in practicethe parallel algorithm consistently performs better on all problem sizes. The
ex planation for this is, as always, div ergence. In the parallel algorithm, nearby threads are
http://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_paper.pdf -
7/27/2019 Thinking Parallel, Part 3
9/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 9/11
always accessing nearby Morton codes, whereas the top-down algorithm spreads out the
accesses ov er a wider area.
BOUNDING BOX CALCULATION
Now that we have a hierarchy of nodes in place, the only thing left to do is to assign a
conserv ative bounding box for each o f them. The approach I adopt in my paper is to do a parallel
bo ttom-up reductio n, where each thread star ts from a single leaf node and walks toward the ro ot.
To find the bounding box o f a given node, the thread simply looks up the bounding box es of its
children and calculates their union. To av oid duplicate work, the idea is to use an atomic flag per
node to terminate the first thread that enters it, while letting the second one thro ugh. This
ensures that ev ery node gets processed only once, and not before both of its children are
proc essed. The bounding box calculation has high execution divergenceonly half of the threads
remain active after proc essing one node, one quarter after processing two nodes, one eigth after
proc essing three nodes, and so on. Howev er, this is not really a problem in practic e because of
two reasons. First, bounding box calculation takes only0.06 ms, which is still reasonably low
compared to e.g. sorting the objects. Seco nd, the processing mainly c onsists of reading and
writing bounding boxes, and the amo unt of computation is minimal. This means that the
exe cution time is almost entirely dictated by the available memory bandwidth, and reducing
exe cution divergence would not really help that much.
SUMMARY
Our algorithm for finding potential collisions among a set of 3D objects c onsists of the following 5
steps (times are for the 1 2K object scene used in theprevious post).
1 . 0.02 ms, one thread per o bject: Calculate bounding box and assign Morto n code.
2. 0.18 ms, parallel radix sor t: Sort the objects acc ording to their Morton codes.
3. 0.02 ms, one thread pe r internal node: Generate BVH node hierarchy .
4. 0.06 ms, one thread per objec t: Calculate node bounding box es by walking the hierarchy
toward the root.
5. 0.25 ms, one thread per o bject: Find potential co llisions by traversing the BVH.
The complete algorithm takes 0.53 ms, out of which 53% goes to tree construc tion and 47 % to
tree trave rsal.
DISCUSSION
We hav e presente d a number of algorithms of vary ing c omplex ity in the c ontext of broad-phase
co llision detec tion, and identified some of the most important considerations when designing and
implementing them on a massively parallel processor. The comparison between independent
trav ersal and simultaneous trav ersal illustrates the importance of divergence in algorithm design
the best single-core algorithm may easily turn out to be the worst one in a parallel setting.
Relying on time complex ity as the main indicator o f a good algorithm can sometimes be
misleading or ev en harmfulit may actually be beneficial to do mo re wor k if it helps in reduc ing
divergence.
The parallel algorithm for BVH hierarc hy generation brings up another interesting point. In the
traditional sense, the algorithm is completely pointlesson a single-core processor, the
https://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu -
7/27/2019 Thinking Parallel, Part 3
10/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 10/11
dependencies between nodes were not a problem to begin with, and doing more work per node
only makes the algorithm run slower. This shows that parallel programming is indeed
fundamentally different from traditional single-core programming: it is not so muc h about
porting ex isting algorithms to run on a parallel proc essor; it is about re-thinking some of the
things that we usually take for granted, and c oming up with new algorithms specifically designed
with massiv e paralle lism in mind.
And there is still a lot to be accomplished on this fro nt.
About the author: Tero Karras is a graphics research scientist at NVIDIA Research.
Parallel Forall is the NVI DIA Parallel Programming blog. If you enjoy ed this post, subsc ribe to
theParallel Forall RSS fee d! You may contact us via the contact form.
NVIDIA Developer Programs
Get exclusiv e access to the lat est softwar e, report bugs and r eceiv e notifications for special ev ents.
Lear n m ore and Register
Recommended Reading
About Par al lel Forall
Conta ct Para llel Forall
Parallel Forall Blog
Featured Articles
Prev iousPauseNext
Tag Index
accelerometer (1) Alg orithm s (3) An droid (1 ) ANR (1 ) ARM (2) Ar ray Fire (1) Au di (1) Aut omotiv e &
Embedded (1 ) Blog (2 0) Blog (2 3) Cluster (4 ) competition (1 ) Compilation (1 ) Concur rency (2)
Copperh ead (1 ) CUDA (2 3) CUDA 4.1 (1 ) CUDA 5.5 (3 ) CUDA C (1 5) CUDA Fort ra n (1 0) CUDA Pro Tip
(1 ) CUDA Profiler (1 ) CUDA Spotligh t (1 ) CUDA Zone (81 ) CUDACasts (2) Debug (1 ) Debugg er (1 )
Debugging (3) Dev elop 4 Shield (1) dev elopment kit (1 ) DirectX (3) Eclipse (1 ) Ev ents (2) FFT (1 ) Finite
Difference (4) Floatin g Point (2 ) Gam e & Graphics Dev elopment (3 5) Gam es and Gra phics (8) GeForce
Dev eloper Stories (1 ) getting started (1) google io (1 ) GTC (2) Hardwar e (1 ) Interv iew (1 ) Kepler (1 )
Lamborghini (1 ) Librar ies (4) m emory (6) Mobile Dev elopment (2 7 ) Monte Car lo (1 ) MPI (2) Multi -GPU
Udacity CS344: Intro to Parallel
Programming
https://developer.nvidia.com/udacity-cs344-intro-parallel-programminghttps://developer.nvidia.com/udacity-cs344-intro-parallel-programminghttps://developer.nvidia.com/blog/tags/4853?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4756?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4871?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/676?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4836?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/431?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4862?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4870?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4859?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/181?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4201?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/421?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4948?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4977?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/1?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4676?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4851?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4946?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4741?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4811?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3131?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/276?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4915?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4521?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4372?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4166?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4874?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/2?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4336?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/466?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4873?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/456?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4501?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4933?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4631?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/166?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4857?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4845?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4864?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4860?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4852?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4917?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4905?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4786?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/436?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4842?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4868?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4867?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/496?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4843?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/321?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttp://developer.nvidia.com/blog/feed/2http://developer.nvidia.com/content/contact-parallel-forallhttp://developer.nvidia.com/about-parallel-forallhttps://developer.nvidia.com/blog?term=2https://developer.nvidia.com/registered-developer-programshttps://developer.nvidia.com/content/contact-parallel-forallhttp://nvdrupws18.nvidia.com/blog/feed/2 -
7/27/2019 Thinking Parallel, Part 3
11/11
10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
(3) n ativ e_app_glue (1 ) NDK (1 ) NPP (1 ) Nsight (2) N Sight Eclipse Edition (1 ) Nsight Tegra (1 )
NSIGHT Visual Stu dio Edition (1 ) Num baPro (2) Nu m erics (1) NV IDIA Parallel Nsight (1 ) nv idia-smi
(1 ) Occupancy (1 ) OpenACC (6) OpenGL (3) OpenGL ES (1) Parallel Forall (6 9) Parallel Nsight (1 )
Parallel Progr am m ing (5) PerfHUD ES (2) Perform anc e (4) Port ability (1 ) Port ing (1 ) Pro Tip (5)
Professional Gra phics (6) Profiling (3) Program m ing Lang ua ges (1) Py thon (3 ) Robotics (1 ) Shape
Sensing (1 ) Shared Memory (6) Shield (1) Stream s (2) tablet (1 ) TADP (1 ) Technologies (3) t egra (5)
Tegra A ndroid Developer Pack (1 ) Tegra A ndroid Development Pack (1 ) Tegra Dev eloper Stories (1)
Tegra Profiler (1 ) Tegra Zone (1 ) Textur es (1) Th rust (3 ) Tools (10) tools (2) Toradex (1 ) Visual Stu dio
(3 ) Windows 8 (1 ) xoom (1 ) Zone In (1 )
Developer Blogs
Parallel Forall Blog
About
Contact
Copyright 2013 NVIDIA Corporation
Legal Inf ormation
Priv acy Policy
Code of C onduct
https://developer.nvidia.com/code-of-conducthttp://www.nvidia.com/object/privacy_policy.htmlhttp://www.nvidia.com/object/legal_info.htmlhttps://developer.nvidia.com/contacthttp://www.nvidia.com/page/companyinfo.htmlhttp://developer.nvidia.com/blog/feed/2https://developer.nvidia.com/blog/tags/4392?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4211?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4945?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4161?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4861?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/256?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/50?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4821?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4849?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4397?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4939?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4971?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4940?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4914?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/51?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/221?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4865?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/316?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4844?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4854?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4848?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4937?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4872?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4858?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4841?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4526?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4847?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4869?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4850?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3546?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4831?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/231?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/176?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4880?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/311?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3356?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4621?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4846?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4944?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4626?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4681?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4953?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4903?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4938?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4949?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4776?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4701?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/296?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4866?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublas