Thinking Parallel, Part 3

download Thinking Parallel, Part 3

of 11

Transcript of Thinking Parallel, Part 3

  • 7/27/2019 Thinking Parallel, Part 3

    1/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 1/11

    NVIDIA Developer Zone

    Secondary links

    CUDA ZONE

    Hom e > CUDA ZONE > Parallel Fora ll

    Thinking Parallel, Part III: Tree Construction on the GPU

    By Tero Karr as, posted Dec 19 201 2 at 1 0:49PM

    Tags: Algorithms, Occ upancy , Parallel Programming

    Inpart IIof this series, we looked at hierarchical tree trav ersal as a means of quickly identify ing

    pairs of potentially co lliding 3D objec ts and we demonstrated how optimizing for low divergence

    can result in substantial performance gains on massively parallel processors. Hav ing a fast

    trav ersal algorithm is not v ery useful, though, unless we also have a tree to go with it. In this

    part, we will close the c ircle by looking at tree building; specifically, parallel bounding vo lume

    hierarchy (BVH) construction. We will also see an ex ample of an algorithmic optimization that

    would be completely po intless on a single-cor e pro cessor, but leads to substantial gains in aparallel setting.

    There are many use c ases for BVHs, and also many way s of constructing them. In our case,

    construction speed is of the essence. In a physics simulation, objects keep moving from one time

    step to the next, so we will need a different BVH for eac h step. Furthermore, we know that we are

    going to spend only about 0 .25 milliseconds in trav ersing the BVH, so it makes little sense to

    spend much more on constructing it. Onewell-kno wn appro ach for handling dy namic scenes is to

    essentiallyrecycle the same BVH over and over. The basic idea is to only recalculate the

    bo unding boxes of the nodes ac cor ding to the new objec t lo cations while keeping the hierarc hic al

    structure of nodes the same. I t is also possible to make small incremental modifications to

    improve the node structure around obje cts that hav e moved the most. However, the main

    problem plaguing these algorithms is that the tree can deteriorate in unpredictable way s over

    time, which can result in arbitrarily bad traversal performance in the worst case. To ensure

    predictable worst-case behavior, we instead choose to build a new tree from scratch ev ery time

    step. Lets look at how.

    EXPLOITING THE Z-ORDER CURVE

    The most promising current parallel BVH construc tion approac h is to use a so-called linear BVH(LBVH). The idea is to simplify the problem by first choosing the order in which the leaf nodes

    (each correspo nding to one object) appear in the tree, and then generating the internal nodes in a

    way that respects this o rder. We generally want objec ts that located close to each other in 3D

    space to also reside nearby in the hierarchy, so a reasonable choice is to sort them along a space-

    Dev eloper Cent er s Tech nologies Tools

    Resou rces Com m unity

    Log In

    http://en.wikipedia.org/wiki/Space-filling_curvehttp://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpuhttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/blog/tags/4843https://developer.nvidia.com/blog/tags/4846https://developer.nvidia.com/blog/tags/231https://developer.nvidia.com/user/loginhttps://developer.nvidia.com/suggested-readinghttps://developer.nvidia.com/https://developer.nvidia.com/https://developer.nvidia.com/technologieshttps://developer.nvidia.com/toolshttps://developer.nvidia.com/http://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttps://developer.nvidia.com/user/loginhttps://developer.nvidia.com/https://developer.nvidia.com/suggested-readinghttps://developer.nvidia.com/toolshttps://developer.nvidia.com/technologieshttps://developer.nvidia.com/http://en.wikipedia.org/wiki/Space-filling_curvehttp://luebke.us/publications/eg09.pdfhttp://visual-computing.intel-research.net/publications/papers/2008/async/AsyncBVHJournal2008.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpuhttps://developer.nvidia.com/sites/default/files/akamai/cuda/images/parallel-forall/thinking-parallel-part-iii-thumb.jpghttps://developer.nvidia.com/blog/tags/231https://developer.nvidia.com/blog/tags/4846https://developer.nvidia.com/blog/tags/4843https://developer.nvidia.com/blog/term/2https://developer.nvidia.com/category/zone/cuda-zonehttps://developer.nvidia.com/https://developer.nvidia.com/
  • 7/27/2019 Thinking Parallel, Part 3

    2/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 2/11

    filling curve. We will use theZ-order curve for simplicity .

    The Z-order c urv e is defined in terms ofMorto n codes. To calculate a Morton co de for the given3D point, we start by looking at the binary fixed-point representation of its coordinates, as shown

    in the top left part of the figure. First, we take the fractional part of each co ordinate and ex pand it

    by inse rting two gaps after each bit. Second, we inter leav e the bits o f all three c oor dinate s

    together to form a single binary number. If we step through the Morton codes obtained this way

    in increasing order , we are effectiv ely stepping along the Z-order curve in 3D (a 2D

    representation is shown on the right-hand side of the figure). In practice, we can determine the

    order of the leaf nodes by assigning a Morton code for each ob ject and then sorting the objects

    accordingly. A s mentioned in the contex t of sort and sweep inpart I, parallel radix sort is just the

    right tool for this job. A good way to assign the Morton code for a given object is to use the

    centro id point of its bounding box , and express it relative to the bo unding box of the scene. The

    ex pansion and interleaving of bits can then be performed efficiently by ex ploiting the arcane bit-

    swizzling properties of integer multiplication, as shown in the following code.

    // Expands a 10-bit integer into 30 bits

    // by inserting 2 zeros after each bit.

    unsignedint expandBits(unsignedint v)

    {

    v =(v *0x00010001u)&0xFF0000FFu;

    v =(v *0x00000101u)&0x0F00F00Fu;

    v =(v *0x00000011u)&0xC30C30C3u;

    v =(v *0x00000005u)&0x49249249u;

    return v;

    }

    // Calculates a 30-bit Morton code for the

    // given 3D point located within the unit cube [0,1].

    unsignedint morton3D(float x,float y,float z)

    {

    x = min(max(x *1024.0f,0.0f),1023.0f);

    y = min(max(y *1024.0f,0.0f),1023.0f);

    z = min(max(z *1024.0f,0.0f),1023.0f);

    http://en.wikipedia.org/wiki/Space-filling_curvehttps://developer.nvidia.com/content/thinking-parallel-part-i-collision-detection-gpuhttp://en.wikipedia.org/wiki/Z-order_curvehttp://en.wikipedia.org/wiki/Space-filling_curve
  • 7/27/2019 Thinking Parallel, Part 3

    3/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 3/11

    unsignedint xx = expandBits((unsignedint)x);

    unsignedint yy = expandBits((unsignedint)y);

    unsignedint zz = expandBits((unsignedint)z);

    return xx *4+ yy *2+ zz;

    }

    In our ex ample dataset with 12,486 objects, assigning the Morton codes this way takes 0.02

    milliseco nds on GeForce GTX 690 , whereas sorting the objec ts takes 0.18 ms. So far so goo d, but

    we still have a tree to b uild.

    TOP-DOWN HIERARCHY GENERATION

    One of the great things about LBVH is that once we hav e fixed the order o f the leaf nodes, we c an

    think of each internal node as just a linear range ov er them. To illustrate this, suppose that we

    haveNleaf nodes in total. The root node co ntains all of them, i.e. it cov ers the range [0,N-1]. The

    left child of the root must then cov er the range [0,], for some appro priate choice of , and the

    right child cov ers the range [+1,N-1]. We can continue this all the way down to obtain thefollowing recursiv e algorithm.

    Node* generateHierarchy(unsignedint* sortedMortonCodes,

    int* sortedObjectIDs,

    int first,

    int last)

    {

    // Single object => create a leaf node.

    if(first ==last)

    returnnewLeafNode(&sortedObjectIDs[first]);

    // Determine where to split the range.

    int split = findSplit(sortedMortonCodes, first,last);

    // Process the resulting sub-ranges recursively.

    Node* childA = generateHierarchy(sortedMortonCodes, sortedObjectI

    first, split);

    Node* childB = generateHierarchy(sortedMortonCodes, sortedObjectI

    split +1,last);

    returnnewInternalNode(childA, childB);

    }

    We start with a range that covers all o bje cts (first=0, last=N-1), and determine an appropriateposition to split the range in two (split=). We then repeat the same thing for the resulting sub-

    ranges, and generate a hierarchy where each such split corresponds to one internal node. The

    recursion terminates when we encounter a range that co ntains only one item, in which case we

    create a leaf node.

    http://en.wikipedia.org/wiki/GeForce_600_Series
  • 7/27/2019 Thinking Parallel, Part 3

    4/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 4/11

    The only remaining question is how to choose . LBVH determines acc ording to the highest bit

    that differs between the Morton co des within the given range. In other words, we aim to partition

    the ob jects so that the highest differing bit will be zero for all objects in childA, and one for all

    objects in childB. The intuitive reason that this is a good idea is that partitioning objec ts by the

    highest differing bit in their Morton codes co rrespo nds to classify ing them on either side of an

    axis-aligned plane in 3D. In prac tice, the most efficient way to find out where the highest bit

    changes is to use binary search. The idea is to maintain a current best guess for the po sition, and

    try to adv ance it in exponentially dec reasing steps. On each step, we check whether the propo sed

    new position would violate the requirements for childA, and either accept o r reject it

    acc ordingly. This is illustrated by the following code, which uses the __clz()intrinsic function

    available in NVI DIA Fermi and Kepler GPUs to count the number of leading zero bits in a 32-bit

    integer.

    int findSplit(unsignedint* sortedMortonCodes,

    int first,

    int last)

    {

    // Identical Morton codes => split the range in the middle.

    unsignedint firstCode = sortedMortonCodes[first];

    unsignedint lastCode = sortedMortonCodes[last];

  • 7/27/2019 Thinking Parallel, Part 3

    5/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 5/11

    if(firstCode == lastCode)

    return(first +last)>>1;

    // Calculate the number of highest bits that are the same

    // for all objects, using the count-leading-zeros intrinsic.

    int commonPrefix = __clz(firstCode ^ lastCode);

    // Use binary search to find where the next bit differs.

    // Specifically, we are looking for the highest object that

    // shares more than commonPrefix bits with the first one.

    int split = first;// initial guess

    int step =last- first;

    do {

    step =(step +1)>>1;// exponential decrease

    int newSplit = split + step;// proposed new position

    if(newSplit commonPrefix)

    split = newSplit;// accept proposal

    }

    }

    while(step >1);

    return split;

    }

    How should we go about parallelizing this kind of recursive algorithm? One way is to use the

    approach presented by Garanzha et al., which processes the levels of nodes sequentially ,

    starting from the root. The idea is to maintain a growing array of nodes in a breadth-first order ,

    so that ev ery level in the hierarchy c orresponds to a linear range of nodes. On a given level, we

    launch one thread for each node that falls into this range. The thread starts by reading firstand

    lastfrom the node array and calling findSplit(). It then appends the resulting child nodes to the

    same node array using an atomic counter and writes out their corre sponding sub-ranges. This

    process iterates so that each level outputs the nodes c ontained on the nex t leve l, which then get

    processed in the next ro und.

    OCCUPANCY

    The algorithm just described (Garanzha et al.) is surprisingly fast when there are millions of

    objec ts. The algorithm spends most of the exec ution time at the bottom levels of the tree, which

    http://scholar.google.fi/scholar?q=Simpler+and+faster+HLBVH+with+work+queues
  • 7/27/2019 Thinking Parallel, Part 3

    6/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 6/11

    contain more than enough work to fully employ the GPU. There is some amo unt of data

    dive rgence on the higher levels, as the threads are accessing distinct parts of the Morton code

    array . But those lev els are also less significant considering the ov erall execution time, since they

    do not c ontain as many nodes to begin with. In our c ase, however, there are only 1 2K objects

    (recall the example see n from the last post). Note that this is actually less than the number of

    threads we would need to fill our GTX 690, ev en if we were able to parallelize every thing

    perfectly . GTX 690 is a dual-GPU card, where each o f the two GPUs can run up to 16K threads in

    parallel. Even if we restric t ourselv es to only one of the GPUsthe other one c an e.g. handlerendering while we do phy sicswe are still in danger o f running low on parallelism.

    The top-down algorithm takes 1.04 ms to process our workload, which is more than twice the

    total time taken by all the other processing steps. To explain this, we need to c onsider another

    metric in addition to divergence: occupancy. Occ upancy is a measure of how many thre ads are

    exe cuting on average at any giv en time, relative to the maximum number of threads that the

    processor can theoretically support. When occ upancy is low, it translates directly to

    performance: dropping occupancy in half will reduce performance by 2x. This dependence gets

    gradually weaker as the number of activ e threads increases. The reason is that when occ upancy is

    high enough, the ov erall performance starts to bec ome limited by other factors, such asinstruction throughput and memory bandwidth.

    To illustrate, co nsider the case with 12K objec ts and 16K threads. If we launch one thread per

    object, our o ccupancy is at most 7 5%. A bit low, but not by any means catastrophic. How does

    the top-down hierarchy generation algorithm compare to this? There is only one node o n the first

    level, so we launch only o ne thread. This means that the first level will run at 0.006% occupancy !

    The second lev el has two nodes, so it runs at 0.01 3% occ upancy . Assuming a balanced hierarchy,

    the third level runs at 0.025% and the fourth at 0.0 5%. Only when we get to the 1 3th lev el, can we

    ev en hope to reac h a reasonable occupancy of 25%. But right after that we will already run out of

    work. These numbers are somewhat discouragingdue to the low occupanc y , the first level will

    cost roughly as much as the 13th lev el, even though there are 4096 times fewer nodes to process.

    FULLY PARALLEL HIERARCHY GENERATION

    There is no way to avoid this problem without so mehow changing the algorithm in a fundamental

    way . Even if our GPU suppor ts dynamic parallelism (as NV IDIA Tesla K20 does), we canno t av oid

    the fact that every node is dependent on the results of its parent. We have to finish processing the

    root b efore we know which ranges its children cov er, and we cannot ev en hope to start

    proc essing them until we do. In other words, regardless of how we implement top-downhierarchy generation, the first level is doomed to run at 0.006% occupancy. Is there a way to

    break the dependency between nodes?

    In fact there is, and I recently presented the so lution at High Performance Graphics 201 2 (paper,

    slides). The idea is to number the internal nodes in a very specific way that allows us to find out

    which range o f objec ts any giv en node c orresponds to, without having to know anything abo ut

    the rest o f the tree. Utilizing the fact that any binary tree withNleaf nodes always has ex actlyN-1

    internal nodes, we can then generate the entire hierarchy as illustrated by the following

    pseudocode.

    Node* generateHierarchy(unsignedint* sortedMortonCodes,

    int* sortedObjectIDs,

    int numObjects)

    http://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_slides.pptxhttp://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_paper.pdfhttps://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu
  • 7/27/2019 Thinking Parallel, Part 3

    7/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 7/11

    {

    LeafNode* leafNodes =newLeafNode[numObjects];

    InternalNode* internalNodes =newInternalNode[numObjects -1];

    // Construct leaf nodes.

    // Note: This step can be avoided by storing

    // the tree in a slightly different way.

    for(int idx =0; idx < numObjects; idx++)// in parallel

    leafNodes[idx].objectID = sortedObjectIDs[idx];

    // Construct internal nodes.

    for(int idx =0; idx < numObjects -1; idx++)// in parallel

    {

    // Find out which range of objects the node corresponds to.

    // (This is where the magic happens!)

    int2 range = determineRange(sortedMortonCodes, numObjects, id

    int first = range.x;

    intlast= range.y;

    // Determine where to split the range.

    int split = findSplit(sortedMortonCodes, first,last);

    // Select childA.

    Node* childA;

    if(split == first)

    childA =&leafNodes[split];

    else

    childA =&internalNodes[split];

    // Select childB.

    Node* childB;

    if(split +1==last)

    childB =&leafNodes[split +1];

    else

    childB =&internalNodes[split +1];

    // Record parent-child relationships.

    internalNodes[idx].childA = childA;internalNodes[idx].childB = childB;

    childA->parent =&internalNodes[idx];

    childB->parent =&internalNodes[idx];

  • 7/27/2019 Thinking Parallel, Part 3

    8/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 8/11

    }

    // Node 0 is the root.

    return&internalNodes[0];

    }

    The algorithm simply allocates an array ofN-1 internal nodes, and then proc esses all of them in

    parallel. Each thread starts by determining which range of objects its node c orresponds to, with a

    bit of magic, and then proceeds to split the range as usual. Finally , it selec ts c hildren for the no de

    acc ording to their respect ive sub-ranges. If a sub-range has only one object, the c hild must be a

    leaf so we reference the corresponding leaf node direc tly. Otherwise, we reference another

    internal node from the array .

    The way the internal nodes are numbered is already ev ident from the pseudocode. The root has

    index 0 , and the children of ev ery node are loc ated on either side of its split position. Due to

    some nice properties of the sorted Morton codes, this numbering scheme never results in any

    duplicates or gaps. Furthermo re, it turns out that we can implement determineRange()in much

    the same way as findSplit(), using two similar binary searches ov er the nearby Morton codes.

    For further details on how and why this works, please see thepaper.

    How does this algorithm compare to the recursive top-down approach? It clearly performs more

    workwe no w need thr ee binary searches per no de, whereas the top-down approach only needs

    one. But it does all of the work completely in parallel, reaching the full 7 5% occupanc y in our

    ex ample, and this makes a huge difference. The execut ion time of parallel hierarchy generation is

    merely0.02 msa 50x improvement ov er the top-down algorithm!

    Y ou might think that the top-down algorithm should start to win when the number of objec ts is

    sufficiently high, since the lack of occupancy is no longer a problem. Howev er, this is not the case

    in practicethe parallel algorithm consistently performs better on all problem sizes. The

    ex planation for this is, as always, div ergence. In the parallel algorithm, nearby threads are

    http://www.parallelforall.com/wp-content/uploads/2012/11/karras2012hpg_paper.pdf
  • 7/27/2019 Thinking Parallel, Part 3

    9/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 9/11

    always accessing nearby Morton codes, whereas the top-down algorithm spreads out the

    accesses ov er a wider area.

    BOUNDING BOX CALCULATION

    Now that we have a hierarchy of nodes in place, the only thing left to do is to assign a

    conserv ative bounding box for each o f them. The approach I adopt in my paper is to do a parallel

    bo ttom-up reductio n, where each thread star ts from a single leaf node and walks toward the ro ot.

    To find the bounding box o f a given node, the thread simply looks up the bounding box es of its

    children and calculates their union. To av oid duplicate work, the idea is to use an atomic flag per

    node to terminate the first thread that enters it, while letting the second one thro ugh. This

    ensures that ev ery node gets processed only once, and not before both of its children are

    proc essed. The bounding box calculation has high execution divergenceonly half of the threads

    remain active after proc essing one node, one quarter after processing two nodes, one eigth after

    proc essing three nodes, and so on. Howev er, this is not really a problem in practic e because of

    two reasons. First, bounding box calculation takes only0.06 ms, which is still reasonably low

    compared to e.g. sorting the objects. Seco nd, the processing mainly c onsists of reading and

    writing bounding boxes, and the amo unt of computation is minimal. This means that the

    exe cution time is almost entirely dictated by the available memory bandwidth, and reducing

    exe cution divergence would not really help that much.

    SUMMARY

    Our algorithm for finding potential collisions among a set of 3D objects c onsists of the following 5

    steps (times are for the 1 2K object scene used in theprevious post).

    1 . 0.02 ms, one thread per o bject: Calculate bounding box and assign Morto n code.

    2. 0.18 ms, parallel radix sor t: Sort the objects acc ording to their Morton codes.

    3. 0.02 ms, one thread pe r internal node: Generate BVH node hierarchy .

    4. 0.06 ms, one thread per objec t: Calculate node bounding box es by walking the hierarchy

    toward the root.

    5. 0.25 ms, one thread per o bject: Find potential co llisions by traversing the BVH.

    The complete algorithm takes 0.53 ms, out of which 53% goes to tree construc tion and 47 % to

    tree trave rsal.

    DISCUSSION

    We hav e presente d a number of algorithms of vary ing c omplex ity in the c ontext of broad-phase

    co llision detec tion, and identified some of the most important considerations when designing and

    implementing them on a massively parallel processor. The comparison between independent

    trav ersal and simultaneous trav ersal illustrates the importance of divergence in algorithm design

    the best single-core algorithm may easily turn out to be the worst one in a parallel setting.

    Relying on time complex ity as the main indicator o f a good algorithm can sometimes be

    misleading or ev en harmfulit may actually be beneficial to do mo re wor k if it helps in reduc ing

    divergence.

    The parallel algorithm for BVH hierarc hy generation brings up another interesting point. In the

    traditional sense, the algorithm is completely pointlesson a single-core processor, the

    https://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu
  • 7/27/2019 Thinking Parallel, Part 3

    10/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 10/11

    dependencies between nodes were not a problem to begin with, and doing more work per node

    only makes the algorithm run slower. This shows that parallel programming is indeed

    fundamentally different from traditional single-core programming: it is not so muc h about

    porting ex isting algorithms to run on a parallel proc essor; it is about re-thinking some of the

    things that we usually take for granted, and c oming up with new algorithms specifically designed

    with massiv e paralle lism in mind.

    And there is still a lot to be accomplished on this fro nt.

    About the author: Tero Karras is a graphics research scientist at NVIDIA Research.

    Parallel Forall is the NVI DIA Parallel Programming blog. If you enjoy ed this post, subsc ribe to

    theParallel Forall RSS fee d! You may contact us via the contact form.

    NVIDIA Developer Programs

    Get exclusiv e access to the lat est softwar e, report bugs and r eceiv e notifications for special ev ents.

    Lear n m ore and Register

    Recommended Reading

    About Par al lel Forall

    Conta ct Para llel Forall

    Parallel Forall Blog

    Featured Articles

    Prev iousPauseNext

    Tag Index

    accelerometer (1) Alg orithm s (3) An droid (1 ) ANR (1 ) ARM (2) Ar ray Fire (1) Au di (1) Aut omotiv e &

    Embedded (1 ) Blog (2 0) Blog (2 3) Cluster (4 ) competition (1 ) Compilation (1 ) Concur rency (2)

    Copperh ead (1 ) CUDA (2 3) CUDA 4.1 (1 ) CUDA 5.5 (3 ) CUDA C (1 5) CUDA Fort ra n (1 0) CUDA Pro Tip

    (1 ) CUDA Profiler (1 ) CUDA Spotligh t (1 ) CUDA Zone (81 ) CUDACasts (2) Debug (1 ) Debugg er (1 )

    Debugging (3) Dev elop 4 Shield (1) dev elopment kit (1 ) DirectX (3) Eclipse (1 ) Ev ents (2) FFT (1 ) Finite

    Difference (4) Floatin g Point (2 ) Gam e & Graphics Dev elopment (3 5) Gam es and Gra phics (8) GeForce

    Dev eloper Stories (1 ) getting started (1) google io (1 ) GTC (2) Hardwar e (1 ) Interv iew (1 ) Kepler (1 )

    Lamborghini (1 ) Librar ies (4) m emory (6) Mobile Dev elopment (2 7 ) Monte Car lo (1 ) MPI (2) Multi -GPU

    Udacity CS344: Intro to Parallel

    Programming

    https://developer.nvidia.com/udacity-cs344-intro-parallel-programminghttps://developer.nvidia.com/udacity-cs344-intro-parallel-programminghttps://developer.nvidia.com/blog/tags/4853?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4756?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4871?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/676?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4836?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/431?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4862?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4870?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4859?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/181?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4201?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/421?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4948?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4977?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/1?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4676?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4851?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4946?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4741?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4811?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3131?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/276?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4915?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4521?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4372?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4166?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4874?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/2?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4336?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/466?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4873?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/456?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4501?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4933?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4631?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/166?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4857?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4845?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4864?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4860?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4852?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4917?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4905?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4786?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/436?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4842?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4868?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4867?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/496?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4843?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/321?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttp://developer.nvidia.com/blog/feed/2http://developer.nvidia.com/content/contact-parallel-forallhttp://developer.nvidia.com/about-parallel-forallhttps://developer.nvidia.com/blog?term=2https://developer.nvidia.com/registered-developer-programshttps://developer.nvidia.com/content/contact-parallel-forallhttp://nvdrupws18.nvidia.com/blog/feed/2
  • 7/27/2019 Thinking Parallel, Part 3

    11/11

    10/10/13 Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone

    (3) n ativ e_app_glue (1 ) NDK (1 ) NPP (1 ) Nsight (2) N Sight Eclipse Edition (1 ) Nsight Tegra (1 )

    NSIGHT Visual Stu dio Edition (1 ) Num baPro (2) Nu m erics (1) NV IDIA Parallel Nsight (1 ) nv idia-smi

    (1 ) Occupancy (1 ) OpenACC (6) OpenGL (3) OpenGL ES (1) Parallel Forall (6 9) Parallel Nsight (1 )

    Parallel Progr am m ing (5) PerfHUD ES (2) Perform anc e (4) Port ability (1 ) Port ing (1 ) Pro Tip (5)

    Professional Gra phics (6) Profiling (3) Program m ing Lang ua ges (1) Py thon (3 ) Robotics (1 ) Shape

    Sensing (1 ) Shared Memory (6) Shield (1) Stream s (2) tablet (1 ) TADP (1 ) Technologies (3) t egra (5)

    Tegra A ndroid Developer Pack (1 ) Tegra A ndroid Development Pack (1 ) Tegra Dev eloper Stories (1)

    Tegra Profiler (1 ) Tegra Zone (1 ) Textur es (1) Th rust (3 ) Tools (10) tools (2) Toradex (1 ) Visual Stu dio

    (3 ) Windows 8 (1 ) xoom (1 ) Zone In (1 )

    Developer Blogs

    Parallel Forall Blog

    About

    Contact

    Copyright 2013 NVIDIA Corporation

    Legal Inf ormation

    Priv acy Policy

    Code of C onduct

    https://developer.nvidia.com/code-of-conducthttp://www.nvidia.com/object/privacy_policy.htmlhttp://www.nvidia.com/object/legal_info.htmlhttps://developer.nvidia.com/contacthttp://www.nvidia.com/page/companyinfo.htmlhttp://developer.nvidia.com/blog/feed/2https://developer.nvidia.com/blog/tags/4392?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4211?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4945?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4161?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4861?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/256?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/50?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4821?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4849?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4397?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4939?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4971?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4940?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4914?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/51?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/221?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4865?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/316?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4844?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4854?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4848?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4937?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4872?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4858?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4841?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4526?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4847?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4869?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4850?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3546?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4831?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/231?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/176?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4880?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/311?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/3356?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4621?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4846?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4944?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4626?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4681?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4953?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4903?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4938?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4949?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4776?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4701?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/296?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublashttps://developer.nvidia.com/blog/tags/4866?display[%24ne]=char%28119%2C104%2C115%2C83%2C81%2C76%2C105%29content%2Fincomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublas