Interactive Rendering with Coherent Ray Tracingwald/Publications/2001/CRT/CRT.pdfBasic ray casting,...

12
EUROGRAPHICS 2001 / A. Chalmers and T.-M. Rhyne (Guest Editors) Volume 20 (2001), Number 3 Interactive Rendering with Coherent Ray Tracing Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner Computer Graphics Group, Saarland University Abstract For almost two decades researchers have argued that ray tracing will eventually become faster than the rasteri- zation technique that completely dominates todays graphics hardware. However, this has not happened yet. Ray tracing is still exclusively being used for off-line rendering of photorealistic images and it is commonly believed that ray tracing is simply too costly to ever challenge rasterization-based algorithms for interactive use. However, there is hardly any scientific analysis that supports either point of view. In particular there is no evidence of where the crossover point might be, at which ray tracing would eventually become faster, or if such a point does exist at all. This paper provides several contributions to this discussion: We first present a highly optimized implementation of a ray tracer that improves performance by more than an order of magnitude compared to currently available ray tracers. The new algorithm makes better use of computational resources such as caches and SIMD instructions and better exploits image and objectspace coherence. Secondly, we show that this software implementation can challenge and even outperform high-end graphics hardware in interactive rendering performance for complex environments. We also provide an brief overview of the benefits of ray tracing over rasterization algorithms and point out the potential of interactive ray tracing both in hardware and software. 1. Introduction Ray tracing is famous for its ability to generate high-quality images but is also well-known for long rendering times due to its high computational cost. This cost is due to the need to traverse a scene with many rays, intersecting each with the geometric objects, shading the visible surface samples, and finally sending the resulting pixels to the screen. Due to the cost associated with ray tracing the technique is viewed almost exclusively as an off-line technique for cases where image quality matters more than rendering speed. On the other hand ray tracing offers a considerable num- ber of advantages over other rendering techniques. Basic ray casting, i.e. sampling the scene with individual rays, is a fun- damental task that is the core of a large number of algorithms not only in computer graphics. Other disciplines use the sa- me approach for example to simulate propagation of radio waves 11 , neutron transport, and diffusion 33 . Another exam- ple is DARPA’s large “Data Intensive Systems” (DIS) pro- ject 10 , which is mainly motivated by the need to speed up ray tracing for computing radar cross sections. But even if we only concentrate on rendering applications, ray tracing offers a number of benefits over rasterization- Figure 1: Interactive ray tracing: The office, conference and Soda Hall models contain roughly 40k, 680k, and 8 million triangles, respectively. Using our software ray tracing im- plementation on a single PC (Dual Pentium-III, 800 MHz, 256MB) at a resolution of 512 2 pixels, these scenes render at roughly 3.6, 3.2, and 1.6 frames per second using both processors. based algorithms that dominate todays algorithms targeted at interactive 3D graphics: Occlusion Culling and Logarithmic Complexity Ray tra- cing enables efficient rendering of complex scenes through its built in occlusion culling as well as its loga- rithmic complexity in the number of scene primitives. c The Eurographics Association and Blackwell Publishers 2001. Published by Blackwell Publishers, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA.

Transcript of Interactive Rendering with Coherent Ray Tracingwald/Publications/2001/CRT/CRT.pdfBasic ray casting,...

  • EUROGRAPHICS2001/ A. ChalmersandT.-M. Rhyne(GuestEditors)

    Volume20 (2001), Number3

    Interacti veRendering with Coherent Ray Tracing

    Ingo Wald,PhilippSlusallek,CarstenBenthin,andMarkusWagner

    ComputerGraphicsGroup,SaarlandUniversity

    AbstractFor almosttwo decadesresearchers havearguedthat ray tracingwill eventuallybecomefasterthan therasteri-zationtechniquethat completelydominatestodaysgraphicshardware. However, this hasnot happenedyet.Raytracing is still exclusivelybeingusedfor off-line renderingof photorealistic imagesand it is commonlybelievedthat ray tracingis simplytoocostlyto everchallenge rasterization-basedalgorithmsfor interactiveuse. However,there is hardly anyscientificanalysisthatsupportseitherpointof view. In particular there is noevidenceof wherethecrossover pointmightbe, at which ray tracingwouldeventuallybecomefaster, or if such a pointdoesexistatall.Thispaperprovidesseveral contributionsto thisdiscussion:Wefirstpresenta highlyoptimizedimplementationofa ray tracerthat improvesperformanceby more thanan order of magnitudecomparedto currentlyavailableraytracers. Thenew algorithm makesbetteruseof computationalresourcessuch as cachesand SIMD instructionsandbetterexploits image andobjectspacecoherence. Secondly, weshowthat this software implementationcanchallenge and evenoutperformhigh-endgraphicshardware in interactive renderingperformancefor complexenvironments.We alsoprovidean brief overview of thebenefitsof ray tracingover rasterizationalgorithmsandpoint out thepotentialof interactiveray tracingbothin hardware andsoftware.

    1. Intr oduction

    Raytracingis famousfor its ability to generatehigh-qualityimagesbut is alsowell-known for long renderingtimesdueto its high computationalcost.This cost is dueto the needto traversea scenewith many rays, intersectingeachwiththe geometricobjects,shadingthe visible surfacesamples,andfinally sendingtheresultingpixelsto thescreen.Duetothecostassociatedwith ray tracingthe techniqueis viewedalmostexclusively asan off-line techniquefor caseswhereimagequality mattersmorethanrenderingspeed.

    On theotherhandray tracingoffersa considerablenum-berof advantagesoverotherrenderingtechniques.Basicraycasting,i.e.samplingthescenewith individualrays,is afun-damentaltaskthatis thecoreof alargenumberof algorithmsnot only in computergraphics.Otherdisciplinesusethesa-me approachfor exampleto simulatepropagationof radiowaves11, neutrontransport,anddiffusion33. Anotherexam-ple is DARPA’s large “Data Intensive Systems”(DIS) pro-ject 10, which is mainly motivatedby the needto speedupray tracingfor computingradarcrosssections.

    But evenif weonly concentrateonrenderingapplications,ray tracing offers a numberof benefitsover rasterization-

    Figure1: Interactiveraytracing:Theoffice, conferenceandSodaHall modelscontainroughly40k,680k,and8 milliontriangles,respectively. Using our software ray tracing im-plementationon a singlePC (Dual Pentium-III, 800 MHz,256MB)at a resolutionof 5122 pixels,thesescenesrenderat roughly3.6, 3.2, and 1.6 framesper secondusingbothprocessors.

    basedalgorithmsthat dominatetodaysalgorithmstargetedat interactive 3D graphics:

    OcclusionCulling and Logarithmic Complexity Raytra-cing enables efficient rendering of complex scenesthroughits built in occlusionculling aswell as its loga-rithmic complexity in thenumberof sceneprimitives.

    c�

    TheEurographicsAssociationandBlackwellPublishers2001.Publishedby BlackwellPublishers,108 Cowley Road,Oxford OX4 1JF, UK and350 Main Street,Malden,MA02148,USA.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    Flexibility Ray tracingallows us to traceindividual or un-structuredgroupsof rays.Thisprovidesfor efficientcom-putationof justtherequiredinformation,e.g.for samplingnarrow glossyhighlights,for filling holesin image-basedrendering,andfor importancesamplingof illumination42.

    Efficient Shading With ray tracing,samplesareonly sha-dedafter visibility hasbeendetermined.Given the trendtowardmoreandmorerealisticandcomplex shading,thisavoidsredundantcomputationsfor invisiblegeometry.

    Simpler ShaderProgramming Programmingshadersthatcreatespeciallighting andappearanceeffectshasbeenatthecoreof realisticrendering.While writing shaders(e.g.for the RenderManstandard3) is fairly straightforward,adoptingtheseshadersto beusedin thepipelinemodelofrasterizationhasbeenverydifficult 31. Sinceraytracingisnot limited to this pipelinemodelit canmake direct useof shaders16� 38.

    Corr ectnessBy default raytracingcomputesmostlyphysi-cally correctreflections,refractions,andshading.In ca-se the correct resultsare not requiredor are too cost-ly to compute,ray tracing can easily make use of thesameapproximationsusedto generatetheseeffects forrasterization-basedapproaches,suchas reflectionor en-vironmentmaps.This is contraryto rasterization,whereapproximationsare the only option and it is difficult toevencomecloseto realisticeffects.

    Parallel Scalability Ray tracingis known for being“tri vi-ally parallel” aslong asa high enoughbandwidthto thescenedatabaseis provided.Giventheexponentialgrowthof availablehardwareresources,raytracingshouldbebet-terableto utilize it thanrasterization,whichhasbeendif-ficult to scaleefficiently 12. However, theinitial resourcesrequiredfor a ray tracingenginearehigherthanthosefora rasterizationengine.

    Coherence is thekey to efficient rendering.Dueto thelowcoherencebetweenrays in traditional recursive ray tra-cing implementations,performancehasbeenratherlow.However, aswe show in this paper, ray tracingstill offersconsiderablecoherencethatcanbeexploitedto speeduprenderingto interactive levelsevenon a standardPCs.

    It is due to this long list of advantagesthat we belie-ve ray tracing is an interestingalternative even in the fieldof interactive 3D graphics.Thechallengeis to improve thespeedof ray tracingto the extent that it cancompetewithrasterization-basedalgorithms.It seemsthatsomehardwaresupportwill eventuallybeneededto reachthisgoal,howeverin thispaperweconcentrateonapuresoftwareimplementa-tion.

    While it is certainlytruethat ray tracinghasa high com-putationalcost,its low performanceon todayscomputersisalsostronglyaffectedby thestructureof thebasicalgorithm.It is well-known thattherecursivesamplingof raytreesneit-herfits with thepipelineexecutionmodelof modernCPUsnor with theuseof cachingto hidelow bandwidthandhighlatency whenaccessingmainmemory28.

    Many researchprojectshave addressedthetopic of spee-ding up ray tracing14� 15 by variousmethodssuchasbetteraccelerationstructures,fasterintersectionalgorithms,paral-lel computation34� 8, approximatecomputations6, etc.Thisresearchhasresultedin a largenumberof improvementstothebasicalgorithmandis documentedin theray tracingli-teratureof thepastdecades.

    In our implementationwebuild onthispreviouswork andcombineit in a novel andoptimizedway, payingparticularattentionto caching,pipelining,andSIMD issuesto achievemore than an order of magnitudeimprovementin ray tra-cing speedcomparedto otherwell-known ray tracerssuchasRayshadeor POV-Ray. As a result we areable achieveinteractive frameratesevenonstandardPCs(seeFigure1).

    We start this paper with a description of our high-performanceray tracingengine,discussinggeneraloptimi-zationstrategies(Section2), a vectorizedintersectionalgo-rithmusingIntel’sSSESIMD instructions,whichis workingonpacketsof rays(Section3),asimpleandefficientBSPtra-versalalgorithmsthatalsoworksonraypackets(Section4),andfinally aSIMD shadingimplementation(Section5).Theperformanceof our ray tracingengineis thenevaluatedinSection6.

    Wethenuseourraytracingenginetoshow thatraytracingis particularlywell suitedfor efficient renderingof complexmodels(Section7). We evaluatetheperformanceandscala-bility of ourraytraceronanumberof modelsrangingfrom afew ten-thousandup to 8 million triangles.Wealsocompareoursoftwareray traceragainsttheperformanceof OpenGL-basedrasterizationhardwaresuchasalow-costNvidiaGPU,a SGI Octaneworkstation,anda SGI ONYX-3 graphicssu-percomputer. We show thateventodaytheperformanceof asoftwareray traceron a singlePC canchallengededicatedrasterizationhardware for complex environments.Additio-nally, we show early resultsof distributedray tracingusinga few desktopPCsthat outperformsthe graphicshardwareabove for complex scenes.

    1.1. PreviousWork

    Even thoughray-tracingis asold as1968 4� 20� 43� 9, its usefor interactiveapplicationsis relatively new. RecentlyParkeret al. 28� 29� 30 demonstratedthat interactive frameratescouldbeachievedwith a full-featuredray traceron a largesharedmemorysupercomputer.

    Their implementationoffersall theusualray tracingfea-tures,includingparametricsurfacesandvolumeobjects,butis carefully optimized for cacheperformanceand parallelexecution in a non-uniform memory accessenvironment.They have proven that ray tracingscaleswell in the num-berof processorsin a sharedmemoryenvironment,andthatevencomplex scenesof severalhundredthousandprimitivescouldberenderedat almostreal-timeframerates.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    Pharret al. 32 have shown that coherencecanbe exploi-ted by completelyreorderingthe ray tracing computation.They wereable to rendersceneswith up to 46 million tri-angles.Theirapproachactively managesthescenegeometryandraysthroughpriority queuesinsteadof relyingonsimplecachingaswe do. However, their systemwasfar from realtime.

    Hardwareimplementationsof ray tracingareavailable40,but are currently limited to acceleratingoff-line-renderingapplications,anddonot targetinteractive framerates.

    2. An Optimized Ray Tracing Implementation

    The following four sectionsdescribeour implementationof a highly optimizedray tracing enginethat outperformscurrentlyavailableray tracersby morethananorderof ma-gnitude(seeSection6).

    We startwith an overview of generaloptimizationtech-niquesandhow they have beenappliedto our ray tracingengine,suchasreducingcodecomplexity, optimizingcacheusage,reducingmemorybandwidth,and prefetchingdata.A similar discussionof optimizationissues– althoughon ahigher-level – canalsobefoundin 39. In thefollowing secti-onswethendiscusstheuseof SIMD instructions,commonlyavailableonmicroprocessorstoday, to efficiently implementthe main threecomponentsof a ray tracer:ray intersectioncomputations,scenetraversal,andshading.

    2.1. CodeComplexity

    A modernprocessorhasseveral hardware featuressuchasbranchprediction,instructionreordering,speculative execu-tion, and other techniques17� 18 in order to avoid expensi-ve pipeline stalls.However, the successof thesehardwareapproachesis fairly limited anddependsto a large degreeon thecomplexity of theinput programcode.Therefore,weprefer simple codethat containsfew conditionals,and or-ganizeit suchthat it canexecutein tight inner loops.Suchcodeis easierto maintainandcanbewell optimizedby theprogrammeraswell as by the compiler. Theseoptimizati-onsbecomemoreandmoreimportantasprocessorpipelinesgetlongerandthegapbetweenprocessorspeedandmemorybandwidthandlatency opensfurther.

    For traversing the scene,we use an axis-alignedBSP-tree23: Its raytraversalalgorithmis shorterandsimplercom-paredto octrees,boundingvolumehierarchies(BVH), andgrids.Eventhoughmosteasilyformulatedrecursively, it canbetransformedto a compactiterative algorithm21.

    We have alsochosento only supporttrianglesasgeome-tric primitives.As a result,the inner loop thatperformsin-tersectioncomputationson lists of objectsdoesnot have tobranchto a differentfunctionfor eachtypeof primitive.Bylimiting the codeto triangleswe loselittle flexibility asallsurfacegeometrycan be converted.The sameapproachis

    beingusedby mostcommercialray tracers(accordingto in-formationfrom their developers).While thenumberof pri-mitivesincreases,this is morethancompensatedby thebet-terperformanceof theray tracingengine.

    For shadingwe needthe flexibility to supportarbitraryshadersandthusallow for dynamicloadingof shaders.Ad-vancedshadingfeatureslikemulti-texturing,bump-mappingor reflectionscould be addedwithout changingthe coreofthe ray tracing engine.Flexibility in the shadingstageismuch lessproblematicthan for intersectioncomputations,asit is only calledoncefor eachshadingray, while we per-form anaverageof 40-50traversalstepsand5-10intersecti-on testsperray.

    2.2. Caching

    Contrary to generalopinion, our careful profiling revealsthat a ray traceris not boundby CPU speed,but is in factbandwidth-boundby accessto main memory. Especiallyshootingraysincoherently(asdonein many globalillumina-tion algorithms)resultsin almostrandommemoryaccessesandbadcacheperformance.On currentPC systems,band-width to mainmemoryis typically upto 8-10timeslessthanto primarycaches.Evenmoreimportantly, memorylatencyincreasesby similar factorsaswegodown thememoryhier-archy. Forexample,ourtriangletestis morethan60%slowerif the datahasto be fetchedfrom main memoryinsteadofbeingin the cache.Memory issuesbecomeeven more im-portantfor BSPtraversal,wheretheratio of computationtomemorybandwidthis lower, thusmakingit moredifficult tohidelatencies.

    Sincedatatransferbetweenmemoryandcacheis alwaysperformedin entirecachelinesof 32bytes,theeffectivecostwhenaccessingmemoryis not directly relatedto the num-berof bytesread,but thenumberof cacheline transfers.Asageneralresultweneedto carefullylay outdatasuchthatitmakesbestuseof theavailablecachesanddesignour algo-rithmssothatwe canefficiently hidelatency by prefetchingdata,suchthat it is alreadyavailable in a cachewhen it isneededfor computations.

    Wecarefullyaligndatato cachelines:Thisminimizestheadditionalbandwidthrequiredto loadtwo cachelinessimplybecausesomedatahappento straddleacacheline boundary.However thereareoften trade-offs. For instanceour trian-gle datastructurerequiresabout37 bytes.By paddingit to48 byteswe trade-off memoryefficiency andcacheline ali-gnment.

    We keepdatatogetherif andonly if it is usedtogether:E.g.only datanecessaryfor atriangleintersectiontest(planeequation,etc.) arestoredin our geometrystructures,whiledatathatis only necessaryfor shading(suchasvertex colorsand normals,shaderparameters,etc.) is storedseparately.Becausewe intersecton averageseveral trianglesbeforewe

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    find an intersection,we avoid loadingdatathat will not beused.

    Given thehugelatency of accessingmainmemoryit be-comesnecessaryto load datainto the cachebeforeit willbe usedin computationsandnot fetch it on demand.Thisway thememorylatency canbecompletelyhidden.Most oftodaysmicroprocessorsoffer instructionsto explicitly pre-fetchdatainto certaincaches.However, in orderto usepre-fetchingeffectively, algorithmsmustbesimpleenoughsuchthat it caneasilybe predictedwhich datawill beneededinthenearfuture.

    2.3. Cachesand Mailboxes

    We also separateread-only, e.g. preprocessedtriangle da-ta, from read-writedata such as mailboxes 15. If a mail-box would be storedwith the triangledataas in the origi-nal proposal,anentirecacheline would bemarkedchangedeventhoughonly asingleintegerhasactuallybeenmodified.This becomesa hugeproblemin a multi-threadedenviron-ment,whereby constantlychangingmailboxeseachproces-sorkeepsinvalidatingcachelinesin all otherprocessors.

    Theseproblemscan easily be resolved by employing asimplehashingmechanism:eachthreadcomputingintersec-tionshasasmallhashtableof entriesof theform (triangleId,rayId). A mailbox lookup thensimply consistsof checkingthecorrespondinghashtableentry. Sincea goodscenetra-versalalgorithmresultsin only a few triangleintersectionsperray, thehash-tablecanbekeptsmall.

    Similarly we do not needan elaboratehashfunction butsimplemaskingof the triangleid will do. Due to thesmallamountof memoryusedandthefrequentaccessesto its ent-ries,thehashtablewill stayin thefirst level cachesmostofthe time. Occasionalredundantintersectionsof objectsdueto hashcollisionsarefaroutweighedby thevastlyimprovedmemory-performance.This mechanismis simpleenoughtobeimplementedby only a few linesof code.

    2.4. Coherencethr oughPacketsof Rays

    The mostimportantaspectof acceleratingray tracingis toexploit coherenceasfaraspossible.Ourmainapproachis toexploit coherenceof primaryandshadow raysby traversing,intersecting,andshadingpacketsof rays in parallel.Usingthis approachwe canreducethe computetime of the algo-rithm by usingSIMD instructionson multiple raysin paral-lel, reducememorybandwidthby requestingdataonly onceperpacket,andincreasecacheutilization at thesametime.

    2.5. Parallelism thr oughSIMD Extensions

    Several modernmicroprocessorarchitecturesoffer SIMDextensions,which allow to executethe samefloating pointinstructionsin parallelon several(typically two to four) da-ta values,therebyyielding a significantspeedupfor floating

    point intensiveapplicationsincluding3D graphics.Suchex-tensionsalso containinstructionsfor explicit cachemana-gementlike prefetching.Examplesof suchextensionsareIntel’s SSE19, AMD’ s 3dNow! 1, andIBM/Motorola’s Al-tiVec26.

    In the following threesectionswe discussin moredetailhow coherentcomputationswith packetsof raysandSIMDoperationscan be usedtogetherto speedup the core of araytracer, namelytriangleintersection,raytraversalandsha-ding.

    3. Ray-Triangle Intersection Computation

    Optimal ray triangleintersectioncodehaslong beenanac-tive field of researchin computergraphicsandhasleadto alargevarietyof algorithms,e.g.Moeller-Trumbore25, Glas-sner15, Badouel5, Pluecker 13, andmany others24. BeforediscussingSIMD implementations,we first describethetri-angletest usedin our C-codewithout using assemblerorSSEoptimizations.This formsthebasefor our laterdiscus-sions.

    3.1. Optimized Barycentric Coordinate Test

    The triangletestusedin our implementationis a modifica-tion of Badouel’s algorithm5. It first computesthedistanceto the point wherethe ray piercesthe planedefinedby thetriangle,andchecksthatdistancefor validity. Only if thedi-stancefallswithin theinterval wheretheray is searchingforintersections,theactualhit pointH is computedandprojec-tedinto a 2D-planeperpendicularto a coordinateaxis.

    In orderto preventnumericalinstabilities,theplanewiththelargestangleto thetrianglenormalis chosenfor thepro-jection.This resultsin threecasesfor the intersectioncom-putation.Thebarycentriccoordinatesof thehit point H canthenbecalculatedefficiently in 2D. Basedonthesebarycen-tric coordinates,it can be decidedwhetherthe ray piercesthetriangleor not.

    For the implementation,we needonly the properlysca-led2D edgeequationsfor two of thetriangleedges,togetherwith the planeequationfor the distancecalculation,andatagto marktheprojectionaxis.By preprocessingandproperscalingof theseequations,this informationcanbeexpressedby 9 floatsplustheprojectionflag.For cachealignmentpur-poses,we padthat datato a total of 48 bytes.An in-depthdescriptionof theimplementationcanbefoundin 41.

    3.2. Evaluating Instruction Level Parallelism

    Theimplementationof thebarycentrictriangletestrequiresonly few instructionsandoffers almostno potentialfor ex-ploiting instruction-level parallelism.Optimizing the algo-rithmsusingtheIntel SSEextensionsresultsin aspeedupofabout20%. It is clearthat this speedupis not sufficient forinteractive ray tracing.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    Bary. Pluecker Bary. speedupC code SSE SSE4-1

    min 78 77 22 3.5max 148 123 41 3.7

    Table 1: Cost(in CPU cycles)for thedifferent intersectionalgorithms.41 cyclescorrespondto roughly20 million in-tersectionspersecondona 800MHzPentium-III.MeasuredbyusingtheinternalPentium-IIICPUcounters.

    As anotheralternative, we alsoevaluateda SIMD imple-mentationof thePluecker triangletest(see36� 13). Dueto ali-nearcontrolflow andasomewhathighercomputationalcost,this triangletestoffersmuchmorepotentialfor instruction-level parallelism.The SSE implementationis straightfor-ward and showed good speedupscomparedto a C imple-mentationof the Pluecker test.However, due to its highercomputationalcostandparticularlyits higherbandwidthre-quirements,the instruction-parallelPluecker codeis effec-tively not significantlyfasterthantheoriginal barycentricCcode(seeTable1).

    3.3. SIMD Barycentric Coordinate Test

    The speedupachieved with the instruction-parallelimple-mentationsis toosmallto beof significantimpacton rende-ring time. We thereforewentbackto the alreadyfastbary-centriccode,anduseddataparallelismby performingfourray-triangletestsin parallel.However, this meansto eitherintersectoneray with four triangles,or to intersecta packetof four rayswith a singletriangle.Thelattercaserequiresachangeto theoverall architectureof theray tracingengine.

    Intersectingoneray with four triangleswould requireusto always have four trianglesavailable for intersectiontoachieve optimal performance.However, voxels of accele-ration datastructuresshouldcontainonly few trianglesonaverage(typically 2-3). More importantly, trianglesfall in-to threedifferentprojectioncaseswith separatecodeeach,which lowerstheoptionsto usedataparallelismevenmoreandprecludestheuseof thisapproach.

    In contrastit is muchsimplerto bundlefour raystogetherandintersectthemwith a singletriangle.However, this ap-proachrequiresus to always have a bundle of four raysavailable together, which requiresa completelynew scenetraversalalgorithm,which wediscussin thenext section.

    Thedata-parallelimplementationcorrespondsalmostex-actly to theoriginal algorithmandis straightforward to im-plementin SSE.A potentialsourceof overheadis thateventhoughsomeraysmay have terminatedearly, all four rayshave to beintersectedwith a triangle.Informationon whichof the four rays is still active is kept in a bit-field, whichcanbe usedto maskout invalid raysin a conditionalmoveinstructionwhenthehit point informationis stored.In prac-

    tice we obtain almostperfectparallelismfor primary raysandto a somewhatlesserdegreefor shadow rays.

    In our implementation,theSSEcodefor intersectingfourrays with a single triangle requires86-163 CPU cycles.Amortizing this cost over the four rays resultsin only 22to 41 cyclesper intersection,which correspondsto rough-ly 20 to 36 million ray-triangleintersectiontestpersecondon a 800 MHz Pentium-III CPU. The observed speedupis3.5-3.7(seeTable1), andis closeto themaximumexpectedvalue.Note that this algorithmcould alsobe usedto acce-lerateotherray tracing-basedrenderingalgorithmssuchasmemorycoherentray tracing32.

    4. BSPTraversal

    Evenbeforeacceleratingthetriangletest,traversalof theac-celerationstructurewastypically 2-3 timesascostlyasray-triangle intersection.As the SSEtriangle intersectioncodereducestheintersectioncostby morethana factorof three,traversalis thelimiting factorin our ray tracingengine.Sin-ceourSSEintersectionprocedurerequiresusto alwayshavefour raysavailablethis suggestsadataparalleltraversalof abundleof at leastfour rays.

    A wide variety of ray tracing accelerationschemesha-vebeendeveloped,suchasoctrees,generalBSP-trees,axis-alignedBSP-trees,regularandhierarchicalgrids,rayclassi-fication,boundingvolumehierarchies,andeven hybridsofseveralof thesemethods.See37� 15 for anoverview andfur-therreferences.Our mainreasonfor usinga BSPtreein ourimplementationis the simplicity of the traversalcode:Tra-versinganodeis basedononly two binarydecisions,oneforeachchild, which canefficiently bedonefor several raysinparallelusingSSE.If any ray traversesa child, all rayswilltraverseit in parallel.

    This is in contrastto algorithmslike octreesor hierar-chical grids, whereeachof the raysmight take a differentdecisionof which voxel to traversenext. Keepingtrack ofthesestatesis non-trivial andwasjudgedto be too compli-catedto beimplementedefficiently. BoundingVolumeHier-archieshave a traversalalgorithmthat comesclosein sim-plicity to BSPtrees.However, BVHs donot implicitly ordertheir child nodes,which is anotherreasonfor our choiceofaxis-alignedBSPtrees.

    4.1. Traversal Algorithm

    Beforedescribingour algorithmfor traversalof four raysinparallel,we first take a look at the traversalof a singleray,as presentedin 23: In eachtraversalstep,we maintainthecurrent ray segment � near� f ar � , which is thepartof theraythatactuallyintersectsthecurrentvoxel. This raysegmentisfirst initializedto � 0 ����� , thenclippedto theboundingboxofthescene,andis updatedincrementallyduringtraversal.For

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    a.)

    +

    b.)

    +

    c.)

    +

    Figure2: Thethreetraversalcasesin a BSPtree:A rayseg-mentis completelyin frontof thesplittingplane(a), comple-tely behindit (b), or intersectsbothsides(c).

    eachtraversednode,we calculatethedistanced to thesplit-ting planedefinedby thatnode,andcomparethatdistancetothecurrentraysegment.

    If theraysegmentliescompletelyononesideof thesplit-ting plane( f ar d or d near), weimmediatelyproceedtothe correspondingchild voxel. Otherwise,we traversebothchildrenin turn,with theraysegmentclippedto therespecti-ve child voxel. Thisavoidsproblemsfor raysthatareforcedby anotherray from thesamepacket to traverseavoxel theywouldnototherwisetraverse.For thoseraystheraysegmentwill betheemptyinterval.

    Thealgorithmfor tracingfour differentraysis essentiallythesame:For eachnode,we useSSEoperationsto compu-te the four distancesto the splitting planeand to comparetheseto the four respective ray segments,all in parallel.Ifall raysrequiretraversalof the samechild, we immediate-ly proceedto that child without having to changethe raysegments.Otherwise,we traverseboth children,with eachraysegmentsupdatedto � near� min f ar� d ��� for thecloser, re-spectively �max near� d ��� f ar � for thedistantchild.

    Whentraversingseveralraysatthesametime,theorderoftraversalcanbeambiguous,sincedifferentraysmight requi-readifferenttraversalorder. Sincetheorderis basedonly onthe sign of the respective direction,this canhappenonly ifthesignsof thefour directionvectorsdonotmatch,whichisa rarecaseif we assumetheraysto becoherent.Additional-ly, it canbeshown thatnotwo rayswith thesameorigin canrequiredifferent traversalorders.This completelyresolvesthis problemfor pinholecamerasandpoint light sources.Ifraysareallowedto startin differentlocations,a straightfor-wardsolutionis to only allow rayswith matchingdirectionsignsin the samepacket, andtracingthe few specialcasesseparately.

    4.2. Memory Layout for Better Caching

    As mentionedabove,theratioof computationto theamountof accessedmemoryis very low for scenetraversal.This re-quiresus to carefully designthe datastructurefor efficientcachingandprefetching.Memorybandwidthsandcacheuti-

    lization have beenimproved with a compact,unified nodelayout.For innernodesin a BSPnodewe have to store

    Pointersto the two child nodes.By implicitly storingthe

    right child immediatelyaftertheleft child, this canbere-presentedwith a singlepointer. A flag on whetherit is a leaf nodeor the type of innernode(splitting axis).This requirestwo bits. The split coordinate,which is the coordinatewheretheplaneintersectsits perpendicularaxis.

    For bestperformancewe useonefloat for thesplit coor-dinateandsqueezethetwo flag bits into the2 low orderbitsof thepointer, which resultsin 8 bytespernodeor 4 nodesper cacheline. By aligning the two childrenof a nodeonhalf a cacheline we make surethat both childrenare fet-chedtogethersincethey arelikely to be traversedtogether.The additionalcomputationsto extract thesetwo bits fromthepointerarenegligible asthetraversalcodeperformsveryfew computationsanyway comparedto the amountof me-mory it accesses.

    For leafnodesthepointeraddressesthelist of objects,andtheotherfieldscanbeusedto storethenumberof objectsinthe list. Using the samepointerfor both nodetypesallowsusto reducememorylatenciesandpipelinestallsby prefet-ching,asthenext data(eithera nodeor thelist of triangles)canbe prefetchedbeforeeven processingthe currentnode.Even thoughprefetchingcanonly be usedwith SSEcachecontroloperations,thereducedbandwidthandimprovedca-cheutilization alsoaffect thepureC implementation.

    4.3. Traversal Overhead

    Traversingpacketsof raysthroughtheaccelerationstructuregeneratessomeoverhead:Even if only a single ray requi-restraversalof a subtreeor intersectionwith a triangle,theoperationis alwaysperformedon all four rays.Our experi-mentshave shown that this overheadis relatively small aslongastheraysarecoherent.Table2 shows theoverheadinadditionalBSPnodetraversalsfor differentpacket sizes.

    As canbe seenfrom this experiment,overheadis in theorder of a few percentfor 2 � 2 packets of rays,but goesup for larger packets.On the otherhand,increasingscreenresolutionalsoincreasescoherencebetweenprimaryrays.

    Most important is the fact that the effective memorybandwidthhasbeenreducedessentiallyby a factorof fourthroughthenew SIMD traversalandintersectionalgorithmsastrianglesandBSPnodesneednotbeloadedseparatelyforeachray. This effect is particularlyimportantfor ray traver-salasthecomputationto bandwidthratio in relatively low.

    Of courseonecouldoperateonevenlargerpacketsof raysto enhancetheeffect.However, our resultsshow thatwe arerunningalmostcompletelywithin theprocessorcachesevenwith only four rays.Wehavethereforechosennotto usemo-re raysper ray packet, asit would additionallyincreasethe

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    2 � 2 4 � 4 8 � 8 2562 10242

    Shirley 6 1.4% 4.4% 11.8% 5.8% 1.4%MGF office 2.6% 8.2% 21.6% 10.4% 2.6%MGF conf. 3.2% 10.6% 28.2% 12.2% 3.2%

    Table 2: Overhead(measuredin numberof additionalnodetraversals)of tracingentirepacketsof raysat animagereso-lution of 10242 in thefirst threecolumns:Asexpected,over-headincreaseswith scenecomplexity (800, 34k, and 280ktriangles,respectively)and packet size, but is tolerable forsmall packet sizes.Thetwo columnson the right showtheoverheadfor 2 � 2 packetsat differentscreenresolutions.

    overheaddueto redundanttraversalandintersectioncompu-tations.

    5. SIMD PhongShading

    As data-parallelintersectionandtraversalhasshown to bevery effective, the samebenefitsshouldalsoapply for sha-ding computations.Similar to the traversalandintersectioncode,we canshadefour raysin parallel.Sincethe four hitpointsmayhave differentmaterials,datahasto be rearran-ged.Althoughthissetupresultsin someoverhead,thefollo-wingshadingoperationscanbeveryefficiently implementedin SSE,yielding almostperfectutilizationof theSSEunits.

    Light sourcesareprocessedin turn: For eachlight sour-ce,we first determineits visibility by shootingshadow raysusing the traversal and intersectionalgorithms describedabove. If a light sourceis visible from at leastone pixel,its contribution to all four hit pointsis computedin parallel.Thiscontribution is thenaddedto thevisiblehit pointsonly,by maskingout shadowed points.This procedurecomputesinformationthatmaygetdiscardedlaterandthushassomeoverhead.However, this happensonly in the casethat thevisibility of a light sourceis differentbetweenthesetof hitpoints.For a coherentsetof raysthis happensbut rarely.

    Specialcarehasto be taken when shootingthe shadowrays.Sinceshadow raystypically make up thelargestfracti-onof all raysin araytracer, shootingthemwith thefastSSEtraversalcodeis desirable.This,however, is only efficientaslongastheraysarecoherent,whichis notautomaticallytruefor shadow rays,sinceall shadow raysfrom asinglehit pointtypically go in very differentdirections.However, coherentprimary rays are also likely to hit similar locationsin thescene,yielding coherentshadow raysif connectedto oneofthelight sources.In theworstcase(i.e. if theraysareinco-herent),performancedegradesto theperformanceachievedwhentracingeachrayon its own.

    Implementingtheshadingin SSEoperationsgivesaspee-dupof 2 to 2.5ascomparedto theC implementationon topof thespeedupobtainedby thegeneraloptimizationsdiscus-sedabove.Texturinghasshown to berelatively cheap.Even

    scene flat textured

    Quake 4.22fps 3.85fpsTerrain 1.05fps 0.96fps

    Figure 3: Texturing comesrathercheaply:Evenfor a com-plex scenewith incoherent texture access,texturing onlyslightlyreducestheframerate, renderingevena sceneof onemillion trianglesinteractively. Image resolutionis 5122.

    Figure4: Framesfromtheaccompanyingvideoshowingtheentire conferenceroombeingreflectedin thefire extinguis-her(left).Performanceonlydropsslightlyevenwhena largefractionof thesceneis reflective. Theofficehasbeenrende-redwith manyreflectivematerials(window, lamp,mug, andothers) andthreepoint light sources.

    anunoptimizedversionhasreducedframeratesby lessthan10percent,ascanbeseenin Figure3.Thiscostcouldproba-bly bereducedevenmoredueto a largepotentialfor prefet-chingandparallelcomputationsthatwecurrentlydonottakeadvantageof. As shadingtypically makesupfor lessthan10percentof total renderingtime,morecomplex shadingope-rationscould easilybe addedwithout a majorperformancehit.

    6. Performanceof the Ray Tracing Engine

    After all thepartsof afull raytracerarenow togetherwecanevaluatetheoverall performanceof our system(RTRT). Westartby evaluatingtheperformancefor primaryraysasthiswill allow us to comparethe ray tracingalgorithmdirectlyto rasterization-basedalgorithmsthatdonotdirectlysupportshadows, reflection,andrefractioneffects.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    Tris Rayshade POV-Ray RTRT

    MGF office 40k 29 22.9 2.1MGF conf. 256k 36.1 29.6 2.3

    MGF theater 680k 56.0 57.2 3.6Library 907k 72.1 50.5 3.4

    SodaFloor5 2.5m OOM OOM 2.9SodaHall 8m OOM OOM 4.5

    Table 3: Performancecomparisonof our ray traceragainstRayshadeandPOV-Ray. All renderingtimesaregivenin mi-crosecondsperprimary ray includingall renderingoperati-onsfor thesameview of each sceneat a resolutionof 5122

    (OOM = outof memory).

    Onasingle800MHz Pentium-III,weachievearenderingperformancefrom about200,000to almost1.5 million pri-maryrayspersecondfor theSSEversionof our algorithm.If wecomparetheperformanceof thethisversionto ourop-timizedC codewe seeanoverall speedupbetween1.8 and2.5. This is a bit lessthanthat for ray-triangleintersectionbut is dueto theworseratio of memoryaccessesto compu-tationsin thetraversalstageandthestrictsequentialtraversalorderthatdoesnotallow for betterprefetching.

    6.1. Comparison to Other Ray Tracers

    In order to evaluatethe performanceof our optimizedraytracingenginewetestedit againstanumberof freelyavaila-ble ray tracers,includingPOV-Ray 27 andRayshade22. Wehave chosenthe setof testscenesso that they spana widerangeregardingthe numberof triangleand the overall oc-clusionwithin thescene.Unfortunately, bothothersystemsfailedto rendersomeof themorecomplex testscenesduetomemorylimitationsevenwith 1GB of mainmemory.

    Thenumbersof theperformancecomparisonfor thecaseof primaryraysaregivenin Table3. It demonstratesclearlythat our new ray tracing implementationimproves perfor-manceconsistentlyby a factorbetween11 and15 (!) com-paredto both POV-Ray andRayshade.The numbersshowthatpayingcarefulattentionto cachingandcoherenceissu-escanhave a tremendouseffect on theoverall performance,evenfor suchwell-analyzedalgorithmsasray tracing.

    The numbersalsoseemto indicatethat the performancegapwidensslightly for morecomplex scenes,which indica-te thatthecachingeffect getevenmorepronouncedin thesecases.Our implementationwastestedonamachinewith on-ly 256MB of mainmemory, while we hadto usea machinewith 1GBof memoryfor theotherray tracers.

    Somecommentson theseresultsarenecessary:Rayshadeis usinga uniform grid asan accelerationstructureandwehad to determinethe bestgrid size for eachsceneby trialanderror. No suchmanualoptimizationswasnecessaryforPOV-Rayandour implementation.Also both otherray tra-

    cerscouldnot dealwell with largescenesandreportedoutof memoryerrorsfor scenesbeyond1 million triangles.

    Of coursePOV-RayandRayshadeoffer considerablymo-re featuresthanour ray tracerengine.However, mostof the-sefeaturesareshadingrelatedandcouldeasilybeaddedtoour engineusingdynamicallyloadableshaders.This wouldhave little effect on theperformanceof thecoreengineun-lessthosefeaturesareused.Theotherraytracersarealsonotlimited to only usetrianglesto representobjects.However,we believe this is actuallyanadvantagefor usandis partlythereasonfor thegoodperformance.Finally, theseotherraytracersarenotwrittenwith highestoptimizationin mindbutaremoretargetedtowardsalargefeatureset.Webelievethatwewill beableto show in thefuturethatthesetwo goalsdonotcontradicteachother.

    6.2. Reflectionand Shadow Rays

    Of coursea ray tracingenginewould not be completeif itcouldnot handleshadows, reflection,andrefraction.Theseeffectsalsochallengeouroverall approachasraycoherencecanbeconsiderablylessfor shadow or evenreflectionrays.Althoughthehandlingof secondaryraysis notyetfully opti-mizedin our implementationweweresurprisedby thegoodperformanceweobservedevenfor extremecasesof reflecti-vity.

    The accompanying video shows a walkthroughof theMGF conferencescene,wheremostof the materialhasatleastaslight contributionby reflectionrays.Eventhedoors,wall panelswith fixtures,andmetalframesof theseatsgene-ratereflectionrays,oftenresultingin multiple reflectionsasclearlyvisible whenzoomingtowardsthefire extinguisher,which reflectstheentirescene(seeFigure4).

    Sphere-like objectssuchas the fire extinguisherarepo-tentialhot spotsin sceneslike theseasthey cantriggerlargenumbersof reflectionraysthatsampletheentirevisible en-vironmentandarelikely to have adverseeffect on caching.It is interestingto seethat theeffect is hardlynoticeableaslongastheseobjectscoveronly moderatepartsof theimage.In thiscaseonly afew raysarereflectedalmostrandomlyin-to theenvironment.Thoserayspotentiallysampletheentirescenebut our accelerationstructuresuccessfullylimits thedatabeing accessedto only a few BSP-cellsand trianglesalongthepathsof thosefew rays.As a resultthe impactonperformanceremainslow.

    Performancedegradessignificantlyonly if zoomingin ona reflective objectsuchthat it fills the field of view. In thiscasealmostall visible geometrywill actually be sampledandcachingwill no longerbeeffective for largescenes.Ho-wever, this is anunavoidableconsequenceof dealingwith aworking setmuchlarger thanthe cache(our largestscenesoccupy closeto 2GB of memorybut renderfine with 256MB of main memory).In thosecasesit seemsunavoidableto useapproximationssuchasa reflectionmap.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    An exampleimagerenderedwith reflectionsandshadowscanalsobe seenon the right in Figure4. Many objectarereflective andgeneratereflectionsrays.Also threeshadowraysaresentfor eachintersection.Theperformanceis main-ly influencedby thenumberof shadow rays.

    7. Ray Tracing of ComplexScenes

    Thereis astrongtrendtowardsmoreandmorecomplex sce-nesthat needto be rendered.Many disciplinesneedto vi-sualizelarge assembliessuchas whole cars,ships,airpla-nes,power plants,andsimilar structures.It is oftenvery ti-meconsumingto preprocessthedatain orderto reducethecomplexity to a level manageableby currentrenderingtech-nology. Thesepreprocessingstepsarenon-trivial andoftenrequireconsiderableuserinteraction2. We expectthis trendto morecomplex modelsto increaseascomputingandme-mory resourcesmake large assemblieseasierto storeandhandle.

    Ray tracingstill needspreprocessingfor thosedatasets,but thispreprocessingis limited to purespatialorderinganddoesnot involveany complex computationsonthegeometryitself, aswould for instancegeometricsimplification.Addi-tionally ray tracinghasocclusionculling built into the al-gorithmsanddoesnot requirecomplex precomputationofdata,suchasthepotentiallyvisible set(PVS)or similar da-ta structures2. Consequently, ray tracingis especiallywellsuitedfor largeandcomplex models.

    7.1. Comparisonwith RasterizationHardware

    We startedthis paperwith the claim madeby researchersin the past that ray tracing would eventually becomefa-sterthanrasterizationhardware.However, it wasunclearatwhich point thatcrossover would happen,if at all. With theray tracing systemdescribedabove we arenow in a posi-tion to answerthis questions:We have alreadyreachedthecrossover point andcannow even outperformrasterizationhardwarewith a softwareray tracer– at leastfor complexscenesandmoderatescreenresolutions.

    For this demonstrationwe comparedthe performanceofour ray tracing implementationwith the renderingperfor-manceof the OpenGL-basedhardware. In order to get thehighestpossibleperformanceon this hardware we choseto renderthe sceneswith SGI Performer35, which is well-known for its highly optimizedrenderingenginethat takesadvantageof most available hardware resourcesincludingmultiprocessingon our multiprocessormachines.We haveusedthedefaultparametersof Performerwhenimportingthescenedatavia theNFF formatandwhile rendering.The32-bit versionof Performerthatwe usedwasunableto handlethe largestscene(SodaHall) becauseit ranout of memory.Weusedsimpleconstantshadingin all cases.

    Therasterizationmeasurementsof our experimentswere

    Scene Tris Octane Onyx PC RTRT

    MGF office 40k >24 > 36 12.7 1.8MGF conf. 256k >5 > 10 5.4 1.6

    MGF theater 680k 0.4 6-12 1.5 1.1Library 907k 1.5 4 1.6 1.1

    SodaFloor 2.5m 0.5 1.5 0.6 1.5SodaHall 8m OOM OOM OOM 0.8

    Table 4: OpenGLrenderingperformancein framesper se-condwith SGI Performeron threedifferent graphicshard-ware platformscompared with our software ray tracerat aresolutionof 5122 pixelson a dual processorPC. Theraytracerusesonlya singleprocessor, whileSGIPerformerac-tually usesall available.

    conductedon threedifferentmachinesin orderto get a re-presentativesampleof todayshardwareperformance.OnourPCs(dualPentium-III,800MHz, 256MB) weusedaNvidiaGeForceII GTS graphicscardrunningunderLinux. Addi-tionally, we usedan SGI Octane(300 MHz R12k, 4 GB)with therecentlyintroducedV8 graphicssubsystemaswellasabrandnew SGIOnyx-3 graphicssupercomputer(8x 400MHz R12k, 8 GB) with InfiniteReality3graphicsandfourrastermanagers.Theresultsareareshown in Table4.

    The resultsshow clearly that the software ray traceral-readyoutperformthebesthardwarerasterizationenginesforsceneswith a complexity of roughly1 million trianglesandmoreandis alreadycompetitive for scenesof abouthalf thesize.The ray tracing numberscan be scaledeasily by ad-ding moreprocessors— just enablingthe secondCPU onourmachinesdoublesourRTRT numbersgivenin Table4.

    In orderto visualizethescalingbehavior of rasterizationand ray tracing-basedrenderers,we usedthe large terrainsceneshown in Figure3 andsubsampledthegeometry. Theresultsareshown in Figure5. Even thoughSGI Performerusesa numberof techniquesto reducerenderingtimes,weseethe typical linearscalingof rasterization.Evenocclusi-on culling would not help in this kind of scene.Raytracingbenefitsfrom thefactthateachray visits roughlya constantnumberof trianglesbut needsto traverseaBSPtreewith lo-garithmicallyincreasingdepth.Raytracingalsosubsamplesthegeometryfor thehigherresolutionterrainasthenumberof pixels is lessthanthenumberof triangles.

    For sceneswith low complexity, rasterizationhardware,benefitsfrom the large initial costper ray for traversalandintersectionrequiredby a ray tracer. However, we believethat for thesecasesthereis still room for performanceim-provements.The large initial costper ray alsofavors raste-rizationfor higherimageresolutions.However, thiseffect islinear in the numberof pixels andcanbe compensatedbyaddingmoreprocessors,for instancein form of adistributedray tracer.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    -200 0 200 400 600 800 1000 1200

    sec per frame

    (thousand triangles)

    onyxnvidia

    octanertrt,512^2

    Figure 5: This figure showsthe logarithmic scalingof raytracingwith input complexity. We alsoshowthe linear sca-ling of differentrasterizationhardware. Thesceneswereob-tainedby subsamplingthe high resolutionterrain from Fi-gure 3 with more than onemillion triangles.Other scenesshowevenbetterresultsdueto more occlusion.

    7.2. Distrib uted Ray Tracing

    Sofar we have concentratedon simpleray tracingwith pri-mary raysonly thatcanbecomparedto rasterization-basedhardware.As weaddspecialray tracingeffectssuchassha-dows, reflections,or even global illumination, we arecon-frontedwith theneedto traceanincreasingnumberof rays.However, ray tracing is well-known for its almostperfectscalabilityin a distributedenvironment.

    Our small distributedray tracingsystemusesthe typicalmaster/slaveapproachwith socket-basedcommunication.Asimpleloadbalancingschemebasedonawork queueonthemasterkeepstheclient processorsbusyalmostall thetime.

    Table5 show thepreliminaryperformanceof our enginewhenconnectingto five dualPentium-IIIdesktopPCs(800and866MHz, 256to 768MB of memory).In particularwegeta decentrenderingspeedwith theoffice sceneevenwithreflectionsandshadows from threepoint light sources.

    The machinesare connectedwith a switched100-MbitEthernetandrenderingspeedis mainly limited by theband-width to the computerusedfor display. The uncompres-sedpixel streameasilysaturatesthe availablelinks andwearecurrentlyexperimentingwith higherbandwidthnetworkcomponentsto eliminateor at leastreducethecurrentbott-leneck.However, the bandwidthrequirementsareonly de-pendenton thesizeof theoutputimageandarenot affectedaswe proceedto morecostly renderingoperations,suchasmorelight sources,moreexpensiveshaders,supersampling,andglobalillumination.

    7.3. Description of AccompanyingVideo

    Thissubmissionalsocontainsavideo.Eachvideoclip is re-cordedlive from the screenof oneof our PCs.Ray tracingis computedat a resolutionof 640 by 480 pixels. The raytracingcomputationsareperformedonfivePCs,whichsend

    Scene Tris framerate

    Library 907k 7.7fpsMGF Theater 680k 7.3fpsMGF office 40k 2.4fps(*)

    Table 5: Interactiverenderingperformanceof a distributedray tracing implementationbasedon theoptimizedray tra-cing core. All imagesare rendered at a resolutionof 5122

    onfivePCs(Pentium-III,800MHzmachines).(*) Werenderprimary raysonly, exceptfor theofficewith containsreflec-tionsandshadowsfromthreepoint light sources.

    the computedpixels to the display host acrossa switched100Mbit Ethernet.Higherresolutionscouldnotberendereddueto network bandwidthrestrictions.Note thatsomesyn-chronizationartifactsarevisible, which aredueto theearlystageof developmentof this distributed renderingsystem.Thecurrentframerateis alwaysdisplayedin thetitle areaofthedisplaywindow.

    Thevideostartswith a simplemodelof a Quake monsterrenderedwith andwithouttexturesto show thattexturinghashardlyany effectontherenderingperformance.In particulartexturingperformanceis independentof scenecomplexity aseachpixel is only shadedonce.All scenesarerenderedwithonly a single primary ray for eachpixel unlessotherwisementioned.

    Thenext clip showstherenderingof atexturedterrainsce-necontaining1 million triangles.Thecamerastartszoomedin on a few triangleswith thepointeroutlining oneof them.We thenzoomout until almostall trianglesarevisible. Therenderingperformancechangesonly slightly in theprocess.

    We thenshow a walkthroughof the MGF theatrescenecontaining200k triangles.The illumination hasbeenpre-computedwith stochasticradiosityusingtheRenderParksy-stemby BekaertandSuykens7.

    Thenext clip actuallyshows two differentmodelsof thefully furnished,sevenfloor SodaHall building from Berke-ley. Theinitial view is thenon-illuminatedmodelconsistingof 1.5million triangles.As we enterthebuilding we switchto a illuminatedmodelsubdivided by the radiositycompu-tation into roughly8 million triangles.Note that the modelcontainscoplanartrianglesof which someare black. Thiscreatesartifactsthatareunrelatedto therenderingalgorithm.

    The next two clips show theMGF office andconferencescenewith reflectionby almostall materials.Pleasenotethereflectionin thedoors,rails alongthewalls, andtheir fixtu-res.Multiple reflectioncanfor instancebeseenin thelowerpart of the desklamp in the office clip. Performancestaysfairly highexceptwhendisplayingthereflective lampin fullscreen.

    Thefinal clip shows theoffice sceneagain,this time withreflectionsandshadows from threepoint light sources.The

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    performancedropsaccordingto the additionalnumbersofraysthatneedto betraced.

    8. Conclusions

    Thecomputationalcostof ray tracingis known to be loga-rithmic in termsof thenumberof triangles.In contrast,therenderingcostusinga rasterizationpipelineappearslinearin the numberof triangleseven with optimizationssuchasview frustumculling. Therefore,a break-even point in mo-delcomplexity wasexpected,abovewhichraytracingwouldbe preferredover rasterizationhardware.The main goal ofthis paperwasto investigatewherethis break-even point islocated,comparingan efficient softwareimplementationoftheray tracingalgorithmon a commodityPCwith state-of-the-artrasterizationhardware.

    Our ray tracingimplementationexploits a numberof no-vel techniques,describedabove, that make it morethananorder of magnitudefasterthan other ray tracerswe couldcomparewith:

    Carefulattentionis paidto exploiting coherencein theray

    tracingalgorithmsin orderto achieve goodcachingbeha-vior suchthat the algorithmscan essentiallyrun withinthe first andsecondlevel datacachesof the processors.Ourexperimentsindicatethatthis resultsin aspeed-upofroughlyhalf anorderof magnitude. Several strategies have been investigatedfor utilizingSIMD instructionsfound on commodity processors.Inour implementation,we usedIntel’s SSEextensionson aPentium-IIIprocessor. A significantspeed-upcanonly beobtainedby re-orderingthe ray tracingalgorithmso thatraysaretracedin packetsof four coherentrays.This re-ducesthememorybandwidthby afactorof four andgivesanadditionalspeed-upfactorof about2.

    We have comparedrenderingspeedsof this ray tracingimplementationon a 800MHz Pentium-III basedLinux PCwith thoseobtainedusing a high-endcommercialvisuali-zationpackage(SGI Performer)on threedifferentgraphicsaccelerators(NVidia GeForceII GTS,SGI Octanewith V8graphicsboard,SGIOnyx-3 with InfiniteReality3 graphics).Our experimentson a varietyof models(seeTable4), sug-gestthat the break-even point is reachedfor modelsof theorderof magnitudeof 1 million trianglesat a screenresolu-tion of 512 � 512.For largermodels,ray tracingwins.

    Both ray tracingand the Z-buffer algorithmhave a costcomponentlinearin thenumberof screenpixelsaswell ho-wever. In hardwareimplementationsof the rasterizationpi-peline,thiscostcomponentis almostnegligible. Thecostofray tracingis directly proportionalto the numberof pixels.Thebreak-evenpoint thereforeshiftstowardsmorecomplexmodels,proportionalto screenresolution.Moreover, moresophisticatedocclusionculling algorithmscurrently beingdevelopedmayreducethecostof a rasterizationpipelinetosub-linear, similar to ray-tracing.

    On theotherhand,by payingcarefulattentionto cachingissues,ray tracingis not limited by memorybandwidthbutrunswithin theprocessorcachesandperformancescalesli-nearlywith theimageresolution,thespeedof theprocessor,andwith thenumberof processors.

    Wetestedour implementationalsoon4-CPUsystemwithno performancedegradationandestimatea gradualbottlen-eck due to limited memorybandwidthonly at around6-8CPUswith currentPC technology. Thememorybandwidthof currentPCsystemsis ratherpoor andmeasuresat about200 MB per secondto main memory. A hardware imple-mentationwould allow for memorybandwidthin theorderof several GB per second,enoughto keepa large numberof parallelray tracingunitsbusy. This would allow for real-time visualizationfor a very wide rangeof models.We areactively investigatingsuitablehardwarearchitecturesfor thisapproach.

    We concludethat,unlike widely believed,theray tracingalgorithmis a viablealternative for a Z-buffer basedrasteri-zationpipelineespeciallywhenit comesto visualizinglargepolygonaldatasets.

    9. Acknowledgements

    We would like to thank the Max-Planck-Institutfür Infor-matik in Saarbrücken,Germany, for giving usaccessto theirequipmentandrasterizationgraphicshardware,andto JörgFischerfor his helpin creatingthestatistics.

    Finally, weareespeciallyindebtedto PhilippeBeakert forsupplyinguswith nicemodels,andfor his active supportingeneratingthecomparisonto rasterizationhardware.

    References

    1. AdvancedMicro Devices. Inside 3DNow![tm] Technology.http://www.amd.com/products/cpg/k623d/inside3d.html.

    2. D. Aliaga,J. Cohen,A. Wilson, E. Baker, H. Zhang,C. Erik-son,K. Hoff, T. Hudson,W. Stürzlinger, R. Bastos,M. Whit-ton,F. Brooks,andD. Manocha.MMR: An Interactive Mas-sive Model RenderingSystemUsing Geometricand Image-BasedAcceleration.In 1999ACM Symposiumon Interactive3D Graphics, pages199–206,Atlanta,USA, April 1999.

    3. Anthony Apodakaand Larry Gritz. AdvancedRenderMan.MorganKaufmann,2000.

    4. A. Appel. Sometechniquesfor shadingmachinerenderingsofsolids.SJCC, pages27–45,1968.

    5. Didier Badouel.An efficient ray polygonintersection.In Da-vid Kirk, editor, GraphicsGemsI. AcademicPress,1992.

    6. Kavita Bala, Julie Dorsey, and SethTeller. Radianceinter-polantsfor acceleratedbounded-errorray tracing.ACM Tran-sactionsonGraphics, 18(3),August1999.

    7. Philippe Bekaertand Frank Suykens. RenderPark – a pho-torealistic rendering tool. http://www.cs.kuleuven.ac.be/-cwis/research/graphics/RENDERPARK.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.

  • Wald, Slusallek,Benthin,andWagner/ InteractiveRenderingwith CoherentRayTracing

    8. Alan Chalmersand Erik Reinhard. Parallel and distributedphoto-realisticrendering.In Coursenotesfor SIGGRAPH98,pages425–432.ACM SIGGRAPH,Orlando,USA,July1998.

    9. RobertL. Cook,ThomasPorter, andLorenCarpenter. Distri-butedray tracing.ComputerGraphics, 18(3):137–145,1984.

    10. DARPA. Data intensive systems. http://www.darpa.mil/-ito/research/dis/index.html, 1997.

    11. G.D.Durgin,N. Patwari, andT.S.Rappaport.An advanced3Dray launchingmethodfor wirelesspropagationprediction. InIEEE 47thVehicularTechnology Conference, volume2, May1997.

    12. Matthew Eldridge,HomanIgehy, andPat Hanrahan.Pome-granate:A fully scalablegraphicsarchitecture.ComputerGra-phics, pages443–454,July 2000.

    13. Jeff Erickson. Pluecker coordinates. Ray TracingNews, 1997. http://www.acm.org/tog/resources/RTNews/-html/rtnv10n3.html#art11.

    14. Foley, van Dam, Feiner, andHughes. ComputerGraphics–Principlesand Practice, 2nd edition in C. AddisonWesley,1997.

    15. Andrew Glassner. An Introductionto Raytracing. AcademicPress,1989.

    16. Larry Gritz andJamesK. Hahn. BMRT: A global illumina-tion implementationof the rendermanstandard. Journal ofGraphicsTools, 1(3):29–47,1996.

    17. JohnL HennessyandDavid A Patterson.ComputerArchitec-ture – A QuantitativeApproach, 2nd edition. MorganKauf-mann,1996.

    18. Intel Corp. Intel ComputerBasedTutorial. http://developer.-intel.com/vtune/cbts/cbts.htm.

    19. Intel Corp. Intel Pentium III StreamingSIMD Extensions.http://developer.intel.com/vtune/cbts/simd.htm.

    20. Timothy L. Kay andJamesT. Kajiya. Ray tracingcomplexscenes.ComputerGraphics, 20(4):269–278,August1986.

    21. A. Keller. Quasi-MonteCarlo Methodsfor RealisticImageSynthesis. PhDthesis,Universityof Kaiserslautern,1998.

    22. Craig Kolb. Rayshadehome-page.http://graphics.stanford.-edu/˜cek/rayshade/rayshade.html.

    23. K.SungandP.Shirley. Raytracingwith theBSPtree.GraphicsGemsIII , pages271—274,1992.

    24. TomasMoeller. Practicalanalysisof optimizedray-triangleintersection.http://www.ce.chalmers.se/staff/tomasm/raytri/.

    25. TomasMoellerandBenTrumbore.Fast,minimumstorageraytriangleintersection.Journal of GraphicsTools, 2(1):21–28,1997.

    26. Motorola Inc. AltiVec Technology Facts. Available athttp://www.motorola.com/AltiVec/facts.html.

    27. Persistenceof Vision DevelopmentTeam. Pov-ray home-page.http://www.povray.org/.

    28. Steven Parker, William Martin, PeterPike Sloan,PeterShir-ley, Brian Smits,andChuckHansen.Interactive ray tracing.

    In 1999ACM Symposiumon Interactive3D Graphics, pages119–126,April 1999.

    29. StevenParker, MichaelParker, YarenLivnat,PeterPikeSloan,ChuckHansen,andPeterShirley. Interactive ray tracingforvolumevisualization. IEEE Transactionson ComputerGra-phicsandVisualization, 5(3):238–250,July-September1999.

    30. Steven Parker, PeterShirley, YardenLivnat,CharlesHansen,and PeterPike Sloan. Interactive ray tracing for isosurfacerendering.In IEEEVisualization’98, pages233–238,October1998.

    31. Mark S. Peercy, Marc Olano,JohnAirey, andP. Jeffrey Un-gar. Interactive Multi-Pass ProgrammableShading. ACMSiggraph,New Orleans,USA, July2000.

    32. Matt Pharr, Craig Kolb, Reid Gershbein,and Pat Hanrahan.Renderingcomplex sceneswith memory-coherentray tracing.ComputerGraphics, 31:101–108,August1997.

    33. PhilippePlasi,BetrantLe Saëc,andGèradVignoles. Appli-cationof renderingtechniquesto monte-carlophysicalsimu-lation of gasdiffusion. In JulieDorsey andPhilipp Slusallek,editors,RenderingTechniques’97, pages297–308.Springer,1997.

    34. David J.Plunkett andMichaelJ.Bailey. TheVectorizationofa RayTracingAlgorithm for ImprovedExecutionSpeed.IE-EEComputerGraphicsandApplications, 6(8):52–60,August1985.

    35. JohnRohlf andJamesHelman. IRIS Performer:A high per-formancemultiprocessingtoolkit for real-time3D graphics.ComputerGraphics, 28(AnnualConferenceSeries):381–394,July1994.

    36. Ken Shoemake. Pluecker coordinatetutorial. Ray TracingNews, 1998. http://www.acm.org/tog/resources/RTNews/-html/rtnv11n1.html#art3.

    37. GeorgeSimiakakis. Accelerating RayTracingwith Directio-nal SubdivisionandParallel Processing. PhDthesis,Univer-sity of EastAnglia, 1995.

    38. Philipp Slusallek,ThomasPflaum, and Hans-PeterSeidel.UsingproceduralRenderManshadersfor globalillumination.In ComputerGraphicsForum(Proc.ofEUROGRAPHICS’95,pages311–324,1995.

    39. BrianSmits.Efficiency issuesfor ray tracing.Journal of Gra-phicsTools, 3(2):1–14,1998.

    40. AdvancedRenderingTechnologies.TheAR250- anew archi-tecturefor ray tracedrendering. In Proceedingsof the Euro-graphics/SIGGRAPHworkshopon Graphicshardware - HotTopicsSession, pages39–42,1999.

    41. IngoWald,CarstenBenthin,MarkusWagner, andPhilippSlu-sallek.Interactive RaytracingonNotebooks.Technicalreport,ComputerGraphicsGroup,SaarlandUniversity, 2000.

    42. Greg Ward.Adaptiveshadow testingfor raytracing.In Photo-realisticRenderingin ComputerGraphics(Proceedingsof theSecondEurographicsWorkshopon Rendering, pages11–20.SpringerVerlag,New York, 1994.

    43. T. Whitted. An improved illumination modelfor shadeddis-play. CACM, 23(6):343–349,June1980.

    c�

    TheEurographicsAssociationandBlackwell Publishers2001.