08 Performance Overview

download 08 Performance Overview

of 37

Transcript of 08 Performance Overview

  • 7/30/2019 08 Performance Overview

    1/37

    Understanding GPUs ThroughUnderstanding GPUs Through

    BenchmarkingBenchmarking

    Mike HoustonMike HoustonStanford UniversityStanford University

  • 7/30/2019 08 Performance Overview

    2/37

    2

    IntroductionIntroduction

    Key areas for GPGPU

    Memory latency behavior

    Memory bandwidths Upload/Download

    Instruction rates

    Branching performance

    Chips analyzed

    ATI X1900XTX (R580)

    NVIDIA 7900GTX (G71)

    NVIDIA 8800GTX (G80)

    AMD HD2900 (R600)

  • 7/30/2019 08 Performance Overview

    3/37

    3

    GPUBenchGPUBench

    An open-source suite of micro-benchmarks

    GL

    DX9 (alpha version)

    Developed at Stanford to aid our

    understanding of GPUs

    Vendors wouldnt directly tell us arch details

    Behavior under GPGPU apps different than games

    and other benchmarks

    Library of results

    http://graphics.stanford.edu/projects/gpubench/

  • 7/30/2019 08 Performance Overview

    4/37

    4

    Memory latencyMemory latency

    Questions

    Can latency be hidden?

    Does access pattern affect latency?

  • 7/30/2019 08 Performance Overview

    5/37

    5

    MethodologyMethodology

    Try different numbers of texture fetches

    Different access patterns:

    Cache hit every fetch to the same texel

    Sequential every fetch increments address by 1

    Random dependent lookup with random texture

    Increase the ALU ops of the shader

    ALU ops must be dependent to avoid

    optimization

    GPUBench test: fetchcost

  • 7/30/2019 08 Performance Overview

    6/37

    6

    Fetch costFetch cost ATIATI cache hitcache hit

    ATI X1800XT ATI X1900XTX

    4 ALU ops 12 ALU ops

    Cost = max(ALU, TEX)

    X1900XTX has 3X the ALUs per pipe

  • 7/30/2019 08 Performance Overview

    7/37

    7

    Fetch costFetch cost ATIATI cache hitcache hit

    ATI X1900XTX

    5 ALU ops

    Cost = max(ALU, TEX)

    AMD HD2900XT

    12 ALU ops

  • 7/30/2019 08 Performance Overview

    8/37

    8

    Fetch costFetch cost ATIATI sequentialsequential

    ATI X1800XT ATI X1900XTX

    8 ALU ops 24 ALU ops

    Cost = max(ALU, TEX)

    X1900XTX has 3X the ALUs per pipe

  • 7/30/2019 08 Performance Overview

    9/37

    9

    Fetch costFetch cost ATIATI sequentialsequential

    ATI X1900XTX

    24 ALU ops

    Cost = max(ALU, TEX)

    AMD HD2900XT

    20 ALU ops

  • 7/30/2019 08 Performance Overview

    10/37

    10

    Fetch costFetch cost NVIDIANVIDIA cache hitcache hit

    Cost = sum(ALU, TEX)

    4 ALU op penalty

    NVIDIA 7900 GTX

  • 7/30/2019 08 Performance Overview

    11/37

    11

    Fetch costFetch cost NVIDIANVIDIA sequentialsequential

    NVIDIA 7900 GTX

    8 ALU op issue penalty

    Cost = sum(ALU, TEX)

  • 7/30/2019 08 Performance Overview

    12/37

    12

    Fetch costFetch cost NVIDIA 8800 GTXNVIDIA 8800 GTX

    Cost = max(ALU, TEX)

    Cache sequential

    4 ALU ops

    8 ALU ops

    NVIDIA 8800 GTXNVIDIA 8800 GTX

  • 7/30/2019 08 Performance Overview

    13/37

    13

    Bandwidth to ALUsBandwidth to ALUs

    Questions

    Cache performance?

    Sequential performance? Random-read performance?

  • 7/30/2019 08 Performance Overview

    14/37

    14

    MethodologyMethodology

    Cache hit

    Use a constant as index to texture(s)

    Sequential Use fragment position to index texture(s)

    Random

    Index a seeded texture with fragment position to

    look up into input texture(s)

    GPUBench test: inputfloatbandwidth

  • 7/30/2019 08 Performance Overview

    15/37

    15

    ResultsResults

    ATI X1900XTX NVIDIA 7900GTX

    Better effective

    cache bandwidth

    Better random

    bandwidth

    Sequential bandwidth

    (SEQ) about the same

  • 7/30/2019 08 Performance Overview

    16/37

    16

    ResultsResults

    NVIDIA 8800GTX

    2X bandwidth of

    7900GTX

    NVIDIA 7900GTX

  • 7/30/2019 08 Performance Overview

    17/37

    17

    ResultsResults

    NVIDIA 8800GTX AMD HD2900XT

  • 7/30/2019 08 Performance Overview

    18/37

    18

    Off-board bandwidthOff-board bandwidth

    Questions

    How fast can we get data on the board (download)?

    How fast can we get data off the board (readback)?

    GPUBench tests:

    download

    readback

  • 7/30/2019 08 Performance Overview

    19/37

    19

    DownloadDownload

    ATI X1900XTX NVIDIA 7900GTX

    Host to GPU is slow

  • 7/30/2019 08 Performance Overview

    20/37

    20

    DownloadDownload

    NVIDIA 7900GTX NVIDIA 8800GTX

    Next generation not much better

  • 7/30/2019 08 Performance Overview

    21/37

    21

    ReadbackReadback

    ATI X1900XTX NVIDIA 7900GTX

    GPU to host is slow

    ATI GL Readback

    performance is

    abysmal

  • 7/30/2019 08 Performance Overview

    22/37

    22

    ReadbackReadback

    NVIDIA 7900GTX NVIDIA 8800GTX

    Next generation not much better

  • 7/30/2019 08 Performance Overview

    23/37

    23

    Instruction Issue RateInstruction Issue Rate

    Questions

    What is the raw performance achievable?

    Do different instructions have different costs? Vector vs. scalar issue differences?

  • 7/30/2019 08 Performance Overview

    24/37

    24

    MethodologyMethodology

    Write longshaders with dependent

    instructions

    >100 instructions All instructions dependent

    But try to structure to allow for multi-issue

    Test float1 vs. float4 performance

    GPUBench tests:

    instrissue

  • 7/30/2019 08 Performance Overview

    25/37

    25

    ResultsResults Vector issueVector issue

    ATI X1900XTX NVIDIA 7900GTX

    = More costly than others

  • 7/30/2019 08 Performance Overview

    26/37

    26

    ResultsResults Vector issueVector issue

    ATI X1900XTX NVIDIA 7900GTX

    Faster

    ADD/SUB

    Peak (single instruction)

    FLOPS with MAD

  • 7/30/2019 08 Performance Overview

    27/37

    27

    ResultsResults Vector issueVector issue

    NVIDIA 7900GTX NVIDIA 8800GTX

    8800GTX is 37%

    faster (peak)

  • 7/30/2019 08 Performance Overview

    28/37

    28

    ResultsResults Vector issueVector issue

    AMD HD2900XT NVIDIA 8800GTX

  • 7/30/2019 08 Performance Overview

    29/37

    29

    When benchmarks go wrongWhen benchmarks go wrong

    Smart compilers subverting testing and optimizing

    away shaders. Bug found in previous subtract test.No clever way to write RCP test found yetAlways sani t y check r esul t s agai nst t heoret i cal

    peak!! !

    NVIDIA 7800GTX

    GPUBench 1.2

  • 7/30/2019 08 Performance Overview

    30/37

    30

    ResultsResults Scalar issueScalar issue

    NVIDIA 7900GTX NVIDIA 8800GTX

    8800GTX is a scalar issue

    processor

  • 7/30/2019 08 Performance Overview

    31/37

    31

    Branching PerformanceBranching Performance

    Questions

    Is predication better than branching?

    Is using Early-Z culling a better option? What is the cost of branching?

    What branching granularity is required?

    How much can I really save branching around heavycomputation?

  • 7/30/2019 08 Performance Overview

    32/37

    32

    MethodologyMethodology

    Early-Z

    Set a Z-buffer and compare function to mask out compute

    Change coherence of blocks

    Change sizes of blocks

    Set differing amounts of pixels to be drawn

    Shader Branching

    If{ do a little }; else { LOTS of math}

    Change coherence of blocks

    Change sizes of blocks

    Have differing amounts of pixels execute heavy math branch

    GPUBench tests:

    branching

  • 7/30/2019 08 Performance Overview

    33/37

    33

    ResultsResults Early-Z - NVIDIAEarly-Z - NVIDIA

    NVIDIA 7900GTX

    4x4 coherence isalmost perfect!

    Random is bad!

  • 7/30/2019 08 Performance Overview

    34/37

    34

    ResultsResults Branching - NVIDIABranching - NVIDIA

    NVIDIA 7900GTX

    Fully coherent

    has good

    performance

    But overhead

  • 7/30/2019 08 Performance Overview

    35/37

    35

    ResultsResults Branching - NVIDIABranching - NVIDIA

    NVIDIA 7900GTX

    Performance

    increases with

    branch

    coherence

    Need > 32x32 branch coherence

  • 7/30/2019 08 Performance Overview

    36/37

    36

    ResultsResults Branching - NVIDIABranching - NVIDIA

    NVIDIA 8800GTX

    Performance

    increases with

    branch

    coherence

    Need > 16x16 branch coherence

    (Turns out 16x4 is as good as 16x16 )

  • 7/30/2019 08 Performance Overview

    37/37

    37

    SummarySummary

    Benchmarks can help discern app behavior

    and architecture characteristics

    We use these benchmarks as predictivemodels when designing algorithms

    Folding@Home

    ClawHMMer

    CFD

    Be wary of driver optimizations

    Driver revisions change behavior

    Raster order, scheduler, compiler