Load Value Approximation: Approaching the Ideal Memory ...Load Value Approximation 5 core core core...

18
Load Value Approximation: Approaching the Ideal Memory Access Latency Joshua San Miguel Natalie Enright Jerger

Transcript of Load Value Approximation: Approaching the Ideal Memory ...Load Value Approximation 5 core core core...

  • Load Value Approximation: Approaching the Ideal Memory Access Latency

    Joshua San Miguel Natalie Enright Jerger

  • Chip Multiprocessor

    2

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    miss

  • Approximate Data

    Many applications can tolerate inexact data values. In approximate computing applications, 40% to nearly 100% of

    memory data footprint can be approximated [Sampson, MICRO 2013].

    3

    Approximate data storage: Reducing SRAM power by lowering supply voltage [Flautner, ISCA 2002].

    Reducing DRAM power by lowering refresh rate [Liu, ASPLOS 2011].

    Improving PCM performance and lifetime by lowering write precision and reusing failed cells [Sampson, MICRO 2013].

  • Outline

    • Load Value Approximation

    • Approximator Design

    • Evaluation

    • Conclusion

    4

  • Load Value Approximation

    5

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

  • Load Value Approximation

    6

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

  • Load Value Approximation

    7

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    miss A

  • Load Value Approximation

    8

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    generate A_approx

  • Load Value Approximation

    9

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    A_approx

  • Load Value Approximation

    10

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    A_approx

    request A_actual

  • Load Value Approximation

    11

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    A_approx

    train with A_actual

  • Load Value Approximation

    12

    core core core

    private cache private cache private cache

    shared caches, network-on-chip

    main memory

    approximator approximator approximator

    Takes memory access off critical path.

  • Approximator Design

    13

    ℎ , 1.0 2.2 3.1 instruction

    address

    global history buffer

    tag

    tag

    tag

    tag

    tag

    tag

    tag

    tag

    approximator table

    𝑓

    local history buffer

    4.1 3.9 4.0

    PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1

    (4.1 + 3.9 + 4.0) / 3

    A_approx = 4.0

    load A

  • Approximator Design

    14

    Load value approximators overcome the challenges of traditional value predictors: No complexity of tracking speculative values.

    No rollbacks.

    High accuracy/coverage with floating-point values.

    More tolerant to value delay.

  • Evaluation

    15

    EnerJ framework [Sampson, PLDI 2011]: Program annotations to distinguish approximate data from

    precise data.

    Evaluate final output error and approximator coverage.

    benchmark GHB size LHB size approximator size

    fft 0 2 49 kB

    lu 3 1 32 kB

    raytracer 1 1 32 kB

    smm 5 1 32 kB

    sor 0 2 49 kB

  • Evaluation

    16

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    fft lu raytracer smm sor

    output error approximator coverage

  • Conclusion

    17

    Future work:

    Further explore approximator design space (dynamic/hybrid schemes, machine learning).

    Measure speedup of load value approximation using full-system simulations.

    Measure power savings (low-power caches/NoCs/memory for approximate data).

    Low-error, high-coverage approximators allow us to approach the ideal memory access latency.

  • Thank you

    18

    baseline (precise) - raytracer load value approximation - raytracer