Indexing Stream Register Files

Indexing Stream Register Files

Nuwan Jayasena10/8/2002

04/19/23 NSJ 2

Indexing Stream Register Files

• Motivation

• Architecture overview

• Usage examples

• Language and compiler issues

• Implementation issues

04/19/23 NSJ 3

Stream Memory Hierarchy

• Roughly order of magnitude increase in BW at each level

• Maximize data reuse at each level

• Focus on Stream Register File (SRF) for this talk

Memory Sys

Stream RF

Local Registers

Compute Units

Streams

Records

Localvariables

04/19/23 NSJ 4

SRF Data Reuse

• Current SRF only supports in-order reuse

Indexed access to SRF allows reordered reuse

In-order reuse Reordered reuse

Temporal or producer-consumer locality

App-dependentreordering

Data-dependentreordering

04/19/23 NSJ 5

SRF-Memory Stream Transfers

• Types of stream transfers– Compulsory: application I/O

– Capacity: due to SRF capacity pressure

– Reordering: re-ordering of data already in SRF

• SRF indexing…– Eliminates most reordering transfers

– Reduces data replication in SRF• Eliminates some capacity transfers

04/19/23 NSJ 6

Architecture Overview

• High-level view of SRF indexing implementation

• Mostly to highlight capabilities and limitations of SRF indexing

• More detailed view of hardware and mechanisms later

04/19/23 NSJ 7

Current Stream Processor Arch.

• N “lanes” each with SRF bank and compute cluster• Cross-lane communication via inter-cluster switch

SRF Bank 0

SRFBank 1

SRFBank N

Cluster 0 Cluster 1 Cluster N

Stream buffers

Inter-cluster switch

04/19/23 NSJ 8

In-lane SRF Indexing

• Each cluster can index in to its own bank of the SRF

• Address queue between cluster and SRF bank• Sequence of steps for indexed read:

– Cluster places index in address queue– Bank read using index– Result placed in stream buffer– Cluster reads data from stream buffer

(+)High bandwidth indexed accesses(+)Few changes to exiting architecture(–) Only 1/N of data structure visible within each

cluster

SRFBank X

Cluster X

04/19/23 NSJ 9

Cross-lane SRF Indexing• Any cluster can access any SRF location• Adds interconnect between clusters for address communication• Data return takes place over existing inter-cluster network

SRF address switch

SRFBank 0

SRFBank 1

SRFBank 7

Cluster 0 Cluster 1 Cluster 7

Inter-cluster switch

04/19/23 NSJ 10

Cross-lane SRF Indexing (Contd.)

• Sequence of steps– Clusters place indices in their own index queues

– Indices broadcast on address switch

– Arbitrate to resolve bank conflicts

– Access SRF banks and return data via inter-cluster network

– Write data in to requesting clusters’ stream buffer

– Clusters read data from stream buffers

(+) Entire data structure visible to all clusters

(–) Low bandwidth (1 word/cycle/cluster peak)

(–) Extra hardware for cross-lane index issue

04/19/23 NSJ 11

Usage Examples

• Application-specific uses– Efficient access to application data structures

• System-level uses– Hide hardware limitations

04/19/23 NSJ 12

Multidimensional Data w/o SRF Indexing

• 90º rotation (“corner-turn”) between accesses along different dimensions

Memory

SRF

Clusters Compute

Rotate

Computetime

04/19/23 NSJ 13

Multidimensional Data w/ SRF Indexing

• Accesses along 2nd dimension can typically use in-lane indexing

• Eliminates data reordering through memory reduce reordering stream transfers to/from memory system

Memory

SRF

Clusters Compute Compute

04/19/23 NSJ 14

Regular Grid Stencils w/o SRF Indexing

• Each row is a different stream, all streams consumed at same rate

• Values from adjacent columns communicated among neighbor lanes

• 3 streams for 2D grid with 1-wide stencil• Many streams for higher dimension grids and/or wider

stencils– Number of streams currently limited by hardware resources

04/19/23 NSJ 15

Regular Grid Stencils w/ SRF Indexing

• Primary stream consumed sequentially

• Accesses within vertical planes use in-lane indexing

• Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case)

• Reduces number of streams needed– May reduce reordering and/or redundant transfers

04/19/23 NSJ 16

Arbitrary Stencils w/o SRF Indexing

• Repeated accesses to same node leads to data replication in SRF

Memory

SRF

ClustersIndex Gen

Lookup

Compute

04/19/23 NSJ 17

Arbitrary Stencils w/ SRF Indexing

• Cross-lane indexing supports arbitrary access pattern

• Eliminates data replication in SRF – May reduce capacity stream

transfers

– Increases strip size

• Reduce redundant transfers from memory system

Memory

SRF

ClustersCompute

04/19/23 NSJ 18

Sub-stream Extraction w/o SRF Indexing

• Splitting records require pass through memory or passing useless data through clusters

• Same for selecting subset of records

Memory

SRF

ClustersComputeCompute

Extract

04/19/23 NSJ 19

Sub-stream Extraction w/ SRF Indexing

• In-lane indexing to select words from records

• Selecting subset of records may require cross-lane indexing to preserve ordering

Memory

SRF

ClustersComputeCompute

04/19/23 NSJ 20

Virtual Streams

• Current SRF has hard limit on number of streams used by a kernel– Imposed by hardware constraints

– Exceeding limit requires merging streams, splitting kernels or other workarounds

• Indexing in to SRF provides a mechanism to access any number of sequences– Essentially multiplex multiple logical streams on to one

hardware stream

04/19/23 NSJ 21

Other Uses

• Space allocation for variable length streams– Current SRF requires space allocation for worst case stream

size for variable length streams

– Indexing can be used to allocate for common case and gracefully degrade if overflows

• Spill local variables from kernels– Reduce register pressure for large kernels

• Etc.

04/19/23 NSJ 22

Summary of Benefits

• Reduce memory system bandwidth demands– Most reordering transfers and some capacity transfers

• Reduce SRF capacity pressure by eliminating replication– Increases strip sizes

• Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length– Increases strip size

• Flexible stream control– More streams per kernel than hardware supports– Efficient SRF allocation for variable length streams

04/19/23 NSJ 23

Language & Compiler Issues

• System-level issues should clearly be handled by compiler/scheduler– Virtual streams

– SRF allocation for variable length streams

– Register spilling etc.

• How much of the application-level uses can be inferred by compiler?– Substream extraction, regular stencils etc. can be inferred w/o

programmer help?

– Multi-dimensional data structures, irregular stencils etc. need programmer help?

• If so, what should the API be?

04/19/23 NSJ 24

Implementation Issues

• Hiding indexed SRF access latency

• Merging scratchpad and SRF

• SRF access arbitration

• Memory array implementation

04/19/23 NSJ 25

Hiding SRF Access Delay

• Kernels are statically scheduled

• SRF access by streams is dynamically arbitrated– Allows optimal run-time allocation of SRF BW to cluster and

memory streams

– Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay

• Indexed accesses are treated much like another stream for arbitration purposes– In order to hide arbitration and access delay for reads, SRF

indices must be issued early and data read a few cycles later

– Breaks indexed accesses in to two distinct ops at machine level

04/19/23 NSJ 26

Hiding SRF Access Delay (Contd.)

• Split read operation example:

• Address/data separation is not critical for writes

User pseudocode:

Kernel XYZ(…, idx_istream<int> S1, …) {

int a, b, R, S;loop(…) {

Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);

}}

User pseudocode:

Kernel XYZ(…, idx_istream<int> S1, …) {

int a, b, R, S;loop(…) {

Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);

}}

Post-compile pseudocode:

loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);

}

Post-compile pseudocode:

loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);

}

04/19/23 NSJ 27

Merging Scratchpad w/ Indexable SRF

• Data structures in SRF are typically read-only or write-only

• Scratchpad needs to support read/write data– Pending writes are matched against new reads and multiple

writes to same location are collapsed

• Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency– Reads are performed immediately after matching with

pending writes (if no match found) to avoid ordering problems

• Must sustain at least the current scratchpad bandwidth – one read and one write every cycle

04/19/23 NSJ 28

SRF Memory Array Implementation

• SSS SRF:– 64K word total 4K words per cluster

• Non-indexable bank can be implemented as a single 512x512 bit macro

• Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses

04/19/23 NSJ 29

SRF Memory Array Implementation (Contd.)

• Non-indexed SRF bank

• 512x512 macro

• 4x4 array of blocks assuming 128x128 blocks

• 2:1 column decode to sustain 4 words/cycle peak BW

SRAMArray

Row

Dec.

Col. Dec.Rd/Wr Circuits

• Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads

04/19/23 NSJ 30


• One word per cycle per bank

• All accesses are one word wide– Best BW utilization for mixed

indexed and stream accesses

• High area overhead due to replicated row decoders

• No replication of column decoders and rd/wr circuits

• Power in SRAM array(s) comparable to non-banked memory

Col.Rd/Wr Circuits

• Option 1: Multiple narrow columns

Row

Dec.

Col.

Row

Dec.

Col.

Row

Dec.

Col.R

ow D

ec.

04/19/23 NSJ 31


• Leverage hierarchical bitlines with additional muxing

• With appropriate data interleaving, mux area fairly small

• Low area overhead

• Low power for wide accesses only

• BW utilization may be suboptimal for mixed stream and indexed accesses

Row

Rd/Wr Circuits

• Option 2: Multiple banks along rows of blocks

Mux Row

Mux Row

Mux Row

Mux

Indexing Stream Register Files

Documents

Transcript of Indexing Stream Register Files