CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D....

CSE 661 PAPER CSE 661 PAPER PRESENTATIONPRESENTATION

ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR

By D. Wentzlaff et al

Presented BySALAMI, Hamza Onoruoiza

g201002240

OUTLINE OF OUTLINE OF PRESENTATIONPRESENTATION IntroductionTile64 Architecture Interconnect HardwareNetwork UsesNetwork to Tile InterfaceReceive-side Hardware DemultiplexingProtectionShared Memory Communication and

Ordering Interconnect SoftwareCommunication InterfaceApplicationsConclusion

2

INTRODUCTIONINTRODUCTIONTile Processor’s five on-chip 2D

mesh networks differ from ◦traditional bus based scheme;

requires global broadcast hence not scalable beyond 8 – 16 cores

◦1D ring not scalable; bisection BW is constant

Can support few or many processors with minimal changes to network structure

3

TILE64 ARCHITECTURETILE64 ARCHITECTURE2D grid of 64 identical compute

elements (tiles) arranged in 8 x 8 mesh1GHz clock, 3-way VLIW, 192 bil. 32-bit

instructions/sec4.8MB distributed cache, per tile TLBSupports DMA and virtual memoryTiles may run independent OSs. May be

combined to run multiprocessor OS such as SMP Linux

Shared memory. Cores directly access other cores’ cache

through on-chip interconnects4

TILE64 ARCHITECTURE (2)TILE64 ARCHITECTURE (2)

5

Off chip memory BW ≤ 200GbpsI/O BW ≥ 40Gbps

TILE64 ARCHITECTURE (3)TILE64 ARCHITECTURE (3)

6

Courtesy: http://www.tilera.com/products/processors/TILE64

INTERCONNECT INTERCONNECT HARDWAREHARDWARE5 low latency mesh networksEach network connects tile in five

directions; north, south, east, west and processor

Each link made of two 32-bit unidirectional links

7

INTERCONNECT INTERCONNECT HARDWARE(2)HARDWARE(2)

8

1.28Tb/s BW in and out of a single tile

NETWORK USESNETWORK USES4 dynamic networks

◦ packet header contains destination’s (x, y) coordinate and packet length (≤128 words)

◦ Flow controlled, reliable delivery◦ UDN: low latency comm. between userland

processes without OS intervention◦ IDN: direct communication with I/O devices◦ MDN: communication with off-chip memory◦ TDN: direct tile-to-tile transfers; requests

through TDN, response through MDN1 static network

◦ Streams of data instead of packets◦ First setup route, then send streams (circuit

switched)◦ Also a userland network

9

LOGICAL VS. PHYSICAL LOGICAL VS. PHYSICAL NETWORKSNETWORKS

5 physically independent networks

Lots of free nearest neighbor on-chip wiring

Buffer space takes about 60% tile area vs 1.1% for each network

More reliable on-chip network => less buffering to manage link failure

10

NETWORK TO TILE NETWORK TO TILE INTERFACEINTERFACETiles have register access to on-

chip networks. Instructions can read/write from/to UDN, IDN or STN.

MDN and UDN used indirectly on cache miss

Register-mapped network access is provided

11

RECEIVE-SIDE HARDWARE RECEIVE-SIDE HARDWARE DEMULTIPLEXINGDEMULTIPLEXINGTag word = (sending node, stream

num., message type)Receiving hardware demultiplexes

message into appropriate queue using tag.

On a tag miss, send data to ‘catch all’ queue, then raise interrupt

UDN has 4 deMUX queues, one ‘catch all’

IDN has 2 deMUX queues, one ‘catch all’

128-word reverse side buffering per tile12

RECEIVE-SIDE HARDWARE RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)DEMULTIPLEXING(2)

13

PROTECTIONPROTECTIONTile Architecture implements Multicore

Hardwall (MH)MH protects UDN, IDN and STN linksStandard memory protection mechanisms

used for MDN and TDN MH blocks attempts to send traffic over

hardwalled link, then signals an interrupt to system software

Protection is implemented on outbound links

14

SHARED MEMORY SHARED MEMORY COMMUNICATION AND COMMUNICATION AND ORDERINGORDERINGOn-chip distributed shared cacheData could be retrieved from

1. Local cache2. Home tile (request sent through TDN).

Data available only in home tile. Coherency maintained here.

3. Main MemoryNo guaranteed ordering between

networks and shared memoryMemory fence instructions used to

enforce ordering

15

INTERCONNECT INTERCONNECT SOFTWARESOFTWAREC based iLib provides

communication primitives implemented via UDN◦Lightweight socket-like streaming

channels for streaming algorithms◦MPI-like message passing interface

for adhoc messaging

16

COMMUNICATION COMMUNICATION INTERFACESINTERFACESiLib Socket

◦Long-lived FIFO point-to-point connection between two processes

◦Good for producer-consumer relationship

◦Multiple senders-one receiver possible; good for forwarding results to single node for aggregation

◦Raw Channels: low overhead; use as much space as available in buffer

◦Buffered Channels: higher overhead, but virtualization of memory is possible

17

COMMUNICATION COMMUNICATION INTERFACES(2)INTERFACES(2)Message Passing API

◦Similar to MPI◦Messages can be sent from a node to any

other at all times◦No need to establish connections

Implementation◦Sender: Send packet with message key

and size◦Receiver’s catch-all queue interrupts

processor◦ If expecting a message with this key, send

packet to sender to begin transfer◦Else, save notification. ◦On ilib_msg_receive() with same key, send

packet to interrupt sender to begin transfer

18

COMMUNICATION COMMUNICATION INTERFACES(3)INTERFACES(3)

19

COMMUNICATION COMMUNICATION INTERFACES(4)INTERFACES(4)UDN’s maximum BW is 4

bytes/cycleRaw Channels’ max BW 3.93

bytes/cycle; overhead due to header word and tag word

Buffered Channel: Overhead of memory read/write

Message Passing: Overhead of interrupting receiving tile

20

COMMUNICATION COMMUNICATION INTERFACES(5)INTERFACES(5)

21

APPLICATIONSAPPLICATIONSCorner Turn

◦Reorganize distributed array from 1 dimension to another

◦Each core send data to every other core

◦Important Factors Network for Distribution (TDN using

shared memory or UDN using raw channels)

Network for tiles’ synchronization (STN or UDN)

22

APPLICATIONS (2)APPLICATIONS (2)

◦Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data

◦Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages.

◦Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock

23

APPLICATIONS (3)APPLICATIONS (3)Dot Product

◦ Pairwise element multiplication, followed by addition of all products.

◦ 65,536-element dot product◦ Shared memory not scalable, higher

communication overhead◦ From 2 to 4 tiles, speedup is sublinear

because dataset completely fits into tiles’ L2 cache.

24

CONCLUSIONCONCLUSIONTile uses unconventional

architecture to achieve high on-chip communication BW

Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).

25

CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D....

Documents

Transcript of CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D....