November 18, 2005 PACL and ASC Processor Research Overview 1 Research Overview Parallel and...

50
November 18, 2005 PACL and ASC Processor Rese arch Overview 1 Research Overview Parallel and Associative Computing Group and the ASC Processor Group Kent State University Dr. Johnnie Baker, Dr. Robert Walker, and Dr. Jerry Potter (Emeritus), Michael Scherger, Wittaya Chantamas, Hong Wang Sabegh Singh Virdi, Shannon Steinfadt, Kevin Schaffer Department of Computer Science Kent State University Kent, Ohio

Transcript of November 18, 2005 PACL and ASC Processor Research Overview 1 Research Overview Parallel and...

November 18, 2005 PACL and ASC Processor Research Overview

1

Research Overview Parallel and Associative Computing Groupand theASC Processor Group

Kent State University

Dr. Johnnie Baker, Dr. Robert Walker, and Dr. Jerry Potter (Emeritus), Michael Scherger, Wittaya Chantamas, Hong Wang

Sabegh Singh Virdi, Shannon Steinfadt, Kevin Schaffer

Department of Computer Science

Kent State University

Kent, Ohio

PACL and ASC Processor Research Overview

2November 18, 2005

Associative Models of Computation

Parallel RuntimeEnvironments

Parallel and AssociativeSystem Software

Parallel and AssociativeApplications

Associative and Parallel Algorithms

Parallel and AssociativeResearch Group

ASC ProcessorResearch Group

FPGA-BasedASC Processor

MASCProcessor

Structure Codes,ASC-centric

Implementations

Pipelined ASCw/ Reconfigurable

Network

MultithreadedASC Processor

PACL and ASC Processor Research Overview

3November 18, 2005

Presentation Outline

Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model

Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-

Worker Paradigm – Wittaya Chantamas

ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE

Network to Support MASC – Hong Wang

PACL and ASC Processor Research Overview

4November 18, 2005

Presentation Outline

Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model

Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-

Worker Paradigm – Wittaya Chantamas

ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE

Network to Support MASC – Hong Wang

PACL and ASC Processor Research Overview

5November 18, 2005

Associative Models of Computation Associative Computer: A SIMD computer with

certain additional hardware features. Features can be supported (less efficiently) in

software by a traditional SIMD The name “associative” is due to its ability to locate

items in the memory of PEs by content rather than location.

Uses associative features to simulate an associative memory

The ASC model (for ASsociative Computing) identifies the properties assumed for an associative computer.

PACL and ASC Processor Research Overview

6November 18, 2005

The Associative Computing (ASC) Model

Instruction Stream

Cel

l Net

wor

k

Broadcast / R

eduction Netw

ork

. . .

PEMemory

PEMemory

PEMemory

PEMemory

PEMemory

PEMemory

cells

PACL and ASC Processor Research Overview

7November 18, 2005

Associative Properties of the ASC Model

Broadcast data in constant time Constant time global reduction of

Boolean values using AND/OR Integer values using MAX/MIN

Constant time associative search Responder processing

An IS can detect if a data test is satisfied by any of its cells in constant time (i.e., any-responders)

An IS can select one arbitrary responder in constant time (i.e., pick-one)

Above properties supported in hardware with broadcast and reduction networks

References: M. Jin, J. Baker, and K. Batcher, Timings of Associative Operations on the MASC

model, Workshop of Massively Parallel Processing, IPDPS ’01.

PACL and ASC Processor Research Overview

8November 18, 2005

The MASC Model

Instruction Stream

Instruction Stream

Instruction Stream

Cel

l Net

wor

kInstruction S

tream N

etwork

Broadcast / R

eduction Netw

ork

. . .

PEMemory

PEMemory

PEMemory

PEMemory

PEMemory

PEMemory

. . .

PACL and ASC Processor Research Overview

9November 18, 2005

The MASC Model

MASC (i.e., Multiple ASC) is a multiple ASC model Multiple SIMD model with more than one Instruction

Stream (IS) Each IS can execute a separate data-parallel task

These threads execute to completion without interacting or interruption

Dynamically reconfigurable Each cell listens to only one IS Cells can switch ISs, based on a data test. Cells can switch between being active, inactive, or idle

Each IS with its cells satisfy the ASC model Job/functional parallelism is used to control the ISs

PACL and ASC Processor Research Overview

10November 18, 2005

WEBSITE FOR PAPERS

http://www.cs.kent.edu/~parallel

Follow pointer to “papers”

PACL and ASC Processor Research Overview

11November 18, 2005

Presentation Outline

Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model

Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-

Worker Paradigm – Wittaya Chantamas

ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE

Network to Support MASC – Hong Wang

PACL and ASC Processor Research Overview

12November 18, 2005

MASC Runtime Environment

Designed extensions to the existing ASC instruction set to support multiple instruction streams ISGEN compiler extension Reference: Scherger, Michael, Jerry Potter, and Johnnie Baker,

“Multiple Instruction Stream Control for an Associative Model of Parallel Computation", Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), April 2003.

Developed a prototype MASC runtime environment using a cluster (proof of concept for multiple instruction streams)

PACL and ASC Processor Research Overview

13November 18, 2005

Parallel if-then-else with Instruction Stream Commands

MI_REGION_BEGIN A

if( parallel conditional expression) then (parallel conditional expression)

MI_BEGIN A0

<body_1> /* 15 instructions */ <body_1>

MI_END A0

else

MI_BEGIN A1

<body_2> /* 10 instructions */ <body_2>

MI_END A1

endif; MI_REGION_END A

PACL and ASC Processor Research Overview

14November 18, 2005

Shape Example

Circle ?

Rectangle ?

Triangle ?

Compute Area of Circle

Compute Area of

Rectangle

Compute Area of Triangle

PACL and ASC Processor Research Overview

15November 18, 2005

Runtime Environment

Circle ?

Rectangle ?

Triangle ?

Compute Area of Circle

Compute Area of

Rectangle

Compute Area of Triangle

MI_BEGIN A0

MI_BEGIN A1-B0

MI_BEGIN A1-B1-C0

MI_END A0MI_END A1-B0

MI_REGION_END A

MI_REGION_BEGIN A

MI_BEGIN A1MI_REGION_BEGIN B

MI_REGION_END BMI_END A1

MI_END A1-B1-C0

MI_REGION_END CMI_END A1-B1

MI_BEGIN A1-B1MI_REGION_BEGIN C

IS 0 IS 1compare shape == circlenonresponders -> IS 1compute circle areanoopnoopnoopnoop

noopcompare shape == rectnon-responders -> IS 2compute rectangle areanoopnooplisten IS 0

IS 2noopnoopnoopcompare shape == trianglecompute triangle arealisten IS 1

MI_REGION_BEGIN A compare if shape is a circle

MI_BEGIN A0 5 instructions to compute area of a circle

MI_END A0 MI_BEGIN A1 MI_REGION_BEGIN B

compare if shape is a rectangle MI_BEGIN A1-B0

3 instructions to compute area of a rectangle MI_END A1-B0 MI_BEGIN A1-B1 MI_REGION_BEGIN C

compare if shape is a triangle MI_BEGIN A1-B1-C0

5 instructions to compute area of a triangle MI_END A1-B1-C0 MI_REGION_END C MI_END A1-B1 MI_REGION_END B MI_END A1 MI_REGION_END A

PACL and ASC Processor Research Overview

16November 18, 2005

Presentation Outline

Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model

Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-

Worker Paradigm – Wittaya Chantamas

ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE

Network to Support MASC – Hong Wang

PACL and ASC Processor Research Overview

Outline

A review of MASC Computational Model using manager/worker paradigm and work pool of tasks

Design and implementation of MASC back-end compiler for ASC language (an on going project)

An overview of the MASC emulator (the next project)

PACL and ASC Processor Research Overview

MASC Computational Model

Two types of ISs one manager IS

fork and join tasks manage work pool

a few worker ISs execute tasks

A work pool of tasks

Manager-ISID 0

Worker-IS ID 1

Worker-IS ID 2

Broadcast/Reduction Networks

CELL

CELL

CELL

CELL

CELL

CELL

CELL

Instruction Stream Network

Cell Network

...

PACL and ASC Processor Research Overview

Outline

A review of MASC Computational Model using manager/worker paradigm and work pool of tasks

Design and implementation of MASC back-end compiler for ASC language (an on going project)

An overview of the MASC emulator (the next project)

PACL and ASC Processor Research Overview

MASC Directive

Concurrent data parallel executions of different paths in a branch can be achieved by using the directive

/* .masc fork */ A user has a tight control

Not all different paths in branches will be executed concurrently Only those in branches with directives will

Considered as a comment by the ASC compiler (will show in .lst file, not show in .iob file)

No need for a new ASC compiler in order to run an ASC program in MASC system

PACL and ASC Processor Research Overview

main testint parallel b[$], c[$], d[$];logical parallel BCD[$];associate b[$], c[$], d[$] with BCD[$];

read b[$] c[$] d[$] in BCD[$];b[$] = c[$] + 2;c[$] = d[$] - 3;

/* will be no fork here */if (b[$] .lt. c[$]) then

b[$] = c[$];d[$] = 4;

else c[$] = b[$];

b[$] = d[$];endif;c[$] = d[$];d[$] = c[$];

end;

M100 0000

W110 0000

M111 0000

M1000000

W1100000

a structure a structure codecode

.MI_BEGIN W1100000.MI_BEGIN W1100000beg_of_stmt 1c00 6 0 beg_of_stmt 1c00 6 0 beg_read 5a00 SYSOT beg_read 5a00 SYSOT BCD B,C,D, BCD B,C,D, …… beg_of_stmt 1c00 20 0 beg_of_stmt 1c00 20 0 mvpa_ 4812 C Dmvpa_ 4812 C D.MI_END W1100000.MI_END W1100000

M1110000

PACL and ASC Processor Research Overview

main testint parallel b[$], c[$], d[$];logical parallel BCD[$];associate b[$], c[$], d[$] with BCD[$];

read b[$] c[$] d[$] in BCD[$];b[$] = c[$] + 2;c[$] = d[$] - 3;

/*.MASC FORK */if (b[$] .lt. c[$]) then

b[$] = c[$];d[$] = 4;

else c[$] = b[$];

b[$] = d[$];endif;c[$] = d[$];d[$] = c[$];

end;

M100 0000

W110 0000

M111 0000

W111 1000

W111 2000

W111 X100

M111 X110

a structure a structure codecode

.MI_BEGIN W1112000beg_of_stmt 1c00 16 0 beg_of_stmt 1c00 16 0 mvpa_ 4812 B C mvpa_ 4812 B C beg_of_stmt 1c00 17 0 beg_of_stmt 1c00 17 0 mvpa_ 4812 D Bmvpa_ 4812 D B.MI_END W1112000

M1000000

W1100000

W1111000

M1110000

W111X100

M111X110

W1112000

PACL and ASC Processor Research Overview

Outline

A review of MASC Computational Model using manager/worker paradigm and work pool of tasks

Design and implementation of MASC back-end compiler for ASC language (an on going project)

An overview of the MASC emulator (the next project)

PACL and ASC Processor Research Overview

A MASC Emulator A software that emulates exact MASC hardware ’s

behavior on a PC Thus, allows an ASC program to run on a PC

computer as if the program were run on a MASC system

A modified version of the existing ASC emulator with built-in performance monitoring

Manager/worker paradigm and work pool idea will be implemented in the emulator

MASC runtime system

PACL and ASC Processor Research Overview

25November 18, 2005

Presentation Outline

Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model

Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-

Worker Paradigm – Wittaya Chantamas

ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE

Network to Support MASC – Hong Wang

PACL and ASC Processor Research Overview

26November 18, 2005

Outline of Talk

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction

Stream Sample Code Architecture and Sample Execution

Conclusion

PACL and ASC Processor Research Overview

27November 18, 2005

ASC Processor’s Pipelined Architecture

We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs

Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs In the Control Unit

Instruction Fetch (IF) Part of Instruction Decode (ID)

In the Scalar PE (SPE), in each Parallel PE (PPE) Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB)

PACL and ASC Processor Research Overview

28November 18, 2005

ID/EX Latch

EX/MEM Latch

MEM/WB Latch

Data Memory

Register File

IF/ID Latch

InstructionMemory

Decoder

Control Unit (CU)

Sequential PE (SPE)

Parallel PE (PPE) Array

ImmediateData

BroadcastRegister

Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

PACL and ASC Processor Research Overview

29November 18, 2005

Re

gis

ter

File

Da

ta S

witc

h

Co

mp

ara

tor

ID/E

X L

atc

h

Mask

EX

/ME

M L

atc

h

ME

M/W

B L

atc

h

Da

ta M

em

ory

MU

X

Processing Element (PE)

Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise

Top of mask of ‘0’ disables ID/EX Latch

PACL and ASC Processor Research Overview

30November 18, 2005

Pipelined ASC Processor’s Performance

Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs Other 8-bit processor cores implemented on this FPGA / speed

grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz

Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors With the 5-stage pipeline, our ASC Processor can approach a

peak performance of 300 MHz

PACL and ASC Processor Research Overview

31November 18, 2005

Reconfigurable PE Interconnection Network

Our pipelined ASC Processor also has a reconfigurable PE interconnection network

Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via Linear array (currently implemented), or 2D mesh (to be implemented soon)

without the restriction of physical adjacency

Each PE in the PE Array can Choose to stay in the PE interconnection network, or Choose to stay out of the PE interconnection network, so that it is

bypassed by any inter-PE communication

PACL and ASC Processor Research Overview

32November 18, 2005

ID/EX Latch

EX/MEM Latch

MEM/WB Latch

Data Memory

Register File

IF/ID Latch

InstructionMemory

Decoder

Control Unit (CU)

Sequential PE (SPE)

Parallel PE (PPE) Array

ImmediateData

BroadcastRegister

Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

PACL and ASC Processor Research Overview

33November 18, 2005

Data Switch

RegisterFile

RegisterData

(from SPE)

ImmediateData

(from CU)

LeftNeighbor

RightNeighbor

Top ofMask Stack

Comparator &ID/EX Latch

Reconfigurable Network Implementation

Data switch Passes register, broadcast, and immediate data to the PE and to

its two neighbors Routes data from the PE’s neighbors to its EX stage

Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network Will be needed by MASC Processor

PACL and ASC Processor Research Overview

34November 18, 2005

ASC Processor’s Network Performance

Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present Due to the long path from the first PE to the last PE in the PE

array

4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present When the number of PEs is increased to 50, the clock frequency

drops to 22 MHz

In the future we hope to reduce this delay using a pipelined or other multi-hop architecture

PACL and ASC Processor Research Overview

35November 18, 2005

Outline of Talk

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction

Stream Sample Code Architecture and Sample Execution

Conclusion

PACL and ASC Processor Research Overview

36November 18, 2005

IDLE

Task Manager

Task_Allocation

Wait_For_IS

Join

Call_TM

Task_Execution

IDLE

Instruction Stream

PACL and ASC Processor Research Overview

37November 18, 2005

MASC PE Structure

PE

IS_TM_Chooser

IS1 IS2 TM1 TM2

ID Register

PACL and ASC Processor Research Overview

38November 18, 2005

IDLE

Task Manager

Task_Allocation

Wait_For_IS

Join

Call_TM

Task_Execution

IDLE

Instruction Stream

TM ID

IS ID

IS ID

PACL and ASC Processor Research Overview

39November 18, 2005

Assembly Code Example

.

.101 Parallel_Select_Start Mem(110)102 Pcase Condition1 Mem(104)103 Pcase Condition2 Mem(107)104 Case1105 …106 Parallel_Case_End107 Case 2108 …109 Parallel_Case_End110 Parallel_Select_End (note: This does not trigger JOIN, lack of

tasks do)..

PACL and ASC Processor Research Overview

40November 18, 2005

TM0TM1

TM2 IS0 IS1 IS2

Task Managers Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

PACL and ASC Processor Research Overview

41November 18, 2005

TM0TM1

TM2

Task ManagersIS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

Originally All PEs listen to IS0

PACL and ASC Processor Research Overview

42November 18, 2005

TM0

TM1TM2

Task Managers

IS0 IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

When Parallel Select is met, Task Manager takes over PEs

101 Parallel_Select_Start Mem(110)

PACL and ASC Processor Research Overview

43November 18, 2005

TM0

TM1TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM then calls IS0 to perform 1st task

102 Pcase Condition1 Mem(104)

104 Case1105 …

PACL and ASC Processor Research Overview

44November 18, 2005

TM0

TM1TM2

Task Managers

IS0 IS1

IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM then calls IS1 to perform 2nd task

102 Pcase Condition2 Mem(107)

107 Case 2 108 …

102 Pcase Condition1 Mem(104)

104 Case1105 …

PACL and ASC Processor Research Overview

45November 18, 2005

TM0

TM1TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

2nd task finishes and gives control back to TM

107 Case 2 108 … 109 Parallel_Case_End

102 Pcase Condition1 Mem(104)

104 Case1105 …

PACL and ASC Processor Research Overview

46November 18, 2005

TM0

TM1TM2

Task Managers

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

1st task finishes and gives control back to TM

104 Case1105 …106 Parallel_Case_End

PACL and ASC Processor Research Overview

47November 18, 2005

TM0TM1

TM2

Task ManagersIS0

IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

Control is back to the last finished IS which is IS0

110 Parallel_Select_End . .

IS1

PACL and ASC Processor Research Overview

48November 18, 2005

TM0

TM1

TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

IS1 meets a nested parallel select code

PACL and ASC Processor Research Overview

49November 18, 2005

TM0

TM1

TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM1 allocates the two tasks to IS1 and IS2

A = 2

C = AB = A

Common Register

PACL and ASC Processor Research Overview

50November 18, 2005

Conclusion

We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing Performance is improved by adding a 5-stage pipeline, split

between the Control Unit and the PEs Additional functionality is provided by a reconfigurable PE

interconnection network

Future work will include Support for multiple Control Units (in progress) Performance improvement to support more efficient broadcast to

a large number of PEs