Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

45
A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

description

A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications. Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA. Outline of Talk. SIMD Associative Computing - PowerPoint PPT Presentation

Transcript of Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

Page 1: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

A Scalable PipelinedAssociative SIMD ArrayWith Reconfigurable PE

Interconnection Network For Embedded Applications

Hong Wang & Robert A. WalkerComputer Science Department

Kent State UniversityKent, OH 44242 USA

Page 2: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group2

Outline of Talk SIMD Associative Computing

Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample execution

Sample Application String Matching

Conclusion

Page 3: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group3

Associative SIMD Array

Memory APE

Memory APE

Memory APE

Memory APE

Memory APE

Memory APE

Assoc.Control

Unit(CU)

Associative Computing

Tabular data, cells referenced by content

= Associative Search

On successful search is, cell is flagged

+ Associative Processing

Flagged cells are processed further

SIMD Associative Computing

Memory cell uses associative processing element (APE) to search concurrently

Page 4: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group4

Assoc.Control

Unit(CU)

Ford Taurus $22,000 Kent

Chevrolet Malibu $20,000 Akron

Ford Taurus $18,000 Akron

Ford Focus $14,000 Kent

Jeep Wrangler $25,000 Akron

Associative PEs (APEs) search for a key, and those that find it are flagged as responders

Find all “Ford” cars for sale

APE

APE

APE

APE

APE

R

R

R

R

R

Associative Search

Page 5: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group5

Assoc.Control

Unit(CU)

Ford Taurus $22,000 Kent

Chevrolet Malibu $20,000 Akron

Ford Taurus $18,000 Akron

Ford Focus $14,000 Kent

Jeep Wrangler $25,000 Akron

The responders can be processed further, either one, all sequentially, or all in parallel, as needed

APE

APE

APE

APE

APE

R

R

R

Associative Processing

Page 6: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group6

ASC: Associative SIMD Array w/ PE Network

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

PE

Int

erco

nnec

tion

Net

wor

k

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

APE

APE

APE

Page 7: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group7

MASC: ASC + Multiple Control Units / Instruction Streams

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

PE

Int

erco

nnec

tion

Net

wor

k

CU

Int

erco

nnec

tion

Net

wor

k

APE

APE

APE

APE

APE

APE

APE

APE

Assoc.Control

Unit(CU)

Assoc.Control

Unit(CU)

Assoc.Control

Unit(CU)

Page 8: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group8

Outline of Talk SIMD Associative Computing

Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution

Sample Application String Matching

Conclusion

Page 9: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group9

ASC Processor’s Pipelined Architecture

We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs

Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs In the Control Unit

Instruction Fetch (IF) Part of Instruction Decode (ID)

In the Scalar PE (SPE), in each Parallel PE (PPE) Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB)

Page 10: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group10

ID/EX Latch

EX/MEM Latch

MEM/WB Latch

Data Memory

Register File

IF/ID Latch

InstructionMemory

Decoder

Control Unit (CU)

Sequential PE (SPE)

Parallel PE (PPE) Array

ImmediateData

BroadcastRegister

Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

Page 11: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group11

Re

gis

ter

File

Da

ta S

witc

h

Co

mp

ara

tor

ID/E

X L

atc

h

Mask

EX

/ME

M L

atc

h

ME

M/W

B L

atc

h

Da

ta M

em

ory

MU

X

Processing Element (PE)

Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise

Top of mask of ‘0’ disables ID/EX Latch

Page 12: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group12

Pipelined ASC Processor’s Performance

Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs Other 8-bit processor cores implemented on this

FPGA / speed grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz

Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors With the 5-stage pipeline, our ASC Processor can

approach a peak performance of 300 MHz

Page 13: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group13

Reconfigurable PE Interconnection Network

Our pipelined ASC Processor also has a reconfigurable PE interconnection network

Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via Linear array (currently implemented), or 2D mesh (to be implemented soon)

without the restriction of physical adjacency

Each PE in the PE Array can Choose to stay in the PE interconnection network, or Choose to stay out of the PE interconnection network,

so that it is bypassed by any inter-PE communication

Page 14: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group14

ID/EX Latch

EX/MEM Latch

MEM/WB Latch

Data Memory

Register File

IF/ID Latch

InstructionMemory

Decoder

Control Unit (CU)

Sequential PE (SPE)

Parallel PE (PPE) Array

ImmediateData

BroadcastRegister

Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

Page 15: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group15

Data Switch

RegisterFile

RegisterData

(from SPE)

ImmediateData

(from CU)

LeftNeighbor

RightNeighbor

Top ofMask Stack

Comparator &ID/EX Latch

Reconfigurable Network Implementation

Data switch Passes register, broadcast, and immediate data to the PE

and to its two neighbors Routes data from the PE’s neighbors to its EX stage

Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network Will be needed by MASC Processor

Page 16: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group16

ASC Processor’s Network Performance

Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present Due to the long path from the first PE to the last PE in

the PE array

4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present When the number of PEs is increased to 50, the clock

frequency drops to 22 MHz

In the future we hope to reduce this delay using a pipelined or other multi-hop architecture

Page 17: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group17

Outline of Talk SIMD Associative Computing

Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution

Sample Application String Matching

Conclusion

Page 18: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group18

IDLE

Task Manager

Task_Allocation

Wait_For_IS

Join

Call_TM

Task_Execution

IDLE

Instruction Stream

Page 19: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group19

MASC PE Structure

PE

IS_TM_Chooser

IS1 IS2 TM1 TM2

ID Register

Page 20: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group20

IDLE

Task Manager

Task_Allocation

Wait_For_IS

Join

Call_TM

Task_Execution

IDLE

Instruction Stream

TM ID

IS ID

IS ID

Page 21: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group21

Assembly Code Example

.

.

101 Parallel_Select_Start Mem(110)

102 Pcase Condition1 Mem(104)

103 Pcase Condition2 Mem(107)

104 Case1

105 …

106 Parallel_Case_End

107 Case 2

108 …

109 Parallel_Case_End

110 Parallel_Select_End (note: This does not trigger JOIN, lack of tasks do)

.

.

Page 22: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group22

TM0TM1

TM2 IS0 IS1 IS2

Task Managers Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

Page 23: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group23

TM0TM1

TM2

Task ManagersIS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

Originally All PEs listen to IS0

Page 24: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group24

TM0

TM1TM2

Task Managers

IS0 IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

When Parallel Select is met, Task Manager takes over PEs

101 Parallel_Select_Start Mem(110)

Page 25: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group25

TM0

TM1TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM then calls IS0 to perform 1st task

102 Pcase Condition1 Mem(104)

104 Case1105 …

Page 26: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group26

TM0

TM1TM2

Task Managers

IS0 IS1

IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM then calls IS1 to perform 2nd task

102 Pcase Condition2 Mem(107)

107 Case 2 108 …

102 Pcase Condition1 Mem(104)

104 Case1105 …

Page 27: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group27

TM0

TM1TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

2nd task finishes and gives control back to TM

107 Case 2 108 … 109 Parallel_Case_End

102 Pcase Condition1 Mem(104)

104 Case1105 …

Page 28: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group28

TM0

TM1TM2

Task Managers

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

1st task finishes and gives control back to TM

104 Case1105 …106 Parallel_Case_End

Page 29: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group29

TM0TM1

TM2

Task ManagersIS0

IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

Control is back to the last finished IS which is IS0

110 Parallel_Select_End . .

IS1

Page 30: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group30

TM0

TM1

TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

IS1 meets a nested parallel select code

Page 31: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group31

TM0

TM1

TM2

Task Managers

IS0

IS1 IS2

Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5

TM1 allocates the two tasks to IS1 and IS2

A = 2

C = AB = A

Common Register

Page 32: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group32

Outline of Talk

SIMD Associative Computing Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams

ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance

MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution

Sample Application String Matching

Conclusion

Page 33: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group33

Sample Application — String Matching

One of the most fundamental computing operations

Variable Length Don’t Care (VLDC) SIMD associative algorithm (Mary Esenwein 1997 and Ping Xu 2005) can find all instances of a pattern string within a larger text string Exact-match version shown here Extensions support single-character and variable-

length “don’t cares”

Demonstrates associative search, associative computing (responder processing), and the linear PE interconnection network

Page 34: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group34

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

R

R

R

R

1

2

3

4

5

Look for match of pattern string AB in text string ABAA

Initialize variables as shown above

Note that “$” indicates a parallel variable

text$

0

counter$

0

match$

0 0

0 0

0 0

@

A

B

A

A 0 0

0

patt_counter

2

patt_length

AB

patt_string

Page 35: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group35

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

R

R

R

R

1

2

3

4

5

Responders are text$ == patt_string[j]and counter$ == patt_counter;

text$

0

counter$

0

match$

0 0

0 0

0 0

@

A

B

A

A 0 0

0

patt_counter

2

patt_length

AB

patt_string

j

Page 36: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group36

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

1

2

3

4

5

Responders are text$ == patt_string[j]and counter$ == patt_counter;

text$

0

counter$

0

match$

0 0

0 0

0 0

@

A

B

A

A 0 0

0

patt_counter

2

patt_length

AB

patt_string

j

Page 37: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group37

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

1

2

3

4

5

Responders add 1 to counter$ and send resultto counter$ of preceding cell via network;

patt_counter++;

text$

0

counter$

0

match$

0

0 0

0 0

@

A

B

A

A 0 0

0 1

patt_counter

2

patt_length

AB

patt_string

j

0 1

Page 38: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group38

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

R

R

R

R

1

2

3

4

5

Responders are text$ == patt_string[j]and counter$ == patt_counter;

text$

0

counter$

0

match$

1 0

0 0

0 0

@

A

B

A

A 0 0

1

patt_counter

2

patt_length

AB

patt_string

j

Page 39: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group39

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

1

2

3

4

5

Responders are text$ == patt_string[j]and counter$ == patt_counter;

text$

0

counter$

0

match$

1 0

0 0

0 0

@

A

B

A

A 0 0

1

patt_counter

2

patt_length

AB

patt_string

j

Page 40: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group40

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

1

2

3

4

5

Responders add 1 to counter$ and send resultto counter$ of preceding cell via network;

patt_counter++;

text$ counter$

0

match$

1 0

0 0

0 0

@

A

B

A

A 0 0

1 2

patt_counter

2

patt_length

AB

patt_string

j

0 2

Page 41: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group41

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

R

R

R

R

1

2

3

4

5

Responders are counter$ == patt_length;

text$ counter$

0

match$

1 0

0 0

0 0

@

A

B

A

A 0 0

2

patt_counter

2

patt_length

AB

patt_string

2

Page 42: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group42

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R1

2

3

4

5

Responders are counter$ == patt_length;

text$ counter$

0

match$

1 0

0 0

0 0

@

A

B

A

A 0 0

2

patt_counter

2

patt_length

patt_string

2

AB

Page 43: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group43

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R1

2

3

4

5

Responders send 1 to match$ of next cell via network;

text$ counter$

0

match$

1

0 0

0 0

@

A

B

A

A 0 0

2

patt_counter

2

patt_length

patt_string

2

0 1

AB

Page 44: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group44

String Match by Associative Computing

Assoc.Control

Unit(CU)

APE

APE

APE

APE

APE

R

1

2

3

4

5

Responders are match$ == 1;

Indicates cell(s) where match of pattern string ABin text string ABAA begins

text$ counter$

0

match$

1

0 0

0 0

@

A

B

A

A 0 0

2

patt_counter

2

patt_length

patt_string

2

1

AB

Page 45: Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

ASC Processor Group45

Conclusion

We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing Performance is improved by adding a 5-stage

pipeline, split between the Control Unit and the PEs Additional functionality is provided by a reconfigurable

PE interconnection network

Future work will include Support for multiple Control Units (in progress) Performance improvement to support more efficient

broadcast to a large number of PEs