Reconfigurable computing

10
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes INTRODUCTION Those architectures can be categorized in three main groups according to their degree of flexibility: i. The general purpose computing group that is based on the Von Neumann (VN) computing paradigm; ii. Domain-specific processors, tailored for a class of applications having in common a great range of characteristics; iii. Application-specific processors tailored for only one application. 1. General Purpose Computing General Purpose Computers are based on Von-Neumann Architecture. According to which, a computer could have a simple, fixed structure, able to execute any kind of computation, given a properly programmed control, without the need for hardware modification. The general structure of a VN machine is as shown in figure below: It consists of: A memory for storing program and data. Harvard architectures contain two parallel accessible memories for storing program and data separately. A control unit (also called control path) featuring a program counter that holds the address of the next instruction to be executed. An arithmetic and logic unit (also called data path) in which instructions are executed. A program is coded as a set of instructions to be executed sequentially, instruction after instruction. At each step of the program execution, the next instruction is fetched from the memory at the address specified in the program counter and decoded. The required operands are then collected from the memory before the instruction is executed. After execution, the result is written back into the memory. In this process, the control path is in charge of setting all signals necessary to read from and write to the memory, and to allow the data path to perform the right computation. The data path is controlled by the control path, which interprets the instructions and sets the data path’s signals accordingly to execute the desired operation. In general, the execution of an instruction on a VN computer can be done in five cycles: Instruction Read (IR) in which an instruction is fetched from the memory; Decoding (D) in which the meaning of the instruction is determined and the operands are localized; Read Operands (R) in which the operands are read from the memory; Execute (EX) in which the instruction is executed with the read operands; Write Result (W) in which the result of the execution is stored back to the memory. In each of those five cycles, only the part of the hardware involved in the computation is activated. The rest remains idle. For example, if the IR cycle is to be performed, the program counter will be activated to get the address of the instruction, the memory will be addressed and the instruction register to store the instruction before decoding will be also activated. Apart from those three units (program counter, memory and instruction register), all the other units remain idle. Fortunately, the structure of instructions allows several of them to occupy the idle part of the processor, thus increasing the computation throughput.

Transcript of Reconfigurable computing

Page 1: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

INTRODUCTION Those architectures can be categorized in three main groups according to their degree of flexibility:

i. The general purpose computing group that is based on the Von Neumann (VN) computing paradigm;

ii. Domain-specific processors, tailored for a class of applications having in common a great range of

characteristics;

iii. Application-specific processors tailored for only one application.

1. General Purpose Computing General Purpose Computers are based on Von-Neumann Architecture. According to which, a computer could have a

simple, fixed structure, able to execute any kind of computation, given a properly programmed control, without the need

for hardware modification.

The general structure of a VN machine is as shown in figure below:

It consists of:

A memory for storing program and data. Harvard architectures contain two parallel accessible memories for

storing program and data separately.

A control unit (also called control path) featuring a program counter that holds the address of the next instruction

to be executed.

An arithmetic and logic unit (also called data path) in which instructions are executed.

A program is coded as a set of instructions to be executed sequentially, instruction after instruction. At each step of the

program execution, the next instruction is fetched from the memory at the address specified in the program counter and

decoded. The required operands are then collected from the memory before the instruction is executed. After execution,

the result is written back into the memory.

In this process, the control path is in charge of setting all signals necessary to read from and write to the memory, and to

allow the data path to perform the right computation. The data path is controlled by the control path, which interprets the

instructions and sets the data path’s signals accordingly to execute the desired operation.

In general, the execution of an instruction on a VN computer can be done in five cycles: Instruction Read (IR) in which

an instruction is fetched from the memory; Decoding (D) in which the meaning of the instruction is determined and the

operands are localized; Read Operands (R) in which the operands are read from the memory; Execute (EX) in which the

instruction is executed with the read operands; Write Result (W) in which the result of the execution is stored back to the

memory.

In each of those five cycles, only the part of the hardware involved in the computation is activated. The rest remains

idle.

For example, if the IR cycle is to be performed, the program counter will be activated to get the address of the instruction,

the memory will be addressed and the instruction register to store the instruction before decoding will be also activated.

Apart from those three units (program counter, memory and instruction register), all the other units remain idle.

Fortunately, the structure of instructions allows several of them to occupy the idle part of the processor, thus increasing

the computation throughput.

Page 2: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

Instruction Level Parallelism Pipelining is a transparent way to optimize the hardware utilization as well as the performance of programs. Because the

execution of one instruction cycle affects only a part of the hardware, idle parts could be activated by having many

different cycles executing together.

For one instruction, it is not possible to have many cycles being executed together. For instance, any attempt to perform

the execute cycle (EX) together with the reading of an operand (R) for the same instruction will not work, because the

data needed for the EX should first be provided by R cycle.

Nevertheless, the two cycles EX and D can be performed in parallel for two different instructions. Once the data have

been collected for the first instruction, its execution can start while the data are being collected for the second

instruction. This overlapping in the execution of instructions is called pipelining or instruction level parallelism (ILP),

and it is aimed at increasing the throughput in the execution of instructions as well as the resource utilization. It should

be mentioned that ILP does not reduce the execution latency of a single execution, but increases the throughput of a set

of instructions.

.

If tcycle is the time needed to execute one cycle, then the execution of one instruction will require 5∗tcycle to perform. If

three instructions have to be executed, then the time needed to perform the execution of those three instructions without

pipelining is 15 ∗ tcycle, as illustrated in figure. Using pipelining, the ideal time needed to perform those three

instructions, when no hazards have to be dealt with, is 7∗tcycle. In reality, we must take hazards into account. This

increases the overall computation time to 9 ∗ tcycle.

The main advantage of the VN computing paradigm is its flexibility, because it can be used to program almost all

existing algorithms. However, each algorithm can be implemented on a VN computer only if it is coded according to

the VN rules. We say in this case that ‘The algorithm must adapt itself to the hardware’

Disadvantage: With the fact that all algorithms must be sequentially programmed to run on a VN computer, many algorithms

cannot be executed with their potential best performance. Algorithms that usually perform the same set of inherent

parallel operations on a huge set of data are not good candidates for implementation on a VN machine.

If the class of algorithms to be executed is known in advance, then the processor can be modified to better match

the computation paradigm of that class of application. In this case, the data path will be tailored to always execute

the same set of operations, thus making the memory access for instruction fetching as well as the instruction

decoding redundant.

Also because of the temporal use of the same hardware for a wide variety of applications, VN computation is often

characterized as ‘temporal computation’.

Page 3: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

2. Domain-Specific Processors A domain-specific processor is a processor tailored for a class of algorithms.

The data path is tailored for an optimal execution of a common set of operations that mostly characterizes the

algorithm in the given class.

Also, memory access is reduced as much as possible.

Digital Signal Processor (DSP) belong to the most used domain-specific processors.

A DSP is a specialized processor used to speed-up computation of repetitive numerically intensive tasks in signal

processing areas such as telecommunication, multimedia, automobile, radar, sonar, seismic, image processing, etc.

The most often cited feature of the DSPs is their ability to perform one or more multiply accumulate (MAC)

operations in single cycle. Usually, MAC operations have to be performed on a huge set of data. In a MAC

operation, data are first multiplied and then added to an accumulated value.

The normal VN computer would perform a MAC in 10 steps. The first instruction (multiply) would be fetched, then

decoded, then the operand would be read and multiply, the result would be stored back and the next instruction

(accumulate) would be read, the result stored in the previous step would be read again and added to the accumulated

value and the result would be stored back.

DSPs avoid those steps by using specialized hardware that directly performs the addition after multiplication without

having to access the memory.

Because many DSP algorithms involve performing repetitive computations, most DSP processors provide special

support for efficient looping. Often a special loop or repeat instruction is provided, which allows a loop

implementation without expending any instruction cycles for updating and testing the loop counter or branching

back to the top of the loop.

DSPs are also customized for data with a given width according to the application domain. For example if a DSP is

to be used for image processing, then pixels have to be processed. If the pixels are represented in Red Green Blue

(RGB) system where each colour is represented by a byte, then an image processing DSP will not need more than 8

bit data path. Obviously, the image processing DSP cannot be used again for applications requiring 32 bits

computation.

Advantage and Disadvantage:-

This specialization of the DSPs increases the performance of the processor and improves the device utilization.

However, the flexibility is reduced, because it cannot be used anymore to implement other applications other than

those for which it was optimally designed.

3. Application-Specific Processors Although DSPs incorporate a degree of application-specific features such as MAC and data width optimization, they

still incorporate the VN approach and, therefore, remain sequential machines. Their performance is limited.

If a processor has to be used for only one application, which is known and fixed in advance, then the processing unit

could be designed and optimized for that particular application. In this case, we say that ‘the hardware adapts itself t the

application’.

For example, In multimedia processing, processors are usually designed to perform the compression of video frames

according to a video compression standard. Such processors cannot be used for something else than compression. Even

in compression, the standard must exactly match the one implemented in the processors.

A processor designed for only one application is called an Application-Specific Processor (ASIP).

In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated.

The instruction set of the application is directly implemented in hardware.

Input data stream in the processor through its inputs, the processor performs the required computation and the

results can be collected at the outputs of the processor.

ASIPs are usually implemented as single chips called Application-Specific Integrated Circuit (ASIC)

Example : If algorithm 1 has to execute on a Von Neumann computer, then at least 3 instructions are required.

Page 4: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

With tcycle being the instruction cycle, the program will be executed in 3 ∗ 5 ∗ tcycles = 15 ∗ tcycle without pipelining.

Let us now consider the implementation of the same algorithm in an ASIP. We can implement the instructions d = a +

b and c = a ∗ b in parallel. The same is also true for d = b + 1, c = a − 1 as illustrated in figure below:

Fig: ASIP implementation of Algorithm 1

The four instructions a+b, a∗b, b+1, a−1 as well as the comparison a < b will be executed in parallel in a first stage.

Depending on the value of the comparison a < b, the correct values of the previous stage computations will be assigned

to c and d as defined in the program.

Let tmax be the longest signal needed by a signal to move from one point to another in the physical implementation of

the processor (this will happen on the path Input-multiply-multiplex). tmax is also called the cycle time of the ASIP

processor.

For two inputs a and b, the results c and d can be computed in time tmax. The VN processor can compete with this

ASIP only if 15 ∗ tcycle < tmax, i.e. tcycle < tmax/15. The VN must be at least 15 times faster than the ASIP to be

competitive.

ASIPs use a spatial approach to implement only one application. The functional units needed for the computation of all

parts of the application must be available on the surface of the final processor. This kind of computation is called

‘Spatial Computing’.

Once again, an ASIP that is built to perform a given computation cannot be used for other tasks other than those for

which it has been originally designed.

4. Reconfigurable Computing We can identify two main means to characterize processors: flexibility and performance.

VN COMPUTERS

The VN computers are very flexible because they are able to compute any kind of task. This is the reason why the

terminology GPP (General Purpose Processor) is used for the VN machine.

They do not bring so much performance, because they cannot compute in parallel. Moreover, the five steps (IR, D,

R, EX, W) needed to perform one instruction becomes a major drawback, in particular if the same instruction has

to be executed on huge sets of data.

Flexibility is possible because ‘the application must always adapt to the hardware’ in order to be executed.

ASIPs:

ASIPs bring much performance because they are optimized for a particular application. The instruction set

required for that application can then be built in a chip.

Performance is possible because ‘the hardware is always adapted to the application’.

Page 5: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

Ideally, we would like to have the flexibility of the GPP and the performance of the ASIP in the same device. We would

like to have a device able ‘to adapt to the application’ on the fly. We call such a hardware device a reconfigurable

hardware or reconfigurable device or reconfigurable processing unit (RPU).

Reconfigurable computing is defined as the study of computation using reconfigurable devices.

For a given application, at a given time, the spatial structure of the device will be modified such as to use the best

computing approach to speed up that application. If a new application has to be computed, the device structure will be

modified again to match the new application. Contrary to the VN computers, which are programmed by a set of

instructions to be executed sequentially, the structure of reconfigurable devices are changed by modifying all or part of

the hardware at compile-time or at run-time, usually by downloading a so called bitstream into the device.

Configuration is the process of changing the structure of a reconfigurable device at star-up-time respectively at run-

time.

Reconfiguration is the process of changing the structure of a reconfigurable device at run-time.

5. Fields of Application

5.1 Rapid Prototyping The development of an ASIC that can be seen as physical implementation of ASIPs is a cumbersome process

consisting of several steps, from the specification down to the layout of the chip and the final production. Because of

the different optimization goals in development, several teams of engineers are usually involved. This increases the

Non-recurring engineering (NRE) cost that can only be amortized if the final product is produced in a very large

quantity.

Contrary to software development, errors discovered after the production mean enormous loss because the produced

pieces become unusable.

Rapid prototyping allows a device to be tested in real hardware before the final production. In this sense, errors can

be corrected without affecting the pieces already produced. A reconfigurable device is useful here, because it can be

used several times to implement different versions of the final product until an error-free state.

One of the concepts related to rapid prototyping is hardware emulation, in analogy to software simulation. With

hardware emulation, a hardware module is tested under real operating conditions in the environment where it will be

deployed later.

5.2 In-System Customization

Time to market is one of the main challenges that electronic manufacturers face today. In order to secure market

segments, manufacturers must release their products as quickly as possible. In many cases, a well working product

can be released with less functionality. The manufacturer can then upgrade the product on the field to incorporate

new functionalities.

A reconfigurable device provides such capabilities of being upgraded in the field, by changing the configuration.

In-system customization can also be used to upgrade systems that are deployed into non-accessible or very difficult

to access locations. One example is the Mars rover vehicle in which some FPGAs, which can be modified from the

earth, are used.

5.3 Multi-modal Computation

The number of electronic devices we interact with is permanently increasing. Many people hold besides a mobile

phone, other devices such as handhelds, portable mp3 player, portable video player, etc. Besides those mobile

devices, fixed devices such as navigation systems, music and video players as well as TV devices are available in

cars and at home.

All those devices are equipped with electronic control units that run the desire application on the desired device.

Furthermore, many of the devices are used in a time multiplexed fashion. It is difficult to imagine someone playing

mp3 songs while watching a video clip and given a phone call.

For a group of devices used exclusively in a time multiplexed way, only one electronic control unit can be used.

Whenever a service is needed, the control unit is connected to the corresponding device at the correct location and

reconfigured with the adequate configuration.

For instance, a domestic mp3, a domestic DVD player, a car mp3, a car DVD player as well as a mobile mp3 player

and a mobile video player can all share the same electronic unit, if they are always used by the same person.

The user just needs to remove the control unit from the domestic devices and connect them to one car device when

going to work. The control unit can be removed from the car and connected to a mobile device if the user decides to

go for a walk. Coming back home, the electronic control unit is removed from the mobile device and used for

watching video.

Page 6: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

5.4 Adaptive Computing Systems Advances in computation and communication are helping the development of ubiquitous and pervasive computing.

The design of ubiquitous computing system is cumbersome task that cannot be dealt with only at compile time.

Because of uncertainty and unpredictability of such systems, it is impossible, at compile time, to address all

scenarios that can happen at run-time, because of unpredictable changes in the environment.

We need computing systems that are able to adapt their behavior and structure to change operating and

environmental conditions, to time-varying optimizing objectives, and to physical constraints such as changing

protocols and new standards. We call those computing systems Adaptive Computing System.

Reconfiguration can provide a good fundament for the realization of adaptive systems, because it allows system to

quickly react to changes by adopting the optimal behavior for a given run-time scenario.

6. Reconfigurable Devices

Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives,

interconnected via a configurable network.

The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be

implemented spatially on the device with intermediates flowing directly from the producing function to the

receiving function.

Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented

processing typical of programmable architectures such as microprocessors.

6.1 Reconfigurable Devices characteristics:

The key differences between reconfigurable machines and conventional processors are:

Instruction Distribution – Rather than broadcasting a new instruction to the functional units on every cycle,

instructions are locally configured, allowing the reconfigurable device to compress instruction stream distribution and

effectively deliver more instructions into active silicon on each cycle.

Spatial routing of intermediates – As space permits, intermediate values are routed in parallel from producing

function to consuming function rather than forcing all communication to take place in time through a central resource

bottleneck.

More, often finer-grained, separately programmable building blocks – Reconfigurable devices provide a large

number of separately programmable building blocks allowing a greater range of computations to occur per time step.

This effect is largely enabled by the compressed instruction distribution.

Distributed deployable resources, eliminating bottlenecks – Resources such as memory, interconnect, and functional

units are distributed and deployable based on need rather than being centralized in large pools. Independent, local

access allows reconfigurable designs to take advantage of high, local, parallel on-chip bandwidth, rather than creating a

central resource bottleneck.

7. Programmable, Configurable and Fixed Function Devices Programmable – we will use the term “programmable” to refer to architectures which heavily and rapidly reuse a single

piece of active circuitry for many different functions. The canonical example of a programmable device is a processor

which may perform a different instruction on its ALU on every cycle. All processors, be they microcoded, SIMD, Vector,

or VLIW are included in this category.

Configurable– we use the term “configurable”to refer to architectures where the active circuitry can perform any of a

number of different operations, but the function cannot be changed from cycle to cycle. FPGAs are our canonical example

of a configurable device.

Fixed Function, Limited Operation Diversity, High Throughput – When the function and data granularity to be

computed are well-understood and fixed, and when the function can be economically implemented in space, dedicated

hardware provides the most computational capacity per unit area to the application.

Variable Function, Low Diversity – If the function required is unknown or varying, but the instruction or data diversity is

low; the task can be mapped directly to a reconfigurable computing device and efficiently extract high computational

density.

Space Limited, High Entropy – If we are limited spatially and the function to be computed has a high operation and data

diversity, we are forced to reuse limited active space heavily and accept limited instruction and data bandwidth. In this

regime, conventional processor organization are most effective since they dedicate considerable space to on-chip

instruction storage in order to minimize off-chip instruction traffic while executing descriptively complex tasks.

Page 7: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

8. Key Relations From an empirical review of conventional, reconfigurable devices, we see that 80-90% of the area is dedicated to the

switches and wires making up the reconfigurable interconnect. Most of the remaining area goes into configuration

memory for the network. The actually logic function only accounts for a few percent of the area in a reconfigurable

device.

This interconnect and configuration overhead is responsible for the 100 X density disadvantage which reconfigurable

devices suffer relative to hardwired logic.

To a first order approximation, this gives us:

It is this basic relationship which characterizes the RP design space.

Since Areaconfig<<Areanet , devices with a single on-chip configuration, such as most reconfigurable devices, can

afford to exert fine-grained control over their operations – any savings associated with sharing configuration

bits would be small compared to the network area.

Since Areaconfig<<Areanet , to pack the most functional diversity into a part, one can allocate multiple

configurations on chip. With the order-of-magnitude relative sizes given in Relation 1.1, up to a 10 increase in

the functional diversity per unit area is attainable in this manner.

However, since Areanet is only 10 X Areaconfig, if the number of configurations is large, say 100 or more, the

configuration memory area will become the dominant size factor. Processors are essentially optimized into this

regime and this accounts for their 10times lower performance density compared to reconfigurable devices.

Once we go to a large number of contexts, such that the total configuration memory space begins to dominate

interconnect area, fine-granularity becomes costly. In this regime wideword operations allow us to reduce

instruction area across multiple bit processing elements.

This simultaneously allows machines with wide (e.g. 32 bit) datapaths to hold 1000’s of configurations on chip

while making them only 10 less computationally dense than finegrained, single context devices.

9. General-Purpose Computing General-purpose computing devices are specifically intended for those cases where we cannot or need not dedicate

sufficient spatial resources to support an entire computational task or where we do not know enough about the

required task or tasks prior to fabrication to hardwire the functionality.

The key ideas behind general-purpose processing are:

i. Postpone binding of functionality until device is employed – i.e. after fabrication

ii. Exploit temporal reuse of limited functional capacity

Delayed binding and temporal reuse work closely together and occur at many scales to provide the characteristics

we now expect from general-purpose computing devices.

Market Level – Rather than dedicating a machine design to a single application or application family, the

design effort may be utilized for many different applications.

System Level – Rather than dedicating an expensive machine to a single application, the machine may

perform different applications at different times by running different sets of instructions.

Application Level – Rather than spending precious real estate to build a separate computational unit for

each different function required, central resources may be employed to perform these functions in sequence

with an additional input, an instruction, telling it how to behave at each point in time.

Algorithm Level – Rather than fixing the algorithms which an application uses, an existing general-purpose

machine can be reprogrammed with new techniques and algorithms as they are developed.

User Level – Rather than fixing the function of the machine at the supplier, the instruction stream specifies

the function, allowing the end user to use the machine as best suits his needs.Machines may be used for

Page 8: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

functions which the original designers did not conceive. Further, machine behavior may be upgraded in the

field without incurring any hardware or hardware handling costs.

10. General-Purpose Computing Issues

There are two key features associated with general-purpose computers which distinguish them from their specialized

counterparts:

Interconnect

In general-purpose machines, the datapaths between functional units cannot be hardwired.

Different tasks will require different patterns of interconnect between the functional units. Within a task

individual routines and operations may require different interconnectivity of functional units.

General-purpose machines must provide the ability to direct data flow between units. In the extreme of a single

functional unit, memory locations are used to perform this routing function.

As more functional units operate together on a task, spatial switching is required to move data among functional

units and memory.

The flexibility and granularity of this interconnect is one of the big factors determining yielded capacity on a

given application.

Instructions

Since general-purpose devices must provide different operations over time, either within a computational task or

between computational tasks, they require additional inputs, instructions, which tell the silicon how to behave at

any point in time.

Each general-purpose processing element needs one instruction to tell it what operation to perform and where to

find its inputs.

The handling of this additional input is one of the key distinguishing features between different kinds of general-

purpose computing structures.

When the functional diversity is large and the required task throughput is low, it is not efficient to build up the

entire application dataflow spatially in the device. Rather, we can realize applications, or collections of

applications, by sharing and reusing limited hardware resources in time (See Figure 2.1) and only replicating the

less expensive memory for instruction and intermediate data storage.

11. Regular and Irregular Computing Tasks Computing tasks can be classified informally by their regularity:

Regular tasks perform the same sequence of operations repeatedly. Regular tasks have few data dependent

conditionals such that all data processed requires essentially the same sequence of operations with highly

predictable flow of execution. Nested loops with static bounds and minimal conditionals are the typical example

of regular computational tasks.

Irregular computing tasks, in contrast, perform highly data dependent operations. The operations required vary

widely according to the data processed, and the flow of operation control is difficult to predict.

12. Metrics: Density, Diversity, and Capacity The goal of general-purpose computing devices is to provide a common, computational substrate which can be used to

perform any particular task. Metrics are tools to characterize the amount of computation provided by a given general-

purpose structure.

For example, If we were simply comparing tasks in terms of a fixed processor instruction set, we might measure this

computational work in terms of instruction evaluations. If we were comparing tasks to be implemented in a gate-array

we might compare the number of gates required to implement the task and the number of time units required to

complete it. Here, we want to consider the portion of the computational work done for the task on each cycle of

execution of diverse, general-purpose computing devices of a certain size.

Functional Density-Functional Capacity Functional capacity is a space-time metric which tells us how much computational work a structure can do per unit

time. Correspondingly, our “gate-evaluation” metric is a unit of space-time capacity.

That is, if a device can provide capacity Dcap gate evaluations per second, optimally, to the application, a task

requiring Nge gate evaluations can be completed in time:

Page 9: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

In practice, limitations on the way the device’s capacity may be employed will often cause the task to take longer and

the result in a lower yielded capacity. If the task takes time TTask-actual , to perform NTask_ge, then the yielded capacity is:

The capacity which a particular structure can provide generally increases with area. To understand the relative benefits

of each computing structure or architecture independent of the area min a particular implementation, we can look at

the capacity provided per unit area. We normalize area in units of , half the minimum feature size, to make the results

independent of the particular process feature size.

Our metric for functional density, then, is capacity per unit area and is measured in terms of gate-evaluations per unit

space-time in units of gate-evaluations/2. The general expression for computing functional density given an

operational cycle time tcycle for a unit of silicon of size A evaluating Nge gate evaluations per cycle is:

Concluding Remarks on Functional Density:

This density metric tells us the relative merits of dedicating a portion of our silicon budget to a particular computing

architecture.

The density focus makes the implicit assumption that capacity can be composed and scaled to provide the aggregate

computational throughput required for a task.

To the extent this is true, architectures with the highest capacity density that can actually handle a problem should

require the least size and cost.

Remember also that resource capacity is primarily interesting when the resource is limited. We look at computational

capacity to the extent we are compute limited. When we are i/o limited, performance may be much more controlled by

i/o bandwidth.

Functional Diversity – Instruction Density Functional density alone only tells us how much raw throughput we get. It says nothing about how many different

functions can be performed and on what time scale. For this reason, it is also interesting to look at the functional

diversity or instruction density. We thus define functional diversity as:

Concluding Remarks on Functional Diversity:

We use functional diversity to indicate how many distinct function descriptions are resident per unit of area.

This tells us how many different operations can be performed within part or all of an IC without going outside of the

region or chip for additional instructions.

Data Density Space must also be allocated to hold data for the computation. This area also competes with logic, interconnect, and

instructions for silicon area. Thus, we will also look at data density when examining the capacity of an architecture. If

we put of Nbits of data for an application into a space, A , we get a data density:

Page 10: Reconfigurable computing

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes

12. Fine-Grained Reconfigurable Architectures and Coarse-Grained Reconfigurable Architectures Reconfigurable computing are divided into two groups, according to the level of the used hardware Abstraction

or Granularity:

i. Fine-Grained Reconfigurable Architectures

ii. Coarse-Grained Reconfigurable Architectures

Fine-grained reconfigurable devices These are hardware devices, whose functionality can be modified on a very low level of granularity. For

instance, the device can be modified such as to add or remove a single inverter or a single two input NAND-

gate.

Fine-grained reconfigurable devices are mostly represented by programmable logic devices (PLD). This consists

of programmable logic arrays (PLA), programmable array logics (PAL), complex programmable logic devices

(CPLD) and the field programmable gate arrays (FPGA).

On fine-grained reconfigurable devices, the number of gates is often used as unit of measure for the capacity of

an RPU.

Example: FPGA

Issues:

Bit level operation => Several processing units => Large routing overhead => Low silicon area

efficiency => Use up more Power

High volume of configuration data needed => Large configuration memory => More power Dissipation

and long configuration time

Coarse-grained reconfigurable devices

Also called pseudo reconfigurable device.

These devices usually consist of a set of coarse-grained elements such as ALUs that allow for the efficient

implementation of a dataflow function.

Hardware abstraction is based on a set of fixed blocks (like functional units, processor cores and memory tiles)

Merely the reprogramming of interconnections

Coarse-grained reconfigurable devices do not have any established measure of comparison. However, the

number of PEs on a device as well as their characteristics such as the granularity, the amount of memory on

the device and the variety of peripherals may provide an indication on the quality of the device.

Coarse Grained architecture overcomes disadvantages of fine grain architecture

Provides multiple-bit wide datapath and complex operator

Avoids routing overhead

Needs Fewer lines ->Lower area use