Clockless Chips

Seminar

on

Clockless Chips

Date: October 25, 2005.

Presented by:

K. Subrahmanya Sreshti.

(05IT6004).

ABSTRACT

Clock less approach, which uses a technique known as asynchronous logic, differs

from conventional computer circuit design in that the switching on and off of digital

circuits are controlled individually by specific pieces of data rather than by a tyrannical

clock that forces all of the millions of the circuits on a chip to march in unison. It

overcomes all the disadvantages of a clocked circuit such as slow speed, high power

consumption, high electromagnetic noise etc. For these reasons the clock-less technology

is considered as the technology, which is, going to drive majority of electronic chips in

the coming years.

Introduction

Over the years, the designers of microprocessors have resorted to all sorts of

tricks to make their products run faster. Modern chips, for example, queue up several

instructions in a “pipeline” and analyze them to see if switching the order in which they

are executed can produce the correct result, only more quickly.

After a point, cranking up the clock speed becomes an exercise in diminishing

returns. That's why a one-gigahertz chip doesn't run twice as fast as a 500-megahertz

chip. The clock, through the work it must do to coordinate millions of transistors on a

chip, generates its own overhead. The faster the clock, the greater the overhead becomes.

The clock in a state-of-the-art microprocessor can consume up to 30 percent of the chip's

computing capability, with that percentage increasing at an ever faster rate as clock

speeds increase.

Faced with diminishing returns, however, chip designers are dusting down two

technologies—called multi-threading and asynchronous logic—that were both invented

2

decades ago. At the time, neither was competitive with conventional designs, but

important uses have since emerged for each of them. Multi-threading can increase the

performance of database- and web-servers, while asynchronous logic is ideal for wireless

devices and smart cards.

Problems with Synchronous Approach

The synchronous approach predominated, largely because it is easier to design

chips in which things happen only when the clock ticks.

As chips get bigger, faster and more complicated, distributing the clock signal

around the chip becomes harder. Another drawback with clocked designs is that they

waste a lot of energy, since even inactive parts of the chip have to respond to every clock

tick. Clocked chips also produce electromagnetic emissions at their clock frequency,

which can cause radio interference.

Each tick must be long enough for signals to traverse even a chip’s longest wires

in one cycle. However, the tasks performed on parts of a chip that are close together

finish well before a cycle but can’t move on until the next tick. As chips get bigger and

more complex, it becomes more difficult for ticks to reach all elements, particularly as

clocks get faster.

In today's chips, the clock remains the key part of the action. As a microprocessor

performs a given operation, electronic signals travel along microscopic strips of metal

forking, intersecting again, encountering logic gates-until they finally deposit the results

of the computation in a temporary memory bank called a register. Let's say you want to

multiply 4 by 6. If you could slow down the chip and peek into the register as this

calculation was being completed, you might see the value changing many times, say,

from 4 to 12 to 8, before finally settling down into the correct answer. That's because the

signals transmitted to perform the operation travel along many different paths before

arriving at the register; only after all signals have completed their journey is the correct

value assured. The role of the clock is to guarantee that the answer will be ready at a

3

given time. The chip is designed so that even the slowest path through the circuit-the path

with the longest wires and the most gates-is guaranteed to reach the register within a

single clock-tick.

The chip’s clock is an oscillating crystal that vibrates at a regular frequency,

depending on the voltage applied. This frequency is measured in gigahertz or megahertz.

All the chip’s work is synchronized via the clock, which sends its signals out along all

circuits and controls the registers, the data flow, and the order in which the processor

performs the necessary tasks. An advantage of synchronous chips is that the order in

which signals arrive doesn’t matter. Signals can arrive at different times, but the register

waits until the next clock tick before capturing them. As long as they all arrive before the

next tick, the system can process them in the proper order. Designers thus don’t have to

worry about related issues, such as wire lengths, when working on chips. And it is easier

to determine the maximum performance of a clocked system. With these systems,

calculating performance simply involves counting the number of clock cycles needed to

complete an operation.

Calculating performance is less defined with asynchronous designs.

The clocks themselves consume power and produce heat. In addition, in

synchronous designs, registers use energy to switch so that they are ready to receive new

data whenever the clock ticks, whether they have inputs to process or not. In

asynchronous designs, gates switch only when they have inputs.

The job of coordinating tens of millions of transistors at a billion ticks per second

requires the consumption of a lot of energy, most of which ends up as heat. Patrick

Gelsinger, chief technology officer at Intel, referred to the problem in his keynote speech

at the International Solid-State Circuits Conference last February. Gelsinger was only

half-joking when he said that if microprocessors continue to be run by ever-faster clocks,

then by 2005 a chip will run as hot as a nuclear reactor.

4

By throwing out the clock, the fundamental way that chips have organized and

executed their work. For instance, within every one-gigahertz microprocessor, there lies

an oscillating crystal ticking one billion times a second. Engineers are trained to design

chips where their first consideration is getting work done before the next clock-tick

comes around. For most chip designers, throwing out the clock is difficult to imagine.

The clock establishes a timing constraint within which all chip elements must

work, and constraints can make design easier by reducing the number of potential

decisions.

Asynchronous logic circuits (Stop the clocks)

As its name suggests, it does away with the cardinal rule of chip design: that

everything marches to the beat of an oscillating crystal “clock”. For a 1GHz chip, this

clock ticks one billion times a second, and all of the chip’s processing units co-ordinate

their actions with these ticks to ensure that they remain in step. Asynchronous, or

“clockless”, designs, in contrast, allow different bits of a chip to work at different speeds,

sending data to and from each other as and when appropriate.

Clockless processors, also called asynchronous or self-timed, don’t use the

oscillating crystal that serves as the regularly “ticking” clock that paces the work done by

traditional synchronous processors. Rather than waiting for a clock tick, clockless-chip

elements hand off the results of their work as soon as they are finished.

5

Figure 1.

How clockless chips work

There are no purely asynchronous chips yet. Instead, today’s clockless processors

are actually clocked processors with asynchronous elements. Clockless elements use

perfect clock gating, in which circuits operate only when they have work to do, not

whenever a clock ticks. Instead of clock-based synchronization, local handshaking

controls the passing of data between logic modules. The asynchronous processor places

the location of the stored data it wants to read onto the address bus and issues a request

for the information. The memory reads the address off the bus, finds the information, and

places it on the data bus. The memory then acknowledges that it has read the data.

Finally, the processor grabs the information from the data bus.

According to Jorgenson, “Data arrives at any rate and leaves at any rate. When

the arrival rate exceeds the departure rate, the circuit stalls the input until the output

catches up.”

The many handshakes themselves require more power than a clock’s operations.

However, clockless systems more than offset this because, unlike synchronous chips,

each circuit uses power only when it performs work.

6

Clockless advantages

In synchronous designs, the data moves on every clock edge, causing voltage

spikes. In clockless chips, data doesn’t all move at the same time, which spreads out

current flow, thereby minimizing the strength and frequency of spikes and emitting less

EMI. Less EMI reduces both noise-related errors within circuits and interference with

nearby devices.

Power efficiency, responsiveness, and robustness

Because asynchronous chips have no clock and each circuit powers up only when

used, asynchronous processors use less energy than synchronous chips by providing only

the voltage necessary for a particular operation.

According to Jorgenson, clockless chips are particularly energy-efficient for

running video, audio, and other streaming applications — data-intensive programs that

frequently cause synchronous processors to use considerable power. Streaming data

applications have frequent periods of dead time — such as when there is no sound or

when video frames change very little from their immediate predecessors — and little

need for running error-correction logic. During this inactive time, asynchronous

processors don’t use much power. Clockless processors activate only the circuits needed

to handle data, thus they leave unused circuits ready to respond quickly to other demands.

Asynchronous chips run cooler and have fewer and lower voltage spikes. Therefore, they

are less likely to experience temperature-related problems and are more robust. Because

they use handshaking, clockless chips give data time to arrive and stabilize before circuits

pass it on. This contributes to reliability because it avoids the rushed data handling that

central clocks sometimes necessitate, according to University of Manchester Professor

Steve Furber, who runs the Amulet project.

7

Simple, efficient design

Logic modules could be developed without regard to compatibility with a central

clock frequency, which makes the design process easier. Also, because asynchronous

processors don’t need specially designed modules that all work at the same clock

frequency, they can use standard components. This enables simpler, faster design and

assembly.

However, the recent use of both domino logic and the delay-insensitive mode in

asynchronous processors has created a fast approach known as integrated pipelines mode.

Domino logic improves performance because a system can evaluate several lines of data

at a time in one cycle, as opposed to the typical approach of handing one line in each

cycle. Domino logic is also efficient because it acts only on data that has changed during

processing, rather than acting on all data throughout the process. The delay-insensitive

mode allows an arbitrary time delay for logic blocks. “Registers communicate at their

fastest common speed. If one block is slow, the blocks that it communicates with slow

down,” said Jorgenson. This gives a system time to handle and validate data before

passing it along, thereby reducing errors.

Advantages of the Clockless chips

A clocked chip can run no faster than its most slothful piece of logic; the answer

isn't guaranteed until every part completes its work. By contrast, the transistors on an

asynchronous chip can swap information independently, without needing to wait for

everything else. The result? Instead of the entire chip running at the speed of its slowest

components, it can run at the average speed of all components. At both Intel and Sun, this

approach has led to prototype chips that run two to three times faster than comparable

products using conventional circuitry.

Clockless chips draw power only when there is useful work to do, enabling a huge

savings in battery-driven devices; an asynchronous-chip-based pager marketed by Philips

8

Electronics, for example, runs almost twice as long as competitors' products, which use

conventional clocked chips.

Asynchronous chips use 10 percent to 50 percent less energy than synchronous

chips, in which the clocks are constantly drawing power. That makes them ideal for

mobile communications applications - which usually need low power sources - and the

chips' quiet nature also makes them more secure, as typical hacking techniques involve

listening to clock ticks.

Another advantage of clockless chips is that they give off very low levels of

electromagnetic noise. The faster the clock, the more difficult it is to prevent a device

from interfering with other devices; dispensing with the clock all but eliminates this

problem. The combination of low noise and low power consumption makes asynchronous

chips a natural choice for mobile devices. "The low-hanging fruit for clockless chips will

be in communications devices," starting with cell phones

Asynchronous logic would offer better security than conventional chips:

"The clock is like a big signal that says, Okay, look now," says Fant. "It's like looking for

someone in a marching band. Asynchronous is more like a milling crowd. There's no

clear signal to watch. Potential hackers don't know where to begin."

Analyzing the power consumption for each clock tick can crack the encryption on

existing smart cards. This allows details of the chip’s inner workings to be deduced. Such

an attack would be far more difficult on a smartcard based on asynchronous logic.

They can perform encryption in a way that is harder to identify and to crack.

Improved encryption makes asynchronous circuits an obvious choice for smart cards—

the chip-endowed plastic cards beginning to be used for such security-sensitive

applications as storage of medical records, electronic funds exchange and personal

identification.

Ivan Sutherland of Sun Microsystems, who is regarded as the guru of the field,

believes that such chips will have twice the power of conventional designs, which will

9

make them ideal for use in high-performance computers. But Dr Furber suggests that the

most promising application for asynchronous chips may be in mobile wireless devices

and smart cards.

Different styles:

There are several styles of asynchronous design. Conventional chips represent the

zeroes and ones of binary digits (“bits”) using low and high voltages on a particular wire.

One clockless approach, called “dual rail”, uses two wires for each bit. Sudden

voltage changes on one of the wires represent a zero, and on the other wire a one.

"Dual-rail" circuits use two wires giving the chip communications pathways, not

only to send bits, but also to send "handshake" signals to indicate when work has been

completed. Replacing the conventional system of digital logic with what he calls "null

convention logic," a scheme that identifies not only "yes" and "no," but also "no answer

yet"—a convenient way for clockless chips to recognize when an operation has not yet

been completed.

Another approach is called “bundled data”. Low and high voltages on 32 wires

are used to represent 32 bits, and a change in voltage on a 33rd wire indicates when the

values on the other 32 wires are to be used.

Applications of Clockless Chips (more into technical details)

1. High performance.

2. Low power dissipation.

3. Low noise and low electro-magnetic emission.

4. A good match with heterogeneous system timing.

10

1. Asynchronous for High Performance

In an asynchronous circuit the next computation step can start immediately after

the previous step has completed: there is no need to wait for a transition of the clock

signal. This leads, potentially, to a fundamental performance advantage for asynchronous

circuits, an advantage that increases with the variability in delays associated with these

computation steps. However, part of this advantage is canceled by the overhead required

to detect the completion of a step. Furthermore, it may be difficult to translate local

timing variability into a global system performance advantage.

Data-dependent delays

The delay of the combinational logic circuit show in Figure-1 depends on the current

state and the value of the primary inputs. The worst-case delay, plus some margin for

flip-flop delays and clock skew, is then a lower bound for the clock period of a

synchronous circuit. Thus, the actual delay is always less (and sometimes much less) than

the clock period.

11

A simple example is an N-bit ripple-carry adder (Figure 2). The worst-case delay

occurs when 1 is added to 2N - 1. Then the carry ripples from FA1 to FAN. In the best case

there is no carry ripple at all, as, for example, when adding 1 to 0. Assuming random

inputs, the average length of the longest carry-propagation chain is bounded by log 2N.

For a 32-bit wide ripple-carry adder the average length is therefore 5, but the clock period

must be 6 times longer! On the other hand, the average length determines the average

case delay of an asynchronous ripple-carry adder, which we consider next. In an

asynchronous circuit this variation in delays can be exploited by detecting the actual

completion of the addition. Most practical solutions use dual-rail encoding of the carry

signal (Figure 2(b)); the addition has completed when all internal carry-signals have been

computed. That is, when each pair (cfi; cti) has made a monotonous transition from (0; 0)

to (0; 1) (carry = false) or to (1; 0) (carry = true). Dual-rail encoding of the carry signal

has also been applied to a carry bypass adder. When inputs and outputs are dual-rail

encoded as well, the completion can be observed from the outputs of the adder.

12

Elastic pipelines

In general it is not easy to translate a local asynchronous advantage in average-

case performance into a system-level performance advantage. Today's synchronous

circuits are heavily pipelined and retimed. Critical paths are nicely balanced and little

room is left to obtain an asynchronous benefit. Moreover, an asynchronous benefit of this

kind must be balanced against a possible overhead in completion signaling and

asynchronous control.

The controller communicates exclusively with the controllers of the immediately

preceding and succeeding stages by means of handshake signaling, and controls the state

of the data latches (transparent or opaque). Between the request and the next

acknowledge phase the corresponding data wires must be kept stable.

2. Asynchronous for Low Power

Dissipating when and where active the classic example of a low-power

asynchronous circuit is a frequency divider. A D-flip-flop with its inverted output fed

back to its input divides an incoming (clock) frequency by two (Figure 4(a)). A cascade

of N such divide-by-two elements (Figure 4(b)) divide the incoming frequency by 2N.

13

The second element runs at only half the rate of the first one and hence dissipates only

half the power; the third one dissipates only a quarter, and so on. Hence, the entire

asynchronous cascade consumes, over a given period of time, slightly less than twice the

power of its head element, independent of N. That is, fixed power dissipation is obtained.

In contrast, a similar synchronous divider would dissipate in proportion to N. A

cascade of 15 such divide-by-two elements is used in watches to convert a 32 kHz crystal

clock down to a 1 Hz clock. The potential of asynchronous for low power depends on the

application.

For example, in a digital filter where the clock rate equals the data rate, all flip-

flops and all combinational circuits are active during each clock cycle. Then little or

nothing can be gained by implementing the filter as an asynchronous circuit. However, in

many digital-signal processing functions the clock rate exceeds the data (signal) rate by a

large factor, sometimes by several orders of magnitude 2. In such circuits, only a small

fraction of registers change state during a clock cycle. Furthermore, this fraction may be

highly data dependent. The clock frequency is chosen that high to accommodate

sequential algorithms that share resources over subsequent computation steps. One is

vastly improved electrical efficiency, which leads directly to prolonged battery life.

14

One application for which asynchronous circuits can save power is Reed-Solomon

error correctors operating at audio rates, as demonstrated at Philips Research

Laboratories. Two different asynchronous realizations of this decoder (single-rail and

dual-rail) are compared with a synchronous (product) version. The single rail was

clearly superior and consumed five times less power than the synchronous version.

A second example is the infrared communications receiver IC designed at

Hewlett-Packard/Stanford. The receiver IC draws only leakage current while waiting for

incoming data, but can start up as soon as a signal arrives so that it loses no data. Also,

most modules operate well below the maximum frequency of operation.

The filter bank for a digital hearing aid was the subject of another successful

demonstration, this time by the Technical University of Denmark in cooperation with

Oticon Inc. They re-implemented an existing filter bank as a fully asynchronous circuit.

The result is a factor five less power consumption.

A fourth application is a pager in which several power-hungry sub circuits were

redesigned as asynchronous circuits, as shown later in this issue.

3.Asynchronous for Low Noise and Low Emission.

Sub circuits of a system may interact in unintended and often subtle ways. For

example, a digital sub circuit generates voltage noise on the power-supply lines or

induces currents in the silicon substrate. This noise may affect the performance of an

analog-to-digital converter connected so as to draw power from the same source or that is

integrated on the same substrate. Another example is that of a digital sub circuit that

emits electromagnetic radiation at its clock frequency (and the higher harmonic

frequencies), and a radio receiver sub-circuit that mistakes this radiation for a radio

signal.

Due to the absence of a clock, asynchronous circuits may have better noise and

EMC (Electro-Magnetic Compatibility) properties than synchronous circuits. This

15

advantage can be appreciated by analyzing the supply current of a clocked circuit in both

the time and frequency domains.

Circuit activity of a clocked circuit is usually maximal shortly after the productive

clock edge. It gradually fades away and the circuit must become totally quiescent before

the next productive clock edge. Viewed differently, the clock signal modulates the supply

current as depicted schematically in Figure 5(a). Due to parasitic resistance and

inductance in the on-chip and off-chip supply wiring this causes noise on the on-chip

power and ground lines.

4. Heterogeneous Timing

There are two on-going trends that affect the timing of a system-on-a-chip: the

relative increase of interconnects delays versus gate delays and the rapid growth of

design reuse. Their combined effect results in an increasingly heterogeneous organization

of system-on-a-chip timing. According to Figure 7, gate delays rapidly decrease with

each technology generation. By contrast, the delay of a piece of interconnect of fixed

modest length increases, soon leading to a dominance of interconnect delay over gate

delay. The introduction of additional interconnects layers and new materials (copper and

low dielectric constant insulators) may slow down this trend somewhat. Nevertheless,

new circuits and architectures are required to circumvent these parasitic limitations. For

16

example, across-chip communication may no longer fit within a single clock period of a

processor core.

Heterogeneous system timing will offer considerable design challenge for system-

level interconnect, including buses, FIFOs, switch matrices, routers, and multi-port

memories. Asynchrony makes it easier to deal with interconnecting a variety of different

clock frequencies, without worrying about synchronization problems, differences in clock

phases and frequencies, and clock skew. Hence, new opportunities will arise for

asynchronous interconnect structures and protocols. Once asynchronous on-chip

interconnect structures are accepted, the threshold to introduce asynchronous clients to

these interconnects is lowered as well. Also, mixed synchronous-asynchronous circuits

hold promise.

Clockless challenges

Asynchronous chips face a couple of important challenges.

Integrating clockless and clocked solutions

17

In today’s clockless chips, asynchronous and synchronous circuitry must

interface. Unlike synchronous processors, asynchronous chips don’t complete

instructions at times set by a clock. This variability can cause problems interfacing with

synchronous systems, particularly with their memory and bus systems. Clocked

components require that data bits be valid and arrive by each clock tick, whereas

asynchronous components allow validation and arrival to occur at their own pace. This

requires special circuits to align the asynchronous information with the synchronous

system’s clock.

Lack of tools and expertise

Because most chips use synchronous technology, there is a shortage of expertise, as well

as coding and design tools, for clockless processors. There is also a shortage of

asynchronous design expertise. Not only is there little opportunity for developers to gain

experience with clockless chips, but also colleges have fewer asynchronous design

courses.

References:

1) Scanning the Technology: Applications of Asynchronous Circuits – C. H. (Kees)

van Berkel, Mark B. Josephs, and Steven M. Nowick

2) http://ieeexplore.ieee.org/iel5/2/30617/01413111.pdf (October 2001)

3) http://csdl2.computer.org/comp/mags/dt/2003/06/d6005.pdf

4) http://www1.cs.columbia.edu/async/misc/technologyreview_oct_01_2001.html

5) http://www.technologyreview.com/articles/01/10/tristram1001.asp

6) http://www1.cs.columbia.edu/async/misc/economist/Economist_com.htm

18

http://www1.cs.columbia.edu/async/misc/economist/Economist_com.htm

http://www.technologyreview.com/articles/01/10/tristram1001.asp

http://www1.cs.columbia.edu/async/misc/technologyreview_oct_01_2001.html

http://csdl2.computer.org/comp/mags/dt/2003/06/d6005.pdf

http://ieeexplore.ieee.org/iel5/2/30617/01413111.pdf

Clockless Chips

Documents

Transcript of Clockless Chips