A dynamic clock synchronization technique for large systems

12
I 350 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 17, NO. 3, AUGUST 1994 A Dynamic Clock Synchronization Technique for Large Systems D. E. Brueske and S. H. K. Embabi, Member, IEEE Abstract-This paper reports on a circuit technique which can be used to reduce the clock skew in ULSI and WSI systems. It is suitable for large systems which are divided into isochronic mod- ules with locally optimized clock distribution. The inter-module clock distribution uses tunable delay elements to compensate for the differences in the phase delay of the individual modules. The delay elements introduce a phase shift to the clock signals going to each region to align the clock edges at the leaves of the local clock trees. This technique is dynamic in the sense that it guarantees clock synchronization in the presence of process or ambient variations. Whenever the clock skew exceeds a specific limit, the clock synchronization systems is activated to restrain the skew. The advantage of this technique over using phase locked loop circuits is that it saves power consumption and area. In addition, its lock-in time is much shorter. Experiments have shown that using the tunable delay element approach is capable of reducing the skew to less than loops. A stability criterion was developed. The effect of substrate and power supply noise on the synchronization scheme was also investigated. I. INTRODUCTION LOCK skew used to be a secondary issue in designing C synchronous systems. Since the skew is to be within 10% of the clock period [I], the skew must be scaled down as the clock frequency is being increased. The reduction of the clock skew is becoming a necessity in future generations which feature large die size and higher clock frequencies. Recently, more attention has been dedicated towards clock distribution with special emphasis on skew time minimization. Most of the research in this area revolves around RC balancing techniques to equate the delay along the branches of the clock tree [2]-[6]. Such techniques could be efficient for trees of limited sizes. As the die size increases and the levels of buffering of the clock tree increases the skew due to buffer mismatching will grow. The layout process will be also very difficult and requires sophisticated CAD tools otherwise several iterations will be required for placing and routing the balanced clock network with its buffers [6]. An approach which may be more suitable for future systems is to assume that a system can be partitioned into modules. The size of each module has to be small enough so that conventional clock distribution schemes would yield acceptable local skew times. Hence, these modules can be Manuscript received January 12, 1994; revised April 11, 1994. D. E. Brueske was with Texas A&M University, Dept. of Electrical Engineering, College Station, TX 77843-3128, and is currently with Motorola, Fort Lauderdale, FX 33322 USA. S. H. K. Embabi is with Texas A&M University, Department of Electrical Engineering, College Station, TX 77843 USA. IEEE, Log Number 9402894. 1070-9894/94$04 (b) Fig. 1. (a) Unsynchronized clock tree. MA and MB are two isochronic modules with different phase delays. PA and PB are the paths between the master CLK and the A and B pins respectively. (b) Synchronized clock tree with two variable delay elements DEA and DEB which are tuned via the control signals VCA and VCB generated by the skew sensor SS. regarded as isochronic. The phase delays' of the modules can be different. These differences in the phase delay can then be compensated for by the inter-module clock distribution. Hence, the global synchronization can be achieved without interfering with the intemal module design. Such an approach has been previously proposed by Friedman and Powell [7] and Anceau [8]. This technique fits with the nature of VLSI, WSI or MCM design process, where each module can be designed separately. Each team can focus on the assigned module including the clock network. Another advantage of the modular clock distribution is that it allows for saving power consumption by powering-up and powering-down the clock to each module upon request. This is only feasible by having independent clock nets for each module. The approach adopted by Friedman and Powell is suscep- tible to process variations because the compensation for the differences in the phase delays of the modules is achieved by parameterizing buffers in the inter-module clock interconnec- tions [7]. , 'The phase delay of a module is defined as the propagation delay of the local clock tree of the module. ..OO 0 1994 IEEE

Transcript of A dynamic clock synchronization technique for large systems

Page 1: A dynamic clock synchronization technique for large systems

I

350 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 17, NO. 3, AUGUST 1994

A Dynamic Clock Synchronization Technique for Large Systems

D. E. Brueske and S. H. K. Embabi, Member, IEEE

Abstract-This paper reports on a circuit technique which can be used to reduce the clock skew in ULSI and WSI systems. It is suitable for large systems which are divided into isochronic mod- ules with locally optimized clock distribution. The inter-module clock distribution uses tunable delay elements to compensate for the differences in the phase delay of the individual modules. The delay elements introduce a phase shift to the clock signals going to each region to align the clock edges at the leaves of the local clock trees. This technique is dynamic in the sense that it guarantees clock synchronization in the presence of process or ambient variations. Whenever the clock skew exceeds a specific limit, the clock synchronization systems is activated to restrain the skew. The advantage of this technique over using phase locked loop circuits is that it saves power consumption and area. In addition, its lock-in time is much shorter. Experiments have shown that using the tunable delay element approach is capable of reducing the skew to less than loops. A stability criterion was developed. The effect of substrate and power supply noise on the synchronization scheme was also investigated.

I. INTRODUCTION LOCK skew used to be a secondary issue in designing C synchronous systems. Since the skew is to be within 10%

of the clock period [I], the skew must be scaled down as the clock frequency is being increased. The reduction of the clock skew is becoming a necessity in future generations which feature large die size and higher clock frequencies. Recently, more attention has been dedicated towards clock distribution with special emphasis on skew time minimization. Most of the research in this area revolves around RC balancing techniques to equate the delay along the branches of the clock tree [2]-[6]. Such techniques could be efficient for trees of limited sizes. As the die size increases and the levels of buffering of the clock tree increases the skew due to buffer mismatching will grow. The layout process will be also very difficult and requires sophisticated CAD tools otherwise several iterations will be required for placing and routing the balanced clock network with its buffers [6]. An approach which may be more suitable for future systems is to assume that a system can be partitioned into modules. The size of each module has to be small enough so that conventional clock distribution schemes would yield acceptable local skew times. Hence, these modules can be

Manuscript received January 12, 1994; revised April 11, 1994. D. E. Brueske was with Texas A&M University, Dept. of Electrical

Engineering, College Station, TX 77843-3128, and is currently with Motorola, Fort Lauderdale, FX 33322 USA.

S. H. K. Embabi is with Texas A&M University, Department of Electrical Engineering, College Station, TX 77843 USA.

IEEE, Log Number 9402894.

1070-9894/94$04

(b)

Fig. 1. (a) Unsynchronized clock tree. MA and MB are two isochronic modules with different phase delays. PA and PB are the paths between the master CLK and the A and B pins respectively. (b) Synchronized clock tree with two variable delay elements DEA and DEB which are tuned via the control signals VCA and VCB generated by the skew sensor SS.

regarded as isochronic. The phase delays' of the modules can be different. These differences in the phase delay can then be compensated for by the inter-module clock distribution. Hence, the global synchronization can be achieved without interfering with the intemal module design. Such an approach has been previously proposed by Friedman and Powell [7] and Anceau [8]. This technique fits with the nature of VLSI, WSI or MCM design process, where each module can be designed separately. Each team can focus on the assigned module including the clock network. Another advantage of the modular clock distribution is that it allows for saving power consumption by powering-up and powering-down the clock to each module upon request. This is only feasible by having independent clock nets for each module.

The approach adopted by Friedman and Powell is suscep- tible to process variations because the compensation for the differences in the phase delays of the modules is achieved by parameterizing buffers in the inter-module clock interconnec- tions [7].

,

'The phase delay of a module is defined as the propagation delay of the local clock tree of the module.

..OO 0 1994 IEEE

Page 2: A dynamic clock synchronization technique for large systems

BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS

~

35 1

t a invlb inv2b

.__.__

nanwb

VDD

I I l -

Fig. 2. (a) Schematic of the phase detector. (b) Schematic of the charge pump. (c) Block diagram of the skew sensor.

(a) (b)

Fig. 3. (a) Schematic of the current starved inverter delay cell (CSI). (b) Schematic of the variable RC delay cell.

Another alternative is the use of PLL for each module. The advantage of PLL is that it is not sensitive to process or ambient variations. The drawback, however, is the large area and power they consume. The lock-in time of a PLL is too long (several thousands of clock cycles).

In this paper we propose to insert tunable delay elements in the inter-module clock lines to adjust their delays so that they offset the module phase delay. This technique has the advantages of the PLL-based approach because it is not susceptible to process and ambient variations. Yet, it occupies less area and consumes less power compared to PLL-based techniques. Furthermore, its lock-in time is much smaller than that of the PLL. The proposed technique is similar to that proposed by Johnson et al. [9] where they used variable delay lines to synchronize between a CPU chip and a floating-point coprocessor. The system proposed here is intended to be used for a fairly large number of modules hence it has to be simple and compact so that it can be integrated on chip with minimum area overhead.

The paper is divided into seven sections. The basic concept of the proposed deskewing technique is introduced in Section 11. In Section 111, the basic building blocks are presented. Different implementation issues are discussed in Section IV. Section V deals with the stability of the deskewing system. The impact of the substrate noise and power supply level fluctuations on the jitter is discussed in Section VI. Section VI1 shows experimental results.

11. BASIC CONCEPT OF THE DESKEWING TECHNIQUE

Given a partitioned system with locally optimized clock nets we may assume that the skew within each module is within the required limits. Hence, any clock pin (one of the leaves of the local clock tree) can be used to measure the clock phase of the whole module. Consider two modules MA and MB with phase delays $A and $B respectively. Both modules receive the clock signal from the master clock via the clock paths PA and PB as shown in Fig. l(a). A and B are clock pins in modules MA and MB, respectively. Clock skew between

Page 3: A dynamic clock synchronization technique for large systems

0 1 2 3 4 5 Control Voltage, VC (V)

.$ 0.6

0.2 . 04 . . . . . . . . . . . . . . . . . . . . 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Control Voltage, Vc (V)

Fig. 4. (a) The propagation delay of the CSI delay element (one pair of delay cells) versus the control voltage (VC). (b) The propagation delay of the RC delay element (one pair of delay cells) versus the control voltage (VC).

nodes A and B may result because the phase delays $ A and $B are not equal and/or the propagation delays of PA and PB are different. The proposed scheme is based on measuring the skew time using what we call skew sensor (SS) to produce a signal that tunes the delay elements, which are inserted in the paths PA and PB as shown in Fig. l(b). The tuning process will be active until the skew is eliminated. The clock phase introduced by the delay elements will then cancel the differences between $A and 4~ and the differences between the delays of PA and PB.

111. COMPONENTS OF THE DESKEWING SYSTEM

The following sections describe the details of individual components of the synchronization system.

A. Skew Sensor

The skew sensor (SS) consists of a phase detector which generates a pulse whose width is proportional to the skew time which is given by

(1)

where 6, is the phase difference between pin A and pin B of Fig. l(b), and W& is the system clock frequency. Two pulsed signals are generated by the phase detector. One results from negative skew and the other from positive skew. The voltage pulses produce constant current pulses via a charge pump. This charge is pumped onto a capacitor which acts as an integrator. The voltage change on the capacitor is proportional to the phase error de , and is given by

6 e t skew = -

wclk

(2)

where I cp is the magnitude of the current pulse from the charge pump, and Ccp is the value of the charge pump

ICP ee AVc = ~

CCP Wclk

V 0 L

L

N

V 0

1

1

N

V 0 T

1 I N

v 0

1

L I N

352 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL 17, NO. 3, AUGUST 1994

~ ~-

..... ..... .........

.......... ............

.,...( ".+.

.... ....

..... - . -

-----+-- ] , U I .......... I ........ vc* ...... i ..................... ~ ......... -.-.+.-.-.-.-L

- . o , ~ . . l . . . i . . . l . . . I . . . l .... I...) .... (...I .... I. . . I _ . _ _ 1 . J .... L...I .... I...I.... l...l...J....l...l..~

lO .@N 1 5 . M 20.0W 25.ON T I H E [ L I S )

(a)

Fig. 5. (a) HSPICE waveforms for the clock pins A and B (top panel), the UP and DW signals (second panel), and VCA and VCB (third panel). (b) The waveforms for A, B, VCA and VCB during lock-in period.

capacitance. Equation (2) shows that A, is proportional to the skew time.

1 ) Phase Detector: Fig. 2(a) shows a gate level schematic of a the phase detector [lo]. The UP signal becomes active if CLKA's falling edge comes before that of CLKB. When the falling edge of CLKB arrives at time t 2 the phase detector resets itself. This means the UP and DW signals will become inactive as shown in Fig. 2(a). If the CLKB falling edge occurs before the CLKA edge, the DW signal will become active. Again the circuit will reset itself after the falling edge of CLKA arrives. The signals should remain inactive on the rising edges of CLKA and CLKB. If CLKA is picked as the reference clock, then positive skew (i.e. tl - t 2 ) would be detected by the UP signal and negative skew (i.e. t 2 - t l ) by the DW signal.

2) Charge Pump: Once a pulse is generated from the UP and DW signals a charge pump (Fig. 2(b)) converts the voltage pulse into a current pulse. The current pulse is then used to charge the capacitor Ccp (for example, Ccp=SpF) so that the output voltage of the pump is proportional to the detected phase difference. The charge pump proposed here is simpler than the one used in [9], and consumes less power. Furthermore, it is symmetrical in the sense that it achieves almost equal charging and discharging current pulses which is difficult to achieve by the charge pump presented in [lo].

The block diagram of the skew sensor used in our imple- mentation is given in Fig. 2(c).

B. Delay Element:

The delay element shifts the phase of the clock in proportion to the analog control voltage which is produced by the charge pump. Two types of delay cells will be examined in this

Page 4: A dynamic clock synchronization technique for large systems

I

353 BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS

I

Pump Pump

, _ _ _ _ _ _ _ - - - dw 8 Interconnect IB I DEB -

Pin B

I ' Detector I - . - - - - - - . . _ _ . _ _ . . . - -

I Phase

. _ . . _ . _ . - _ .

Fig. 6. Schematic of the deskewing system with compensation for the difference in the delay of the IA and IB interconnects.

Master- Clock

Fig. 7. Schematic of the parallel multi-module synchronization architecture.

section; the current starved inverter (CSI) delay cell shown in Fig. 3(a) and the variable RC delay cell [9] shown in Fig. 3(b). The current starved inverter used in our implementation is a modified version of the conventional one reported in [lo]. The control voltage V i in Fig. 3(a) controls the current through M3 and M1. The current variation is then mirrored to M2 which provides the charging current to the Cload capacitor. Thus, the low to high output transition can be controlled by VC. A delay element consists of an even number of cascaded CSI cells to equally delay both edges of the incoming clock. To reshape the output clock two inverters are added in series with each delay cell. The delay of the conventional current starved inverter increases sharply as VC approaches the threshold voltage of the NMOS transistor (M3) because the starved inverter tends to tum off. The steep slope of the delay curve of the conventional starved current inverter amplifies the effect of noise and may cause jitter [9]. This problem can be solved by adding the biasing current (Ibdas) as shown in Fig. 3(a) to prevent the starved inverter from tuming off as VC decreases below the threshold voltage of M3. Fig. 4(a) illustrates the

delay accomplished by the current starved inverter versus the control voltage as measured from the HSPICE simulations. Note that the delay range and the slope of the delay can be controlled via Ibias .

The propagation delay of the RC type is determined by the variable NMOS resistor which is controlled via Vc. Again an even number of the RC type delay cells should be cascaded for equal phase shifts on both transition of the incoming clock signal. A plot of the simulated propagation delay versus control voltage is shown in Fig. 4(b). Further comparison between the two types will be discussed in Section VI.

I v . IMPLEMENTATION OF THE DESKEWING SYSTEM

The skew sensor shown in Fig. 2(c) and the current starved inverter delay element were used for the deskewing system shown in Fig. l(b). HSPICE simulations were run to verify the functionality of the deskewing system. The waveforms generated by HSPICE are depicted in Fig. 5. Initially, the clock at pin A leads that of pin B by Ins as shown in Fig. 5(a) (top panel). The waveforms of the UP and DW

Page 5: A dynamic clock synchronization technique for large systems

354 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL.

Isochronic n

NO. 3, AUGUST 1994

U

Fig. 8. Schematic of the series multi-module synchronization architecture.

1 0 7 . ON

Fig. 9. HSPICE waveforms for the clock pins A, B, D, C of Fig. 8 (a) without synchronization and (b) with synchronization.

I Y

Fig. 10. of the basic deskewing system of Fig. l(b).

Block diagram for the control model used for the stability analysis

signals (second panel) and the corresponding analog control signals VCA and VCB are shown in the third panel of Fig. 5(a). When the UP signal becomes active, VCA increases and VCB decreases in proportion to the UP pulse width. When the UP and DW signals are equal, the control voltages stop changing and the skew induced by the delay element's control voltage has virtually eliminated the skew (skew<60ps). Which occurs after eight falling edges of the clock signal or 60ns as can be seen in Fig. 5(b).

So far, we have assumed that the interconnects IA and IB between the clock pins and the skew sensor (see Fig. l(b)) contribute equal delays. Now, assume that the interconnects IA and IB are long and may require buffers. It can no longer be assumed that the delays are equal since mismatching due to process variations is likely. To eliminate the phase error due to the difference on the delay of interconnects IA and IB, a scheme similar to that of Fig. l(a) can be used. The modified scheme is depicted in Fig. 6. The points A' and B' are as close as possible to A and B, respectively. Interconnects IA' and IB', which are exact images (including the repeaters) of IA and IB, respectively, connect points X to A' and B'. Another pair of interconnects IA" and IB", which are also duplicates of IA and IB, connect A' to Y and B' to Z. Note that IA, IA' and IA" must be laid out parallel to each other in close proximity. The same applies for IB, IB', and IB". Also, note that delay elements DEA, DEA', and DEA" (DEB, DEB', and DEB") are to be laid out in a similar fashion and should be as close as possible to each other. By doing this, the mismatches between

Page 6: A dynamic clock synchronization technique for large systems

I

1 I I .*- ..........

l . D - . ......... ........,......... ,........-

BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS 355

0 . 6 0 I n.6 0 0.6 I

Rcd *p. lbd &U

(a) (b)

Fig. 11. Root locus in the z-domain generated by MATLAE3 for (a) m = 3 and (b) m = 4.

1 0 1 .....i............................

-1

-2 .....................................

-a0 0 10 1s W

no. al-

(c)

I,* 4 1 ......................................

1.2 ........ n :

0.. 'tr : I -1 ............................... 0.e .....................................

0.. .......................... L ........

0.2 .. _ . _ _ _ . _ _ _ _ _ _ . _ _ I _ _ . _ _ . _ _ _ I . _ _ _ _ _ _ .

0 6 10 IS W

NO. U-

@)

s- ..........................

s 6 10 16 P

Fig. 12. Unit-step response based on (5) using MATLAB for (a) K = 0.2, overdamped. (b) K = 0.4, slightly overdamped. (c) K = 0.8, underdamped, and (d) K = 1.1, unstable.

IA, IA' and IA", their repeaters and their delay elements are minimized. This also applies to the B set of interconnects. A clock signal (any clock signal can be used for this purpose) is injected at point X. The delay of this clock signal from X to Y is different from that between X and Z. The secondary phase detector senses the Y-Z phase difference and tunes the delay of all elements (DEA, DEA', DEA", DEB, DEB', and DEB") until Z and Y are in phase. This guarantees that the delay between pin A and CLKA is equal to that between pin B and CLKB, and hence, the phase error sensed by the primary phase detector will be only due to the delay difference of PA and PB. The functionality of this technique has been verified using HSPICE simulations.

So far, we have considered the case where only two modules are to be synchronized. In practice, more modules will need

deskewing. This can be achieved through parallel synchro- nization or series synchronization. Parallel synchronization is achieved by synchronizing all regions to one reference node. The second method, series synchronization, tries to align the clock phase of each module to that of its neighbors. A more complete description is given in the following paragraphs.

Parallel synchronization provides a simple architecture for multi-module synchronization, as shown in Fig. 7. Pin A, B, C , and D are representative points of the isochronic modules. Each region has a locally optimized clock network. All modules, as shown in the figure, are synchronized to a reference pin which is pin A in this case. Each skew sensor (SS) detects the phase difference between pin A and one of the clock pins B, C or D. The delay elements are adjusted via the control voltage, VCB, V& or Vco, until the regional clock

Page 7: A dynamic clock synchronization technique for large systems

356 [EEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 17, NO. 3, AUGUST 1994

t L A

V

! z

V

; L

I:

! L A

V

P z

V 0

T

L r:

. s o =... ............................................................ .- - ........

.......... .....................................................................

..........................

........................................ ................ _:__

, , i , I I i , I I i , , , , i , , , i , . , q a . * o

a

a . 1

I .

1

6 0 0 .

(a) . s D - ..................

- 1 . E O , , , i I I , , i o , , . i I I I I i I

IN

(b) . 2 -

......... ...... ..........

....................

. e o ~ . . . .

............

..(...... i . . . . . .;...... .....-

......................................................... ~. c.. i ............................. .............................. i. i ............-

1 0 . O N T I M E C L I N > 0 0 . O N

(C)

Fig. 13. Unit-step response obtained from the HSPICE simulations for (a) Ii = 0.2, overdamped (Ccp = 30pF).(b) h' = 0.4, underdamped (Ccp = 7 p F ) . (c) h' > 1, unstable (C,p = 2pF) . For each case, the top panel shows the control voltage of the delay element and the lower panel shows the clock at pins A and B.

at pins B, C or D are all aligned with the clock phase of pin A. Each regional pin is independently adjusted to synchronize with pin A. Therefore the lock-in times are all independent of each other. The system lock-in time should be approximately the same as the lock-in time for two pin synchronization.

The series synchronization scheme is shown in Fig. 8. The delay elements to each module have two control voltages. A, B, C, and D are representative pins of their corresponding isochronic modules. The master clock is injected at the center

and eventually passes through each delay element. The pins are then synchronized by complementary recursive adjustment of the delay element control voltages. For instance, suppose the skew sensor between pin A and D detects a positive phase difference with respect to pin A. The control voltages will change to compensate for the skew. Voltage to the delay element DED will increase, and in a equal but opposite manner, the VC-l will increase and V c ~ l will decrease by an equal amount. However, pins C and D are synchronizing at

Page 8: A dynamic clock synchronization technique for large systems

I

BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS 351

Fig. 14. Schematic of the RC model 'of the interconnect carrying the analog control signal from the charge pump to the delay element. The circuit consisting of Vdzglta,, C,, C,, R, and L , models the parasitics which determine the substrate noise [13].

the same time as pins A and D. This means any adjustments by control voltage VC-2 will change the phase at pin D, which may unsynchronize pins A and D. Again all pins will try to synchronize on the next clock cycle. Eventually the nodes will be aligned. Fig. 9 illustrates the HSPICE simulations waveforms for the clock signal of all four pins before and after synchronization. The maximum skew was 3ns before synchronization as shown in Fig. 9(a), and was reduced to about 3Ops as shown in Fig. 9(b).

The process for most conditions will take longer to synchro- nize than in the parallel architecture. In addition, the series approach may not be as effective as the parallel method in eliminating the skew. The skew between any two neighboring nodes in the loop of the series architecture will always have some small but finite value. This small skew may accumulate around the loop and may cause the skew between any two distant points to be relatively large. In the parallel architecture the worst case skew is fairly constant and is equal to that which may exist in a simple dual module system of Fig. l(b).

V. STABILITY ANALYSIS The presented deskewing scheme is a feedback system and

may become unstable under certain conditions to be revealed in this section. Since the system is by nature sampled, a linearized sampled analysis similar to the one done for PLL in [ 1 I] can be performed.

A. Linear Control Model

by the block diagram shown in Fig. 10. The phase error which is measured by the phase detector, is given by

$e = ($PA + ODA) - (OPB + B D B ) = $P - $0 (3)

where 8pA(dPB) is the clock phase shift due to the delay of the path PA (PB), and $DA($DB) is the phase shift caused by the delay of the delay element DEA (DEB). For convenience,

phase detector can be represented by a summer, the charge pump can be modeled by a constant gain block (Kcp) and an integrator. The gain K c p , which is the ratio of the the change

The deskewing system shown in Fig. l(b) can be modeled

we Will use 8p for $PA - $PB and $0 for 8DA - 6DB. The

in VC (AVc) and 8,, can be derived from (2) and is given by

(4)

The constant gain block, KD, in Fig. 10 represents the gain of the delay element. The zWm block accounts for the total delay in the loop which is defined as the time elapsed from the point when an UP or DW signal is generated until the corresponding clock phase shift is produced. This includes the delays of the phase detector, the charge pump, the interconnects carrying the control signal to the delay element and finally that of the delay element itself. The delay of the loop (T'p) may be less than or greater than the clock period (TC&). The factor m can be calculated as the greatest integer function of the ratio TLp/Tclk shifted to -1, i.e. m=[l+ TLP/Tclk]. This means that m can be obtained by rounding the ratio T~p/T~lk up to nearest integer. Finally, the gain of the delay element, which is strictly defined as the derivative of the element's delay w.r.t. the control voltage, has no single value because of the non- linear relation of the delay versus VC as shown in Fig. 4(a). To simplify the analysis we will use the maximum value of the derivative as KD. This yields a worst case analysis.

The linearized model in Fig. 10 yields the following transfer function

ICP Kcp = ~

CCP Wclk

( 5 ) K - H ( z ) = - -

$,(z) zm - zm- 1 f K where K is equal to the loop gain KCPKD. To maintain stability in a discrete control system the poles must be within the unit circle of the z-plane. For a first order system, which means that the loop delay is less than the clock period or m=l, there is only one pole at -K. To guarantee stability K must be less than 1 which implies that

(6) K D ICP KDICP < 1 or ccp > ____

CCP Wclk wclk

The maximum value of K , which is required to maintain stability, decreases as the order of the system increases or in other words as the loop delay increases. Fig. l l(a) and Il(b) show the root locus plots for m=3 and m=4 in the z- domain. For the case where m=3, the system becomes unstable when K > 0.625, while for m= 4 K has to be <0.46. This implies that shorter loop delay yields a more robust system. The transfer function (5) has been used successfully to predict the limits of stability of the deskewing system, and can be used to find the optimum design which yields minimum lock- in time without making the system unstable. The following shows an example in which we have used the transfer function to optimize a second order system. The unit-step time response of the second order system (with m=2), which was analyzed based on the above transfer function using MATLAB, has shown that slight underdamping (K=0.4 in Fig. 12(b)) offers the shortest lock-in time (8 clock cycles). This has been verified using HSPICE. About 6 cycles were required for the skew elimination when K~z0 .4 (Fig. 13(b)). Fig. 13(c) shows how the system becomes unstable resulting in severe clock jitter whenK > 1.

The series multi-module synchronization method requires more complex linear control modeling. However, to ensure its

Page 9: A dynamic clock synchronization technique for large systems

358

e 3

g 2 ' $ ,

IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 17, NO. 3, AUGUST 1994

CSI Delay ccll ---_____-------- ---_

Fig. 15. to the smallest value Ccp=2pF and the bottom curve (curve 5 ) corresponds to Ccp=20pF.

The frequency response of substrate noise coupled to the control input of the delay element for different values of Ccp . Curve 1 corresponds

I I I I .____... ._.____ I j / AoabJg h v a w of CSI &lay ccll 1 1 L ~ --I

Fig. 17. Schematic of the CSI cell with a stabilization capacitance.

stability one should use small loop gain as concluded from the above analysis.

From the simulations and equations derived in this section it is apparent that a fast lock-in time can readily be achieved without loss of stability. To satisfy the stability requirements of the PLL-based systems the capacitance of the low-pass filter of the PLL has to be very large (for example, 100pF). The lock-in time is, therefore, very large. The improvement in the lock-in time makes the proposed deskewing system attractive.

VI. NOISE ANALYSIS

Using an analog control voltage for the variable delay elements yields the desired precision necessary for high clock

r\ I \ RC cell ! I

* Percentage of Clock Cycle -61 I 0 5 10 IS 20 25 30

Clock Cycle Fig. 18. The clock jitter (due to the power supply level fluctuation) as a percentage of the clock cycle for the CSI, RC and CSI with Cstab elements.

frequency phase synchronization. A digital implementation with the same resolution would occupy more silicon area compared to the analog counter part. The disadvantage to using analog elements mixed with a digital system is that noise caused by the fast switching digital transients feeds through to the sensitive analog signals. Noise coupled onto the delay cell's control voltage signals can cause significant phase variations between the synchronized clock pins. This is known as clock jitter. Another source of clock jitter is the low frequency voltage fluctuations in the power supply [12]. The noise immunity of the variable RC and the CSI types and the jitter of the overall synchronization system are discussed next.

A. Substrate Noise

Noise due to the fast switching transients can couple onto the analog interconnect either up through the substrate or through cross-talk between interconnects. The analysis pre- sented here focuses on the noise coupled from the substrate to the analog interconnect carrying the control signal to the delay elements. An excellent analysis of substrate noise can be found

Page 10: A dynamic clock synchronization technique for large systems

BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS 359

I I a 1 Master T Skew

I T -.r 1

Clock

! I " I ' I

(b)

Fig. 19. (a) Schematic of the set-up which was fabricated. (b) The photomicrograph of the chip.

in [ 131 where the parasitics from the integrated circuit package are approximated by a lumped RLC. Circuit models presented in [I31 will be used to emulate the noise transients which typically occur on the substrate. The interconnect carrying the control voltage from the charge pump to the delay element is modeled by a RC 7r model to account for the capacitance (Cint) and resistance (Rint) of the interconnect as shown in Fig. 14. The substrate noise is generated by the RLC lumped model shown in Fig. 14. C, is the metal to substrate capacitance which couples the digital switching (Vdigztal) to the substrate. C, is the capacitance between the substrate and the package cavity. R, is the spreading resistance from the substrate contact to the substrate node, and L, is the inductance associated with the pin of the substrate contact. The charge pump capacitor, Ccp, can be used to filter out some of this noise if it is deliberately placed as close to the control voltage input of the delay element as possible. An AC analysis of the passive circuit (RC 7r model, the substrate noise generator and the charge pump capacitance) was used to measure the transfer function H = Vc/VSub which is a measure for the ratio of the noise coupled to the input of the delay cell to that of the substrate. The Bode plot shown in Fig. 15 indicates that the noise coupled through the capacitance of the interconnect is attenuated as the capacitance increases. This is highly desirable, however, the drawback is that the system time response is slowed. This means the lock-in time will increase.

In [9] it was concluded that the variable RC delay cell is less prone to noise coupled to its control input when compared to the CSI. This conclusion was based on the assumption that the magnitude of the noise component is determined solely by the slopes of the curves in Figs. 4(a) and 4(b) (delay versus control voltage curves). However, this is not true since these curves are based upon a steady state analysis of the control voltage. A simple test is performed on both delay cell types to compare their actual noise immunity. A 0.6V (=20% of VDD) peak- to-peak sinusoidal noise source was added in series with the DC value of VC. Three frequencies are chosen for this noise source: 100, 200, and 340 MHz. Note that the system clock is ISOMHz, so the noise frequency spectrum should consist of frequencies near 15OMHz or greater. The peak-to-peak jitter in the delay due to the noise was measured for the RC and CSI types and is depicted in Fig. 16. Note that the peak-to-peak jitter was measured as a percentage of the maximum delay range achievable by the RC and CSI delay cells (the delay range was equal for both types). Fig. 16 shows that the CSI cell is less sensitive to noise than the RC cell, however, their noise immunities approach each other as the noise frequency increases. One of the problems with the RC cell is the parasitic capacitive coupling between the gate and drain of the NMOS variable resistor. The fast switching transients at the output of the RC cell feed through the parasitics onto the control voltage. This results in small voltage fluctuations which change

Page 11: A dynamic clock synchronization technique for large systems

360 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 17, NO. 3, AUGUST 1994

(b)

Fig. 20. ((~3x4 and SB) and (b) with synchronization ((31-4 and lb).

The measured waveforms clock signals (a) without synchronization

the delay of the RC cell. In essence, it can create its own noise. Another advantage to using the CSI is that an additional stabilizing capacitance, Cstab, can be connected between the gate and source of the PMOS transistor MI (see Fig. 17 for details). The peal-to-peak jitter of the stabilizing capacitance with a 200 MHz noise source is 0.3% as opposed to 1.1% for the CSI cell without Cstab. It is interesting to note that the stabilizing capacitance does not affect the lock-in time of the system.

To further investigate noise performance the synchroniza- tion system was simulated together with the circuit models for the interconnect and substrate noise generator to measure the overall jitter due to substrate noise. This was measured as the difference between the skew of a noisy and noiseless system as a percentage of the clock cycle. The peak-to-peak and rms jitter of the CSI with and without Cstab and the RC delay cells extracted from the HSPICE simulations are shown in Table I . The overall jitter of both the CSI and the RC cells are comparable. The stabilizing capacitance, however, reduces the jitter significantly.

B. Power Supply Noise

The capacitive load associated with integrated circuits in- creases with larger integration levels and I/O count. As a

TABLE I THE CLOCKJITTER (DUE TO THE SUBSTRATE NOISE) AS

PERCENTAGE OF THE FULL DELAY RANGE OF THE DELAY ELEMENT. ALL TYPES HAD THE SAME DELAY RANGE.

CSIDC CSIDC,C RCDC

Jitter(p-p) 2.34%) 0.85% 2.52%) Jitter(rms) 0.52% 0.15%! 0.57%

TABLE I1 THE AREAS OF THE PHASE DETECTOR, CHARGE PUMP AND THE

CSI DELAY ELEMENT LAID OUT IN A 0.8MM CMOS PROCESS.

Element Area(0.8pm process) Area as Percentage of Icm x lcm Die

Phase Detector 13,894pm 0.0139%

0.0026% 2,564~1 m Charge Pump w/Capacitor

Delay Element 4,511pm2 0.0046%

result the current drawn from the power supply to recharge the capacitive loads is also increasing. This may lead to transient power supply level fluctuations due to the dUdt noise [I]. If the switching activity on a chip increases over a given period of time, the average current drawn from the supply will also increase, and will cause a supply voltage drop due to the resistance of the supply lines [12]. This abrupt change in the supply level causes severe jitter in PLL- based systems [12]. The impact of this phenomenon on the proposed deskewing system was investigated. The jitter was measured as the deviation of the skew from its nominal2 value in response to an abrupt change of 10% in the supply voltage. HSPICE simulations were used to measure the jitter of the overall system for three cases; first with the CSI delay cell, second with the RC delay cell and lastly using the CSI with a stabilizing capacitance. Fig. 18 shows the jitter generated due a power supply step for all three types. The jitter in the case of the CSI delay cell is smaller than that of the RC delay cell. The CSI with stabilizing capacitance shows a improvement in the maximum peak jitter, but the response tends to oscillate.

VII. EXPERIMENTAL RESULTS

The basic scheme of Fig. l(b) was fabricated using the Orbit’s 2pm p-well CMOS process available through MOSIS. The synchronizing scheme was implemented using the current starved inverter delay cell. The clock skew was physically introduced into the system by inserting two distributed poly to poly RC lines (RAC* and R ~ C B ) with different lengths in the PA and PB propagation paths as shown in Fig. 19(a). The photomicrograph of the chip is shown in Fig. 19(b). The same circuit was used to measure the skew with and without synchronization. Fig. 20(a) shows the measured waveforms of the clock signals before synchronization (at X A and X B in Fig. 19(a)). The skew between X A and X B is 1.7511s (Fig. 20(a)). With synchronization the skew time was reduced to about 40ps. The the clock waveforms measured at YA and

2The nominal value of the skew is defined as the skew when the supply voltage is fixed at its nominal value (e.g. 3V in our study).

Page 12: A dynamic clock synchronization technique for large systems

I

BRUESKE AND EMBABI: A DYNAMIC CLOCK SYNCHRONIZATION TECHNIQUE FOR LARGE SYSTEMS 361

Ys show that the clock edges are virtually aligned as depicted in Fig. 20(b).

The areas occupied by the phase detector, the charge pump and the delay element are relatively small. Table I1 shows the areas of each component measured from the layout using 0.8” CMOS design rules. It also shows the relative areas normalized to a lcm2 die.

VIII. CONCLUSION A dynamic deskewing technique for ULSI and WSI systems

was presented. The proposed system is independent of the process and ambient variations. It suits systems with very high integration levels since it allows modular clock distribution where the local networks can be designed separate form the global network. The power and area consumed is smaller than that of the PLL-based synchronization systems. Since the synchronization is achieved via variable delay elements instead of a VCO, the lock-in time is smaller than that of the PLL. The stability analysis shows that the stability of the deskewing system can be easily guaranteed as shown by the developed stability criterion. The jitter due to substrate noise and power supply level fluctuations was also investigated. A modified current starved inverter delay cell, with high noise immunity, was proposed. Experimental results have demonstrated that the variable delay elements can be used efficiently in practice to reduce the skew from about few nano seconds to less than loops.

REFERENCES

r31

141

H. B. Bakoglu, Circuits, Interconnections, and Packaging for VUI, Reading. MA: Addison-Wesley, 1990. D. F. Wann, and M. A. Fanklin, “Asynchronous and clocked control structures for VLSI based interconnections networks,” IEEE Trans. Comput., pp. 284-293, Mar. 1983. Ting-Hai Chao et al., “Zero skew clock net routing,” 29th ACMIIEEE Design Automat. Con$ Digest Tech. Papers, pp. 518-523, 1992. F. Minami, and M. Takano, “Clock tree synthesis based on RC delay balancing,” IEEE I992 Custom Integrated Circuits Con$ Digest of Tech. Papers, pp. 28.3.1-28.3.4, 1992. Ren-Song Tsay, “Exact zero skew,” in IEEE Intl. Con$ Cornput.-Aided Design Digest Tech. Papers, pp. 336-339, 1991. S. Boon et. al., “High performance clock distribution for CMOS ASICs,” in IEEE 1989 Custom Integrated Circuits Conf: Digest Tech. Papers, pp 15.4.1-15.4.5, 1989.

[7] E, G. Friedman, and S. Powell, “Design and analysis of a hierarchi- cal clock distribution system for synchronous standard celVmacrocel1 VLSI,” IEEE J. Solid-State Circuits, vol. SC-21, no. 2, pp. 240-256, Apr. 1986.

[8] Francois Anceau, “A synchronous approach for clocking VLSI systems,” IEEE J. Solid-state Circuits, vol. SC-17, no. 1, Feb. 1982.

[9] M. G. Johnson and E. L. Hudson, “A variable delay line PLL for CPU- coprocessor synchronization,” IEEE J. Solid-State Circuits, vol. SC-23, no. 5, pp. 1218-1223.

[lo] Deog-Kyoon Jeong, et al., “Design of PLL-based clock generation circuits,” IEEE J. Sol.-State Circuits, vol. SC-22, no. 2, pp. 255-261, Apr. 1987.

[I 11 F. M. Gardner, “Charge-pump phase-lock loops,” IEEE Trans Commun., vol COM-28, no. 11, Nov. 1980, pp. 1849-1856.

[12] I. A. Young, et al., “PLL clock generator with 5 to llOMHz of lock range for microprocessors,” IEEE J. Sol.-State Circuits, vol. 27, no. 11,

[13] D. K. Su, et al., “Experimental results and modeling techniques for substrate noise in mixed-signal integrated circuits,” IEEE J. Sol.-State Circuits, vol. SC-28, no. 4, pp. 420-430, Apr. 1993.

pp. 1599-1606, NOV. 1992.

Daniel E. Brueske received a B.S.E.E. degree in 1992 from the University of Portland, Oregon and a M.S.E.E. in 1994 from Texas A&M University, College Station, Texas.

Currently, he is working for Motorola, Inc., Radio Products Group, in Ft. Lauderdale, Florida. His current interests are mixed mode circuit design, fre- quency synthesis, and high frequency VCO design.

Sherif Embabi (S’87-M’91) received the B.Sc. and M.Sc. degrees in electronics and communications from Cairo University, Egypt, in 1983 and 1986 respectively, and the Ph.D. degree in electrical en- gineering from the University of Waterloo, Canada, in 1991.

He is currently an Assistant Professor of Elec- trical Engineering at Texas A&M University. His current interests are in digital circuit design, in particular BiCMOS circuits, low power circuits and clock distribution. He is also interested in analog

circuit design with emphasis on the development of field programmable analog arrays.

Dr. Embabi is a co-author of Digital BiCMOS Integrated Circuit Design Kluwer Academic Publishers, 1993.