14 Ref Import

download 14 Ref Import

of 10

Transcript of 14 Ref Import

  • 8/2/2019 14 Ref Import

    1/10

    Fault-TolerantClock Synchronizationin Distributed Systems

    Parameswaran Ramanathan, University of Wisconsin

    Kang G . Shin, University of Michigan

    Ricky W . Butler, NASA Langley Research Center

    igi ta l compu ters have become es-sential to critical real-time appli-ca t ions such as aerospace systems,l i fe support sys tems, nuclear power p lants ,dr ive-by-wire systems, and compu ter- in-tegra ted manufactur ing systems. Comm on

    to all these applicat ions is the demand formax imum reliability and high performan cefrom com puter control lers. This require-ment is necessarily stringent because asingle controller failure in these applica-tions can lead to disas ter . For ex ample , theal lowable probabil ity of fa i lure for a com -mercial aircraft is specified to be less thanper 1 0-hour miss ion because a con-troller failure during flight could result in acrash.Because of such stringent requiremen ts,t radi t ional methods for des ign and v al ida-tion of computer controllers are often in-adequate. Ad hoc techniques that appearsound under a careful failure-modes-and-effects analysis are often susceptible tocertain subtle failure modes. The clocksynchroniza t ion problem show n in Figure1 is a c lass ic example . The f igure shows athree-node system in which each node hasi ts own c lock. Th e c locks are synchronizedby adjust ing each to the median of the three

    Software algorithmsare suitable only whereloose synchronization

    is acceptable, andhardware algorithmsare expensive. A hybrid

    scheme achievesreasonably tight

    synchronization and iscost-effective.

    clock values. This intuitively correct al-gori thm works f in e as long as a l l the c locksare consis tent in the ir behavior , as Figu rela i l lus tra tes . However , i f one c lock isfaulty and mis informs the other two c locks ,

    the two nonfaulty c locks cannot be syn-chronized. For examp le , in Figure 1b thefaulty clock B reports incorrec tly to c locksA an d C. A s a result , c locks A an d C d o n o tmake any correc tions because both behaveas if they are the median c lock.Lamport an d Melliar-Smith were the firstto study the three-clock synchronizationproblem in the presence of arbitrary faultbehavior . They coined the term Byzantinefault to refer to the fault model in which afaulty c lock can exhibi t a rbi trary behaviorincluding , but not limited to, misrepresent-ing its value to other c locks in the system.They sho wed that in the presence of Byz-antine faults , no a lgori thm can guaranteesynchroniza t ion of the nonfaulty c locks ina three-node system. They a lso showedthat 3m + 1 clocks are suff ic ient to ensuresynchron ization of the nonfaulty cloc ks inthe presence of m Byzantine faults . Thiscondition later proved not only sufficientbut a lso necessary for ensuring synchroni-zation in the presence of Byzantine faults.Since the initial study by Lamp ort and Mel-liar-Smith, the problem of clock synchroniza-tion in the presence of Byzantine faults hasbeen studied extensively by several other re-searchers. Al l this attention is mainly be-

    October 1990 @ @ 18~91h? i9Ol l0@ @ -0033$@ 1.001990 IEEE 33

  • 8/2/2019 14 Ref Import

    2/10

    cause many applications require aconsist entview of time across all nodes of ad istributedsystem, and this can be achieved only throughclock synchronization.Solutions proposed in the literature onclock synchronization take either a soft-ware or a hardware approach. The softwareapproach is flexible and economical, butadditional messages must be exchangedsole ly for sync hr~ niz a t io n . ' -~ecause theydepend on message exchanges, the worst-case skews guaranteed by most of thesesolutions are greater than the differencebetween the maximum and minimum mes-sage transit delay between any two nodes inthe system. The hardware approach, on th eother hand, uses special hardware at eachnode to achieve a tight synchronization withm i ni m al t im e o ~ e r h e a d . ~ - ~oweve r, the costof additional hardware precludes this ap-proach in large distributed syst ems unless avery tight synchronization is essential.Hardware solutions also require a separatenetwork of clocks that is different from theinterconnection network between the nodesof the distributed system.6Because of these limitations in the soft-ware and hardware approaches, research-ers have begun investigating a hybrid ap-proa c h . A ha rdwa re -a s s i s t e d sof twa resynchronization scheme has been proposedthat strikes a balance between the hard-ware requirement at each node and thec lo c k sk e ws a t t a i ~ ~ a b l e . ~nother recentlyproposed approach is probabilistic clocksynchronization, in which the worst-caseskews can b e made as small as desired."34

    However , depending on the desired worst-case skews, there is a nonzero probabilityof loss of synchronization. Also, this ap-proach may ind uce very high traffic to thesystem.This article compares and con trasts ex-isting fault-tolerant clock syn chronizationalgorithms. The worst-case clock skewsguaranteed by representative algorithmsare compared, along with other importantaspects such as time, message, and costove rhe a d im pose d by the a lgor i thm s .Special emphasis is given to more recentdevelopments such as hardware-assistedsoftware synchroniza t ion, probabil is t icclock synchronization, and algorithms forsynchronizing large, partially connecteddistributed systems.

    Preliminary conceptsFirst we define some of the conceptscommon to most c lock synchroniza t ionalgorithms and introduce the notation thatwill be used throughout the article (seesidebar on next page). We begin with thenotion of a clock.Definition 1: Time that is directly ob-servable in some particular clock is calledits clock time. This contrasts with theterm real time, which is measured in anassumed Newtonian time frame that isnot directly observable.It is convenient to d efine the local clock

    at a node as a mapping from real time toclock time. In other words, let C be amapping from real time to clock time, thenC(t)= T means that when the real time is t,the clock time at a particular node is T . W eadopt the convention of using lowercaseletters to denote quantities that representreal time and uppercase letters to denotequantities that represent clock time. Figure 2i l lustra tes the concept of ac lockfunction /mapping. A perfect clock is one in which aunit of clock time elapses for every unit ofreal time. If mo re or less than one unit ofclock time elapses for every unit of realtime, the clock is said to be fast or slow,respectively.

    Since a properly functioning clock is amonoto nic increasing function, its inversefunction is well defined. Let c(7') = C-'(T)= t denote this inverse function. In theliterature some of the results in clocksynchronization are formulated using theclock function, while others are formulat-ed using the inverse function. We will usesubscripts to distinguish between the dif-ferent clocks in the system. For example,C,(t) will denote the clock at node p , an dCJt) the clock at node q . A clock is con-sidered nonfaulty if th ere is a bound on theamount of deviation from real time for anygiven finite time interval. In other words,even nonfaulty clocks do not always main-tain perfect time.

    Definition 2: A clock c is said to benonfaulty during the real-time interval[ t l , 2 ] f it is a monotonic, differentiable

    COMPUTER

  • 8/2/2019 14 Ref Import

    3/10

    function on [ T I , 2 ]where c(T, )= t, , i =1,2, and for all T E [ T I , 2 ]

    for some constant p.The con stantp is sa idto be the drift rate of the nonfaulty clock.Instead of the above definition, somesynchroniza t ion a lgori thm s are def inedusing one of the followin g near-equivalentdefinitions for a nonfaulty clock:Definition 2a: A clock c is said to benonfaulty if there exists a constant p2such that for i l an d t2

    Defini t ion 2b: A clock c is said to benonfaulty if there exists a constant p 3such that for t l an d t2

    The near equivalence of these defini-tions follows from the Taylor series ex-pansion of ( 1 + p)-I:(1 +p)- '= l -p+p*-pP3+p4- . . .

    Fo r a typical value of p = the second-order and high-ord er terms can be ignored,thus implying p 2 = p 3 = p/2.The above notion of a clock does notimply any particular implementation. Infac t , some synchroniza t ion a lgori thms dealwith hardware clocks, while others dealwith logical clocks. Hardware clocks arethe actual clock pu lses that control circuitrytiming; logical clocks are values derivedfrom hardware clocks. For instance, a log-ical clock might be the value of a counterthat is incremented once every predeter-mined number of pulses of the hardwareclock. In either case, we can talk aboutsynchronization in terms of the abstractnotion of the mapping function from clocktime to real time. The main difference willbe in the granularity, or skew, of synchro-nization.

    Definition 3:Two clocks c , an d c2 re saidto be &synchronized at a clock t ime T ifand only if Icl(T) - c2(T)I 2 6. A set ofclocks are said to be wellsynchronized ifand only if any two nonfaulty clock s inthis set are 6-synchronized for somespecified constant 6.

    Notations used in this articleClock time at node p when the real time is tReal time when the clock time at node p is TMaximum drift rateof all clocks in the systemMaximum skew betw een any two nonfaulty clocks in the systemUpper bound on the read errorMaximum number of faulty clocks in the systemTotal numbe r of clocks in the systemResynchronization ntervalUpper bound on message transit delayNode p's perception of its skew with respect to clock at node q

    Because of the nonzero drift rates of allc locks , a set of cloc ks does not remain wellsynchronized without some periodic re-synchroniza t ion. This means tha t the nodesof a distributed system must periodicallyresynchronize their local clocks to main-tain a global time base across the entiresystem. Synchronization requires each no deto read the other nodes' clock values. Theactual mechanism used by a node to readother clocks differs from one algorithm toanother . In hardware algorithms the clocksignal from each of the other nodes (or anapprop riate subset of nodes) is an input tothe synchronization circuitry at each node.In software algorithms each node eitherbroadcasts its clock value to all nodes atspecified times or sends its clock valueindividually to requesting nodes.Regardless of the actual reading mech-anism, a node can obta in only an approx-imate view of its skew with respect to othernodes in the system. Errors occur mainlybecause it takes a finite and unpredictableamoun t of time to deliver a clock s ignal ora clock message from one nod e to another .In hardware a lgori thms, errors are duemainly to the unpredictable propagationdelays for clock signals, whereas in soft-ware algorithm s errors are due to variationin the message transit delays. Most of thesynchronization algorithms discussed inthis article are based on the assumptionthat if two nodes are n onfaulty, the error in

    .-cY85

    r

    Real timeFigure 2. The clock fun ction.

    reading each other's clock is bounded. Sincethe actual errors that occu r differ from on ealgorithm to ano ther, we will discuss themfurther as we describe each algorithm.The t ime a t which a node de cides to readthe clocks at other nodes also depends onthe a lgor i thm unde r c ons ide ra t ion . Inhardware a lgori thms, synchroniza t ion c ir-cuitry continuously monitors the frequen-cy and phase of all clocks . On the basis ofthe input signals, the circuitry also updatesthe local c lock continuously. Hardwarealgorithms can therefore be classified ascontinuous-update a lgori thms. Softwaresynchronization algorithms, on the other

    October 1990 35

  • 8/2/2019 14 Ref Import

    4/10

    hand, are usual ly discrete-update algo-rithms; that is, correction to a local clock iscomputed at discrete t ime intervals. How-ever, software algorithm s may differ in theway they apply this correction to the localclock. In some sof tware synchronizationalgorithms the correction is applied in-stantaneou sly, whereas in others it is spreadover a time interval.

    The t ime at which the correct ion iscomputed is determined by each node onthe basis of its own clock . The time intervalbetween successive correct ions is cal ledthe resynchronization interval, denoted R ,and is usually a constant known to allnodes. If T ( O j denotes the time at which thesystem began its operation, then the time ofthe ith resynchronization is f l i ) f l O )+ iR.The t ime interval R()= [Tc),T+ 1 is oftencalled the ith resynchronization interval.In discrete-update algorithm s, we can viewthe local clock of a node as a series offunct ions, one for each resynchronizationinterval. That i s , the clock at n od ep in the(i + 1)th resynchronization is given by

    where Cgrepresents the change in p sclocksince the start of the system.

    SoftwaresynchronizationThe basic idea of sof tware synchroniza-tion algorithms is that each node has alogical clock that provides a time base for

    all activities on that node. This logicalclock is der ived f rom the hardware clockon that node, though it usually has a muc hlarger granularity than the hardware clock.The algorithm executed by each node forsynchronizing the logical clocks can beviewed as a clock process invoked at theen d of every resynchronization interval.This clock process is responsible for per i-odically reading the clock values at othernodes and then adjusting the correspond-ing local clock value.Informally, any software synchronizationalgor i thm m ust sat isfy the fol lowing twocondi t ions:

    Agreement : The skew between al l non-faulty clocks in the system is bounded.Accuracy: The logical clocks keep upwith real time.

    The first condition states that there is aconsistent view of time ac ross all nodes inthe system. The second condi t ion states

    that this view of time is consistent withwhat is happening in the environment . Thesynchronization algorithms differ in theway these two conditions are specified, aswell as in the way the cloc k processes readthe clock values at other nodes. In somealgor i thms, a clock process requests andreceives a clock value f rom each of theother nodes, whi le in other algor i thms theclock process broadcasts i t s local clockvalue w hen it thinks it is time for a resyn-chronizat ion. The other nodes receive thebroadcast message and use the clock valuefor correcting themse lves at a later time. Inei ther case, the skew perceived by a receiv-ing clock process di f fers f rom the actualskew that exists between the two clocks,because of the errors in reading the clocks.These er rors are due mainly to unpredict -able var iat ion in the comm unicat ion delaybetween the two nodes. This not ion isformalized in the following discussion.

    Le t p an d q be any two nonfaul ty nodesin the system. Let T be the clock time ofnode q when the clock process at node qinitiates a broadcast of the local clock value.In other words, let the clock process atnode q initiate a broadcast of the localclock value at real time c,(T). This mes-sage wi l l reachp af ter some t ime delay, sayQ. That is , the broadcast message is de-l ivered to node p at real time cq(T)+ Q . Atthat instant, the clock times at nodesp an dspectively. That is, the actual skew exist-ing between p an d q at the instant when preceives this message is given by Cq(c,(T)+ Q )-C p ( c q ( T ) Q). However , nodep hasno way of determining C,(c,(T) + Q ) ex -actly. The best it can $0 is estimate thedelay Q by, say, 1 + p@ , and compute theskew as fol lows:

    4 are C,(c,(T) +Q ) an d Cq(c,(T) + Q ), e-

    This m eans that the error in the perceivedskew at p is

    This error is small if we can obtain a goo destimate of the communica tion delay and ifthe clock drifts of p an d q are comparable,especially d uring the communication delay.Al l synchronizat ion algor i thms discussedin this article are based on the assum ptionthat if p an d q are nonfaulty, then eq p sbounded.From the above discussion we can con-clude that the read error is not a seriousproblem if the algorithm is used to syn-chronize a ful ly connected system. How-

    ever, it can be a serious lim itation in futuredist r ibuted systems because system size isincreasing, and it is difficult to fully con-nect all the nodes in a large distributedsystem. This problem has been specificallyaddressed by some of the recent algo-rithms.9*0Software synchronization algorithms canbe classi f ied as convergence algor i thmswith averaging, convergence algor i thmswithout averaging, or consistency algo-rithms (see Figure 3).The leave s of the treein Figure 3are the representative softwaresynchronization algorithms surveyed in thisarticle. (For additional reading refer toSchneider . )

    Convergence-averaging algorithms.Conver gence- aver ag ing a lgo r i t hms a r ebased on the fol lowing idea: The clockprocess at each node broadcasts a resyncmessage when the local time equals T(O)+iR- S for some integer i and a parameter Sto be determined by the algorithm. Afterbroadcasting the clock value, the clockprocess waits for S time units. During thiswaiting period, the clock process collectsthe resync messages broadcast by othernodes. For each resync messa ge, the clockprocess records the time, according to itsclock, when the message was received. Atthe end of the waiting period, the clockprocess estimates the skew of its clockwith respect to each of the other nodes onthe basis of the times at which it receivedresync messa ges. It then compu tes a fault-tolerant average of the estimated skewsand uses it to correct the local clock beforethe start of the next resynchronization in-terval.The convergence-averaging algor i thmsproposed in the literature differ mainly inthe fault-tolerant averaging function usedto compute the correction. In algorithmCNV, each node uses the ar i thmet ic meanof the estimated skews as its correction.However, to limit the impact of faultyclocks on the mean, the estimated skewwith respect to each node is comparedagainst a threshold, and skew s greater thanthe threshold are set to zero before compu t-ing the mean. In contrast, with the algo-r i thm in L unde l ius - Welch and L ynch3(henceforth referred to as algorithm LL),each node limits the impact of faulty clocksby first discarding the rn highest and rnlowest estimate d skews and then using themidpoint of the remaining skews as itscorrect ion, where m i s the maximum num-ber of faulty clocks to be tolerated.One main l imitation of the convergence-averaging algorithms is that they are in-

    36 C O M P U T E R

  • 8/2/2019 14 Ref Import

    5/10

    tended for a fully connected network ofnodes. As a result, the algorithms are noteasily scalable. In addition, they all requireknown upper bounds on clock read errorand initial synchronization of the clocks.The need for initial synchron ization is nota serious problem, because there are sever-al algorithms to ensure it.3 However, theassumption about the bound on the clockread error is a serious problem because theworst-case skews guaran teed by these con-vergence-av eraging algorithm s are usuallygreater than the bound on the read error.In addition to the read error, the worst-case skews depend on the time intervalbetween the resynchronizations, the clockdrift rates, the numb er of faults to be toler-ated, the total number of clocks in thesystem, and the duration of the time inter-val during which the clock processes readthe clock values at other nodes. However,amon g all these factors, the read error hasthe greatest impact on the worst-case skew.LamportandMelliar-Smith' show the worst-case skew for a lgori thm CNV to bemax ([N/N - 3 r n ] [ 2 ~ p(R + 2[(N -m)/

    NlS)I,60 + PR Iwhere 6o is the maximum skew after theinitial synchronizatio n, N is the total num-ber of clocks in the system, rn is the max-imum num ber of faults to be tolerated, S isthe duration of the time interval duringwhich clock processes read the clock val-ues at other nodes, and E is the assumedbound on the read error. Similarly, in Lun-delius-Welch and Lynch3 the wors t-caseskew of algorithm LL was shown to be

    p +E+P(7P+ U+4E ) +4$(2 +p) (P+ U )where U is the maximum message transitdelay between any two nodes in the system,E is such that U - 2~ is the minimum mes-sage transit delay between any two nodes inthe system, and p is roughly 4(&+ pR).Since, typically, p is on the order of

    R is on the order of a few second s, Uan d S are on the order of a few milliseconds,an d E is on the order of a few microsecond s,p U

  • 8/2/2019 14 Ref Import

    6/10

    va l (F )- kU, fl)), where U is themaximum message transit delay be-tween any pair of nodes in the sys-tem.

    In S rik an th a nd T o ~ e g , ~owever , a nodeconsiders all resynchronization messagessuspicious. Therefore, a node is wi l l ing toresynchronize before i t s clock reaches thetime for the next resynchronization only ifit receives ame ssage from m + 1other nodesindicating that they have resynchronized.This is done to ensure that at least onenonfaulty nodes clock has reached thetime for resynchronization.

    T he main l im i t a t i on o f conver gence-nonaveraging algor i thms is that the worst -case skew is greater than the maximummessage t ransi t delay between any pair ofnodes. For instance, the worst -case ske wsof the algor i thms in Halpern e t al .* andS r ikan th and T oueg4 a r e (1 + p ) U + p ( 2 ++ U ( + p ) , espectively. S ince all the termsare posi t ive, the worst -case skew is great -er than the maximum message t ransi t de-l ay , U , n both algor ithms. Sinc e U can beon the order of tens of mil l i seconds, espe-cial ly in a large, par t ial ly conne cted dis-t r ibuted system, the worst -case skewsoffered by these algor i thms are also o nthe order of tens of mil l i seconds, whichmay not be acceptable in many appl ica-t ions.

    One of the unique features in Sr ikanthand Toueg s algorithm is that the accuracyof the logical clocks is guaranteed to beoptimal in the se nse that the drift rate of thelogical clocks is the sam e as the drift rate ofthe underlying hardware clocks. This is notnecessarily the case with other softwareclock synchronization algorithms, wherethe logical clocks are guaranteed only to bewithin a linear envelope of the hardwareclocks. Furthermore, in contrast to the al-gorithm in Halpern et al. , Srik anth andTouegs algor i thm does not require acloc kvalue to be sent in a resynchronizat ionmessage. T his character ist ic can be usedto el iminate some of the possible fai luremodes observed in a clock synchroniza-t i on scheme.

    p ) R an d [R(1 + P ) + U l W + p H 1 + P ) I

    Consistency algorithms. Consistency-based algorithms use a completely differ-ent principle than convergence-based al-gorithms. They treat clock values as dataand try to ensure agreement by using aninteractive consistency algorithm. Thisis a dist r ibuted algor i thm that ensuresagreement among nonfaulty nodes on th eprivate value of a designated sender nodethrough a series of message exchanges. By

    agreement we mean the following twocondi t ions are sat isf ied: receive three copies of ps value: onedirectly from p and the other two f rom the(1) All nonfaulty nodes agree on thesenders private value.(2) If the sende r is nonfaulty, the value

    agreed on by the nonfaulty nodes equalsthe senders private value.Note that nonfaulty nodes must agree onthe senders private value regardless ofwhether the sender is faulty or not. How-ever, if the sender is faulty, he value agreedon by the nonfaulty nodes need not equalthe senders pr ivate value.

    With consistency-based algorithms, atthe end of every resynchronization intervaleach node treats its local time as a privatevalue and uses an interactive consistencyalgorithm to convey this value to othernodes. From the clock values thus obtainedfrom all the other nodes, each node computesan estimate of the skew with respect toevery other node. Each node then uses themedian skew to correct the local clock atthe start of the next resynchronization in-terval.Like convergence algor i thms, consis-tency-based a lgorithm s require a bound onthe read error. Read errors occur in thesealgorith ms because there can be a slightdifference between the times at which twonodes, sayp and q , decide on the clock valueof another node, say r . Consequent ly , al -t houghp a ndq wi ll ag r ee on the clock valueof r , he est imate of the skew they computebetween their own clocks and r s clockmight be slightly different. In addition to abound on the read error, consistency-basedalgorithms also require certain conditionsfor correctly ex ecuting an interactive con-sistency algor i thm. These requirementsinclude condi t ions on the minimum num -ber of nodes, sufficient connec tivity, and abound on the message transit delay. Lam-por t , Shostak, and Pease provide moredetai ls on these requirements.Note that these algorithms require noassumption about initial synchronization.Nor do they require a direct connectionbetween all the nodes, although the algo-r i thms descr ibed by Lam por t and Mel l iar-Smith do require such a connect ion.The fol lowing example should clar i fythe basic idea of these algorithms. Con sid-er a four-node system with a maximum ofone Byzan t ine f au l t . In an interact iveconsistency algorithm,2 a node p wouldconvey i ts pr ivate value to other nodesusing the following scheme: Nod ep w ouldsend its value to every other node, which inturn would relay the value to the two re-maining nodes. That i s , each node would

    other two nodes. Each node would thenest im ateps value to be the median of thevalues in the three copies. This sch eme canbe modified for clock synchronization byletting each node independently use th einteractive consistency algorithm to conveyits clock value to other nodes. At the end offour applications of the interactiv e con-sistency algorithm (on e for each node ), thelocal clock at each node could be adjustedto equal the median of the other nodesclock values.

    This clock synchronization scheme canbe further extended to situations with morethan one Byzant ine faul t , ust l ike the inter-act ive co nsistency a lgor i thm.I2 However ,this will require at least m + 1 rounds ofmessage exchang e, where m is the maxi-mum number of faul ts to be tolerated, andthis impl ies more overhead. The worst -case skew guaranteed by consistency-basedalgorithms, however, is usually smallerthan those guaranteed by the converge nce-based algorithms. F or exam ple, if the totalnumber of clocks in the system equals 3m+ 1, then the worst-case skew of algorithmCOM in Lampor t and Mel liar -Smith wasshown to be (6 m + 4 ) ~(4 m + 3) pS + p Ras opposed to ( 6 m+2 ) ~(4 m +2) pS + (3m+ 1)pR or a lgo r i t hmCNV, w h e r e & , S , a n dR are as described earlier under Conver-gence-averaging algori thms.The number of messages required forconsistency-based algorith ms can be re-duced by using digital signatures to au -thenticate the messages. This is true ofalgorithm CSM in Lamport and Melliar-Smith . Algorithms with authentication re-quire fewer messages and fewer nodes totolerate a given number of faults. They alsoachiev e a tighter skew than algorithms thatd o not use authentication.

    Probabilistic synchronization. T h emain limitation of the algorithms discussedabove is that the worst-case skew dep endsheavily on the maximum read error. Thiscan be a ser ious problem in future dist rib-uted systems bec ause read erro rs are likelyto increase wi th system size. To remedythis problem, Cristian proposed a proba-bilistic synchronization scheme in whichthe worst -case skew s can be made as smal las desi red. However , the overhead im-posed by the synchronization algorithmincreases drastically as we reduce the skew.Furthermore, unlike other algorithm s wehave discusse d, this algor i thm does notguarantee synchronization with probabili-ty one. In fact, there is a non zero probabil-

    38 C O M P U T E R

  • 8/2/2019 14 Ref Import

    7/10

    ity of loss of synchronization, and thisprobab ility increases with ade creas e in thedesired skew.Cristians idea is to assume that theprobability distribution of message transitdelay is known and let each node makeseveral attempts to read the other clocks.After each attempt, the node can calculatethe maximum error that might occur if the

    clock value obtained in that attempt wereused to determine the correction. By retryingoften enough, a node can read the otherc locks to any given precis io n with prob-abil i ty as c lose to one as des ired. Thisscheme is particularly suitable for systemsha ving a m a s te r - s l a ve a r r a nge m e nt inwhich one c lock has been designated ore lec ted master and the other c lock s ac t ass laves .More specifically, each node periodical-ly sends a message to the master noderequest ing i ts c lock value . When the mas-ter receives this request, it respond s with amessage saying Th e time is T. When theresponse reaches the request ing node, itcalculates the total round trip delay ac-cording to its own clock . If D is the roundtrip delay as measured by a node p , and ifq is the master node, then Cristian provedthat if nodes p and q are nonfaulty, the localtime at node q when p receives the re-sponse messag e lies in the interval

    where U,,,,, is the minimum m essage transitde lay between p an d q . Therefore, themaximum read error is D ( 1 + 2 p ) - 2Urn,,.It follows from this result that if theround trip delay is small (close to 2Urnj,),then so is the read error. Cristians algorithmuses this observation to limit the read error.Each node is allowed to read the mastersclock repeatedly until the round trip delayis such that the maximum read error isbelow a given thresh old. Once a node readsthe masters clock to the desired precision,it sets its clock to that value.One limitation of this approach is theneed to restrict the numb er of read attemp ts,s ince these are direc t ly re la ted to theoverhead imposed by the algorithm. Thisimplies that a node may not alway s be ableto read the masters clock to the desiredprecision and hence cou ld result in loss ofsynchroniza t ion. A second limitation isthat it is not fault tolerant, since it is noteasy to detect the masters failure. Fur-thermore, algorithms to designate or electa new m aster are fa ir ly complex and t ime-consuming.

    HardwaresynchronizationThe pr inciple of hardware synchroniza-tion algorithms is that of a phase-lockedloop. The hardware c lock a t each node is anoutput of the voltage-controlled oscillator.The voltage applied to the o sc i l la tor comes

    from a phase detec tor whose output isproport ional to the phase error be tween thephase of its clock (the output of the voltage-controlled oscillator it is controlling) and areference signal generated by using theother clocks in the system. Thus, by ad-jus t ing the frequency of each individualclock to the reference signal, the clocks canalways be kept in lock-step with respect toone another .Algorithms. One of the first hardwaresynchroniza t ion a lgori thms res i l ient toByzantine faults was proposed originallyby T.B. S mith. This a lgori thm was devel-oped to synchronize the processors of the

    fault-tolerant m ultiprocessor (FT MP). 3 Th eFTM P uses only four c locks and was ex-pected to tolera te a maximum of one By z-antine fault, so Smiths a lgori thm wasspecifically develop ed for that situation. Inthe algorithm , each clock observes the otherthree c locks continuously and determinesits phase difference with each of them.These ph ase differences are arranged in anascending order , and the c lock corre-spondin g to the median phase difference isselected as the reference signal. Each clockuses a phase- locked loop to synchronizeitself with the referen ce signal it selected.The wors t-case skew in this a lgori thmdepend s on the characteristics of the phase-locked loop. Smith did not characterizethis wors t-case skew, but experimenta lresults on the FTMP h ave shown that theskews are around 50 nanoseconds. Thisshould be compared with the skews in thesoftware synchronization algorithms, whichare on the order of microseco nds. Surpris-ingly, this algorithm does not easily gen-eralize to a situation in which more thanone Byzantin e fault must be tolerated. Thisis because with a median select algorithmit is possible for Byzantine failures to par-tition the set of clocks into two or moreseparate cliques that are internally syn-chronized but are not synchronized withother cliques.As an illustration, consider a system ofseven clocks. Such a system should be ableto mask the failure of two clocks. Supposethat the five nonfaulty nodes (clocks) arenamed a, b, e , d , an d e and the faulty nodes

    (clocks) are named x an d y. If the trans-miss ion delays between c locks are n egli-gible, then the order of arrival of the clocksignals from the nonfau lty nodes should bethe same at all the nodes. If the faultyclocks behav e erratically, their order maybe seen differently by different nodes.Consider the fol lowing order ing of s ignalsas seen by the various nod es in the system:Order seen by a: x y a b c d eOrder seen by b: x y a b c d eOrder seen by c: a b c d e x yOrder seen by d : a b c d e x yOrder seen by e : a b c d e x yNow suppose each node uses the median

    select algorithm to select a reference signalto synchronize i ts own c lock. In the aboveexample , this would mean tha t each nodeselects the third clock, not counting its own,as the reference signal; that is, nodes a, b,c, d , an d e will synchron ize to nodes b,a, d ,c, an d e , respectively. As a result, there aretwo nonsynchroniz ing c l iques , [ a ,b ) an d[c, d, e ) .By nonsynchroniz ing c l iques wemean that a an d b are synchronized to eachother and e ,d ,an d e are synchronized to eachother, but there is nothing to prevent a an db from drifting far apart from c, d , an d e .To overcome this problem, Krishna, Shin,and Butler proposed tha t each node p us ethe f p ( N ,m)th clock signal as the referencesignal instead of the median sig nal, where

    and where N is the total num ber of clocks,m is the maximum number of faults to betolerated, and A,, is the position of nod ep inthe perceived order of the arriving clocksignals a t node p . They sho wed that if N Z3m + 1, then this function for selecting thereference signal will ensure that all thenonfaulty nodes remain synchronized re-ga rd le s s of how the fa u l ty node s be h a ~ e . ~For the situation abov e, N = 7 , m =2, an d

    Thus , a synchronizes to b, b synchronizesto c, c synchronizes to e ,d synchronizes toe , an d e synchronizes to c. This sequenc e ofsynchroniza tion among the nodes resultsin a s ingle c l ique conta ining a l l f ive non-faulty nodes .The problem with Krishna, Shin, andButlers scheme is that selection of thereference signal is based on the position ofthe local clock in the perceived sequence

    October 1990 39

  • 8/2/2019 14 Ref Import

    8/10

    loo 780 -60 -40 -20 -o ! , I , , I , I , I ,

    0 20 40 60 80 100 120Number of clocks

    Figure 4. Percentage of reduction over a fully connected network versus systemsize.

    of the incoming clock signals. This com-plicates the hardware at each node becausethe hardware must order the incoming clocksignals and dynamically cho ose a referencesignal by identifying the local clocks po-sition in the ordered sequence. To ove rcomethis problem, Vasanthavada and Marinosproposed a slight variation of the abovescheme, using tw o reference signals insteadof one.8

    In their scheme it is not necessary togenerate the com plete sequence of all in-coming signals. Instead, it is sufficient toidentify the ( m + 1)th and the ( N -m - 1)thclocks in the temporal sequence. This iseasy to do in hardware because we cancount the clock signals that have arrivedand record the identity of th e ( m + 1)th andth e ( N - m - 1) th clock signals. The ave r-age phase difference of the local clock withrespect to these two clock signa ls s used tocorrect the local clock in the subsequentcycles.

    El iminat ing major l imi tat ions inhardware synchronization. T he mainadvantage of the above hardware syn-chronization algorithms is that the worst-case skews are several orders of magni tudesmaller than those in software synchroni-zation algorithms. Ho wever, these hardwaresynchronization algorithms have two ma-jorlimita tions. Th e first is that the hardwaresynchronization algorithms as describedabove require a fully connected network ofclocks. Because of the many interconnec-

    tions in a fully connected network, thisrequirement means that the reliability ofsynchronization would be determined bythe failure rates of these interconnectionsrather than by the failure rate of the clocks.Furth ermo re, the large numbe r of intercon-nections would cau se proble ms of fan-inand fan-out.

    The second limitation is that the abovehardware synchronization algorithms arebased on the assumption that the trans-mission delays are negligible comparedwith other parameters relevant to the algo-rithms. In a large system , the physical sep-aration between a pair of clocks can beenough to result in non-negl igible t rans-mission delays. A signif icant di f ference inthe transmission delays between tw o pairsof nonfaulty clocks can change the order inwhich a nonfaulty clock perceives othernonfaulty clocks and thus affect the selectionof the reference signal.Thes e two limitations can be eliminatedfrom any of the above algo r i thms with thehelp of the recent dev elopments surveyedbe low.

    Interconnection problem . Shin and Ra-manathan recently proposed an intercon-nection strategy for synch ronizin g a largedistributed system using any of the aboveh a r d w a r e s y n c h r o n i za t i o n a l g o r i t h m s 6Their interconnection strategy requires only20-30 percent of the total number of in-terconnections in afully connected network.The idea behind their st rategy is to

    synchronize the cloc ks in the system at twodifferent levels. The clocks are first parti-tioned into several clusters. Each clockthen synchronizes itself not only with re-spect to al l the clocks in i t s own cluster butalso wi th one clock f rom each of the otherclusters. As aresul t of this mutual coupl ingbetween the clusters, the clusters remainsynchronized with respect to each other,and the system as a whole remains wellsynchronized.

    S h in and Ramana than f o r mula t e t heproblem of partitioning the clocks intoclusters as a small integer-programmingproblem. The object ive is to minimize thetotal number of interconnections subject tothe constraint that the fault tolerance re-quirement is satisfied. For each clock, thes o l u t i o n t o t h e i n t e g e r - p r o g r a m m i n gproblem identifies all other clocks that itshould moni tor . Shin and Ramanathadshow that this interconnection strategy doesnot result in loss of synchronization andthe worst-case skew increases by a factorof three as compared with the skew in afully connected network of clocks.The reduction in the total number ofinterconnections over a fully connectednetwork as a function of system size isshown in Figure 4.It follows from this plotthat the percenta ge of reduction increaseswith system size; this is precisely the sit-uation in which the number of intercon-nections is a serious problem.

    Transmission delay problem. The ef -fects of transmission delay in phase-lockedloops have been studied extensively in thecomm unication area but not in the presenceof Byzantine fau lts. Shin and Ramanathanshow that it is easy to incorporate ideasf rom the comm unicat ion area to take intoacco unt the presence of Byzantine faultsand non-negligible transmission delays.Figure 5 illustrates their underlying ideaand shows the hardware required at node ito determine the exact phase di f ferencebetween its clock and the clock at node j .Instead of one phase detector for everynode pair, as in other algorithms, this solu-tion requires two phase detectors and anaverager . The inputs to the f i r st phase de-tector, PD1 ,are theclock signals f romnodesi an d j . Since the clock signal from node jencounters a delay in reaching node i, th ephase di f ference detected by PDI doe s notrepresent the true phase difference betweenthe two clocks.The inputs to the second phase detector ,PD2, are the clock signals f rom node; andthe clock signal from node i that is returnedfrom node . n other words, the clock signal

    40 C O M P U T E R

    1 --

  • 8/2/2019 14 Ref Import

    9/10

    that nodej is monitor ing is re turned to nodei to be compared with the s ignal f rom nodej . Because of the transmission delays , thephase difference detec ted by PD2 a lso doesnot represent the true phase difference be-tween the two c locks . However , Shin andRamanathan prove that the average of theoutputs of PD, and PD,, e,,, is proport ionalto the exact phase difference between theclocks a t nodes i an d j regardless of thetransmiss ion delays .This t rue phase difference can then beused for any hardware synchroniza t ion a l-gorithm. The disadvantages of this approachare the need for tw o phase detec tors ins teadof one and the doubling of the numb er ofinterconnections. However, when this so-lut ion is combined with Shin and Ra-manathans interconnection scheme, weg et a viable solut ion even for a very largedistributed system.

    Hybrid synchronizationEarl ier we noted tha t the main drawbackof sof tware synchroniza t ion a lgori thms isthat the worst-case skews are at least aslarge as the var ia t ion in the message transi tde lay in the system. This can be a seriousproblem in a la rge dis tr ibuted system be-cause this var ia t ion can be substantia l . Forexample , in som e systems wors t-case mes-

    sage transit delays as la rge as 100 imes themean m essage transi t de lay have been ob-served. In other words, the worst-case skew sin such system s would be at least this largeif a software clock synchronization algo-r i thm were adopted. At the other extreme,hardware a lgori thms achieve very t ightskews with minimal overhead. The skewsin hardware algorithm s are typically on th eorder of tens of nanoseco nds,8 as opposedto tens of milliseconds in the softwarealgorithms. However, he hardware schemesare prohibitively expensive, especially inlarge distributed systems.T o remove the inherent l imita t ions ofboth the hardware and sof tware approaches ,Ramanathan, Kandlur , and Sh in recentlyproposed a hybrid synchroniza t ion schemethat strikes a balance between the c lockskews a t ta inable and the hardware require-ment.9Their scheme s par t icular ly sui tablefor large, partially c onnected homogen eousdis tr ibuted systems with point- to-pointinterconnection topologies such as hyper-cubes or meshes .This hybrid schem e s similar to algorithmCNV. As in CN V, e a c h node ha s a clockprocess that is responsible for maintainingthe t ime on tha t node. At a specified time inOctober 1990

    To node Clock at-T fl Delay T node iClockatnode i Delay

    Delay I At node i

    Figure 5. Elimination of transmission delay effects.

    the resynchroniza t ion interval , a clockprocess broadc asts the local clock value toall other c lock processes . The broadcastalgorithm is such that all clock processesreceive multiple copies of the clock mes-sa ge through node -di s jo in t pa ths . Thenumber of copies used in the broadcasta lgori thm depends on the maximum num-ber of faults to be tolerated and the faultmodel for the system. When a clock pro-cess receives a clock message sent by som eother c lock process , i t records the t ime(according to i ts local c lock) a t which themessage was received. Then, in accordancewith the broadcast algorithm, i t relays themessage to othe r c lock processes.Before re laying the message , a clockprocess appends to the m essage the t imeelapsed (according to i ts own c lock) s incereceipt of the message . At the end of theresynchronizat ion interval , i t comp utes theskews between the local clock and theclock of the source node fo r each copy i thas received. The c lock process then se lec tsth e ( m+ 1)th argest value as an est imate ofthe skew between the tw o c locks . The av-erage of the es t imated skews ov er a l l nodesis used as the correction to the local clock.A s in CNV, a m inim um of 3m + 1 nodes isrequired to tolerate m Byzantine faults.Ramanathan, Kandlur , and Sh ins algo-r i thm has severa l advantages over the a lgo-rithms discussed earlier. First, the sendingof clock values by different nodes occursthroughout the resynchronizat ion intervalrather than during the last S t ime units ofthe interval . For hyp ercube and mesh ar-chitectures, they show that the resynchro-nization interval in their algorithm is so

    la rge tha t there is a t most one node broad-casting its clock value at any given time.This prevents abrupt degradation in themessage del ivery imes, which occurs whenall nodes in the system broadcast c lockvalues a lmost s imultaneously. Second, theonly hardware requirement a t each node isfor t ime-s tamping of c lock messages . Thethird, and probably m ost important, advantageis that the worst-case skews are about two tothree orders of magnitude tighter than theskews in software schemes. Furthermore, be-cause of the hardware support, the worst-case skews are insensitive to variations inmessage transit delay in the system.To com pare the hybrid synchroniza t ionscheme with the var ious sof tware andhardware synchroniza t ion schemes, con-sider the worst-case skew this algorithmguarantees. Raman athan, Kandlur, and Shinshow that this worst-case skew is

    +- 6 , + p N U )( N - 3 m ) where the notation is the same as that usedfor character iz ing a lgori thm CNV underConvergence-averaging a lgori thms. Th esl ight difference between the above ex-press ion and the corresponding express ionfor a lgori thm CNV is tha t for a lgori thmCNV the read error E incorporated the ef-fects of message transit delays betweenany two of the system nodes . This is be-cause a lgori thm CNV was intended for aful ly connected system in which the mes-sage transit de lays are very small . In con-

    4 1

  • 8/2/2019 14 Ref Import

    10/10

    trast, the algo rithm in the hybrid schem e isintended for a partially connected systemin which the m essage transit delay can b efairly large. Therefore, the effect of mes-sage transit delay ( U ) n the skew has beenseparated from the effect of other readerrors (E).The key conclusion we can draw fromthe above expression is that the worst-caseskew is not an explicit function of U as inmost software synchronization algorithmsbut a function of p . U. As a result, fortypical values of p = an d E = 20 mi-croseconds, Ramanathan, Kandlur, and Shinshow that the worst-case skew in a 512-node hypercube with m = 2 is less than 20 0microseconds even when the maximummessage transit delays are as large as 50milliseconds. These skews are still muchlarger than those achieved by hardwaresynchron ization algorithms, but the cost ofhardware synch ronization for a system thislarge would be exorbitant.

    lock synchronization is an impor-tant matter in any fault-tolerant dis-C ributed system and has been ex-tensively studied in recent years. Unfortu-nately, the various solutions proposed aredifficult to com pare because they are pre-sented under different notations and as-sumption s. Additional difficulty arises be-cause of slight differences in the assumptionsmade by the different synchronization al-gorithms. In this article we classified andpresented several software and hardwaresynchronization algorithms as well as ahybrid synchronization algorithm, all us -ing a consistent notation. W e also identi-fied the assumptio ns each algorithm m akes.

    Software algorithms require nodes toexchange an d adjust their individ ual clockvalues periodically. Since the clock valuesare exchanged via message passing, thetime overhead these algorithms impo se canbe substantial, especially if a tight synchro-nizationis desired. They are herefore suitablefor applications that can tolerate a loosesynchronization between the system nodes.Hardwa re algorithms, on the other hand,use special hardware to ensure a very tightsynchronization. Although the overheadthey impose on the system is minimal, thehardware a lgori thms are too expensive touse for all but small systems.The hybrid scheme is cost-effective andachieves a reasonably tight synchroniza-tion. It is also suitable for synchronizinglarge, partially connected systems and henceis the most viable scheme for future dis-tributed systems. H

    AcknowledgmentsThe work reported in this article was supp ort-ed in part by NASA u nder grant No. AG - 1-296and the Office of Naval Research un der contrac tNO. N00014-85-K-0531.

    References1. L. Lamport and P.M. M elliar-Smith, Syn-chronizingClocks in the Presence ofFaults,J.ACM,Vol. 32,No. 1,Jan. 19 85,pp. 52-78.2. J .Y. Halpern et al., Fault-Tolerant ClockSynchronization, Proc . Third Ann. ACMSymp. Pr incip les of Dis t r ibuted Comput ing ,ACM, New York, 1984, pp. 89-102.3. J. Lundelius-Welch and N. Lynch, A NewFault-Tolerant Algorithm for Clock Syn-chronization, Information and Computa-t ion , Vol. 77, No. 1, 1988, pp. 1-36.4. T.K. Srikanth and S. Toueg , Optimal ClockSynchronization, J . ACM, Vol. 34, No. 3,July 1 987, pp. 626-645.5. C.M. Krishna, K.G. Shin, and R.W. Butler,

    Ensuring Fault Tolerance of Phase-LockedClocks, IEEE Trans . Computers , Vol. C-34, NO. 8, Aug. 1 985, pp. 752-756.6. K.G. Shin and P. Ramanathan, Clock Syn-chronization of a Large MultiprocessorSystem n the Presence of Maliciou s Faults,IEEE Trans . Computers , Vol. C-36, No. 1,Jan. 1987, pp. 2-12.7. K.G. Shin and P. Ramanathan, Transmis-sion Delays in Hardware Clock Sy nchroni-zation,IEEE Trans . Computers , Vol. c-3 7,NO. 11 , NOV.1988, pp. 1,465-1,467.8. N. Vasanthavada and P.N. Marinos, Syn-chronization of Fault-TolerantClocks in thePresence of Malicious Failures,lEE E Trans .

    Computers , Vol. C-37,No. 4, Apr. 1988 ,pp.440-448.9. P. Ramanathan, D.D. Kandlur, and K.G.Shin, Hardware-Assisted Software ClockSynchronization for Homogeneous Distrib-uted Systems,IEEE Trans .Computers ,Vol.C-39, No. 4, Apr. 1990, pp. 514-524.10. F. Cristian, Probabilistic Clock Synchroni-zation,Tech. ReportRJ6432 (62550), IBMAlmaden Research Center, Sept. 1988.11. F.B. Schneider, A Paradigm for ReliableClock Synchronization, Tech. Report T R-86-735, Computer Science Dept., CornellUniv., Ithaca, N.Y., Feb. 1986.12. L. L amport, R. Shostak, and M. Pease, TheByzantin e Generals Problem , ACM Trans .Programming Languages and Sys tems ,Vol.4, No. 3, July 1982, pp. 382-401.13. A.L. Hopkins, T.B. Sm ith, and J.H. Lala,FTMP- Highly Reliable Fault-Toler-ant Mu ltiprocessor for Aircraft,P r o c . EEE.Vol. 66, NO,10,Oct. 197 8, pp. 12 21-1240.

    4L

    Parameswaran Ramanathan is an assistantprofessor of electrical and computer engineer-ing at the University of Wisconsin, Madison.From 1984 o 1989 he was aresearch assistant inthe Department of Electrical Engineering andComputer Science at the University of Michi-gan, Ann Arbor. His activities have focused onfault-tolerant computing, distributed real-timecomputing, VLSI design, and computer archi-tecture.Ramanathan received the B. Tech deg ree fromthe Indian Institute of Technology, Bombay,India, in 1984 and MSE and PhD degrees fromthe University of Michigan, Ann Arbor, in 1986and 198 9, respectively. He is a member of theIEEE Computer Society.

    Kang G. Shin joined the University of Michi-gan, Ann Arbo r, in 198 2, where he is currentlya professor of electrical engineering and com-puter science. In 1985he founded the Real-TimeComputing Laboratory, where he and his col-leagues are currently building a 19-node hexag-onal mesh multicomputer to validate variousarchitectures and an alytic results in the area ofdistributed real-time computing.Shin received a BS in electronics engineeringfrom Seoul National University,South Korea, in1970 and MS and PhD degrees in electricalengineering from Cornell University, Ithaca,New York, in 197 6 and 19 78, espectively. He isa member of the IEEE Computer Society.

    Ricky W . Butler is a research engineer at theLangley Research Cen ter. His research interestslie in the design and validation of fault-tolerantcomputer systems used for flight-critical appli-cations. He received his BA in mathematics in1976 and his MS in computer science in 1978,both from the University of Virginia.Inquiries to the authors can be addressed toParameswaran Ramanathan, Dept. of Electricaland Co mputer Engineering, University of Wis-consin, Madison, 3436 Engineering Bldg., 1415Johnson Dr., Madison, WI 53706-1691.

    COMPUTER