Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch...

79
© 2010 IBM Corporation Server Time Protocol Recovery Considerations Noshir Dhondy ([email protected])

Transcript of Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch...

Page 1: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation

Server Time Protocol Recovery Considerations

Noshir Dhondy ([email protected])

Page 2: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation2

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

STP-only CTN recovery with Internal Battery Feature (IBF)

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 3: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation3

CTN–

Collection of servers that are time synchronized to a time value

called Coordinated Server Time (CST)

Server/CF roles–

Preferred Time Server/CF (PTS)•

Server that is preferred to be the Stratum 1 server –

Backup Time Server/CF (BTS)•

Role is to take over as the Stratum 1 under planned or unplanned

outages, without disrupting synchronization capability of STP-only CTN

Current Time Server/CF(CTS) •

Active S1 Server/CF–

Only one S1 allowed–

Only the PTS or BTS can be assigned as the CTS–

Normally the PTS is assigned the role of CTS –

Active S1–

BTS typically is the Inactive S1–

BTS can take over as Active S1 or assigned Active S1 for planned

actions

PTS is the Inactive S1 in those cases–

Arbiter•

Provides additional means to determine if BTS should take over as the CTS under unplanned outages

STP-only CTN Terminology

Page 4: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation4

ETR/STP availability/recovery requirements

Availability–

When primary source of time fails, applications that depend on time synchronization can continue processing with data integrity.•

Parallel Sysplex •

GDPS customers having multi-site sysplex require Site 2 systems to continue processing when Site 1 fails and vice versa

z/OS Global Mirror (XRC) that uses time stamps associated with data updates to make sure secondary copy of the data is consistent

Non-sysplex applications that may use other than coupling links for messaging

ETR/STP recovery must ensure data integrity

when time consistency cannot be maintained–

Availability can be compromised but not data integrity–

Current designs (ETR and STP) have failure scenarios where availability is compromised, resulting in z/OS systems posting a

WTOR

Page 5: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation5

Sysplex Timer Recovery design rule

Sysplex Timer (ST) design rule to ensure data integrity

If Sysplex Timers lose capability to synchronize, one of the Sysplex Timers must disable transmission to servers

Sysplex Timer (ST) detects a failure–

Failing ST transmits 'Going away signal' Off Line Sequence (OLS) Symbol on Control Link Oscillator (CLO) links

If STs

lose capability to synchronize–

Primary ST continues to transmit ETR signals (if Primary is operational)

OLS received or not–

If Secondary ST receives OLS •

Secondary ST becomes Primary ST–

If Secondary ST does not receive OLS •

Secondary ST discontinues transmission of ETR signals

External Time Reference (ETR) Network

P1, P2, P3 in Parallel SysplexActive ETR link

Alternate ETR link

CLO links

ETR links

ISC-3 linksPeer Mode

Sysplex Timers

ETR Network ID =15

9037A

(Primary)9037B

(Secondary)

z9

P2

z10

P1z890

P3

ICB-3 links

Page 6: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation6

STP recovery design rules and overview

CANNOT have two Stratum 1 servers in timing network

Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if

either:–

Preferred Time Server (PTS) can indicate it has “failed”•

PTS, if operational MUST surrender role of CTS –

BTS can unambiguously determine the PTS has “failed”

Page 7: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation7

Switch to Local Timing Mode

Server in ETR network or CTN becomes unsynchronized (S0 in CTN):

z/OS system images running in ETR or STP timing mode switch to local timing mode.

Impact of switching depends on •

PLEXCFG parameter in IEASYSxx, and •

ETRMODE or STPMODE specified in CLOCKxx.–

z/OS systems that specify:•

PLEXCFG=MULTISYSTEM or PLEXCFG=ANY in IEASYSxx, and•

ETRMODE YES or STPMODE YES in CLOCKxx–

Issue a WTOR message to allow operator intervention to resolve the problem before a wait state is loaded•

z/OS systems that specify ETRMODE YES and are running in ETR timing mode issue WTOR message IEA015A.

z/OS systems that specify STPMODE YES and are running in STP timing mode issue WTOR message IEA394A.

Page 8: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation8

WTOR –

IEA015A

WTOR allows time window to correct the problem and respond “RETRY” if problem corrected or “ABORT” if problem cannot be corrected

“ABORT” will load wait state 0A2-114

Secondary Sysplex Timer can be reconfigured to an Expanded Basic configuration (single Sysplex Timer) to restart transmission of ETR signals before

WTOR messages responded to with “RETRY”

New function in z/OS 1.7 for Sysplex Failure Management

(SFM) to recognize that WTOR IEA015A issued

Page 9: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation9

WTOR –

IEA394A

WTOR allows time window to correct the problem and respond “RETRY” if problem corrected or “ABORT” if problem cannot be corrected

“ABORT” will load wait state 0A2-158

Backup Time Server or another operational server in the CTN can be reconfigured to be the Current Time Server (CTS) before

WTOR messages responded to with “RETRY”

New function in z/OS 1.7 for SFM

to recognize that WTOR IEA394A issued

Page 10: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation10

IEA394A WTOR

Important: Priority message checkbox must be selected when responding to WTOR

Page 11: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation11

Sysplex Failure Management (SFM) considerations

SFM allows installation to code a policy to define the recovery actions to be automatically initiated following detection of a Parallel Sysplex failure.

Actions include fencing off the failed image that prevents access to shared resources, logical partition deactivation, or dynamic storage reconfiguration.

New function in z/OS 1.7 and higher for

SFM to recognize that WTOR IEA015A or IEA394A issued

WTOR message issued by all the z/OS images in the sysplex, the user is not time constrained to do timing network reconfiguration before replying to IEA0394A or IEA015A.

Once WTOR on the first system image responded to with “RETRY”,

Number of z/OS images in Sysplex less than or

equal to 8?

XCF will allow a delay of

Four (4) minutes to respond to the last outstanding

WTOR message IEA394A or IEA015A

XCF will allow a delay of

Number of z/OS images ×

30 seconds

YES

NO, Number of z/OS images is > 8

z/OS system images will enter disabled-wait states should the user not be able to respond to the IEA394A or IEA015A WTOR message in the allotted time.

If the message is issued only on a subset of participating sysplex images, the SFM settings specified in the SFM Policy must be considered

Page 12: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation12

STP Recovery terminology

Coordinated Server Time–

Coordinated Server Time (CST) represents the time for the CTN and is the time at a Stratum 1 server

Synchronization check threshold–

Server/CF considered to be in synchronized state if TOD clock within synchronization check threshold of CST

STP synchronization check threshold 50 microseconds–

If TOD clock differs from CST by more than +/-

50 microseconds, server/CF becomes unsynchronized•

Can become a Stratum 0 (S0) server/CF

Freewheel Interval–

Amount of time a Stratum 2 or Stratum 3 server can remain synchronized without receiving messages from its clock source•

Approximately 1 second (Mixed-CTN)•

Approximately 10 seconds (STP-only CTN)

Page 13: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation13

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

STP-only CTN recovery with Internal Battery Feature (IBF)

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 14: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation14

Active ETR link

Alternate ETR link

ETR Network recovery same as before

Improved RAS

over ETR network for Stratum 1 servers/CFs

If either z990 or z890 loses all timing signals from the Sysplex Timers, they are capable of becoming Stratum 2 (S2) servers in a Mixed CTN

P1, P2, P3 in Parallel Sysplex

CLO links

ETR links

Sysplex Timers

ETR Network ID =159037A

(Primary)

9037B

(Secondary)

z900

P2

z990

S1

P1

ICB-3 links

z890S1

P3CTNID=ITSOPOK-15

ISC-3 linksCF + STP messages

Improved RAS for S1 servers/CFs –

Mixed CTN

Page 15: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation15

Improved RAS for S1 servers/CFs –

Mixed CTN (continued)

Active ETR link

Alternate ETR link

Improved RAS over ETR network for Stratum 1 (S1) servers/CFs

If either z990 or z890 loses all timing signals from the Sysplex Timers, they are capable of becoming Stratum 2 (S2) servers in a Mixed CTN

Example:–

z890 loses both ETR links

Can synchronize to z990 and become a S2 server

P1, P2, P3 in Parallel Sysplex

CLO links

ETR links

Sysplex Timers

ETR Network ID =159037A

(Primary)

9037B

(Secondary)

z900

P2

z990

S1

P1

z890

S2P3

ISC-3 linksCF + STP messages

ICB-3 links

CTNID=ITSOPOK-

15

Page 16: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation16

Coupling link recovery –

Mixed CTN example

Link recovery same for Mixed CTN and STP-only CTN

Both ISC-3 links between z990 and z890 established as paths that can be used for STP message exchanges

Only one established path used to exchange STP messages for synchronization

Messages exchanged every 64 ms

z890 synchronized to z990 using ISC-3 link (1) –

ISC-3 link (2) is an established path NOT exchanging messages.

If ISC-3 link (1) fails

Redundant ISC-3 link (2) will be used for synchronizing z890 to z990 P1, P2, P3 in Parallel Sysplex

ICB-3 links

ISC-3 linksCF messages

only

ETR links

CLO linksETR Network ID =15

ISC-3 link (2)

Will be used for synchronization if ISC-3

link (1) fails

(1) (2)

z900

P2

z990

S1

P1

z890

S2P3

CTNID=ITSOPOK-15

CTNID=ITSOPOK-15

Page 17: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation17

Mixed-CTN (Stratum 1 failure)

At least two Stratum 1 servers recommended in a Mixed CTN to avoid single point of failure

STP messages exchanged between S2 and all available S1s–

Algorithm selects one of the available S1s as clock source

In this example:–

z890 synchronized to z9 EC using ISC-3 link (1)

z890 and z990 also exchanging messages using ISC-3 link (4)

If z9 EC fails or is taken down for planned outage,–

z890 selects z990 as clock source, exchanging messages using ISC-3 link (4)

P1, P2, P3, P4 in Parallel Sysplex

ETR links

CLO linksETR Network ID =15

(1)

(2)

(3)

(4) ISC-3 link (4)

Will be used for synchronization if z9 EC

fails or has planned outage

z900

P2

z990

S1

P4

z890

S2

P3

z9 EC

S1

P1

CTNID=ITSOPOK-15

CTNID=ITSOPOK-15

CTNID=ITSOPOK-15

Page 18: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation18

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

Site failure scenarios

STP-only CTN recovery with Internal Battery Feature (IBF)

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 19: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation19

STP-only CTN with 2 servers/CFs

CTN only has a PTS and BTS assigned–

Arbiter NOT ASSIGNED

Assumption: PTS also assigned the CTS role

CANNOT have two Stratum 1 servers in timing network

Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if either:

Preferred Time Server (PTS) can indicate it has “failed”

or–

BTS can unambiguously determine the PTS has “failed”

PTS, if operational MUST surrender role of CTS

Combination of:–

Server Offline Signal (OLS-

Channel going away signal) and –

Console Assisted Recovery (CAR)

Used to determine if BTS can take over as CTS

Page 20: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation20

Server Offline Signal (OLS)

Server Offline signal (OLS) transmitted on a channel by the server to indicate that the channel is going offline

Signals are independent of STP

Conditions when OLS transmitted by server include:–

Server or LPAR dump

Server Power off

Chpid configure off

OLS may not be transmitted for certain failures:–

Server or site power outage

Channel subsystem fails

System Assist Processor (SAP) recovery

Link failures

Page 21: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation21

Console Assisted Recovery (CAR)

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

P000STP2 SCZP101

CAR uses HMC/SE LAN to determine–

CTS has failed or operational–

BTS can take over as CTS

BTS initiates CAR process when:–

BTS has lost communication with the CTS

BTS sends command to its Support Element (SE) to determine the state of the CTS

BTS SE communicates via HMC with CTS SE

If CTS state determined to have “failed”

BTS takes over as CTS

If CTS state “good”

or “indeterminate”–

BTS CANNOT take over as S1–

BTS eventually becomes unsynchronized at end of Freewheel Interval

P1, P2 in Parallel Sysplex

Page 22: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation22

OLS and CAR Recovery Rules

Applicable in an STP-only CTN when optional BTS assigned, but Arbiter NOT assigned

OLS rules applicable when two or more links between servers

If Backup Time Server (BTS) receives OLS on the last two established STP paths to Current Time Server (CTS) within two seconds:

BTS takes over as CTS (S1)–

CAR used to confirm PTS has failed or has surrendered as CTS

If the PTS/CTS has sent OLS on the last two

established STP paths to BTS within two seconds:

PTS will surrender its role of CTS

If only a single link between PTS and BTS or OLS on the last two established STP paths received more than 2 seconds apart:

CAR used to determine if BTS can take over as CTS–

OLS rules do not apply

Page 23: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation23

CTS failure –

OLS on last two paths received within 2 secs

If BTS (SCZP101) receives OLS on last two STP paths to CTS (P000STP2)

within 2 seconds–

BTS takes over as CTS (S1)–

To assure only 1 CTS •

PTS surrenders role of CTS•

CAR confirms CTS has failed

z/OS systems on P000STP2 may have posted WTOR (IEA394A)

z/OS systems on SCZP101 not affected

STP user actions: –

Repair CTS (P000STP2) –

STP does an automatic retakeover•

P000STP2 joins as S2•

Retakes role of CTS after verification checks

SCZP101 becomes S2

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

P000STP2 SCZP101

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S0

P1

z990(BTS)

S1

P2Coupling links

S0 server <<< After Recovery>>> S1 server

SCZP101P000STP2

P1, P2 in Parallel Sysplex

Page 24: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation24

CTS failure –

OLS on last two paths NOT received within 2 seconds; CAR unsuccessful

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

P000STP2 SCZP101

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S0

P2Coupling links

S0 server <<< After Recovery>>> S0 server(assume CAR unsuccessful)

SCZP101P000STP2

BTS does not

receive OLS on last two established STP paths to CTS within 2 seconds:

BTS initiates “Console assisted recovery”

BTS (SCZP101) SE attempts to determine state of CTS (P000STP2) by communicating via HMC with CTS SE

CTS (P000STP2) state “indeterminate”

BTS CANNOT take over as S1–

BTS eventually becomes unsynchronized at end of Freewheel Interval

z/OS systems (STPMODE YES) post WTOR (IEA394A)

STP User actions–

Reassign BTS as CTS –

Respond with Retry to WTOR–

NOTE: When PTS rejoins, it will not re-

takeover role of CTS, since roles reassigned

P1, P2 in Parallel Sysplex

Page 25: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation25

Reconfiguration after CTS Failure –

BTS unsynchronized (S0)

Select System (Sysplex) Time task of SCZP101

Server that will become the new CTS after reconfiguration

Select Network Configuration tab

Assign SCZP101 as BTS and CTS

Select “Force configuration”–

Since starting from Stratum 0

Respond “Retry”

to each WTOR (IEA394A) posted

Note that after responding to the first WTOR, the remaining WTORs

in the Sysplex have to be responded to within approximately 4 minutes if up to 8 z/OS images (additional 30 secs

per image if more than 8 images)

Page 26: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation26

Last Link Failure

When multiple links configured between PTS and BTS, a single link failure results in

BTS selecting redundant link

Failure of last Coupling link between BTS and CTS

CTS/PTS not affected–

BTS loses communication with CTS–

BTS initiates “Console assisted recovery”

CTS (PTS) state “good”

BTS unsynchronized–

z/OS systems (STPMODE YES) on BTS post WTOR (IEA394A)

STP User actions–

Repair “failing”

link •

BTS joins CTN as S2–

Respond with Retry to WTOR

H M C

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Single Coupling link

SCZP901SCZP101

H M C

S1 server <<< After Recovery>>> S0 server

z990 SEz9 EC SE

z9 ECPTS/CTS

S1

P1

z990(BTS)

S0

P2

SCZP901SCZP101

Single Coupling link

P1, P2 in Parallel Sysplex

Page 27: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation27

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

STP-only CTN recovery with Internal Battery Feature (IBF)

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 28: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation28

STP-only CTN with 3 or more servers/CFs

CTN has a PTS, BTS, and Arbiter assigned

Assumption: PTS also assigned the CTS role

CANNOT have two Stratum 1 servers in timing network

Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if either:

Preferred Time Server (PTS) can indicate it has “failed”

or

BTS can unambiguously determine the PTS has “failed”

PTS, if operational MUST surrender role of CTS

Arbiter Assisted Recovery used to determine if BTS can take over

as CTS

Page 29: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation29

Arbiter Assisted Recovery

Arbiter provides additional means to determine if BTS can take over as the CTS

Handles failure scenarios when OLS may not be sent by CTS or received by BTS

If BTS loses communication on all established paths to CTS–

BTS does not invoke OLS recovery rules–

BTS and Arbiter communicate to establish if Arbiter also has lost communication on all established paths to CTS

If both BTS and Arbiter cannot communicate with CTS–

BTS takes over as CTS (S1)

Failure also implies PTS cannot communicate with BTS and Arbiter–

Since only 1 CTS (S1) can exist, –

PTS initially surrenders role of CTS•

Has to assume that BTS has taken over as CTS

Page 30: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation30

CAR –

BTS and Arbiter assigned

Console assisted recovery uses HMC/SE LAN to determine –

BTS can take over as CTS (initiated by BTS case below)–

PTS can retake role of CTS (initiated by PTS case below)

BTS initiates Console Assisted recovery process when:–

BTS has lost communication with the CTS, and –

BTS cannot communicate with the Arbiter to initiate the Arbiter Assisted recovery process.

Works the same way as case when no Arbiter assigned

PTS initiates Console Assisted recovery process when:–

PTS has lost communication with the BTS and the Arbiter –

PTS initially surrenders role of CTS•

Has to assume that BTS has taken over as CTS –

Needs to determine if BTS failed or operational–

If BTS determined to have failed•

PTS retakes its role of CTS–

If BTS state good (BTS is capable of taking over as CTS) or “indeterminate”•

PTS either becomes a S3 server if a S2 clock source is available

or•

PTS becomes unsynchronized at end of Freewheel period

Page 31: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation31

STP-only CTN (Preferred, Backup and Arbiter assigned) CTS failure or power outage

P1, P2, P3 in Parallel Sysplex

CTNID=ITSOPOK -

BTS loses communication with CTS on all established paths

BTS does not invoke OLS recovery rules

BTS and Arbiter communicate to establish if Arbiter also cannot communicate with CTS

If both BTS and Arbiter cannot communicate with CTS–

BTS takes over as CTS (S1)

Since only 1 CTS (S1) can exist,

PTS initially surrenders role of CTS

PTS initiates “Console Assisted Recovery”

to determine if BTS failed or operational

If BTS determined to have failed•

PTS retakes its role of CTS

H M C

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

z890Arbiter

S2

P3

SCZP901SCZP101

H M C

z9 ECPTS/CTS

S0

P1

z990(BTS)

S1

P2Coupling links

z890Arbiter

S2

P3

SCZP901

SCZP101

P000STP2

P000STP2

PTS >S0 server <<< After Recovery>>> BTS >S1 server

Page 32: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation32

STP-only CTN (Preferred, Backup and Arbiter assigned) CTS failure or power outage -

continued

P1, P2, P3 in Parallel Sysplex

CTNID=ITSOPOK -

STP User Actions–

Repair CTS

STP does an automatic re-

takeover–

SCZP101 joins as S2

Retakes role of CTS after verification checks

SCZP901 becomes S2

H M C

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

z890Arbiter

S2

P3

SCZP901SCZP101

P000STP2

PTS >S1 server <<< After Repair Action>>> BTS >S2 serverOriginal configuration

Page 33: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation33

STP-only CTN (Preferred, Backup and Arbiter assigned) Last link failures

Last link between CTS and BTS–

BTS loses communication with CTS on all established paths

BTS initiates Arbiter assisted recovery–

Arbiter still has connectivity to CTS–

BTS synchronizes to Arbiter and becomes S3

STP User actions: None

Last link between CTS and Arbiter–

Arbiter synchronizes to BTS and becomes S3

STP User actions: None

Last link between BTS and Arbiter–

BTS and Arbiter both stay as S2–

STP User actions: None–

Arbiter Assisted Recovery exposed for any subsequent loss of communication between BTS and CTS

P000STP2

H M C

z9 ECPTS/CTS

S1

P1

z990(BTS)

S2

P2Coupling links

z890Arbiter

S2

P3

SCZP901SCZP101

P1, P2, P3 in Parallel Sysplex

Page 34: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation34

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

STP-only CTN recovery with Internal Battery Feature (IBF)

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 35: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation35

Power Outage PTS/CTS with Internal Battery Feature (IBF)

With IBF on CEC1

CEC1 power outage, enters IBF state

CEC1 notifies CEC2 it is running on IBF

CEC2 waits for 30 seconds to take action •

Could be a power glitch•

If notified within 30 seconds that CEC1 back to “normal power”, no further action

If CEC1 in IBF state > 30 seconds, •

CEC2 takes over as the CTS •

CEC1 becomes S2 until IBF no longer functional and power drops

CEC1 power resumes•

Automatic re-takeover as PTS/CTS

HMC

CEC1PTS/CTS

S1

P1

CEC2BTSS2

P3Coupling links

HMC

P2

HMC

CEC1PTS/CTS

S1

P1

CEC2BTSS2

P3Coupling links

HMC

P2

CEC power outage in same data center

Site power outage –

2 data centers

IBF is designed to enable PTS/CTS to reconfigure the BTS as the CTS if

Power outage of PTS/CTS

Power outage of site where PTS/CTS and Arbiter are located

Page 36: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation36

Power outage of data center (Site 1) with PTS and Arbiter

With Internal Battery Feature (IBF) on CEC1 and CEC3–

Site 1 power outage–

CEC1 and CEC3 enter IBF state–

CEC1 and CEC3 notify CEC2 it is running on IBF

CEC2 waits for 30 seconds to take action •

Could be a power glitch•

If notified within 30 seconds that CEC1 back to “normal power”, no further action

If CEC1 and CEC3 in IBF state > 30 seconds, •

CEC2 takes over as the CTS–

Not dependent on Arbiter Assisted Recovery

CEC1 becomes S2 until IBF no longer functional and power drops

CEC1 power resumes•

Automatic re-takeover as PTS/CTS

HMC

CEC1PTS/CTS

S1

P1

CEC2(BTS)

S2Coupling links

CEC3Arbiter

S2

P3

CEC4S2

P4

HMCSite 1 Site 2

P2

Page 37: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation37

IBF Recommendations

Single data center–

IBF only protects for server power outage–

CTN with 2 servers, install IBF on at least the PTS/CTS•

Also recommend IBF on BTS to provide recovery protection when BTS is the CTS

CTN with 3 or more servers IBF not required to recover from CTS power outage, if Arbiter configured

Two data centers–

IBF protects for both server and site power outage scenarios–

CTN with 2 servers (one in each data center) install IBF on at least the PTS/CTS

Also recommend IBF on BTS to provide recovery protection when BTS is the CTS

CTN with 3 or more servers, install IBF on CTS and Arbiter (in same site as CTS)

Also recommend IBF on BTS to provide recovery protection when BTS is the CTS

Page 38: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation38

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 39: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation39

STP-only CTN (Preferred and Backup assigned) Site 1 Failure

P1, P2 in Parallel SysplexCTNID=ITSOPOK -

H M C

SCZP101PTS/CTS

S1

P1

SCZP901(BTS)

S2

P2Coupling links

H M C

Site 1 Site 2

BTS (SCZP901) loses all communication with CTS (SCZP101)

BTS most probably does not receive OLS

BTS initiates “Console assisted recovery”

Results of “Console assisted recovery”

CTS state most probably indeterminate

BTS eventually becomes unsynchronized at end of Freewheel Interval

z/OS systems (STPMODE YES) in site 2 post WTOR (IEA394A)

STP User actions–

Reassign BTS as CTS –

Respond with Retry to WTOR

Page 40: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation40

STP-only CTN (Preferred and Backup assigned) Site 2 failure

P1, P2 in Parallel SysplexCTNID=ITSOPOK -

H M C

P1 P2Coupling links

H M C

Site 1 Site 2

SCZP101PTS/CTS

S1

PTS (SCZP101) continues role of CTS

z/OS systems in Site 1 requiring STPMODE YES not affected

STP User actions–

Restore Site 2

SCZP901(BTS)

S2

Page 41: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation41

STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 1 Failure –

Arbiter in same site as BTS

HMC

P1, P2, P3, P4 in Parallel SysplexCTNID=ITSOPOK -

P1

SCZP901(BTS)

S2

P2Coupling links

P000STP2Arbiter

S2

P3

STP1S2

P4

HMC

Site 1 Site 2

BTS (SCZP901) loses all communication with CTS (SCZP101)

BTS and Arbiter communicate to establish if Arbiter also cannot communicate with CTS

Both cannot communicate

BTS takes over as CTS (S1)

z/OS systems in Site 2 requiring STPMODE YES not affected

STP User actions –

None

SCZP101PTS/CTS

S1

Page 42: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation42

STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 2 Failure –

Arbiter in same site as BTS

PTS/CTS (SCZP101) loses communication with both BTS and Arbiter

PTS surrenders role of CTS

PTS initiates “Console assisted recovery”

to determine if BTS failed or operational

Results of “Console assisted recovery”

BTS state most probably indeterminate

PTS CANNOT retake role of CTS

All z/OS systems in site 1 post WTOR (IEA394A)

STP User actions–

Reassign PTS as CTS –

Respond with Retry to WTOR

HMC

SCZP101PTS/CTS

S1

P1 P2Coupling links

P000STP2 Arbiter

S2

P3P4

HMCSite 1 Site 2

STP1S2

SCZP901(BTS)

S2

P1, P2, P3, P4 in Parallel Sysplex

Page 43: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation43

STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 1 Failure –

Arbiter located in same site as PTS/CTS

HMC

SCZP101PTS/CTS

S1

P1

SCZP901(BTS)

S2

P2Coupling links

P000STP2 Arbiter

S2

P3 P4

HMCSite 1 Site 2

BTS loses all communication with CTS

BTS cannot communicate with Arbiter

BTS initiates “Console assisted recovery”

Results of “Console assisted recovery”

CTS state most probably indeterminate

BTS CANNOT take over as S1

BTS eventually becomes unsynchronized

z/OS systems (STPMODE YES) in site 2 post WTOR (IEA394A)

Similar to case with only PTS and BTS assigned

STP User actions –

Reassign BTS as CTS –

Respond with Retry to WTOR

STP1S2

Page 44: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation44

STP-only CTN (Preferred, Backup, and Arbiter assigned) -

Reconfiguration after Site 1 Failure

Select System (Sysplex) Time task of SCZP901

Server that will become the new CTS after reconfiguration

Select Network Configuration tab

Assign SCZP901 as PTS and CTS

Assign STP1 as BTS

Select “Force configuration”–

Since starting from Stratum 0

Respond “Retry”

to each WTOR (IEA394A) posted

Note that after responding to the first WTOR, the remaining WTORs

in the Sysplex have to be responded to within approximately 4 minutes if up to 8 z/OS images (additional 30 secs

per image if more than 8 images)

Page 45: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation45

STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 2 Failure –

Arbiter located in same site as PTS/CTS

CTS loses communication with only the BTS

CTS maintains communication with Arbiter

PTS maintains role of CTS (S1)

STP-only CTN servers in Site 1 stay synchronized to CTS (S1)

z/OS systems in Site 1 requiring STPMODE YES not affected

STP User actions–

None

HMC

SCZP101PTS/CTS

S1

P1 P2Coupling links

P000STP2 Arbiter

S2

P3

z9 BCS2

P4

HMCSite 1 Site 2

SCZP901(BTS)

S2

Page 46: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation46

Multi-site CTN Rules and Recommendations

Provide redundant routes for fiber links between sites

Use only qualified

DWDMs

If 3 or more servers in CTN, assign BTS and Arbiter–

Locate the Arbiter in same site as PTS•

Provides better recovery for scenarios when:

OLS may not be sent from CTS or –

OLS may not be received by BTS

Page 47: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation47

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 48: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation48

ETS Recovery -

DISCLAIMER

The following section is intended to provide ONLY

a

basic overview of ETS Recovery

For more detailed recovery information and the actions that must be taken in response to various failures, please see the ETS recovery information in

STP Planning Guide, SG24-7280

STP Implementation Guide, SG24-7281

Page 49: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation49

ETS Recovery introduction

External time source in an STP-only CTN can be provided by:

Using dial-out on the HMC

Using an NTP server (LAN connection)

Using an NTP server with a pulse per second output option (LAN connection and coaxial cable to the PPS port of an ETR card)

Limited recovery actions when ETS configured to use dial-out–

HMC attempts to redial if line is busy

Option to have more than one HMC act as a phone server

Regardless of the ETS option selected, failures associated with ETS do not affect the capability of servers in a CTN to stay synchronized with each other.

As long as the timing state of the servers remains synchronized,

z/OS images that depend on synchronization are not affected.

The only effect of unsuccessful recovery for an ETS failure is that the CTN will slowly drift away from ETS time

Page 50: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation50

NTP Server Redundancy Recommendations

At least one NTP server must be configured on the PTS/CTS –

Only the Current Time Server (CTS) makes time adjustments based on information from the NTP Server

Also recommended to configure at least one NTP server on the BTS–

Allows continuous NTP server access when BTS becomes the CTS

Time adjustments to the STP-only CTN when the PTS/CTS cannot access any of its NTP servers

If two NTP servers are configured, user is responsible for selecting preferred NTP server–

This NTP server is called the selected NTP server;

The other NTP server is called the non-selected NTP server.

Recommendations apply when using NTP servers with or without PPS

Page 51: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation51

ETS Recovery design using NTP Servers

Configured NTP servers on the PTS/CTS are accessed once every 10 minutes by the SNTP client.

Once every hour, assuming a successful access of the selected NTP server, the SNTP client sends a CST adjustment to the STP facility.

Normally, the SNTP client on the CTS uses the time information from the selected NTP server to perform the time adjustment.

The time information from the non-selected NTP server is only used when there is a failure associated with accessing time information from the selected NTP server.

Configured NTP servers on the BTS are also accessed once every 10 minutes.

The BTS calculates a value for time adjustment based on this access, and communicates the information to the PTS over the coupling links.

If the PTS/CTS cannot access both its configured NTP servers, it will switch over to using the timing information sent from the BTS to steer the STP-only CTN.

Page 52: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation52

Order of Recovery actions –

ETS using NTP Servers

After two unsuccessful attempts (two hours) at sending a CST adjustment value based on selected NTP server,

SNTP client will switch to sending timing adjustment information

based on the non-selected NTP server

After two unsuccessful attempts (two hours) at sending a CST adjustment value based on non-selected NTP server,

STP will steer CTN using calculation from BTS

BTS information could be based on:•

Selected NTP server at the BTS, or•

Non-selected NTP server, if valid data cannot be accessed from the selected NTP server

When STP is not able to switch to any operational NTP server, automatic base steering continues

Base steering allows STP to compensate for the drift characteristics of the oscillator, thereby maintaining relatively good time accuracy at

the Current Time Server, even if an ETS is not available.

Page 53: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation53

Possible failures -

ETS using NTP Servers

NTP serverStratum 1

July 14 14:21:00 2008 UTC

PTS & CTSor BTS

System z HMC

selected

SNTPclient

EthernetSwitch

2

1

1.

Loss of LAN connectivity between the Support Element and the NTP server

2.

Complete NTP server failure or bad NTP data from the NTP server

1

2

Page 54: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation54

Scenario 1 -

Redundant NTP Servers on PTS/CTS

NTP server 1Stratum 1

July 14 14:21:00 2008 UTC

PTS/CTSS1

NTP serverStratum 1

July 14 14:21:00 2008 UTC

Selected

SNTPclient

EthernetSwitch

Non-selected

NTP server 2 HMC NTP serverStratum 2

Corporatenetwork

Recovery

If selected NTP server becomes unavailable, BUT the non-

selected NTP server is still available (failure ),

SNTP client will use non-

selected NTP server as its ETS, and will continue steering CTN using timing information received from NTP server 2.

Failure

If failure is a LAN failure, NO

recovery is possible, and CTN continues to use automatic base steering

2

1

Loss of LAN connectivity between the Support Element and the NTP

server

Complete NTP server failure or bad NTP data from the NTP server

1

2

Page 55: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation55

Scenario 2 -

Redundant NTP Servers on PTS and BTS

2

Compared to Scenario 1:This configuration provides additional degree of continuous availability of NTP servers

Suitable for a dual site implementation, with PTS and BTS in different sites.

Recovery

If PTS/CTS is not able to access NTP server 1 for two hours

Will start using time adjustment information sent by BTS approximately an hour later to steer the CTN.

If BTS is not able to access NTP server 2 for two hours

NO recovery action.

However, problem should be corrected as soon as possible to maintain ETS redundancy.

Coordinated Timing Network

NTP server 1Stratum 1

July 14 14:21:00 2007 UTC

PTS / CTSBTS

July 14 14:21:00 2007 UTC

System z HMC

NTP server 2Stratum 1

selected@PTS selected@BTS

System z HMC

SNTPclient

SNTPclient

EthernetSwitch

EthernetSwitch

Page 56: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation56

Continuous NTP server availability -

Enhanced Configuration

Corporatenetwork

IBM System zCoordinated Timing Network

NTP server 1Stratum 1

July 14 14:21:00 2007 UTC

PTS / CTSS1

BTSS2

July 14 14:21:00 2007 UTC

System z HMC

site 1

NTP server 2Stratum 1

site 2

non-selected@PTS

System z HMC,NTP server enabledStratum 2

NTP serverStratum 1

July 14 14:21:00 2007 UTC

selected@PTS selected@BTS

SNTPclient

SNTPclient

EthernetSwitch Ethernet

Switch

To provide even more redundancy, also consider configuring an additional NTP server on the HMC

The NTP server on the HMC is the non-selected NTP server at the PTS/CTS.

If the selected NTP server fails at the PTS/CTS, the non-

selected NTP server takes over the ETS role and provides the time information.

In case both NTP servers in site 1 are not accessible for a certain period of time (for example because of LAN problems), the time adjustment information sent by the BTS will be used

Page 57: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation57

ETR and STP Recovery Concepts–

Recovery design rules and terminology–

Switch to Local Timing mode

Mixed Coordinated Timing Network (Mixed CTN) recovery–

Failure scenarios

STP-only CTN recovery (Backup Time Server (BTS) assigned)–

Server Offline Signal (OLS), Console Assisted Recovery–

Failure scenarios

STP-only CTN recovery (BTS and Arbiter assigned) –

Arbiter Assisted Recovery–

Failure scenarios

Site failure scenarios

External Time Source (ETS) Recovery–

ETS Recovery using NTP Servers–

ETS Recovery using NTP Servers with PPS

Agenda

Page 58: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation58

ETS Recovery design using NTP Servers with PPS

Configured NTP servers on PTS/CTS are accessed once a minute by SNTP client.

Once every 10 minutes, assuming successful access of both

NTP servers, the SNTP client sends time adjustment information based on both

NTP servers to the STP facility.

Configured NTP servers on BTS are also accessed once a minute by

SNTP client

Once every 10 minutes, time adjustment information based on both

NTP servers sent to the STP facility on BTS.

Normally, STP facility on BTS uses the time information in conjunction with the PPS signal from the selected NTP server to calculate a time adjustment.

BTS then communicates this information to the PTS over the coupling links. –

Adjustment calculation based on time information and PPS signal from non-

selected NTP server on BTS only used when there is a failure associated with accessing time information or PPS signals from the selected NTP server.

If the PTS/CTS cannot access both its configured NTP servers, it

will switch over to using the timing information sent from the BTS to

steer the STP-only CTN.

Page 59: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation59

Possible Failures -

ETS using NTP Servers with PPS

NTP serverStratum 1

July 14 14:21:00 2008 UTC

PTS/CTSor BTS

System z HMC

PPSout

SNTPclient

ETR cardPPS port 0

EthernetSwitch

2 1

3

Possible failures

1.

Loss of LAN connectivity between SE and NTP server or bad NTP data

2.

PPS signal not received by PPS port on the ETR card.

3.

Complete NTP server failure affecting both NTP data and PPS output of NTP server.

1

2

3

Page 60: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation60

Order of Recovery actions -

ETS using NTP Servers w/PPS

Coordinated Timing Network

NTP server 1Stratum 1

July 14 14:21:00 2007 UTC

PTS/CTSS1

BTSS2

System z HMC

NTP server 2Stratum 1

PPSout

July 14 14:21:00 2007 UTC

selected@PTS selected@BTS

EthernetSwitch

EthernetSwitch

System z HMC

SNTPclient

SNTPclient

PPSout

ETR card PPS port 1

ETR cardPPS port 0

ETR card PPS port 1

ETR cardPPS port 0

If failure type , STP will continue using PPS signals received on PPS port of the selected NTP server on the PTS/CTS.

If failure type or , STP will switch to using time adjustment information received from BTS.

1

2 3

Loss of LAN connectivity between SE and NTP server or bad NTP data

PPS signal not received by PPS port on the ETR card.

Complete NTP server failure affecting both NTP data and PPS output of NTP server.

1

2

3

Note: Refer to SG247280 and SG247281 when NTP server with PPS configuration is different

Page 61: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation61

Order of Recovery actions -

ETS using NTP Servers w/PPS (continued)

Regardless of the specific redundancy provided by an NTP server with PPS configuration–

If PPS signals are not received from

any of the configured NTP servers on the PTS/CTS and the BTS, BUT

valid NTP data is available, •

STP will continue using the NTP data for steering the CTN following the same recovery flow described in previous “ETS recovery using NTP servers”

section–

When STP is not able to switch to any operational NTP server, the automatic base steering continues.

Base steering allows STP to compensate for drift characteristics

of the oscillator, thereby maintaining relatively good time accuracy at

the Current Time Server, even if an ETS is not available.

Page 62: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation62

Scenario 1 -

Redundant NTP Servers with PPS on PTS/CTS

Recovery

If NTP server 1 is not accessible by the SNTP client on the SE (failure ), BUT the PPS signal is still received on PPS port 0

NO recovery is required because STP will continue to steer the CTN using the PPS signals from NTP server 1.

For failures and on NTP server 1, STP will switch to using the time information and the PPS signals from the non-selected server, NTP server 2.

1

2 3

Loss of LAN connectivity between SE and NTP server or bad NTP data

PPS signal not received by PPS port on the ETR card.

Complete NTP server failure affecting both NTP data and PPS output of NTP server.

1

2

3

NTP server 1Stratum 1

July 14 14:21:00 2008 UTC

PTS/CTSS1

System z HMC

PPSout

NTP server 2Stratum 1

PPSout

July 14 14:21:00 2008 UTC

selected@PTS non-selected@PTS

SNTPclient

ETR card PPS port 1

ETR cardPPS port 0

EthernetSwitch

Page 63: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation63

Scenario 2 -

Redundant NTP Servers with PPS on PTS and BTS

Recovery

If NTP server 1 is not accessible by the SNTP client on the SE (failure ), BUT the PPS signal is still received on PPS port 0

NO recovery is required because STP will continue to steer the CTN using the PPS signals from NTP server 1.

For failures and on NTP server 1, the PTS/CTS will start using the time adjustment information received from the BTS, which is based on NTP server 2 and its PPS signals.

For failures , and on NTP server 2

NO Recovery required

1

2 3Coordinated Timing Network

NTP server 1Stratum 1

July 14 14:21:00 2007 UTC

PTS/CTSS1

BTSS2

System z HMC

NTP server 2Stratum 1

PPSout

July 14 14:21:00 2007 UTC

selected@PTS selected@BTS

EthernetSwitch

EthernetSwitch

System z HMC

SNTPclient

SNTPclient

PPSout

ETR card PPS port 1

ETR cardPPS port 0

ETR card PPS port 1

ETR cardPPS port 0

1 2 3

Loss of LAN connectivity between SE and NTP server or bad NTP data

PPS signal not received by PPS port on the ETR card.

Complete NTP server failure affecting both NTP data and PPS output of NTP server.

1

2

3

Page 64: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation64

Page 65: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation65

Summary –

Mixed CTN

Configure for link redundancy

Attach (synchronize) at least 2 STP-configured servers to the Sysplex Timers in an Expanded Availability configuration

Multiple S1s allowed in Mixed-CTN

For configuration across two sites

Locate Sysplex Timers in different sites

Intermediate site may be required to locate second Sysplex Timer if two sites separated by 100 km

Provide redundant routes for fiber links between sites

Page 66: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation66

Summary –

STP-only CTN

Configure for link redundancy

Initialize configuration with the PTS assigned as the Current Time Server

PTS, CTS must be assigned

Assign at least a Backup Time Server–

Can take over as CTS -

active S1

If 3 or more servers in CTN, assign BTS and Arbiter

For configuration across 2 sites–

Provide redundant routes for fiber links between sites

Use only qualified

DWDMs

Locate the Arbiter in same site as PTS•

Provides better recovery for scenarios when: –

OLS may not be sent from CTS or –

OLS may not be received by BTS

Page 67: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation67

Summary –

ETS Recovery

Failures associated with ETS and possible recovery actions do not affect the capability of servers in a CTN to stay synchronized with each other.

The Current Time Server (CTS) is the only server that adjusts the Coordinated Server Time (CST) by steering it to the time obtained from an external time source (ETS). Either the PTS or the BTS can be the CTS.

It is recommended to configure at least one unique NTP server or

NTP server with PPS on the PTS and the BTS. Configuring an NTP server on the BTS provides two benefits:

Access to an NTP server when the BTS becomes the CTS as the result of planned or unplanned recovery

Time adjustments to an NTP server when the PTS/CTS cannot access

any of its NTP servers

Multi-site CTN configurations do not have any specific ETS redundancy considerations, other than the general recommendation to configure an NTP server both on the PTS and the BTS.

The CTS assignment does not change as a consequence of an ETS failure.

Page 68: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation68

Redbooks®

Server Time Protocol Planning Guide SG24-7280–

Server Time Protocol Implementation Guide SG24-7281–

Server Time Protocol Recovery Guide SG24-7380

Education–

Introduction to Server Time Protocol (STP)•

Available on Resource Link™

www.ibm.com/servers/resourcelink/hom03010.nsf?OpenDatabase

STP Web site–

www.ibm.com/systems/z/pso/stp.html

Systems Assurance–

The IBM team is required to complete a Systems Assurance Review (SAPR Guide SA06-012) and to complete the Systems Assurance Confirmation Form via

Resource Link

Techdocs

and WSC Flashes–

http://www-03.ibm.com/support/techdocs/atsmastr.nsf/Web/Techdocs•

Search on “STP”

Additional Information

Page 69: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation69

IBM Implementation Services for System z –

Server Time Protocol (6948-J56)

Offering Description

This offering is designed to assist clients to quickly and safely implement Server Time Protocol within their existing environments. STP provides clients with the capability to efficiently manage time synchronization within their multi-server infrastructure. Following best practices and using detailed planning services, IBM helps clients identify various implementation models and engage in the appropriate configuration required to effectively support STP for driving a more responsive business and IT infrastructure.

Program, Play, Industry Alignment

Infrastructure Improvement; Energy Efficiency; Better performance and lower operational cost

Client Value (enables customers to...)

Swift and secure implementation of STP for improved availability, integrity and performance

Improves multi-server time synchronization without interrupting operations•

Enables integration with next generation of System z infrastructure

Target Audience •

Primarily core, Large Enterprise customers. •

Existing z midrange clients

Key Competitors •

In house staff

Competitive Differentiation

Leverages best practices with secure implementation •

Short implementation time –

lower risk•

Provides support and facilitates knowledge sharing through IBM’s mainframe expertise

Proof Points & Claims for Client Value / Differentiation

Need to safely implement a reliable replacement for Sysplex Timer®

while maintaining continuous operations

Cost of providing and maintaining hardware, floor space and solution support for additional Sysplex Timer intermediate site

Lack of in-house expertise, skills and resources for implementing Server Time Protocol

Engagement Portfolio •

http://spimweb1.boulder.ibm.com/services/sosf/dyno.wss?oid=50423&loc=All&langc

d=en-US#1

Offering Manager •

Anna Lee/Southbury/IBM, 512-590-8914, T/L: 268-9318

Page 70: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation70

IBM Announces –IBM Implementation Services for System z – Server Time Protocol

Offering Assist clients to quickly and safely implement Server Time Protocol within their existing environments. IBM helps clients identify various implementation models and engage in the appropriate configuration required to effectively support STP for driving a more responsive business and IT infrastructure

Customer Value: - Improves multi-server time synchronization without interrupting

operations

- Enables integration with next generation of System z infrastructure

- Swift and secure implementation of STP for improved availability, integrity, and performance

- Reduces hardware maintenance and power costs while eliminating intermediate site requirements for Sysplex Timer

Leverages IBM’s knowledge and best

practices to help implementation of

Server Time Protocol

Implementation of STP for improved availability and performance

Page 71: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation71

Reference Material -

Terminology

APAR

Authorized Program Analysis Report

ARB

Arbiter

BTS

Backup Time Server

CF

Coupling Facility

CTS

Current Time Server

CTN

Coordinated Timing Network

DWDM

Dense Wave Division

Multiplexer

ETR

External Time Reference

ETS

External Time Source

FC

Feature Code

HMC

Hardware Management

Console

HCA

Host Channel Adapter

ICB

Integrated Cluster Bus

IPL

Initial Program Load

ISC

InterSystem Coupling Channel

LAN

Local Area Network

LIC

Licensed Internal Code

LPAR

Logically Partition

NTP

Network Time Protocol

PR/SM

Processor Resource / Systems Manager

PSIFB Parallel Sysplex Infiniband

PTF

Temporary Program Fix

PTS

Preferred Time Server

SW

Software (programs and operating systems)

SE

Support Element

TPF

Operating System

UTC

Coordinated Universal Time

zVM Operating System

zVSE Operating System

z/OS

Operating System

z/VM

Operating System

Page 72: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation72

Questions?

Page 73: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

73

Thank YouTak

DanishDanke

German

Dank u

Dutch

Obrigado

Brazilian

Portuguese

ขอบคุณ Thai

Grazie

Italian

go raibh

maith

agat

Gaelic

Trugarez

Breton

Merci

French

Gracias

Spanish

Спаcибо

Russian

நன்றி Tamil

धन्यवाद

Hindi

شكراً Arabic

감사합니다

Korean

תודה רבהHebrew

Tack så

mycket

Swedish

Dankon

Esperanto

ありがとうございます

Japanese

谢谢 Chinese

děkuji

Czech

MercésCatalan

Page 74: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation7474

TrademarksThe following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.

The following are trademarks or registered trademarks of other companies.

* All other products may be trademarks or registered trademarks of their respective companies.

Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance

ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not

offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information

on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the

United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino

logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus

Torvalds

in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.

For a complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml:

*, AS/400®, e business(logo)®, DBE, ESCO, eServer, FICON, IBM®, IBM (logo)®,

iSeries®, MVS, OS/390®, pSeries®, RS/6000®, S/30, VM/ESA®, VSE/ESA,

WebSphere®, xSeries®, z/OS®, zSeries®, z/VM®, System i, System i5, System p, System p5, System x, System z, System z9®, System z10®,

BladeCenter®

Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.

Those trademarks followed by ®

are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.

Page 75: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation75

Page 76: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation76

ETR network failures

ETR link or CEC ETR port failure–

When Sysplex Timer signals not received by server active ETR port, the server switches to alternate ETR port

Single CLO link failure–

Both Sysplex Timers stay in synch; continue to transmit to attached servers

Both CLO links failure–

Primary timer continues to transmit when loss of communication between Timers

Secondary timer stops transmitting when loss of communication between Timers Active ETR link

Alternate ETR link

CLO links

ETR links

ISC-3 linksPeer Mode

Sysplex Timers

ETR Network ID =159037A

(Primary)

9037B

(Secondary)

z900

P2

z990

P1

ICB-3 links

z890

P3

Stops Transmitting

Page 77: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation77

ETR Network Failures (continued)

Primary Sysplex Timer fails or power outage

OLS received by Secondary ST–

Secondary ST becomes Primary ST–

z/OS systems (ETRMODE YES) on all servers not affected

Primary Sysplex Timer in Site 1, Secondary Sysplex Timer in Site 2

Site 1 fails–

Secondary ST most probably does not receive OLS

Secondary ST stops transmitting–

z/OS systems in Site 2 (ETRMODE YES) post WTOR (IEA015)

Active ETR link

Alternate ETR link

ICB-3 links

CLO links

ETR links

ISC-3 linksPeer Mode

Sysplex Timers

ETR Network ID =159037A

(Primary)

9037B

(Secondary)

z900

P2

z990

P1z890

P3

Stops Transmitting

Page 78: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation78

System Failure Handling in a sysplex

To help get the sick/dead system out of the way as quickly as possible, IBM introduced the Sysplex Failure Management (SFM) component of XCF.

SFM can (under installation control) automatically partition a system from the sysplex if:

The Failure Detection Interval has been reached AND –

No heartbeat has been received AND –

The apparently dead system is not sending any XCF signals.

The Failure Detection Interval (prior to z/OS 1.11) defaults to either –

25 seconds (LPAR with dedicated CPs) or –

85 seconds (LPAR with shared CPs) or –

It can be overridden in the COUPLExx

member

To try to encourage customers to use SFM, health checks were provided to ensure that there is an active SFM policy and that the policy specified the ISOLATETIME option.

Page 79: Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch to Local Timing mode ... BTS can take over as Active S1 or assigned Active S1 for

© 2010 IBM Corporation79

SFM pre-z/OS 1.11

If SFM NOT active, –

Operator would eventually be prompted with message IXC402D, asking him to RESET the LPAR, then reply DOWN.

If SFM active and ISOLATETIME specified (as recommended), –

System would (eventually) attempt to automatically Fence the problem system and partition it out of the sysplex.

Required a Coupling Facility •

Without the CF (in a base sysplex, for example) there is no ability to fence a system.

If the operator observed messages indicating that a system appeared non-responsive,

Could check the system status on the HMC and take manual action if the system was in fact dead.

z/OS 1.11 introduces some fundamental changes to this philosophy.....