Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch...
-
Upload
phungthuan -
Category
Documents
-
view
220 -
download
1
Transcript of Server Time Protocol Recovery Considerations - IBM Time Protocol Recovery Considerations ... Switch...
© 2010 IBM Corporation
Server Time Protocol Recovery Considerations
Noshir Dhondy ([email protected])
© 2010 IBM Corporation2
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
STP-only CTN recovery with Internal Battery Feature (IBF)
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation3
CTN–
Collection of servers that are time synchronized to a time value
called Coordinated Server Time (CST)
Server/CF roles–
Preferred Time Server/CF (PTS)•
Server that is preferred to be the Stratum 1 server –
Backup Time Server/CF (BTS)•
Role is to take over as the Stratum 1 under planned or unplanned
outages, without disrupting synchronization capability of STP-only CTN
–
Current Time Server/CF(CTS) •
Active S1 Server/CF–
Only one S1 allowed–
Only the PTS or BTS can be assigned as the CTS–
Normally the PTS is assigned the role of CTS –
Active S1–
BTS typically is the Inactive S1–
BTS can take over as Active S1 or assigned Active S1 for planned
actions
–
PTS is the Inactive S1 in those cases–
Arbiter•
Provides additional means to determine if BTS should take over as the CTS under unplanned outages
STP-only CTN Terminology
© 2010 IBM Corporation4
ETR/STP availability/recovery requirements
Availability–
When primary source of time fails, applications that depend on time synchronization can continue processing with data integrity.•
Parallel Sysplex •
GDPS customers having multi-site sysplex require Site 2 systems to continue processing when Site 1 fails and vice versa
•
z/OS Global Mirror (XRC) that uses time stamps associated with data updates to make sure secondary copy of the data is consistent
•
Non-sysplex applications that may use other than coupling links for messaging
ETR/STP recovery must ensure data integrity
when time consistency cannot be maintained–
Availability can be compromised but not data integrity–
Current designs (ETR and STP) have failure scenarios where availability is compromised, resulting in z/OS systems posting a
WTOR
© 2010 IBM Corporation5
Sysplex Timer Recovery design rule
Sysplex Timer (ST) design rule to ensure data integrity
–
If Sysplex Timers lose capability to synchronize, one of the Sysplex Timers must disable transmission to servers
Sysplex Timer (ST) detects a failure–
Failing ST transmits 'Going away signal' Off Line Sequence (OLS) Symbol on Control Link Oscillator (CLO) links
If STs
lose capability to synchronize–
Primary ST continues to transmit ETR signals (if Primary is operational)
•
OLS received or not–
If Secondary ST receives OLS •
Secondary ST becomes Primary ST–
If Secondary ST does not receive OLS •
Secondary ST discontinues transmission of ETR signals
External Time Reference (ETR) Network
P1, P2, P3 in Parallel SysplexActive ETR link
Alternate ETR link
CLO links
ETR links
ISC-3 linksPeer Mode
Sysplex Timers
ETR Network ID =15
9037A
(Primary)9037B
(Secondary)
z9
P2
z10
P1z890
P3
ICB-3 links
© 2010 IBM Corporation6
STP recovery design rules and overview
CANNOT have two Stratum 1 servers in timing network
Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if
either:–
Preferred Time Server (PTS) can indicate it has “failed”•
PTS, if operational MUST surrender role of CTS –
BTS can unambiguously determine the PTS has “failed”
© 2010 IBM Corporation7
Switch to Local Timing Mode
Server in ETR network or CTN becomes unsynchronized (S0 in CTN):
–
z/OS system images running in ETR or STP timing mode switch to local timing mode.
–
Impact of switching depends on •
PLEXCFG parameter in IEASYSxx, and •
ETRMODE or STPMODE specified in CLOCKxx.–
z/OS systems that specify:•
PLEXCFG=MULTISYSTEM or PLEXCFG=ANY in IEASYSxx, and•
ETRMODE YES or STPMODE YES in CLOCKxx–
Issue a WTOR message to allow operator intervention to resolve the problem before a wait state is loaded•
z/OS systems that specify ETRMODE YES and are running in ETR timing mode issue WTOR message IEA015A.
•
z/OS systems that specify STPMODE YES and are running in STP timing mode issue WTOR message IEA394A.
© 2010 IBM Corporation8
WTOR –
IEA015A
WTOR allows time window to correct the problem and respond “RETRY” if problem corrected or “ABORT” if problem cannot be corrected
–
“ABORT” will load wait state 0A2-114
Secondary Sysplex Timer can be reconfigured to an Expanded Basic configuration (single Sysplex Timer) to restart transmission of ETR signals before
–
WTOR messages responded to with “RETRY”
New function in z/OS 1.7 for Sysplex Failure Management
(SFM) to recognize that WTOR IEA015A issued
© 2010 IBM Corporation9
WTOR –
IEA394A
WTOR allows time window to correct the problem and respond “RETRY” if problem corrected or “ABORT” if problem cannot be corrected
–
“ABORT” will load wait state 0A2-158
Backup Time Server or another operational server in the CTN can be reconfigured to be the Current Time Server (CTS) before
–
WTOR messages responded to with “RETRY”
New function in z/OS 1.7 for SFM
to recognize that WTOR IEA394A issued
© 2010 IBM Corporation10
IEA394A WTOR
Important: Priority message checkbox must be selected when responding to WTOR
© 2010 IBM Corporation11
Sysplex Failure Management (SFM) considerations
SFM allows installation to code a policy to define the recovery actions to be automatically initiated following detection of a Parallel Sysplex failure.
–
Actions include fencing off the failed image that prevents access to shared resources, logical partition deactivation, or dynamic storage reconfiguration.
New function in z/OS 1.7 and higher for
SFM to recognize that WTOR IEA015A or IEA394A issued
–
WTOR message issued by all the z/OS images in the sysplex, the user is not time constrained to do timing network reconfiguration before replying to IEA0394A or IEA015A.
–
Once WTOR on the first system image responded to with “RETRY”,
Number of z/OS images in Sysplex less than or
equal to 8?
XCF will allow a delay of
Four (4) minutes to respond to the last outstanding
WTOR message IEA394A or IEA015A
XCF will allow a delay of
Number of z/OS images ×
30 seconds
YES
NO, Number of z/OS images is > 8
z/OS system images will enter disabled-wait states should the user not be able to respond to the IEA394A or IEA015A WTOR message in the allotted time.
If the message is issued only on a subset of participating sysplex images, the SFM settings specified in the SFM Policy must be considered
© 2010 IBM Corporation12
STP Recovery terminology
Coordinated Server Time–
Coordinated Server Time (CST) represents the time for the CTN and is the time at a Stratum 1 server
Synchronization check threshold–
Server/CF considered to be in synchronized state if TOD clock within synchronization check threshold of CST
–
STP synchronization check threshold 50 microseconds–
If TOD clock differs from CST by more than +/-
50 microseconds, server/CF becomes unsynchronized•
Can become a Stratum 0 (S0) server/CF
Freewheel Interval–
Amount of time a Stratum 2 or Stratum 3 server can remain synchronized without receiving messages from its clock source•
Approximately 1 second (Mixed-CTN)•
Approximately 10 seconds (STP-only CTN)
© 2010 IBM Corporation13
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
STP-only CTN recovery with Internal Battery Feature (IBF)
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation14
Active ETR link
Alternate ETR link
ETR Network recovery same as before
Improved RAS
over ETR network for Stratum 1 servers/CFs
–
If either z990 or z890 loses all timing signals from the Sysplex Timers, they are capable of becoming Stratum 2 (S2) servers in a Mixed CTN
P1, P2, P3 in Parallel Sysplex
CLO links
ETR links
Sysplex Timers
ETR Network ID =159037A
(Primary)
9037B
(Secondary)
z900
P2
z990
S1
P1
ICB-3 links
z890S1
P3CTNID=ITSOPOK-15
ISC-3 linksCF + STP messages
Improved RAS for S1 servers/CFs –
Mixed CTN
© 2010 IBM Corporation15
Improved RAS for S1 servers/CFs –
Mixed CTN (continued)
Active ETR link
Alternate ETR link
Improved RAS over ETR network for Stratum 1 (S1) servers/CFs
–
If either z990 or z890 loses all timing signals from the Sysplex Timers, they are capable of becoming Stratum 2 (S2) servers in a Mixed CTN
Example:–
z890 loses both ETR links
–
Can synchronize to z990 and become a S2 server
P1, P2, P3 in Parallel Sysplex
CLO links
ETR links
Sysplex Timers
ETR Network ID =159037A
(Primary)
9037B
(Secondary)
z900
P2
z990
S1
P1
z890
S2P3
ISC-3 linksCF + STP messages
ICB-3 links
CTNID=ITSOPOK-
15
© 2010 IBM Corporation16
Coupling link recovery –
Mixed CTN example
Link recovery same for Mixed CTN and STP-only CTN
Both ISC-3 links between z990 and z890 established as paths that can be used for STP message exchanges
Only one established path used to exchange STP messages for synchronization
Messages exchanged every 64 ms
z890 synchronized to z990 using ISC-3 link (1) –
ISC-3 link (2) is an established path NOT exchanging messages.
If ISC-3 link (1) fails
Redundant ISC-3 link (2) will be used for synchronizing z890 to z990 P1, P2, P3 in Parallel Sysplex
ICB-3 links
ISC-3 linksCF messages
only
ETR links
CLO linksETR Network ID =15
ISC-3 link (2)
Will be used for synchronization if ISC-3
link (1) fails
(1) (2)
z900
P2
z990
S1
P1
z890
S2P3
CTNID=ITSOPOK-15
CTNID=ITSOPOK-15
© 2010 IBM Corporation17
Mixed-CTN (Stratum 1 failure)
At least two Stratum 1 servers recommended in a Mixed CTN to avoid single point of failure
STP messages exchanged between S2 and all available S1s–
Algorithm selects one of the available S1s as clock source
In this example:–
z890 synchronized to z9 EC using ISC-3 link (1)
–
z890 and z990 also exchanging messages using ISC-3 link (4)
If z9 EC fails or is taken down for planned outage,–
z890 selects z990 as clock source, exchanging messages using ISC-3 link (4)
P1, P2, P3, P4 in Parallel Sysplex
ETR links
CLO linksETR Network ID =15
(1)
(2)
(3)
(4) ISC-3 link (4)
Will be used for synchronization if z9 EC
fails or has planned outage
z900
P2
z990
S1
P4
z890
S2
P3
z9 EC
S1
P1
CTNID=ITSOPOK-15
CTNID=ITSOPOK-15
CTNID=ITSOPOK-15
© 2010 IBM Corporation18
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
Site failure scenarios
STP-only CTN recovery with Internal Battery Feature (IBF)
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation19
STP-only CTN with 2 servers/CFs
CTN only has a PTS and BTS assigned–
Arbiter NOT ASSIGNED
Assumption: PTS also assigned the CTS role
CANNOT have two Stratum 1 servers in timing network
Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if either:
–
Preferred Time Server (PTS) can indicate it has “failed”
or–
BTS can unambiguously determine the PTS has “failed”
PTS, if operational MUST surrender role of CTS
Combination of:–
Server Offline Signal (OLS-
Channel going away signal) and –
Console Assisted Recovery (CAR)
Used to determine if BTS can take over as CTS
© 2010 IBM Corporation20
Server Offline Signal (OLS)
Server Offline signal (OLS) transmitted on a channel by the server to indicate that the channel is going offline
–
Signals are independent of STP
Conditions when OLS transmitted by server include:–
Server or LPAR dump
–
Server Power off
–
Chpid configure off
OLS may not be transmitted for certain failures:–
Server or site power outage
–
Channel subsystem fails
–
System Assist Processor (SAP) recovery
–
Link failures
© 2010 IBM Corporation21
Console Assisted Recovery (CAR)
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
P000STP2 SCZP101
CAR uses HMC/SE LAN to determine–
CTS has failed or operational–
BTS can take over as CTS
BTS initiates CAR process when:–
BTS has lost communication with the CTS
BTS sends command to its Support Element (SE) to determine the state of the CTS
BTS SE communicates via HMC with CTS SE
If CTS state determined to have “failed”
–
BTS takes over as CTS
If CTS state “good”
or “indeterminate”–
BTS CANNOT take over as S1–
BTS eventually becomes unsynchronized at end of Freewheel Interval
P1, P2 in Parallel Sysplex
© 2010 IBM Corporation22
OLS and CAR Recovery Rules
Applicable in an STP-only CTN when optional BTS assigned, but Arbiter NOT assigned
OLS rules applicable when two or more links between servers
If Backup Time Server (BTS) receives OLS on the last two established STP paths to Current Time Server (CTS) within two seconds:
–
BTS takes over as CTS (S1)–
CAR used to confirm PTS has failed or has surrendered as CTS
If the PTS/CTS has sent OLS on the last two
established STP paths to BTS within two seconds:
–
PTS will surrender its role of CTS
If only a single link between PTS and BTS or OLS on the last two established STP paths received more than 2 seconds apart:
–
CAR used to determine if BTS can take over as CTS–
OLS rules do not apply
© 2010 IBM Corporation23
CTS failure –
OLS on last two paths received within 2 secs
If BTS (SCZP101) receives OLS on last two STP paths to CTS (P000STP2)
within 2 seconds–
BTS takes over as CTS (S1)–
To assure only 1 CTS •
PTS surrenders role of CTS•
CAR confirms CTS has failed
z/OS systems on P000STP2 may have posted WTOR (IEA394A)
z/OS systems on SCZP101 not affected
STP user actions: –
Repair CTS (P000STP2) –
STP does an automatic retakeover•
P000STP2 joins as S2•
Retakes role of CTS after verification checks
•
SCZP101 becomes S2
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
P000STP2 SCZP101
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S0
P1
z990(BTS)
S1
P2Coupling links
S0 server <<< After Recovery>>> S1 server
SCZP101P000STP2
P1, P2 in Parallel Sysplex
© 2010 IBM Corporation24
CTS failure –
OLS on last two paths NOT received within 2 seconds; CAR unsuccessful
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
P000STP2 SCZP101
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S0
P2Coupling links
S0 server <<< After Recovery>>> S0 server(assume CAR unsuccessful)
SCZP101P000STP2
BTS does not
receive OLS on last two established STP paths to CTS within 2 seconds:
BTS initiates “Console assisted recovery”
–
BTS (SCZP101) SE attempts to determine state of CTS (P000STP2) by communicating via HMC with CTS SE
CTS (P000STP2) state “indeterminate”
–
BTS CANNOT take over as S1–
BTS eventually becomes unsynchronized at end of Freewheel Interval
–
z/OS systems (STPMODE YES) post WTOR (IEA394A)
STP User actions–
Reassign BTS as CTS –
Respond with Retry to WTOR–
NOTE: When PTS rejoins, it will not re-
takeover role of CTS, since roles reassigned
P1, P2 in Parallel Sysplex
© 2010 IBM Corporation25
Reconfiguration after CTS Failure –
BTS unsynchronized (S0)
Select System (Sysplex) Time task of SCZP101
–
Server that will become the new CTS after reconfiguration
Select Network Configuration tab
Assign SCZP101 as BTS and CTS
Select “Force configuration”–
Since starting from Stratum 0
Respond “Retry”
to each WTOR (IEA394A) posted
–
Note that after responding to the first WTOR, the remaining WTORs
in the Sysplex have to be responded to within approximately 4 minutes if up to 8 z/OS images (additional 30 secs
per image if more than 8 images)
© 2010 IBM Corporation26
Last Link Failure
When multiple links configured between PTS and BTS, a single link failure results in
–
BTS selecting redundant link
Failure of last Coupling link between BTS and CTS
–
CTS/PTS not affected–
BTS loses communication with CTS–
BTS initiates “Console assisted recovery”
•
CTS (PTS) state “good”
BTS unsynchronized–
z/OS systems (STPMODE YES) on BTS post WTOR (IEA394A)
STP User actions–
Repair “failing”
link •
BTS joins CTN as S2–
Respond with Retry to WTOR
H M C
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Single Coupling link
SCZP901SCZP101
H M C
S1 server <<< After Recovery>>> S0 server
z990 SEz9 EC SE
z9 ECPTS/CTS
S1
P1
z990(BTS)
S0
P2
SCZP901SCZP101
Single Coupling link
P1, P2 in Parallel Sysplex
© 2010 IBM Corporation27
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
STP-only CTN recovery with Internal Battery Feature (IBF)
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation28
STP-only CTN with 3 or more servers/CFs
CTN has a PTS, BTS, and Arbiter assigned
Assumption: PTS also assigned the CTS role
CANNOT have two Stratum 1 servers in timing network
Backup Time Server (BTS) can take over as Current Time Server (CTS), active Stratum 1, only if either:
–
Preferred Time Server (PTS) can indicate it has “failed”
or
–
BTS can unambiguously determine the PTS has “failed”
PTS, if operational MUST surrender role of CTS
Arbiter Assisted Recovery used to determine if BTS can take over
as CTS
© 2010 IBM Corporation29
Arbiter Assisted Recovery
Arbiter provides additional means to determine if BTS can take over as the CTS
•
Handles failure scenarios when OLS may not be sent by CTS or received by BTS
If BTS loses communication on all established paths to CTS–
BTS does not invoke OLS recovery rules–
BTS and Arbiter communicate to establish if Arbiter also has lost communication on all established paths to CTS
If both BTS and Arbiter cannot communicate with CTS–
BTS takes over as CTS (S1)
Failure also implies PTS cannot communicate with BTS and Arbiter–
Since only 1 CTS (S1) can exist, –
PTS initially surrenders role of CTS•
Has to assume that BTS has taken over as CTS
© 2010 IBM Corporation30
CAR –
BTS and Arbiter assigned
Console assisted recovery uses HMC/SE LAN to determine –
BTS can take over as CTS (initiated by BTS case below)–
PTS can retake role of CTS (initiated by PTS case below)
BTS initiates Console Assisted recovery process when:–
BTS has lost communication with the CTS, and –
BTS cannot communicate with the Arbiter to initiate the Arbiter Assisted recovery process.
–
Works the same way as case when no Arbiter assigned
PTS initiates Console Assisted recovery process when:–
PTS has lost communication with the BTS and the Arbiter –
PTS initially surrenders role of CTS•
Has to assume that BTS has taken over as CTS –
Needs to determine if BTS failed or operational–
If BTS determined to have failed•
PTS retakes its role of CTS–
If BTS state good (BTS is capable of taking over as CTS) or “indeterminate”•
PTS either becomes a S3 server if a S2 clock source is available
or•
PTS becomes unsynchronized at end of Freewheel period
© 2010 IBM Corporation31
STP-only CTN (Preferred, Backup and Arbiter assigned) CTS failure or power outage
P1, P2, P3 in Parallel Sysplex
CTNID=ITSOPOK -
BTS loses communication with CTS on all established paths
BTS does not invoke OLS recovery rules
BTS and Arbiter communicate to establish if Arbiter also cannot communicate with CTS
If both BTS and Arbiter cannot communicate with CTS–
BTS takes over as CTS (S1)
Since only 1 CTS (S1) can exist,
–
PTS initially surrenders role of CTS
–
PTS initiates “Console Assisted Recovery”
to determine if BTS failed or operational
–
If BTS determined to have failed•
PTS retakes its role of CTS
H M C
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
z890Arbiter
S2
P3
SCZP901SCZP101
H M C
z9 ECPTS/CTS
S0
P1
z990(BTS)
S1
P2Coupling links
z890Arbiter
S2
P3
SCZP901
SCZP101
P000STP2
P000STP2
PTS >S0 server <<< After Recovery>>> BTS >S1 server
© 2010 IBM Corporation32
STP-only CTN (Preferred, Backup and Arbiter assigned) CTS failure or power outage -
continued
P1, P2, P3 in Parallel Sysplex
CTNID=ITSOPOK -
STP User Actions–
Repair CTS
STP does an automatic re-
takeover–
SCZP101 joins as S2
–
Retakes role of CTS after verification checks
–
SCZP901 becomes S2
H M C
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
z890Arbiter
S2
P3
SCZP901SCZP101
P000STP2
PTS >S1 server <<< After Repair Action>>> BTS >S2 serverOriginal configuration
© 2010 IBM Corporation33
STP-only CTN (Preferred, Backup and Arbiter assigned) Last link failures
Last link between CTS and BTS–
BTS loses communication with CTS on all established paths
–
BTS initiates Arbiter assisted recovery–
Arbiter still has connectivity to CTS–
BTS synchronizes to Arbiter and becomes S3
–
STP User actions: None
Last link between CTS and Arbiter–
Arbiter synchronizes to BTS and becomes S3
–
STP User actions: None
Last link between BTS and Arbiter–
BTS and Arbiter both stay as S2–
STP User actions: None–
Arbiter Assisted Recovery exposed for any subsequent loss of communication between BTS and CTS
P000STP2
H M C
z9 ECPTS/CTS
S1
P1
z990(BTS)
S2
P2Coupling links
z890Arbiter
S2
P3
SCZP901SCZP101
P1, P2, P3 in Parallel Sysplex
© 2010 IBM Corporation34
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
STP-only CTN recovery with Internal Battery Feature (IBF)
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation35
Power Outage PTS/CTS with Internal Battery Feature (IBF)
With IBF on CEC1
–
CEC1 power outage, enters IBF state
–
CEC1 notifies CEC2 it is running on IBF
–
CEC2 waits for 30 seconds to take action •
Could be a power glitch•
If notified within 30 seconds that CEC1 back to “normal power”, no further action
–
If CEC1 in IBF state > 30 seconds, •
CEC2 takes over as the CTS •
CEC1 becomes S2 until IBF no longer functional and power drops
–
CEC1 power resumes•
Automatic re-takeover as PTS/CTS
HMC
CEC1PTS/CTS
S1
P1
CEC2BTSS2
P3Coupling links
HMC
P2
HMC
CEC1PTS/CTS
S1
P1
CEC2BTSS2
P3Coupling links
HMC
P2
CEC power outage in same data center
Site power outage –
2 data centers
IBF is designed to enable PTS/CTS to reconfigure the BTS as the CTS if
–
Power outage of PTS/CTS
–
Power outage of site where PTS/CTS and Arbiter are located
© 2010 IBM Corporation36
Power outage of data center (Site 1) with PTS and Arbiter
With Internal Battery Feature (IBF) on CEC1 and CEC3–
Site 1 power outage–
CEC1 and CEC3 enter IBF state–
CEC1 and CEC3 notify CEC2 it is running on IBF
–
CEC2 waits for 30 seconds to take action •
Could be a power glitch•
If notified within 30 seconds that CEC1 back to “normal power”, no further action
–
If CEC1 and CEC3 in IBF state > 30 seconds, •
CEC2 takes over as the CTS–
Not dependent on Arbiter Assisted Recovery
•
CEC1 becomes S2 until IBF no longer functional and power drops
–
CEC1 power resumes•
Automatic re-takeover as PTS/CTS
HMC
CEC1PTS/CTS
S1
P1
CEC2(BTS)
S2Coupling links
CEC3Arbiter
S2
P3
CEC4S2
P4
HMCSite 1 Site 2
P2
© 2010 IBM Corporation37
IBF Recommendations
Single data center–
IBF only protects for server power outage–
CTN with 2 servers, install IBF on at least the PTS/CTS•
Also recommend IBF on BTS to provide recovery protection when BTS is the CTS
–
CTN with 3 or more servers IBF not required to recover from CTS power outage, if Arbiter configured
Two data centers–
IBF protects for both server and site power outage scenarios–
CTN with 2 servers (one in each data center) install IBF on at least the PTS/CTS
•
Also recommend IBF on BTS to provide recovery protection when BTS is the CTS
–
CTN with 3 or more servers, install IBF on CTS and Arbiter (in same site as CTS)
•
Also recommend IBF on BTS to provide recovery protection when BTS is the CTS
© 2010 IBM Corporation38
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation39
STP-only CTN (Preferred and Backup assigned) Site 1 Failure
P1, P2 in Parallel SysplexCTNID=ITSOPOK -
H M C
SCZP101PTS/CTS
S1
P1
SCZP901(BTS)
S2
P2Coupling links
H M C
Site 1 Site 2
BTS (SCZP901) loses all communication with CTS (SCZP101)
–
BTS most probably does not receive OLS
–
BTS initiates “Console assisted recovery”
–
Results of “Console assisted recovery”
•
CTS state most probably indeterminate
–
BTS eventually becomes unsynchronized at end of Freewheel Interval
–
z/OS systems (STPMODE YES) in site 2 post WTOR (IEA394A)
STP User actions–
Reassign BTS as CTS –
Respond with Retry to WTOR
© 2010 IBM Corporation40
STP-only CTN (Preferred and Backup assigned) Site 2 failure
P1, P2 in Parallel SysplexCTNID=ITSOPOK -
H M C
P1 P2Coupling links
H M C
Site 1 Site 2
SCZP101PTS/CTS
S1
PTS (SCZP101) continues role of CTS
z/OS systems in Site 1 requiring STPMODE YES not affected
STP User actions–
Restore Site 2
SCZP901(BTS)
S2
© 2010 IBM Corporation41
STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 1 Failure –
Arbiter in same site as BTS
HMC
P1, P2, P3, P4 in Parallel SysplexCTNID=ITSOPOK -
P1
SCZP901(BTS)
S2
P2Coupling links
P000STP2Arbiter
S2
P3
STP1S2
P4
HMC
Site 1 Site 2
BTS (SCZP901) loses all communication with CTS (SCZP101)
BTS and Arbiter communicate to establish if Arbiter also cannot communicate with CTS
–
Both cannot communicate
BTS takes over as CTS (S1)
z/OS systems in Site 2 requiring STPMODE YES not affected
STP User actions –
None
SCZP101PTS/CTS
S1
© 2010 IBM Corporation42
STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 2 Failure –
Arbiter in same site as BTS
PTS/CTS (SCZP101) loses communication with both BTS and Arbiter
PTS surrenders role of CTS
PTS initiates “Console assisted recovery”
to determine if BTS failed or operational
Results of “Console assisted recovery”
–
BTS state most probably indeterminate
PTS CANNOT retake role of CTS
All z/OS systems in site 1 post WTOR (IEA394A)
STP User actions–
Reassign PTS as CTS –
Respond with Retry to WTOR
HMC
SCZP101PTS/CTS
S1
P1 P2Coupling links
P000STP2 Arbiter
S2
P3P4
HMCSite 1 Site 2
STP1S2
SCZP901(BTS)
S2
P1, P2, P3, P4 in Parallel Sysplex
© 2010 IBM Corporation43
STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 1 Failure –
Arbiter located in same site as PTS/CTS
HMC
SCZP101PTS/CTS
S1
P1
SCZP901(BTS)
S2
P2Coupling links
P000STP2 Arbiter
S2
P3 P4
HMCSite 1 Site 2
BTS loses all communication with CTS
BTS cannot communicate with Arbiter
BTS initiates “Console assisted recovery”
Results of “Console assisted recovery”
–
CTS state most probably indeterminate
BTS CANNOT take over as S1
BTS eventually becomes unsynchronized
–
z/OS systems (STPMODE YES) in site 2 post WTOR (IEA394A)
•
Similar to case with only PTS and BTS assigned
STP User actions –
Reassign BTS as CTS –
Respond with Retry to WTOR
STP1S2
© 2010 IBM Corporation44
STP-only CTN (Preferred, Backup, and Arbiter assigned) -
Reconfiguration after Site 1 Failure
Select System (Sysplex) Time task of SCZP901
–
Server that will become the new CTS after reconfiguration
Select Network Configuration tab
Assign SCZP901 as PTS and CTS
Assign STP1 as BTS
Select “Force configuration”–
Since starting from Stratum 0
Respond “Retry”
to each WTOR (IEA394A) posted
–
Note that after responding to the first WTOR, the remaining WTORs
in the Sysplex have to be responded to within approximately 4 minutes if up to 8 z/OS images (additional 30 secs
per image if more than 8 images)
© 2010 IBM Corporation45
STP-only CTN (Preferred, Backup, and Arbiter assigned) Site 2 Failure –
Arbiter located in same site as PTS/CTS
CTS loses communication with only the BTS
CTS maintains communication with Arbiter
PTS maintains role of CTS (S1)
STP-only CTN servers in Site 1 stay synchronized to CTS (S1)
z/OS systems in Site 1 requiring STPMODE YES not affected
STP User actions–
None
HMC
SCZP101PTS/CTS
S1
P1 P2Coupling links
P000STP2 Arbiter
S2
P3
z9 BCS2
P4
HMCSite 1 Site 2
SCZP901(BTS)
S2
© 2010 IBM Corporation46
Multi-site CTN Rules and Recommendations
Provide redundant routes for fiber links between sites
Use only qualified
DWDMs
If 3 or more servers in CTN, assign BTS and Arbiter–
Locate the Arbiter in same site as PTS•
Provides better recovery for scenarios when:
–
OLS may not be sent from CTS or –
OLS may not be received by BTS
© 2010 IBM Corporation47
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation48
ETS Recovery -
DISCLAIMER
The following section is intended to provide ONLY
a
basic overview of ETS Recovery
For more detailed recovery information and the actions that must be taken in response to various failures, please see the ETS recovery information in
–
STP Planning Guide, SG24-7280
–
STP Implementation Guide, SG24-7281
© 2010 IBM Corporation49
ETS Recovery introduction
External time source in an STP-only CTN can be provided by:
Using dial-out on the HMC
Using an NTP server (LAN connection)
Using an NTP server with a pulse per second output option (LAN connection and coaxial cable to the PPS port of an ETR card)
Limited recovery actions when ETS configured to use dial-out–
HMC attempts to redial if line is busy
–
Option to have more than one HMC act as a phone server
Regardless of the ETS option selected, failures associated with ETS do not affect the capability of servers in a CTN to stay synchronized with each other.
–
As long as the timing state of the servers remains synchronized,
z/OS images that depend on synchronization are not affected.
The only effect of unsuccessful recovery for an ETS failure is that the CTN will slowly drift away from ETS time
© 2010 IBM Corporation50
NTP Server Redundancy Recommendations
At least one NTP server must be configured on the PTS/CTS –
Only the Current Time Server (CTS) makes time adjustments based on information from the NTP Server
Also recommended to configure at least one NTP server on the BTS–
Allows continuous NTP server access when BTS becomes the CTS
–
Time adjustments to the STP-only CTN when the PTS/CTS cannot access any of its NTP servers
If two NTP servers are configured, user is responsible for selecting preferred NTP server–
This NTP server is called the selected NTP server;
–
The other NTP server is called the non-selected NTP server.
Recommendations apply when using NTP servers with or without PPS
© 2010 IBM Corporation51
ETS Recovery design using NTP Servers
Configured NTP servers on the PTS/CTS are accessed once every 10 minutes by the SNTP client.
–
Once every hour, assuming a successful access of the selected NTP server, the SNTP client sends a CST adjustment to the STP facility.
–
Normally, the SNTP client on the CTS uses the time information from the selected NTP server to perform the time adjustment.
•
The time information from the non-selected NTP server is only used when there is a failure associated with accessing time information from the selected NTP server.
Configured NTP servers on the BTS are also accessed once every 10 minutes.
–
The BTS calculates a value for time adjustment based on this access, and communicates the information to the PTS over the coupling links.
If the PTS/CTS cannot access both its configured NTP servers, it will switch over to using the timing information sent from the BTS to steer the STP-only CTN.
© 2010 IBM Corporation52
Order of Recovery actions –
ETS using NTP Servers
After two unsuccessful attempts (two hours) at sending a CST adjustment value based on selected NTP server,
–
SNTP client will switch to sending timing adjustment information
based on the non-selected NTP server
After two unsuccessful attempts (two hours) at sending a CST adjustment value based on non-selected NTP server,
–
STP will steer CTN using calculation from BTS
–
BTS information could be based on:•
Selected NTP server at the BTS, or•
Non-selected NTP server, if valid data cannot be accessed from the selected NTP server
When STP is not able to switch to any operational NTP server, automatic base steering continues
–
Base steering allows STP to compensate for the drift characteristics of the oscillator, thereby maintaining relatively good time accuracy at
the Current Time Server, even if an ETS is not available.
© 2010 IBM Corporation53
Possible failures -
ETS using NTP Servers
NTP serverStratum 1
July 14 14:21:00 2008 UTC
PTS & CTSor BTS
System z HMC
selected
SNTPclient
EthernetSwitch
2
1
1.
Loss of LAN connectivity between the Support Element and the NTP server
2.
Complete NTP server failure or bad NTP data from the NTP server
1
2
© 2010 IBM Corporation54
Scenario 1 -
Redundant NTP Servers on PTS/CTS
NTP server 1Stratum 1
July 14 14:21:00 2008 UTC
PTS/CTSS1
NTP serverStratum 1
July 14 14:21:00 2008 UTC
Selected
SNTPclient
EthernetSwitch
Non-selected
NTP server 2 HMC NTP serverStratum 2
Corporatenetwork
Recovery
If selected NTP server becomes unavailable, BUT the non-
selected NTP server is still available (failure ),
SNTP client will use non-
selected NTP server as its ETS, and will continue steering CTN using timing information received from NTP server 2.
Failure
If failure is a LAN failure, NO
recovery is possible, and CTN continues to use automatic base steering
2
1
Loss of LAN connectivity between the Support Element and the NTP
server
Complete NTP server failure or bad NTP data from the NTP server
1
2
© 2010 IBM Corporation55
Scenario 2 -
Redundant NTP Servers on PTS and BTS
2
Compared to Scenario 1:This configuration provides additional degree of continuous availability of NTP servers
Suitable for a dual site implementation, with PTS and BTS in different sites.
Recovery
If PTS/CTS is not able to access NTP server 1 for two hours
Will start using time adjustment information sent by BTS approximately an hour later to steer the CTN.
If BTS is not able to access NTP server 2 for two hours
NO recovery action.
However, problem should be corrected as soon as possible to maintain ETS redundancy.
Coordinated Timing Network
NTP server 1Stratum 1
July 14 14:21:00 2007 UTC
PTS / CTSBTS
July 14 14:21:00 2007 UTC
System z HMC
NTP server 2Stratum 1
selected@PTS selected@BTS
System z HMC
SNTPclient
SNTPclient
EthernetSwitch
EthernetSwitch
© 2010 IBM Corporation56
Continuous NTP server availability -
Enhanced Configuration
Corporatenetwork
IBM System zCoordinated Timing Network
NTP server 1Stratum 1
July 14 14:21:00 2007 UTC
PTS / CTSS1
BTSS2
July 14 14:21:00 2007 UTC
System z HMC
site 1
NTP server 2Stratum 1
site 2
non-selected@PTS
System z HMC,NTP server enabledStratum 2
NTP serverStratum 1
July 14 14:21:00 2007 UTC
selected@PTS selected@BTS
SNTPclient
SNTPclient
EthernetSwitch Ethernet
Switch
To provide even more redundancy, also consider configuring an additional NTP server on the HMC
The NTP server on the HMC is the non-selected NTP server at the PTS/CTS.
If the selected NTP server fails at the PTS/CTS, the non-
selected NTP server takes over the ETS role and provides the time information.
In case both NTP servers in site 1 are not accessible for a certain period of time (for example because of LAN problems), the time adjustment information sent by the BTS will be used
© 2010 IBM Corporation57
ETR and STP Recovery Concepts–
Recovery design rules and terminology–
Switch to Local Timing mode
Mixed Coordinated Timing Network (Mixed CTN) recovery–
Failure scenarios
STP-only CTN recovery (Backup Time Server (BTS) assigned)–
Server Offline Signal (OLS), Console Assisted Recovery–
Failure scenarios
STP-only CTN recovery (BTS and Arbiter assigned) –
Arbiter Assisted Recovery–
Failure scenarios
Site failure scenarios
External Time Source (ETS) Recovery–
ETS Recovery using NTP Servers–
ETS Recovery using NTP Servers with PPS
Agenda
© 2010 IBM Corporation58
ETS Recovery design using NTP Servers with PPS
Configured NTP servers on PTS/CTS are accessed once a minute by SNTP client.
–
Once every 10 minutes, assuming successful access of both
NTP servers, the SNTP client sends time adjustment information based on both
NTP servers to the STP facility.
Configured NTP servers on BTS are also accessed once a minute by
SNTP client
–
Once every 10 minutes, time adjustment information based on both
NTP servers sent to the STP facility on BTS.
–
Normally, STP facility on BTS uses the time information in conjunction with the PPS signal from the selected NTP server to calculate a time adjustment.
•
BTS then communicates this information to the PTS over the coupling links. –
Adjustment calculation based on time information and PPS signal from non-
selected NTP server on BTS only used when there is a failure associated with accessing time information or PPS signals from the selected NTP server.
If the PTS/CTS cannot access both its configured NTP servers, it
will switch over to using the timing information sent from the BTS to
steer the STP-only CTN.
© 2010 IBM Corporation59
Possible Failures -
ETS using NTP Servers with PPS
NTP serverStratum 1
July 14 14:21:00 2008 UTC
PTS/CTSor BTS
System z HMC
PPSout
SNTPclient
ETR cardPPS port 0
EthernetSwitch
2 1
3
Possible failures
1.
Loss of LAN connectivity between SE and NTP server or bad NTP data
2.
PPS signal not received by PPS port on the ETR card.
3.
Complete NTP server failure affecting both NTP data and PPS output of NTP server.
1
2
3
© 2010 IBM Corporation60
Order of Recovery actions -
ETS using NTP Servers w/PPS
Coordinated Timing Network
NTP server 1Stratum 1
July 14 14:21:00 2007 UTC
PTS/CTSS1
BTSS2
System z HMC
NTP server 2Stratum 1
PPSout
July 14 14:21:00 2007 UTC
selected@PTS selected@BTS
EthernetSwitch
EthernetSwitch
System z HMC
SNTPclient
SNTPclient
PPSout
ETR card PPS port 1
ETR cardPPS port 0
ETR card PPS port 1
ETR cardPPS port 0
If failure type , STP will continue using PPS signals received on PPS port of the selected NTP server on the PTS/CTS.
If failure type or , STP will switch to using time adjustment information received from BTS.
1
2 3
Loss of LAN connectivity between SE and NTP server or bad NTP data
PPS signal not received by PPS port on the ETR card.
Complete NTP server failure affecting both NTP data and PPS output of NTP server.
1
2
3
Note: Refer to SG247280 and SG247281 when NTP server with PPS configuration is different
© 2010 IBM Corporation61
Order of Recovery actions -
ETS using NTP Servers w/PPS (continued)
Regardless of the specific redundancy provided by an NTP server with PPS configuration–
If PPS signals are not received from
any of the configured NTP servers on the PTS/CTS and the BTS, BUT
valid NTP data is available, •
STP will continue using the NTP data for steering the CTN following the same recovery flow described in previous “ETS recovery using NTP servers”
section–
When STP is not able to switch to any operational NTP server, the automatic base steering continues.
•
Base steering allows STP to compensate for drift characteristics
of the oscillator, thereby maintaining relatively good time accuracy at
the Current Time Server, even if an ETS is not available.
© 2010 IBM Corporation62
Scenario 1 -
Redundant NTP Servers with PPS on PTS/CTS
Recovery
If NTP server 1 is not accessible by the SNTP client on the SE (failure ), BUT the PPS signal is still received on PPS port 0
NO recovery is required because STP will continue to steer the CTN using the PPS signals from NTP server 1.
For failures and on NTP server 1, STP will switch to using the time information and the PPS signals from the non-selected server, NTP server 2.
1
2 3
Loss of LAN connectivity between SE and NTP server or bad NTP data
PPS signal not received by PPS port on the ETR card.
Complete NTP server failure affecting both NTP data and PPS output of NTP server.
1
2
3
NTP server 1Stratum 1
July 14 14:21:00 2008 UTC
PTS/CTSS1
System z HMC
PPSout
NTP server 2Stratum 1
PPSout
July 14 14:21:00 2008 UTC
selected@PTS non-selected@PTS
SNTPclient
ETR card PPS port 1
ETR cardPPS port 0
EthernetSwitch
© 2010 IBM Corporation63
Scenario 2 -
Redundant NTP Servers with PPS on PTS and BTS
Recovery
If NTP server 1 is not accessible by the SNTP client on the SE (failure ), BUT the PPS signal is still received on PPS port 0
NO recovery is required because STP will continue to steer the CTN using the PPS signals from NTP server 1.
For failures and on NTP server 1, the PTS/CTS will start using the time adjustment information received from the BTS, which is based on NTP server 2 and its PPS signals.
For failures , and on NTP server 2
NO Recovery required
1
2 3Coordinated Timing Network
NTP server 1Stratum 1
July 14 14:21:00 2007 UTC
PTS/CTSS1
BTSS2
System z HMC
NTP server 2Stratum 1
PPSout
July 14 14:21:00 2007 UTC
selected@PTS selected@BTS
EthernetSwitch
EthernetSwitch
System z HMC
SNTPclient
SNTPclient
PPSout
ETR card PPS port 1
ETR cardPPS port 0
ETR card PPS port 1
ETR cardPPS port 0
1 2 3
Loss of LAN connectivity between SE and NTP server or bad NTP data
PPS signal not received by PPS port on the ETR card.
Complete NTP server failure affecting both NTP data and PPS output of NTP server.
1
2
3
© 2010 IBM Corporation64
© 2010 IBM Corporation65
Summary –
Mixed CTN
Configure for link redundancy
Attach (synchronize) at least 2 STP-configured servers to the Sysplex Timers in an Expanded Availability configuration
–
Multiple S1s allowed in Mixed-CTN
For configuration across two sites
–
Locate Sysplex Timers in different sites
•
Intermediate site may be required to locate second Sysplex Timer if two sites separated by 100 km
–
Provide redundant routes for fiber links between sites
© 2010 IBM Corporation66
Summary –
STP-only CTN
Configure for link redundancy
Initialize configuration with the PTS assigned as the Current Time Server
–
PTS, CTS must be assigned
Assign at least a Backup Time Server–
Can take over as CTS -
active S1
If 3 or more servers in CTN, assign BTS and Arbiter
For configuration across 2 sites–
Provide redundant routes for fiber links between sites
–
Use only qualified
DWDMs
–
Locate the Arbiter in same site as PTS•
Provides better recovery for scenarios when: –
OLS may not be sent from CTS or –
OLS may not be received by BTS
© 2010 IBM Corporation67
Summary –
ETS Recovery
Failures associated with ETS and possible recovery actions do not affect the capability of servers in a CTN to stay synchronized with each other.
The Current Time Server (CTS) is the only server that adjusts the Coordinated Server Time (CST) by steering it to the time obtained from an external time source (ETS). Either the PTS or the BTS can be the CTS.
It is recommended to configure at least one unique NTP server or
NTP server with PPS on the PTS and the BTS. Configuring an NTP server on the BTS provides two benefits:
–
Access to an NTP server when the BTS becomes the CTS as the result of planned or unplanned recovery
–
Time adjustments to an NTP server when the PTS/CTS cannot access
any of its NTP servers
Multi-site CTN configurations do not have any specific ETS redundancy considerations, other than the general recommendation to configure an NTP server both on the PTS and the BTS.
The CTS assignment does not change as a consequence of an ETS failure.
© 2010 IBM Corporation68
Redbooks®
–
Server Time Protocol Planning Guide SG24-7280–
Server Time Protocol Implementation Guide SG24-7281–
Server Time Protocol Recovery Guide SG24-7380
Education–
Introduction to Server Time Protocol (STP)•
Available on Resource Link™
•
www.ibm.com/servers/resourcelink/hom03010.nsf?OpenDatabase
STP Web site–
www.ibm.com/systems/z/pso/stp.html
Systems Assurance–
The IBM team is required to complete a Systems Assurance Review (SAPR Guide SA06-012) and to complete the Systems Assurance Confirmation Form via
Resource Link
Techdocs
and WSC Flashes–
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/Web/Techdocs•
Search on “STP”
Additional Information
© 2010 IBM Corporation69
IBM Implementation Services for System z –
Server Time Protocol (6948-J56)
Offering Description
•
This offering is designed to assist clients to quickly and safely implement Server Time Protocol within their existing environments. STP provides clients with the capability to efficiently manage time synchronization within their multi-server infrastructure. Following best practices and using detailed planning services, IBM helps clients identify various implementation models and engage in the appropriate configuration required to effectively support STP for driving a more responsive business and IT infrastructure.
Program, Play, Industry Alignment
•
Infrastructure Improvement; Energy Efficiency; Better performance and lower operational cost
Client Value (enables customers to...)
•
Swift and secure implementation of STP for improved availability, integrity and performance
•
Improves multi-server time synchronization without interrupting operations•
Enables integration with next generation of System z infrastructure
Target Audience •
Primarily core, Large Enterprise customers. •
Existing z midrange clients
Key Competitors •
In house staff
Competitive Differentiation
•
Leverages best practices with secure implementation •
Short implementation time –
lower risk•
Provides support and facilitates knowledge sharing through IBM’s mainframe expertise
Proof Points & Claims for Client Value / Differentiation
•
Need to safely implement a reliable replacement for Sysplex Timer®
while maintaining continuous operations
•
Cost of providing and maintaining hardware, floor space and solution support for additional Sysplex Timer intermediate site
•
Lack of in-house expertise, skills and resources for implementing Server Time Protocol
Engagement Portfolio •
http://spimweb1.boulder.ibm.com/services/sosf/dyno.wss?oid=50423&loc=All&langc
d=en-US#1
Offering Manager •
Anna Lee/Southbury/IBM, 512-590-8914, T/L: 268-9318
© 2010 IBM Corporation70
IBM Announces –IBM Implementation Services for System z – Server Time Protocol
Offering Assist clients to quickly and safely implement Server Time Protocol within their existing environments. IBM helps clients identify various implementation models and engage in the appropriate configuration required to effectively support STP for driving a more responsive business and IT infrastructure
Customer Value: - Improves multi-server time synchronization without interrupting
operations
- Enables integration with next generation of System z infrastructure
- Swift and secure implementation of STP for improved availability, integrity, and performance
- Reduces hardware maintenance and power costs while eliminating intermediate site requirements for Sysplex Timer
Leverages IBM’s knowledge and best
practices to help implementation of
Server Time Protocol
Implementation of STP for improved availability and performance
© 2010 IBM Corporation71
Reference Material -
Terminology
APAR
Authorized Program Analysis Report
ARB
Arbiter
BTS
Backup Time Server
CF
Coupling Facility
CTS
Current Time Server
CTN
Coordinated Timing Network
DWDM
Dense Wave Division
Multiplexer
ETR
External Time Reference
ETS
External Time Source
FC
Feature Code
HMC
Hardware Management
Console
HCA
Host Channel Adapter
ICB
Integrated Cluster Bus
IPL
Initial Program Load
ISC
InterSystem Coupling Channel
LAN
Local Area Network
LIC
Licensed Internal Code
LPAR
Logically Partition
NTP
Network Time Protocol
PR/SM
Processor Resource / Systems Manager
PSIFB Parallel Sysplex Infiniband
PTF
Temporary Program Fix
PTS
Preferred Time Server
SW
Software (programs and operating systems)
SE
Support Element
TPF
Operating System
UTC
Coordinated Universal Time
zVM Operating System
zVSE Operating System
z/OS
Operating System
z/VM
Operating System
© 2010 IBM Corporation72
Questions?
73
Thank YouTak
DanishDanke
German
Dank u
Dutch
Obrigado
Brazilian
Portuguese
ขอบคุณ Thai
Grazie
Italian
go raibh
maith
agat
Gaelic
Trugarez
Breton
Merci
French
Gracias
Spanish
Спаcибо
Russian
நன்றி Tamil
धन्यवाद
Hindi
شكراً Arabic
감사합니다
Korean
תודה רבהHebrew
Tack så
mycket
Swedish
Dankon
Esperanto
ありがとうございます
Japanese
谢谢 Chinese
děkuji
Czech
MercésCatalan
© 2010 IBM Corporation7474
TrademarksThe following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.
The following are trademarks or registered trademarks of other companies.
* All other products may be trademarks or registered trademarks of their respective companies.
Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance
ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not
offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information
on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino
logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus
Torvalds
in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.
For a complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml:
*, AS/400®, e business(logo)®, DBE, ESCO, eServer, FICON, IBM®, IBM (logo)®,
iSeries®, MVS, OS/390®, pSeries®, RS/6000®, S/30, VM/ESA®, VSE/ESA,
WebSphere®, xSeries®, z/OS®, zSeries®, z/VM®, System i, System i5, System p, System p5, System x, System z, System z9®, System z10®,
BladeCenter®
Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.
Those trademarks followed by ®
are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.
© 2010 IBM Corporation75
© 2010 IBM Corporation76
ETR network failures
ETR link or CEC ETR port failure–
When Sysplex Timer signals not received by server active ETR port, the server switches to alternate ETR port
Single CLO link failure–
Both Sysplex Timers stay in synch; continue to transmit to attached servers
Both CLO links failure–
Primary timer continues to transmit when loss of communication between Timers
–
Secondary timer stops transmitting when loss of communication between Timers Active ETR link
Alternate ETR link
CLO links
ETR links
ISC-3 linksPeer Mode
Sysplex Timers
ETR Network ID =159037A
(Primary)
9037B
(Secondary)
z900
P2
z990
P1
ICB-3 links
z890
P3
Stops Transmitting
© 2010 IBM Corporation77
ETR Network Failures (continued)
Primary Sysplex Timer fails or power outage
–
OLS received by Secondary ST–
Secondary ST becomes Primary ST–
z/OS systems (ETRMODE YES) on all servers not affected
Primary Sysplex Timer in Site 1, Secondary Sysplex Timer in Site 2
–
Site 1 fails–
Secondary ST most probably does not receive OLS
–
Secondary ST stops transmitting–
z/OS systems in Site 2 (ETRMODE YES) post WTOR (IEA015)
Active ETR link
Alternate ETR link
ICB-3 links
CLO links
ETR links
ISC-3 linksPeer Mode
Sysplex Timers
ETR Network ID =159037A
(Primary)
9037B
(Secondary)
z900
P2
z990
P1z890
P3
Stops Transmitting
© 2010 IBM Corporation78
System Failure Handling in a sysplex
To help get the sick/dead system out of the way as quickly as possible, IBM introduced the Sysplex Failure Management (SFM) component of XCF.
SFM can (under installation control) automatically partition a system from the sysplex if:
–
The Failure Detection Interval has been reached AND –
No heartbeat has been received AND –
The apparently dead system is not sending any XCF signals.
The Failure Detection Interval (prior to z/OS 1.11) defaults to either –
25 seconds (LPAR with dedicated CPs) or –
85 seconds (LPAR with shared CPs) or –
It can be overridden in the COUPLExx
member
To try to encourage customers to use SFM, health checks were provided to ensure that there is an active SFM policy and that the policy specified the ISOLATETIME option.
© 2010 IBM Corporation79
SFM pre-z/OS 1.11
If SFM NOT active, –
Operator would eventually be prompted with message IXC402D, asking him to RESET the LPAR, then reply DOWN.
If SFM active and ISOLATETIME specified (as recommended), –
System would (eventually) attempt to automatically Fence the problem system and partition it out of the sysplex.
•
Required a Coupling Facility •
Without the CF (in a base sysplex, for example) there is no ability to fence a system.
If the operator observed messages indicating that a system appeared non-responsive,
–
Could check the system status on the HMC and take manual action if the system was in fact dead.
z/OS 1.11 introduces some fundamental changes to this philosophy.....