Troubleshooting High Availability - Cisco · Troubleshooting High Availability •...
Transcript of Troubleshooting High Availability - Cisco · Troubleshooting High Availability •...
Troubleshooting High Availability
• Node State Definitions, page 1
• Node States, Causes and Recommended Actions, page 2
Node State DefinitionsThe following table describes the different node states, and associated reasons. You can view the state of anexisting node by either viewing the node details or the subcluster details on the Cluster Topology interface.
These fields are only displayed on the Cluster Topology interface if you turn on High Availability in asubcluster.
Note
DescriptionState
This is the initial (transition) state when the Cisco Server RecoveryManager service starts; it is a temporary state.
Initializing
IM and Presence Service is in Idle state when failover occurs and servicesare stopped. In Idle state, the IM and Presence Service node does notprovide any availability or Instant Messaging services. In Idle state, youcan manually initiate a fallback to this node from the Cluster Topologyinterface.
Idle
This is a stable state. The IM and Presence Service node is operatingnormally. In this state, you can manually initiate a failover to this nodefrom the Cluster Topology interface.
Normal
This is a stable state. The IM and Presence Service node is acting as thebackup for its peer node. Users have moved to this (backup) node.
Running in Backup Mode
This is a transition state. The IM and Presence Service node is takingover for its peer node.
Taking Over
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager, Release9.0(1)
1
DescriptionState
This is a transition state. The IM and Presence Service node is being takenover by its peer node.
Failing Over
This is a stable state. The IM and Presence Service node has failed over,but no critical services are down. In this state, you can manually initiatea fallback to this node from the Cluster Topology interface.
Failed Over
This is a stable state. Some of the critical services on the IM and PresenceService node have either stopped or failed.
Failed Over with Critical ServicesNot Running
This is a transition state. The system is falling back to this IM andPresence Service node from the node running in Backup Mode.
Falling Back
This is a transition state. The failed IM and Presence Service node istaking back over from its peer.
Taking Back
An error occurs during the transition states or Running in Backup Modestate.
Running in Failed Mode
State unknown.Unknown
Node States, Causes and Recommended ActionsThe following table describes the node states, reasons, causes, and recommended actions for failed states.
Table 1: Node High Availability States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
High Availability is running on both nodes inthe subcluster.
Subcluster is running normally (it is in nonfailover mode). The critical services on bothnodes in the subcluster are running.
NormalNormalNormalNormal
The administrator initiates a manual failoverfrom node 1 to node 2. The manual failover isin progress.
On AdminRequest
Taking OverOn AdminRequest
FailingOver
The manual failover from node 1 to node 2(initiated by the administrator) is complete.
On AdminRequest
Running inBackupMode
On AdminRequest
Idle
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager,Release 9.0(1)
2
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
The administrator initiates a manual fallbackfrom node 2 to node 1. The manual fallback isin progress.
On AdminRequest
Falling BackOn AdminRequest
TakingBack
The administrator restarts the SRM service onnode 1 while node 1 is in Idle state.
On AdminRequest
Running inBackupMode
InitializationIdle
The administrator restarts both nodes in thesubcluster, or restarts the SRM service on bothnodes in the subcluster, while the subclusterwas in manual failover mode (failover initiatedby the administrator).
InitializationRunning inBackupMode
InitializationIdle
The administrator restarts the SRM service onnode 2 while node 2 is running in backupmode, but before the heartbeat on node 1 timesout.
InitializationRunning inBackupMode
On AdminRequest
Idle
The administrator restarts the SRM service onnode 2 while node 2 is taking over, but beforethe heartbeat on node1 times out.
InitializationTaking OverOn AdminRequest
FailingOver
The administrator restarts the SRM service onnode 1 while taking back, but before theheartbeat on node 2 times out. After the takingback process is complete, both nodes are inNormal state.
On AdminRequest
Falling BackInitializationTakingBack
Automatic Fallback has been initiated fromnode 2 to node 1 and is currently in progress.
AutomaticFallback
Falling BackAutomaticFallback
TakingBack
Node 1 transitions to Failed Over state when:
• Critical service(s) come back up due toreboot of node 1, or
• The administrator starts critical service(s)on node 1 while node 1 is in "Failed Overwith Critical Services Not Running" state
When node 1 transitions to Failed Overstate the node is ready for theadministrator to perform a manualfallback to restore the nodes in thesubcluster to Normal state.
Critical ServiceDown
Running inBackupMode
Initializationor CriticalServicesDown
FailedOver
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager, Release9.0(1)
3
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
A critical service is down on node 1. IM andPresence performs an automatic failover tonode 2.
Recommended Actions:
1 Check what critical services are down onnode 1, and try to start these servicesmanually.
2 If the critical services on node 1 do notstart, reboot node 1.
3 After the reboot and when all the criticalservices are running, perform a manualfallback to restore the nodes in thesubcluster to Normal state.
Critical ServiceDown
Running inBackupMode
CriticalServiceDown
FailedOver withCriticalServicesnotRunning
A database service is down on node 1. IM andPresence performs an automatic failover tonode 2.
Recommended Actions:
1 Reboot Node 1.
2 After the reboot and when all the criticalservices are running, perform a manualfallback to restore the nodes in thesubcluster to Normal state.
Database FailureRunning inBackupMode
DatabaseFailure
FailedOver withCriticalServicesnotRunning
Critical services fail to start while a node insubcluster is taking back from the other node.
Recommended Actions: (on the node that istaking back)
1 Check what critical services are down onthe node. To start these services manually,select Recovery on the subcluster detailsscreen.
2 If the critical services do not start, rebootthe node.
3 After the reboot and when all the criticalservices are running, perform a manualfallback to restore the nodes in thesubcluster to Normal state.
Start of CriticalServices Failed
Running inFailed Mode
Start ofCriticalServicesFailed
Runningin FailedMode
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager,Release 9.0(1)
4
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
Critical services go down while a node insubcluster is running in backup mode for theother node.
Recommended Actions:
1 Check what critical services are down onbackup node. To start these servicesmanually, select Recovery on thesubcluster details screen.
2 If the critical services do not start, rebootthe subcluster.
Critical ServiceDown
Running inFailed Mode
CriticalServiceDown
Runningin FailedMode
Node 2 has lost its heartbeat with node 1. IMand Presence performs an automatic failoverto node 2.
Recommended Action:
(If node 1 is up)
1 Check and repair the network connectivitybetween nodes in the subcluster.When youreestablish the network connection betweenthe nodes, the node may go into a failedstate. Select Recovery on the subclusterdetails screen to restore the nodes in thesubcluster to Normal state.
2 Start the SRM service, and performmanualfallback to restore the nodes in thesubcluster to Normal state.
(If the node is down)
3 Repair/Power up node 1.
4 When node is up and all critical servicesare running, perform manual fallback torestore the nodes in the subcluster toNormal state.
Peer DownRunning inBackupMode
Node 1 is down due to lossof network connectivity orthe SRM service is notrunning.
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager, Release9.0(1)
5
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
IM and Presence performs an automaticfailover to node 2 due to possible hardwarefailure/power down/restart /shutdown of Node1.
Recommended Action:
1 Repair/Power up node 1.
2 When node is up and all critical servicesare running, perform manual fallback torestore the nodes in the subcluster toNormal state.
Peer RebootRunning inBackupMode
Node 1 is down (due topossible power down,hardware failure,shutdown, reboot)
Node 2 does not see Node 1 during startup.
Recommended Action:
When node1 is up and all critical services arerunning, perform manual fallback to restorethe nodes in the subcluster to Normal state.
Peer DownDuringInitialization
BackupMode
InitializationFailedOver withCriticalServicesnotRunningOR FailedOver
User move fails during taking over process.
Recommended Action:
Possible database error. Select Recovery onthe subcluster details screen. If that doesn'tresolve the issue, reboot the subcluster.
Cisco ServerRecoveryManager TakeOver UsersFailed
Running inFailed Mode
Cisco ServerRecoveryManagerTake OverUsers Failed
Runningin FailedMode
User move fails during falling back process.
Recommended Action:
Possible database error. Select Recovery onthe subcluster details screen. If that doesn'tresolve the issue, reboot the subcluster.
Cisco ServerRecoveryManager TakeBack UsersFailed
Running inFailed Mode
Cisco ServerRecoveryManagerTake BackUsers Failed
Runningin FailedMode
The SRM on a node restarts while the SRMon the other node is in a failed state, or aninternal system error occurs.
Recommended Action:
Select Recovery on the subcluster detailsscreen. If that does not resolve the issue, rebootthe subcluster.
UnknownRunning inFailed Mode
UnknownRunningin FailedMode
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager,Release 9.0(1)
6
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions
Node 2Node 1
Cause/Recommended ActionsReasonStateReasonState
The Database goes down on the backup node.The peer node is in failover mode and can takeover for all users in the subcluster.Auto-recovery operation automatically occursand all users are moved over to the primarynode.
Auto RecoveryDatabaseFailure.
FailoverAffectedServices
Auto RecoverDatabaseFailure
BackupActivated
A critical service goes down on the backupnode. The peer node is in failover mode andcan take over for all users in the subcluster.Auto-recovery operation automatically occursand all users are moved over to the peer node.
Auto RecoverCritical ServiceDown
FailoverAffectedServices
Auto RecoverDatabaseFailure
BackupActivated
Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager, Release9.0(1)
7
Troubleshooting High AvailabilityNode States, Causes and Recommended Actions