RAS - Reliability, Availability, Serviceability

24

Click here to load reader

description

RAS - Reliability, Availability, Serviceability. Product Support Engineering. VMware Confidential. Module 2 Lessons. Lesson 1 – vCenter Server High Availability Lesson 2 – vCenter Server Distributed Resource Scheduler Lesson 3 – Fault Tolerance Virtual Machines - PowerPoint PPT Presentation

Transcript of RAS - Reliability, Availability, Serviceability

Page 1: RAS -  Reliability, Availability, Serviceability

RAS - Reliability, Availability, Serviceability

Product Support Engineering

VMware Confidential

Page 2: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 2

Module 2 Lessons

Lesson 1 – vCenter Server High Availability

Lesson 2 – vCenter Server Distributed Resource Scheduler

Lesson 3 – Fault Tolerance Virtual Machines

Lesson 4 – Enhanced vMotion Compatibility

Lesson 5 – DPM - IPMI

Lesson 6 – vApps

Lesson 7 – Host Profiles

Lesson 8 – Reliability, Availability, Serviceability ( RAS )

Lesson 9 – Web Access

Lesson 10 – vCenter Server Update Manager

Lesson 11 – Guided Consolidation

Lesson 12 – Health Status

Page 3: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 3

Module 2-8 Lessons

Lesson 1 – Overview of RAS

Lesson 2 – RAS objectives

Lesson 3 – Networking vProbs

Lesson 4 – Storage vProbs

Lesson 5 – VMFS vProbs

Lesson 6 – Migration vProb

Page 4: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 4

Introduction

The long-term goal of the ESX RAS project is to make ESX more Reliable, Available and Serviceable.

To do so the VMkernel needs to detect, report, recover, diagnose and repair/react to hardware and software problems which occur in the system.

ESX RAS 1.0 will focus on detecting asynchronous hardware and synchronous software observations and reporting them.

Page 5: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 5

RAS Objectives

ESX RAS team objective is to increase the reliability, availability and serviceability of the vmkernel. This includes:

Hardening of vmkernel drivers (hardware errors): CPU, Memory, PCI(-X/Express), SCSI, Networking.

Hardening of vmkernel facilities (software errors): SCSI, Networking, VMotion, DMotion, etc.

Developing a standardized method of reporting observations from software and hardware error handlers.

Developing a method to diagnose a given stream of observations, down to one or more problems which may have caused them.

Develop method for determining predictive failure of a given (sub-)system and feed analysis to consumers (DRS, DPM, FT, HA)

Gather and write service actions which correspond to the problem or set of problems which are possibly present.

Develop automated policies for certain problems which may be taken care of without user action.

Maintain and improve logging, coredump, and PSOD infrastructure in the vmkernel

Page 6: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 6

RAS Terms

RAS: Reliability, Availability, Serviceability.

Reliability: The ability of a system to perform and maintain its functions, in the face of hostile or unexpected circumstances.

Availability: The proportion of time a system is in a functioning condition.

Serviceability: The ability to debug or perform root cause analysis in pursuit of solving a problem with a product.

Hardening: To enhance a (sub-)system to be able to detect, report and handle errors which may be encountered, whether hardware or software related. Handling may involve panicing and/or attempting recovery from a given error or stream of errors.

VProb: A VProb is an automatically generated problem report.

Page 7: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 7

RAS Categories

The framework defines the following use cases for vSphere 4.0:

Each of the use cases link to respective KBs which describe where the error happened (i.e. affected vmnic#, portgroup, vSwitches, storage path etc.) and provides troubleshooting tips to fix the issue.

Networking vprob.net.connectivity.lost

vprob.net.redundancy.lost

vprob.net.redundancy.degraded

vprob.net.e1000.ts06.notsupported

Storage vprob.storage.connectivity.lost

vprob.storage.redundancy.lost

vprob.storage.redundancy.degraded

Page 8: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 8

RAS Categories

VMFS specific:vprob.vmfs.nfs.server.disconnect

vprob.vmfs.nfs.server.restored

vprob.vmfs.heartbeat.timedout

vprob.vmfs.heartbeat.recovered

vprob.vmfs.heartbeat.unrecoverable

vrpob.vmfs.lock.corruptiondisk

vprob.vmfs.resource.corruptiondisk

vprob.vmfs.volume.locked

Migration Specific:vprob.net.migrate.vmknic

The Public KB’s will be available at GA time.

Page 9: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide 9

Networking VProbvprob.net.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6122&communityID=2701

Connectivity to a physical network has been lost, all the affected portgroups are part of the message (e.g. >Lost network connectivity on virtual switch "system". Physical NIC vmnic1 is down. Affected port groups: "cos", "VM Network".<)

Page 10: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide10

Networking VProbvprob.net.redundancy.lost

http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6097&communityID=2701

Only one physical NIC is currently connected, one more failure will result in a loss of connectivity (e.g. >Lost uplink redundancy on virtual switch "system". Physical NIC vmnic0 is down. Affected port groups: "cos", "VM Network".<)

Page 11: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide11

Networking VProbvprob.net.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6098&communityID=2701

One of the physical NICs in your NIC team has gone down, you still have n-1 NICs available (e.g. >Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic1 is down. 2 uplinks still up. Affected portgroups: "VM Network".<)

Page 12: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide12

Networking VProb

vprob.net.e1000.tso6.notsupported (KB article)

Guest e1000 driver is misbehaving and sending TSO IPv6 packets, which will be dropped. The vprob specifies the affected VM, and the KB article discusses ways to fix this.

http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-7393

"Guest-initiated IPv6 TCP Segmentation Offload (TSO) packets ignored. Manually disable TSO inside the guest operating system in

virtual machine "XYZ", or use a different virtual adapter."

Page 13: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide13

Storage VProb

vprob.storage.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701

The connectivity to a specific device has been lost (e.g. "Lost connectivity to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down")

Page 14: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide14

Storage VProb

vprob.storage.redundancy.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6120&communityID=2701

Only one path is remaining to a device and you no longer have any redundancy (e.g. "Lost path redundancy to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down.")

Page 15: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide15

Storage VProb

vprob.storage.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701

One of your paths to a device has been lost but you still have n-1 paths remaining (e.g. "Path redundancy to storage device naa.60a9800043346534645a433967325334 degraded. Path vmhba35:C1:T0:L7 is down. 3 remaining active paths.")

Page 16: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide16

VMFS vProb

vprob.vmfs.nfs.server.disconnect

vprob.vmfs.nfs.server.restored

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm

Lost connection to server nfs-server mount point /share, mounted as

1264e433-5854ee53-0000-000000000000 ("nfs-share")

Page 17: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide17

VMFS vProb

vprob.vmfs.heartbeat.timedout

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm

VMFS Volume Connectivity Degraded   496befed-1c79c817-6beb-001ec9b60619 san-lun-100

Page 18: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide18

VMFS vProb

vprob.vmfs.heartbeat.recovered

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm

VMFS Volume Connectivity Restored 496befed-1c79c817-6beb-001ec9b60619 san-lun-100

Page 19: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide19

VMFS vProb

vprob.vmfs.heartbeat.unrecoverable

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm

VMFS Volume Connectivity lost 496befed-1c79c817-6beb-001ec9b60619 san-lun-100

Page 20: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide20

VMFS vProb

vrpob.vmfs.lock.corruptiondisk

vprob.vmfs.resource.corruptiondisk

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.corruptioncombined.htm

Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock detected at offset O

Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Resource cluster metadata corruption detected

Page 21: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide21

VMFS vProb

vprob.vmfs.volume.locked

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm

Volume on device naa.60060160b3c018009bd1e02f725fdd11:1 locked, possibly because remote host 10.17.211.73 encountered an

error during a volume operation and couldn’t recover.

Page 22: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide22

Migration Specific

vprob.net.migrate.vmknic

http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.net.migrate.vmkernel.htm

The ESX advanced config option /Migrate/Vmknic is set to an invalid vmknic: vmk0. /Migrate/Vmknic specifies a vmknic that VMotion binds to for improved performance. Please update the config option with a valid vmknic or, if you don't want VMotion to bind to a specific

vmknic, remove the invalid vmknic and leave the option blank.

Page 23: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide23

Lesson 2-8 Summary

Understand what vProbs are

Learn how to troubleshoot vProbs

Page 24: RAS -  Reliability, Availability, Serviceability

VI4 - Mod 2-8 - Slide24

Lesson 2-8 – Optional Lab 1

OPTIONAL

Lab 1 involves generating vProb scenarios