Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU...

65
Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or distribution

Transcript of Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU...

Page 1: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Iwan ‘e1’ Rahabok

MGT2508BU

MGT2508BU

Troubleshooting Mastery with vRealize

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 2: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

#MGT2508BUR CONFIDENTIAL

2

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 3: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

3

Your speaker

Iwan ‘e1’ [email protected]@e1_ang 9119-9226

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 4: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Troubleshooting Stories: Availability

4

What was reported Client Final Cause

Network Mass VDI Disconnect Investment House Network switch buffer overflow

App Horizon View View CS issue Investment House MS ADAM not synchronizing

Storage Random VMs have high disk latency Telco 1 Bug in Array firmware

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 5: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Troubleshooting Stories: Performance

5

What was reported Client Final Cause

Overall VDI slowness Telco 2 JavaScript in new apps

SQL Server hit a glitch Holding Company vMotion was long

VMs experience storage latency UK Bank Physical Oracle servers on shared array

vSAN performing slow Cloud Provider A VM did IOmeter

SAP experiences slow storage State Oil Firm Backup Schedule is different for Tier 1

Random VMs slow. No response to ping Holding Company Symantec EP Agent update

VMs slow when cluster utilization was 6% UK Bank Large Idle VMs

VDI grew slow over 5 months Investment VDI CPU demand undersized

Tier 3 clusters performing better than Tier 2 Global Bank Overcommit Ratio is misleading

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 6: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Troubleshoting: Only 1st time is forgiven

7

Issue! Alert Set

Alert

Triggered

Troubleshoot

+ RCA

Same problem happens again know within minutes.Similar problem happens have good understanding within 1 hourVMworld 2017 Content: N

ot for publicatio

n or distribution

Page 7: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Availability Troubleshooting Process

8

CallHome

PreventiveAction

ManualProcess

Optimized Process

IdealProcess

Time loss…. Time loss…. ManualManual

Upload Logs

Waiting for analysis

Check the rest

RCA Report

AnalysisIssue!

Joint analysis

RCA Report

vRealize

Alert Setup

Issue!

vRealize

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 8: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Performance Troubleshooting Process

9

Early WarningPreventive

Action

Manual Process

Optimized Process

IdealProcess

Ping Pong No RCA

Time loss….

Blame Storming

All Hands on Deck

Issue!

SLA breached

Big Picture Joint Analysis

Your Dashboard

RCA Report

Shared Tools

Alert Setup

vRealize

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 9: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

10

“A problem thoroughly understood is always fairly simple. Found your opinions on facts, not prejudices. We know too many things that are not true.”

Charles Kettering

Inventor and philosopher

General Motors

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 10: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Cluster 1 3 Cluster N2 Cluster 1

1 3 4 7 8 910

52 6

Cluster N

vSAN Nutanix

Tier 1 Clusters Tier 2 Clusters

Datastores

11

Master the Architecture

Volume 1 Vol 3 Vol 1 Vol 3

Vol 2

Physical Array

Vol 2N/a

Clusters

Example

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 11: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Master the Dependancy

• Distributed Services that are in-line can cause performance problem somewhere else

• If they are affected, the business VM maybe affected.

– One problem translates into another. This makes troubleshooting hard

ESXi Host

Storage VM

Network VM

Security VM

Business VM

Business VM

CPUCPU RAM Disk Net

ESXi Host

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 12: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Master the point of monitoring

• For vmdk, use Datastore metric groups.

• For RDM, use Disk metric groups

• Disk metric group is not relevant for NFS

VM

Disk 1 Disk 2 Disk 3

RDM

DiskDisk

VMFS

Datastore

NFS

Datastore

scsi0:1 scsi0:2

vDisk vDisk vDisk

scsi0:0

VM Disk

• Depends on the

app• Depends on the

app

• OS Free RAM

• OS Commit Ratio• OS Page-in Rate

• VM Consumed • VM Contention

Application

Guest OS

VM

Utilization Contention

VM Memory

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 13: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Hero to Zero. Over 5 months…

14

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 14: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

IT received complaint

from end user. Random

users are affected, but

most users are happy.

Performance was good

initially. It grew worse

over the months.

What was reported

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 15: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Information Gathering

16

Over 5 months

200

250

300

350

400

450

500

550

Full Capacity

Deployed

Performance

No change in Horizon, vSphere. Same Win 7 golden image. new application,

Happens on both

wired & wireless, on

thin, thick client,

iPad

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 16: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

17

How Infra impacts VM performance

ESXi + physical infrastructure Infrastructure

Utilisation

VM VM

Guest

OSVM Utilisation

VM

VM Contention

Each VM is 2 vCPU, 8 GB RAM.

At peak period, average user is

expected to use only 7% CPU.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 17: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

18

Analysis

674 MHz

~1.8 GHzPlan

ActualPlotting over time.

CPU higher than expected

during peak period.

ESXi utilization

Again, higher than

plan

Total Demand

Need to investigate this

Demand/User

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 18: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

19

Windows idle. No hands on keyboard.

MS Outlook running (online).

CPU “Peak” Usage is around 40%.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 19: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

20

Watching a YouTube video (company video).

Both CPU rose when we changed to full screen.

Around 95%.

We need to cater for the concurrency. How many

% of staff watching at the same time?

Working with a web based app, then

pause, then working on another web

based app.

CPU Usage is way above >7%

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 20: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

25

Utilization Contention

• By itself, 40% of a 2 vCPU VM does not matter.

• But stack them up, add a monster VM, run it on a small ESXi, and you will have a problem.

8 cores 8 cores

ESXi

8 vCPUStorage VM

2 2 2 2

2 2 2 2

Demand

Supply

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 21: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

27

Super Metrics

8 line charts:

• Max and Average.

• 4 Infra element.

Each over 5 months, as we knew

what was good.

CPU was the constraint

• Average is good.

• Max is bad.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 22: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Cluster CPU Utilization

Max VM CPU Contention

Avg VM CPU Contention

We need to validate at individual level

We need to validate at individual level

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 23: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

29

This one proves that the VM experienced CPU contention, which

built over 2+ months. It was then solved by moving it to a

performing cluster

We went back 3 months for User A, as the

performance was good 3 months ago.

CPU utilisation pattern was the same. So user demand didn’t go up.

The time the VM

was migrated

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 24: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Zooming in: Before and After

30

Zooming into the period before and after VM migration.

CPU Contention dropped right after migration. Notice it

remains stable after that.

Zooming in further, so the comparison clearer.

This proves the HCI vendor box was having difficulty

serving its VMs well. We are at a low consolidation

ratio, so something else is using up the physical

cores.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 25: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

VM CPU Latency in mid July.

We went back 3 months, as the

performance was good 3

months ago.

This was 7 days data, >2000

chances to hit high.

VM CPU Latency in end Sep,

after more VMs were added

It’s ~10x worse.

This explains the performance

degradation felt by user

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 26: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

After migration: 20x better

3232

After migrated out of HCI vendor into ESXi with no overcommit nor large VM.

It was well below 0.5%. The peak was 0.38%

This one proves that the VM experienced CPU contention, which built over 2+ months. It

was then solved by moving it to Cluster without contention.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 27: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

33

Root Cause Analysis

What happened Lesson Learned Alert Setup

• Sizing was too small.

• ESXi unable to cope.

• Monster VM + AV VM

took half the box

• Know the workload.

• Website matters in VDI

• Know how VMkernel

scheduler works

• Define KPI for VDI

• Max VM CPU

Contention in a cluster

• Max VM RAM

Contention in a cluster

• Max VM Disk Latency

in a cluster

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 28: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

34

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 29: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Mass VDI Disconnect

35

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 30: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

A fellow IT team seated on the

same row with VDI team had

his session disconnected at

7:42 am while he was working.

Head of IT Infra had issues

trying to get his zero client to

wake up.

What was reported

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 31: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

This is Horizon Event DB.

It’s hard to see as it’s not plotted across time.

Need to manually scroll one by one. Easy to make

mistake.

Worse, it does not distinguish normal and

abnormal disconnect. VMworld 2017 Content: Not fo

r publication or distri

bution

Page 32: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

This is Log Insight.

We know who was affected.

We also know the ESXi Host they

were on.

We can also group by View CS

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 33: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Mass-Disconnect happen

We know this is abnormal as it’s a high spike, and

no one was working at that time.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 34: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Is there pattern on the client side?

Only thin clients are affected. Thick clients (Windows PC)

not affected.

Checking on the Client logs, no error other than Network

error. The client itself is working fine.VMworld 2017 Content: N

ot for publicatio

n or distribution

Page 35: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Analysis on affected VMs

Problems only happened on HCI clusters.

The hosts were well spread.

• The hosts themselves had no errors

• We knew this was the hosts. No vMotion.

Conclusion: does not look like a problem with host, but the HCI.

Find something in common among the affected VMs

ESXi No Cluster Affected VM

5 HCI cluster 1 1

7 HCI cluster 1 4

8 HCI cluster 1 2

9 HCI cluster 1 2

10 HCI cluster 1 2

11 HCI cluster 1 4

12 HCI cluster 1 2

13 HCI cluster 1 3

14 HCI cluster 1 1

15 HCI cluster 1 2

16 HCI cluster 1 3

17 HCI cluster 1 1

18 HCI cluster 1 2

21 HCI cluster 1 2

22 HCI cluster 1 4

23 HCI cluster 2 1

26 HCI cluster 2 1

27 HCI cluster 1 2

28 HCI cluster 1 4

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 36: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

What did the VMs experience?

It’s not Windows issue, since VMs resumed normally.

It’s not application, since it happened at 42 OS at the same time.

A VM consumes 4 resources:

• CPU

• Disk

• RAM

• Network

We will check for each 4.

42

Something must

have freezed the

VMs.

Temporarily.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 37: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Average Disk Latency among Affected VMs.

The 42 users experienced a spike in Disk Latency at the time of incident.

The spike timing and length matched the situation.

We plot the same for CPU, Network and RAM. No correlation at all.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 38: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Max latency rose to 543 ms. That’s very high as it’s

sustained for 5 minutes. From here we can tell there is a

storage issue.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 39: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Drilling into the 42 VM.

There is noticeable spike across many VM.

The green number shows the last value,

not the peak value.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 40: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

However, there is no high IOPS coming

from the 42 VM. This is expected at

7:42 am on weekday.

So it is something from outside the VM.

It indicates Storage was not processed.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 41: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Conclusion: VMs were hit by disk latency

• At the time of incident (7:42 am), there was a distinct and unhealthy spike

• Both Before and After are very good. So it’s a short lived issue.

• It’s also a one time event. It did not occur before and after within 48 hours period. The spike is distinct.

• There was no spike in IOPS. Latency happens when Demand isn’t met by Supply. So Supply has dropped.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 42: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Alert so IT knows right away.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 43: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

49

Root Cause Analysis

What happened Lesson Learned Alert Setup

• View Client initiated

disconnect

• View Server not

responding

• Windows IO freeze

• HCI Storage can’t sync

• Network buffer overflow

• VDI monitoring has to

include storage and

network

• Not all alerts are in the

GUI. Buried in log

• Network buffer overflow

• HCI unable to sync

• View Disconnect > 10

at the same time.

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 44: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 45: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

SG1029263SG1029263

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 46: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Start CMP Engagement with AssessmentsGet your FREE reports

52

▪ Optimize SDDC and Hybrid Cloud

▪ Time to value in days

vSphere Optimization Assessment (VOA) Hybrid Cloud Assessment (HCA)

▪ Compare private and public cloud costs

▪ Time to value in < 1 hour

vmware.com/assessment/voa vmware.com/hybrid-cloud-assessment.html

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 47: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Next Steps and Key Resources

Upgrade to vRealize Operations 6.6 Share Your Feedback Get Certified

Upgrade to vROps 6.6

Visit our new upgrade center:

vmware.com/go/vrops/upgradeReceive a $10 amazon gift card

by completing a short survey

about vRealize Operations

surveymonkey.com/r/vrops66

OFFER valid during VMworld

2017Complete your online

certification exam

VMware Digital Badges:

vmware.com/go/vROPS201

7badge

VMware Digital Badge

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 48: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Overall VDI slowness

• What was reported

– IT Manager received complaint from end users that VDI is slow

– Customer is a telco in Singapore

• Gathering additional info

– No change in infrastructure

– No change in no of users, VDI VM master image

– A new application was rolled out that provide a front-end for 5 existing applications. User is able to work with all 5 apps instead of 1 at a time. Productivity improve.

– Apps was developed on laptop. Never tested in VDI environment.

56

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 49: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Analysis

• Sum up total Demand from all users

– Take over 1 month, not 1 day.

– Sum the total demand at peak period, then divide by # Users.

57

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 50: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Random VMs slow

• What was reported

– VMs do not response to ping. There seems to be no network problem.

– No time pattern on the problems. They can occur daily, or less.

– Different VMs are affected. No pattern on the affected VMs

– Problem disappear on its own.

– Customer suspect a rogue VM is causing the problem

• Gathering additional info

– 700 VMs spread across 2 Datacenters in Singapore. Linked by low latency network, 2x 1Gb link.

– EMC (VPLEX, linked by Cisco MDS) supporting vSphere HA (stretched cluster)

– Half of the VMs are on VPLEX-replicated datastores.

– vR Ops and Log Insight were not installed

– Network team notice high spike on the link utilization that match the time of issue

• Since the link is dedicated to VPLEX, then VPLEX replication must have gone up

58

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 51: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Analysis

• VPLEX is replicating changes

– Write operations, not Read.

– We plotted

• Install vR Ops

– Created a group on all datastores backed by VPLEX

– Create a super metric that sums their Write Throughput.

– Expect to see the pattern match. It did!

• Login to some VMs. Check with Windows Events. It shows Symantec EP Agent signature update.

• Signature update: It appends 4 KB, but it copy 512 MB, append 4 KB, then delete 512 MB.

• Discussed with Security Team. They agree to randomize even more.

• Problem stop happening

59

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 52: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

VMs slow when cluster utilization was 6%

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

60

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 53: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Tier 3 clusters performing better than Tier 2

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

61

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 54: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

SQL Server performance

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

62

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 55: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

64

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 56: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

VMs experience storage latency

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

65

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 57: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Which VMs hit high

disk latency and

how often?

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 58: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Zooming into 1

VM. The pattern

is obvious!

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 59: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

vSAN performing slow

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

68

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 60: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

SAP slow performance

• What was reported

– SAP team detects slowness. They are seeing high storage latency

– They check with Storage team. Storage team saw high latency & IOPS across many LUNs in many arrays. Since the arrays are dedicated to VMware, they ask Platform team what causing the widespread demand.

– Platform team is caught in-between customer and supplier.

• Gathering additional info

– Customer has 6 datacenters, 22 vCenters and 3000 VM.

– The problem happen during early morning, but complaint only reach VMware team at 12 pm.

• Analysis

• Guarding for future

69

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 61: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Random VMs have high disk latency

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

70

Telco in India

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 62: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Windows VMs cannot boot

• What was reported

– 3 VMs unable to boot. All are Windows, and all have physical RDM

– Customer is a large Asian property conglomerate based in Singapore

– Customer wanted to know what caused the corruption

• Gathering additional info

– MBR is only 512 bytes at the beginning of a drive. BIOS uses it to locate the start of the OS partition. There is a boot table

• Analysis

– MBR region corrupted. 3 bytes

– Microsoft restores the MBR region

– Not able to figure out why it’s corrupted.

• Guarding for future

– Microsoft provided a utility to back up MBR region.

71

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 63: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

SQL Server slow but IOPS is low

• What was reported

• Gathering additional info

• Analysis

• Guarding for future

72

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 64: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

VM CPU counters

• Used vs Usage vs Demand

• Latency

• Ready

– Include co-stop since 6.0?

74

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 65: Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Single VM Troubleshooting

75

The VM

The Infra

Capacity issue:• CPU Demand > 90%• CPU Run Queue > 3 per vCPU• CPU Swap Wait high, CPU IO Wait high• RAM Free < 250 MB• RAM Committed > 70%• Page-In Rate is high• Disk Queue Length > ___• Disk IOPS or Throughput or OIO is high• Low disk space• Network Usage is high

Non-Capacity issue:• Wrong driver (storage driver, network driver) or its settings• Too many snapshots or large snapshots• Tools not running • VM vCPU Usage unbalanced• App configured wrongly, not-indexed• Memory Leak • Network Latency is high or TCP retransmit• VM too big, process ping-pong, high context switch• NUMA effect• Guest OS power setting

Infra unable to cope:• ESXi CPU insufficient: Demand > 90%, VM CPU Co-Stop

>1%, CPU Ready >5%, no of cores to small for VM• ESXi RAM insufficient: VM Balloon active, VM RAM Swap-in

is high, NUMA migration• ESXi Disk IOPS or Throughput is high• ESXi vmkernel queue or latency is high• Datastore latency is high• ESXi vmnic usage is high

Other issue:• VM was vMotion• ESXi vmnic dropped packets or generate errors• ESXi wrong configuration: power management, multi-

pathing, driver version, queue depth setting• Hardware fault: disk soft error, bad sector, RAM error,

VMworld 2017 Content: Not fo

r publication or distri

bution