Tracking Down Storage Performance Issues: A...

Tracking Down Storage Performance Issues: A Customer’s Perspective

Keith Aasen, NetApp

Scott Elliott, Christie Digital

INF-STO1430

#vmworldinf

Agenda 1. Introduction and background

2. Storage problems and their effect on virtual infrastructure

3. Root cause analyses and resolution

4. Results and next steps

Who is Christie? About Christie • A global visual technologies company

• Visual Solutions include:

• Media Walls

• A global visual technologies company


• Digital Cinema Projectors

Who is Christie? About Christie



• 3D Virtual Reality




• Simulation Projection Systems


Christie’s virtualization journey

Kitchener, Ontario

Cypress, California

Christie’s virtualization journey

Kitchener vCenter 14 Hosts:

IBM x3650: Dual Proc, 4-Core, 50 GB RAM IBM x3850: Quad Proc, 8-Core, 256 GB RAM

NetApp v3140; 60 TB

250+ VMs (and growing)

The problem arises • Disk latencies:

20 ms

40 ms

Good

Bad

Ugly

The problem arises • Disk latencies:

20 ms

40 ms

Good

Bad

Ugly

Implementation

Sus

tain

ed

Lat

ency

Spi

kes

Increase in business demand

Sus

tain

ed

Lat

ency

Spi

kes

Deployed SCOM Plug-in

Continued growth; High I/O introduced

Sustained: 30 ms Spikes: 100 ms No application impact

Sustained: +40 ms Spikes: 6 seconds Significant Application Impact

List of issues

1000 ms = 1 second!

List of issues 1. Most datastores had a consistent 40ms (or higher) of disk

latency with spikes lasting multiple seconds

2. ESXi hosts lose connectivity at seemingly random times • Most happen between midnight and 5:00 a.m.

3. Applications complained of disk time-outs • Where applicable, would automatically fail over to DR site

The hunt begins • Where to start?

• Oceans of data across multiple systems

• Need to correlate information and filter out distractions

• Specialized knowledge to interpret the data

Timing is everything • Coincidentally, PoC of NetApp OnCommand Balance

• Additional diagnostic analysis and correlated data

• Supplemented SCOM and PerfStats

• Large amount of misaligned VMs • Most severe latencies happened between midnight and 5:00 a.m.

Intelligence Instead of Data Performance Capacity Analytics

OnCommand Balance

Misaligned VMs on a LUN

VMDK

NTFS block

NTFS block

NTFS block

NTFS block

MBR or starting offset

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

The VMDK is aligned to the VMFS file system.

VMFS block VMFS block VMFS block

The VMFS file system is aligned to the WAFL file system so that the VMFS blocks align to the WAFL blocks

This offset causes the NTFS blocks to be misaligned with the WAFL blocks

Properly aligned VMs on a LUN

VMDK

NTFS block


WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

Properly aligned VM IO

In a properly aligned VM configuration, each Guest OS Block (NTFS/EXT3) is mapped to one block on the storage array.

VMDK

NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

Properly aligned VM IO VMDK

NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

When a write occurs from the guest OS the write is cached and then acknowledged back to the guest.


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block


Guest write


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

Cached in NVRAM

NTFS block

Guest write

When a write occurs from the guest OS, the write is cached and then acknowledged back to the guest.


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

Cached in NVRAM

NTFS block

ACK

Guest write



NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

NTFS block

Because of NetApp WAFL and NVRAM technology NetApp controllers can write to disk very quickly therefore NVRAM rarely fills up.

Written to disk later Invalidated

Misaligned VM IO VMDK

NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

In a misaligned IO configuration, each windows block is stored on multiple blocks on the storage array


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

Guest write

In a misaligned IO configuration each windows block is stored on multiple blocks on the storage array


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

This guest write is 2 partial writes (1/2 of 2 blocks).The storage controller caches the writes and acknowledges the guest as before

Cached in NVRAM

FLck

WAblo

ACK


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

Cached in NVRAM

FLck

WAblo

This write is called a partial write since it only writes parts of 2 blocks. We need to preserve the other half of the block!

Still need!


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

Cached in NVRAM

FLck

WAblo

Still need!

To do this we must first read the old blocks in. Normally this is done during a CP or when an entire write stripe is ready as the rest of the block may come in.

Reads

WAFL block

WAFL block


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

Cached in NVRAM

FLck

WAblo

Still need!

Now we can build the new Blocks and write back out.


NTFS block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block


NTFS block

NTFS block

NTFS block

NTFS block

FLck

FLck

FLck

FLck

Invalidated

Now we can build the new blocks and write back out.

Net effect

• This process causes consistency points to take longer in duration

• Increases CPU load on the controller

• No effect on performance to the VM, if the controller can “keep up”

• If load increases, then a dramatic spike in latency can occur

• Ultimately determines how many VMs can be hosted on a storage system

How to correct misalignment • Adjust the MBR or boot sector with MBRalign or VMware

converter • Permanent solution • Requires Downtime for the VM

• Create an “Optimized Datastore”

• No downtime required for the VM • Limited Vendors offer this • Must be sure not to mix misaligned VMs and aligned VMs

Servers with Misaligned Partitions Report

Virtual Storage Console 4.0

Misaligned VMs on optimized LUN

VMDK NTFS block

NTFS block

NTFS block

NTFS block


WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

WAFL block

The VMDK is aligned to the VMFS file system.


The VMFS file system is “improperly” aligned to the storage file system so that the NTFS blocks align to the storage blocks.

This offset causes the NTFS blocks to be aligned with the storage blocks

Getting closer

• Remaining latency spike late at night with no corresponding IO.

• Time coincided with aggregate-level snapshot.

• Aggregate snapshot is on by default on every system. Usually there is no noticeable activity.

• Will trigger a disk cleanup process, if significant space is released.

• The cleanup process was colliding with the SQL DB copy causing the latency spike. (has since had it’s priority adjusted)

• Still had lingering – and seemingly random – spikes

• Use Veeam’s Management Pack for VMware

• Agentless vSphere monitoring and management

• Systems Center Operations Manager Plug-In

• Used report “Virtual Machines: Disk Performance History”

The cumulative effect of client software

Correlating VM disk activity

AV client maintenance synchronized across VMs

The system now… M

illis

econ

ds

What did we learn? • An underused storage subsystem can mask environment

misconfigurations.

• Storage performance issues are rarely due to a single cause.

• In this case, there were three causes: 1. VM alignment

2. Storage resource contention from background process, and

3. Suboptimal antimalware configuration.

Other lessons learned 1. Invest in monitoring tools to detect problems.

2. Fix misconfigurations before they become a problem.

3. Engage your vendor to assist with the troubleshooting process.

Questions?

Thank you and have a great VMworld 2012!

FILL OUT A SURVEY

EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A

$25 VMWARE COMPANY STORE GIFT CERTIFICATE

Tracking Down Storage Performance Issues: A Customer’s Perspective

Keith Aasen, NetApp

Scott Elliott, Christie Digital

INF-STO1430

#vmworldinf

Tracking Down Storage Performance Issues: A...

Documents

Transcript of Tracking Down Storage Performance Issues: A...