Tracking Down Storage Performance Issues: A...
Transcript of Tracking Down Storage Performance Issues: A...
Tracking Down Storage Performance Issues: A Customer’s Perspective
Keith Aasen, NetApp
Scott Elliott, Christie Digital
INF-STO1430
#vmworldinf
Agenda 1. Introduction and background
2. Storage problems and their effect on virtual infrastructure
3. Root cause analyses and resolution
4. Results and next steps
Who is Christie? About Christie • A global visual technologies company
• Visual Solutions include:
• Media Walls
• A global visual technologies company
• Visual Solutions include:
• Digital Cinema Projectors
Who is Christie? About Christie
• A global visual technologies company
• Visual Solutions include:
• 3D Virtual Reality
Who is Christie? About Christie
• A global visual technologies company
• Visual Solutions include:
• Simulation Projection Systems
Who is Christie? About Christie
Christie’s virtualization journey
Kitchener vCenter 14 Hosts:
IBM x3650: Dual Proc, 4-Core, 50 GB RAM IBM x3850: Quad Proc, 8-Core, 256 GB RAM
NetApp v3140; 60 TB
250+ VMs (and growing)
The problem arises • Disk latencies:
20 ms
40 ms
Good
Bad
Ugly
Implementation
Sus
tain
ed
Lat
ency
Spi
kes
Increase in business demand
Sus
tain
ed
Lat
ency
Spi
kes
Deployed SCOM Plug-in
Continued growth; High I/O introduced
Sustained: 30 ms Spikes: 100 ms No application impact
Sustained: +40 ms Spikes: 6 seconds Significant Application Impact
List of issues 1. Most datastores had a consistent 40ms (or higher) of disk
latency with spikes lasting multiple seconds
2. ESXi hosts lose connectivity at seemingly random times • Most happen between midnight and 5:00 a.m.
3. Applications complained of disk time-outs • Where applicable, would automatically fail over to DR site
The hunt begins • Where to start?
• Oceans of data across multiple systems
• Need to correlate information and filter out distractions
• Specialized knowledge to interpret the data
Timing is everything • Coincidentally, PoC of NetApp OnCommand Balance
• Additional diagnostic analysis and correlated data
• Supplemented SCOM and PerfStats
• Large amount of misaligned VMs • Most severe latencies happened between midnight and 5:00 a.m.
Intelligence Instead of Data Performance Capacity Analytics
OnCommand Balance
Misaligned VMs on a LUN
VMDK
NTFS block
NTFS block
NTFS block
NTFS block
MBR or starting offset
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
The VMDK is aligned to the VMFS file system.
VMFS block VMFS block VMFS block
The VMFS file system is aligned to the WAFL file system so that the VMFS blocks align to the WAFL blocks
This offset causes the NTFS blocks to be misaligned with the WAFL blocks
Properly aligned VMs on a LUN
VMDK
NTFS block
MBR or starting offset
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
Properly aligned VM IO
In a properly aligned VM configuration, each Guest OS Block (NTFS/EXT3) is mapped to one block on the storage array.
VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
Properly aligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
When a write occurs from the guest OS the write is cached and then acknowledged back to the guest.
Properly aligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
When a write occurs from the guest OS the write is cached and then acknowledged back to the guest.
Guest write
Properly aligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
Cached in NVRAM
NTFS block
Guest write
When a write occurs from the guest OS, the write is cached and then acknowledged back to the guest.
Properly aligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
Cached in NVRAM
NTFS block
ACK
Guest write
When a write occurs from the guest OS the write is cached and then acknowledged back to the guest.
Properly aligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
NTFS block
Because of NetApp WAFL and NVRAM technology NetApp controllers can write to disk very quickly therefore NVRAM rarely fills up.
Written to disk later Invalidated
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
In a misaligned IO configuration, each windows block is stored on multiple blocks on the storage array
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
Guest write
In a misaligned IO configuration each windows block is stored on multiple blocks on the storage array
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
This guest write is 2 partial writes (1/2 of 2 blocks).The storage controller caches the writes and acknowledges the guest as before
Cached in NVRAM
FLck
WAblo
ACK
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
Cached in NVRAM
FLck
WAblo
This write is called a partial write since it only writes parts of 2 blocks. We need to preserve the other half of the block!
Still need!
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
Cached in NVRAM
FLck
WAblo
Still need!
To do this we must first read the old blocks in. Normally this is done during a CP or when an entire write stripe is ready as the rest of the block may come in.
Reads
WAFL block
WAFL block
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
Cached in NVRAM
FLck
WAblo
Still need!
Now we can build the new Blocks and write back out.
Misaligned VM IO VMDK
NTFS block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
VMFS block VMFS block VMFS block
NTFS block
NTFS block
NTFS block
NTFS block
FLck
FLck
FLck
FLck
Invalidated
Now we can build the new blocks and write back out.
Net effect
• This process causes consistency points to take longer in duration
• Increases CPU load on the controller
• No effect on performance to the VM, if the controller can “keep up”
• If load increases, then a dramatic spike in latency can occur
• Ultimately determines how many VMs can be hosted on a storage system
How to correct misalignment • Adjust the MBR or boot sector with MBRalign or VMware
converter • Permanent solution • Requires Downtime for the VM
• Create an “Optimized Datastore”
• No downtime required for the VM • Limited Vendors offer this • Must be sure not to mix misaligned VMs and aligned VMs
Misaligned VMs on optimized LUN
VMDK NTFS block
NTFS block
NTFS block
NTFS block
MBR or starting offset
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
WAFL block
The VMDK is aligned to the VMFS file system.
VMFS block VMFS block VMFS block
The VMFS file system is “improperly” aligned to the storage file system so that the NTFS blocks align to the storage blocks.
This offset causes the NTFS blocks to be aligned with the storage blocks
Getting closer
• Remaining latency spike late at night with no corresponding IO.
• Time coincided with aggregate-level snapshot.
• Aggregate snapshot is on by default on every system. Usually there is no noticeable activity.
• Will trigger a disk cleanup process, if significant space is released.
• The cleanup process was colliding with the SQL DB copy causing the latency spike. (has since had it’s priority adjusted)
• Still had lingering – and seemingly random – spikes
• Use Veeam’s Management Pack for VMware
• Agentless vSphere monitoring and management
• Systems Center Operations Manager Plug-In
• Used report “Virtual Machines: Disk Performance History”
The cumulative effect of client software
What did we learn? • An underused storage subsystem can mask environment
misconfigurations.
• Storage performance issues are rarely due to a single cause.
• In this case, there were three causes: 1. VM alignment
2. Storage resource contention from background process, and
3. Suboptimal antimalware configuration.
Other lessons learned 1. Invest in monitoring tools to detect problems.
2. Fix misconfigurations before they become a problem.
3. Engage your vendor to assist with the troubleshooting process.
FILL OUT A SURVEY
EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A
$25 VMWARE COMPANY STORE GIFT CERTIFICATE