Priti Mishra MTS, VMware Bing Tsai Sr. R&D Manager, VMware
-
Upload
jemima-young -
Category
Documents
-
view
28 -
download
1
description
Transcript of Priti Mishra MTS, VMware Bing Tsai Sr. R&D Manager, VMware
Priti MishraMTS, VMware
Bing TsaiSr. R&D Manager, VMware
AP02
NFS & iSCSI: Performance Characterization and Best Practices in ESX 3.5
Housekeeping
Please turn off your mobile phones, blackberries and laptops
Your feedback is valued: please fill in the session evaluation form (specific to that session) & hand it to the room monitor / the materials pickup area at registration
Each delegate to return their completed event evaluation form to the materials pickup area will be eligible for a free evaluation copy of VMware’s ESX 3i
Please leave the room between sessions, even if your next session is in the same room as you will need to be rescanned
Topics
General Performance Data and ComparisonImprovements in ESX 3.5 over ESX 3.0.x
Performance Best Practices Troubleshooting Techniques
Basic methodology
Tools
Case studies
Key performance improvements since ESX3.0.x (1 of 3)
NFSAccurate CPU accounting further improves load balancing among multiple VMs
Optimized buffer and heap sizes
Improvements in TSO supportTSO (TCP segmentation offload) improves large writes
H/W iSCSI (with QLogic 405x HBA)Improvements in PAE (large memory) support
Results in better multi-VM performance in large systems
Minimized NUMA performance overheadThis overhead exists in physical systems as well
Improved CPU cost per I/O
Key performance improvements since ESX3.0.x (2 of 3)
S/W iSCSI (S/W-based initiator in ESX)Improvements in CPU costs per I/O
Accurate CPU accounting further improves load balance among multiple VMs
Increased maximum transfer sizeMinimizes iSCSI protocol processing cost
Reduces network overhead for large I/Os
Ability to handle more concurrent I/OsImproved multi-VM performance
Key performance improvements since ESX3.0.x (3 of 3)
S/W iSCSI (continued)Improvements in PAE (large memory) support
CPU efficiency much improved for systems with >4GB memory
Minimizing NUMA performance overhead
Performance Experiment Setup (1 of 3)
Workload: IometerStandard set based on
Request size1k, 4k, 8k, 16k, 32k, 64k, 72k, 128k, 256k, 512k
Access mode50% read/ write
Access pattern100% sequential
1 worker, 16 Outstanding I/Os
Cached runs100MB data disks to minimize array/server disk activities
All I/Os served from server/array cache
Gives upper bound on performance
Performance Experiment Setup (2 of 3)
VM informationWindows 2003 Enterprise Edition
1 VCPU; 256 MB memoryNo file system used in VM (Iometer sees disk as physical drive)
No caching done in VM
Virtual disks located on RDM device configured in physical modeNote: VMFS-formatted volumes are used in some tests where noted
Performance Experiment Setup (3 of 3)
ESX Server4-socket, 8 x 2.4GHz cores
32GB DRAM
2 x Gigabit NICsOne for vmkernel networking: used for NFS and software iSCSI protocols
One for general VM connectivity
Networking ConfigurationDedicated VLANs for data traffic isolated from general networking
How to read performance comparison charts
ThroughputHigher is betterPositive is better higher throughput
LatencyLower is betterNegative is better lower response time
CPU costLower is betterNegative is better reduced CPU costHow does this metric matter?
CPU Costs
Why is CPU cost data useful?Determines how much I/O traffic the system CPUs can handle
How many I/O-intensive VMs can be consolidated in a host
How to compute CPU costMeasure total physical CPU usage in ESX
esxtop counter: Physical Cpu(_Total)
Normalize to per I/O or per MBpsExample: MHz/MBps = {(Physical CPU usage percentage out 100%) ) X (# of physical CPUs) X (CPU
MHz rating)} / (throughput in MBps)
Performance Data
First set: Relative to baselines in ESX 3.0.xSecond set: Comparison of storage options using Fibre Channel data as the baselineLast: VMFS vs. RDM physical
Software iSCSI – Throughput Comparison to 3.0.x:
Sequential 50%Write Throughput Comparison
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
higher is better
Software iSCSI – Latency Comparison to 3.0.x:
Sequential 50%Write Latency Comparison
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
lower is better
Software iSCSI – CPU Cost Comparison to 3.0.x:
Sequential 50%Write CPU Efficiency Comparison
-50%
-45%
-40%
-35%
-30%
-25%
-20%
-15%
-10%
-5%
0%1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
lower is better
Software iSCSI – Performance Summary
Lower CPU costsCan lead to higher throughput for small IO sizes when CPU is peggedCPU costs per IO also greatly improved for larger block sizes
Latency is lowerEspecially for smaller data sizesRead operations benefit most
Throughput levelsDependent on workload
Mixed read-write patterns show most gainRead I/Os show gains for small data sizes
Hardware iSCSI – Throughput Comparison to 3.0.x:
Sequential 50%Write Throughput Comparison
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
higher is better
Hardware iSCSI – Latency Comparison to 3.0.x:
Sequential 50%Write Latency Comparison
-50%
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
lower is better
Hardware iSCSI – CPU Cost Comparison to 3.0.x :
Sequential 50%Write CPU Efficiency Comparison
-100%
-90%
-80%
-70%
-60%
-50%
-40%
-30%
-20%
-10%
0%1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
lower is better
Hardware iSCSI – Performance Summary
Lower CPU costsResults in higher throughput levels for small IO sizes
CPU costs per IO are especially improved for larger data sizes
Latency is betterSmaller data sizes show the most gain
Mixed read-write and read I/Os benefit more
Throughput levelsDependent on workload
Mixed read-write patterns show most gain for all block sizes
Pure read and write I/Os show gains for small block sizes
NFS – Performance Summary
Performance also significantly improved in ESX 3.5Data now shown here for interest of time
Protocol Comparison
Which storage option to choose?IP Storage vs. Fibre Channel
How to read the charts?All data is presented as ratio to the corresponding 2Gb FC (Fibre Channel) data
If the ratio is 1, the FC and IP protocol data is identical; if < 1, FC data value is larger
Comparison with FC: Throughput
Comparison with FC - Throughput - Sequential-50% Write
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
H/W iSCSI
S/W iSCSI
NFS
if < 1, FC data value is larger
Comparison with FC: Latency
Comparison with FC - Avg Response Time - Sequential 50% Write
0.0
0.5
1.0
1.5
2.0
2.5
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
H/W iSCSIS/W iSCSINFS
lower is better
VMFS vs. RDM
Which one has better performance?Data shown as ratio to RDM physical
VMFS vs. RDM-physical: Throughput
Sequential 50%Write Throughput Comparison
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
higher is better
VMFS vs. RDM-physical: Latency
Sequential 50%Write Latency Comparison
-1.00
-0.50
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
IO Size (Byte)
% D
iffer
ence
lower is better
VMFS vs. RDM-physical: CPU Cost
Sequential 50%Write CPU Cost Comparison
0
5
10
15
20
25
30
1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
% D
iffer
ence
lower is better
Topics
General Performance Data and ComparisonImprovements in ESX 3.5 over ESX 3.0.x
Performance Best Practices Troubleshooting Techniques
Basic methodology
Tools
Case studies
Pre-Deployment Best Practices: Overview
Understand the performance capability of yourStorage server/array
Networking hardware and configurations
ESX host platform
Know your workloads
Establish performance baselines
Pre-Deployment Best Practices (1 of 4)
Storage server/array: a complex system by itselfTotal spindle count
Number of spindles allocated for use
RAID level and stripe size
Storage processor specifications
Read/write cache sizes and caching policy settingsRead-Ahead, Write-Behind, etc.
Useful sources of information:Vendor documentation: manuals, best practice guides, white papers, etc.
Third-party benchmarking reports
NFS-specific tuning information: SPEC-SFS disclosures in http://www.spec.org
Pre-Deployment Best Practices (2 of 4)Networking
Routing topology and path configurations: # of links in between, etc.Switch type, speed and capacityNIC brand/model, speed and featuresH/W iSCSI HBAs
ESX hostCPU: revision, speed and core countArchitecture basics
SMP or NUMA?Disabling NUMA is not recommended
Bus speed, I/O subsystems, etc.
Memory configuration and sizeNote: NUMA nodes may not have equal amount of memory
Pre-Deployment Best Practices (3 of 4)
Workload characteristicsWhat are the smallest, largest and most common I/O sizes?
What is the read%? write%?
Is access pattern sequential? random? mixed?
Response time more important or aggregate throughput?
Response time variance an issue or not?
Important: know the peak resource usage, not just the average
Pre-Deployment Best Practices (4 of 4)
Establish performance baselines by running standardized benchmarks
What’s the upperbound IOps for small I/Os?
What’s the upperbound MBps?
What’s the average/worst case response time?
What’s the CPU cost of doing I/O?
Additional Considerations (1 of 3)
NFS parameters# of NFS mount points
Multiple VMs using multiple mount points may give higher aggregate throughput with slightly higher CPU cost
Export option on NFS server affects performanceiSCSI protocol parameters
Header digest processing: slight impact on performanceData digest processing: turning off may result in
Improved CPU utilizationSlightly lower latenciesMinor throughput improvementActual outcome highly dependent on workload
Additional Considerations (2 of 3)
NUMA specificIf only one VM is doing heavy I/O, may be beneficial to pin the VM and its memory to node 0
If CPU usage is not a concern; no pinning necessaryOn each VM reboot, ESX Server will place it on the next adjacent NUMA node
Minor performance implications for certain workloads To avoid this movement, VM should be affinitized using VI client
SMP VMsFor I/O workloads within an SMP VM that migrate frequently between VCPUs
Pin the guest thread/process to a specific VCPUSome versions of Linux has KHz timer rate and may incur high overhead
Additional Considerations (3 of 3)
CPU headroomSoftware initiated iSCSI and NFS protocols can consume significant amount of CPU in certain I/O patterns
Small I/O workloads require large amount of CPU; ensure that CPU saturation does not restrict I/O rate
NetworkingAvoid link over-subscription
Ensure all networking parameters or even the basic gigabit connection is consistent across the full network path
Intelligent use of VLAN or zoning to minimize traffic interference
General Troubleshooting Tips (1 of 3)
Identify Components in the whole I/O path
Possible issues at each layer in the path
Check all hardware & software configuration parameters, in particularDisk configurations and cache management policies on storage server/array
Network settings and routing topology
Design experiments to isolate problems, such as:Cached runs
Use a small file or logical device, or a physical host configured with RAM-disks: Minimizing physical disk effects
Indicate upper-bound throughput and I/O rate achievable
Run tests with single outstanding I/O
Easier for analysis on packet traces
Throughput entirely dependent on I/O response times
Micro benchmarking each layer in the I/O path
Compare to non-virtualized, native performance results
Collect dataGuest OS data: But don’t trust the CPU%
Esxtop data
Storage server/array data: Cache hit ratio, storage processor busy%, etc.
Packet tracing with tools like TCPdump, Ethereal, Wireshark, etc.
General Troubleshooting Tips (2 of 3)
Analyze performance dataDo any stats, e.g., throughput or latency, change drastically over time?Check esxtop data for anomalies, e.g., CPU spikes or excessive queueingServer/array stats
Compare array stats with ESX statsIs cache hit ratio reasonable? Storage processor overloaded?
Network trace analysisInspect packet traces to see if
NFS and iSCSI requests are processed timelyIO sizes issued by the guest match the transfer sizes over the wireBlock addresses aligned to appropriate boundaries?
General Troubleshooting Tips (3 of 3)
Isolating Performance Problems: Case Study#1 (1 of 3)
SymptomsThroughput can reach Gigabit wire speed doing 128KB sequential reads from a 20GB LUN on an iSCSI array with 2GB cache
Throughput degrades for larger data sizes beyond 128KB
From esxtop dataCPU utilization also lower for l/O sizes larger than 128KB
CPU cost per I/O is in expected range for all I/O sizes
Isolating Performance Problems: Case Study#1 (2 of 3)
From esxtop or benchmark output I/O response times in the 10 to 20ms range for the problematic IOs
Indicates constant physical disk activities required to serve the reads
From network packet traces No retransmissions or packet loss observed indicating no networking issue
Packet time stamps indicating array takes 10ms to 20ms to respond to a read request, no delay in the ESX host
From cached run resultsNo throughput degradation above 128KB!
Problem exists only for file sizes exceeding cache capacityArray appears to have cache-management issues with large sequential reads
Isolating Performance Problems: Case Study#1 (3 of 3)
From native tests to same arraySame problem observed
From the administration GUI of the arrayRead-ahead policies set to highly aggressive
Is the policy appropriate for the workload?
SolutionUnderstand performance characteristics of the array
Experiment with different read-ahead policies
Try turning off read-ahead entirely to get the baseline behavior
Isolating Performance Problems: Case Study#2 (1 of 4)
Symptoms1KB random write throughput much lower (< 10%) than sequential writes to a 4GB vmdk file located on an NFS server
Even after extensive warm-up period
But very little difference in performance between random and sequential reads
From NFS server spec3GB read/write cache
Most data should be in cache after warming up
Isolating Performance Problems: Case Study#2 (2 of 4)
From esxtop and application/benchmark dataCPU% utilization lower but CPU cost per I/O mostly same regardless of randomness
Not likely a client side (i.e., ESX host) issue
Random write latency in the 20ms range
Sequential write < 1ms
From NFS server statscache hit% much lower for random writes, even after warm-up
Isolating Performance Problems: Case Study#2 (3 of 4)
From cached runs to a 100MB vmdkRandom write latency almost matches sequential write
Again, suggests that issue is not in ESX host
From native testsRandom and sequential write performance is almost same
From network packet tracesServer responds to random writes in 10 to 20ms, sequential writes in <1ms
Offset in NFS WRITE requests is not aligned to power-of-2 boundary
Packet traces from native runs show correct alignment
Isolating Performance Problems: Case Study#2 (4 of 4)
QuestionWhy are sequential writes not affected?
NFS Server file system idiosyncrasiesManages cache memory at 4KB granularityOld blocks are not updated in place; writes go to new blocksEach < 4KB write incurs a read from the old blockAggressive read-ahead masks the read latency associated with sequential writes
SolutionUse disk alignment tool in the guest OS to align disk partitionAlternatively, use unformatted partition inside guest OS
Summary and Takeaways
IP-based storage performance in ESX is being constantly improved; Key enhancements in ESX 3.5:
Overall storage subsystem
Networking
Resource scheduling and management
Optimized NUMA, multi-core, and large memory support
IP-based network storage technologies are maturingPrice/performance can be excellent
Deployment and troubleshooting could be challenging
Knowledge is key: server/array, networking, host, etc.
Stay tuned for further updates from VMware
Questions?
NFS & iSCSI – Performance Characterization and Best Practices in ESX 3.5
Priti Mishra & Bing TsaiVMware