CPU bottleneck issues netapp
-
Upload
ashwin-pawar -
Category
Technology
-
view
23.569 -
download
4
description
Transcript of CPU bottleneck issues netapp
NetApp CPU Bottleneck Issues
Some help when dealing with CPU bottleneck issues
A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun
latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system
and reduce the chance of making a false conclusion.
But, to begin with, it makes sense to understand – How Data ONTAP makes use of multiple CPUs.
Data ONTAP operating system implements coarse-grained symmetric multiprocessing (CSMP).
What that means is that - Data ONTAP handles processes across multiple CPUs and these processes
are divided into different domains, but the key information to know is that although different
domains can run simultaneously on different processors, each individual domain can only exist on a
single CPU at any one time. This is useful, because it means that any domain showing 100% usage
indicates a CPU bottleneck for that bundle of related processes.
When you run 'sysstat -M 1' you can see CPU statistics across these domains:
Network
Protocol
Cluster
Storage
Raid
Target
Kahuna
WAFL_Ex(Kahu)
Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage,
Raid, Target, Kahuna ]
HIGH CPU does not always suggest problem in the filer. For example – On a Multi-Processor Filer the output of sysstat –x 1 may be quite deceiving b’cos it’s not showing the AVG utilization percentage which is more true indicative of system performance.
What is Processor utilization?
Processor utilization is nothing but the percentage of time the processor is busy.
For example – Sysstat –x 1 is showing very high % age
Whereas, sysstat –m 1 shows rather normal figures
USEFUL KBs
Block reclamation scanners cause kahuna bottleneck.
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480
What is the 'wafl scan status' command?
https://kb.netapp.com/support/index?page=content&id=3011346
How does Data ONTAP make use of multiple CPUs?
https://kb.netapp.com/support/index?page=content&id=3010150
[Apparently this KB: 3010150 is removed from the NetApp Support site]
What causes High CPU during disk scrub although raid.scrub.perf_impact is set to low?
https://kb.netapp.com/support/index?page=content&id=3011323
Data ONTAP 8: sysstat shows high CPU utilization on multiple processor system
https://kb.netapp.com/support/index?page=content&id=2013653
How does Data ONTAP schedule work across multiple physical CPUs?
https://kb.netapp.com/support/index?page=content&id=3010118
[Apparently this KB: 3010150 is removed from the NetApp Support site]
If the Filer acts as a snapmirror destination, then it is busy running the Deswizzler after a snapmirror
upgrade which can cause high CPU usage. By the way, what is deswizzler or deswizzling?
https://kb.netapp.com/support/index?page=content&actp=LIST&id=3011866
You can monitor the deswizzler work with the command wafl scan status:
https://kb.netapp.com/support/index?page=content&id=3011346
Diagnosing NetApp CPU Issues – Kahuna Bottlenecks
http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahuna-
bottlenecks/
Nice to know
FACT: “A high CPU on a Storage Controller does not always mean CPU bottle neck or performance
problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller
is not busy with user protocols workload, it is doing background work like deswizzling or disk
scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this
scanner work down in order dedicate the CPU to user workload. “
FACT: “During Disk scrubbing, system will be checking the disk blocks of all disks for media errors
and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by
reconstructing the data from other disks and rewriting the data and that's the reason you see the
CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to
non-peak hours or change your RAID scrub speed to Low by using.”
filer>options raid.scrub.perf_impact low
WAFL SCAN
There are many backgrounds WAFL scans for internal Filesystem maintenance. As a result one might
"see" read/write activity in sysstat -x 1 command output. wafl scan is one of them which is always on
and prioritized to run when the filer is idle.
Volume vol0:
Scan id Type of scan progress
213 active bitmap rearrangement fbn 1513 of 2230 w/ max_chain_len 3
This is normal!
NetApp performance Diagnosis commands
Note: Don’t forget to enable print logging 'on' in the putty session, as the output will often exceed
the screen length. Also, note that certain commands may not be available under 'Admin prompt
[priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'.
TIP: If you are not sure or confident about running these commands on the production filer, then
always keep a SIMULATOR running by your side. This way, you can run these commands on the
SIMULATOR and get your confidence level up a bit and before going about your business.
This command will give you over all stats per second [You can change the internal by providing
different value such as 2,3,5,6 etc. for ex – sysstat -x 5]
filer>sysstat -x 1
Gives you a second-by-second readout of the filer’s performance. In particular look at the CP Time
and CP Type – if you’re constantly hitting 100% CP Time and the CP Type is showing lots of B’s (back
to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all
the incoming data quickly enough. This conditions is also called -Deferred back to back CPs (CP
generated CP) (This probably indicates that the condition is getting worse)
filer>priv set diag filer>statit -b
Then wait 5 secs then
filer>statit -e
This command gives detailed stats of filer disk performance. The first begins (-b) the performance
snapshot and the second ends (-e) it. The output can indicate which disks are being hammered.
You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations
Manager]
http://media.netapp.com/documents/tr-4090.pdf
NetApp Storage Monitoring Using HP OpenView
http://www.netapp.com/us/media/tr-3688.pdf
Average CPU HIGH Bottleneck
To check how all the CPUs are doing: filer>priv set diag filer>sysstat -m 1
sysstat -m displays per-processor and average utilization.
The ANY column in sysstat -m output shows the percentage of the time that one or more CPUs were busy. In addition to this, the utilization of each individual processor is displayed, as well as the average (AVG).
As long as average CPU is not 100%, there is nothing to worry about. NetApp Oncommand Performance Advisor might show CPU as high as 100% consistently but do not panic, it’s just plotting the percentage of the time that one or more CPUs were busy.
As you can see AVG CPU is pretty NORMAL.
Only if you see AVG CPU Percentage @ 100 % consistently that you need to be concerned and talk to Netapp and check if you are hitting the BUG..
Kahuna bottleneck
The sum of the Kahuna domain and the (Kahu) from the WAFL_Ex domain reach 100% utilization.
To check how all the CPUs are doing across all domains: filer>sysstat -M 1
In this example below: I have circled 'kahuna domain' and squared 'kahu' just to make it clear.
In this example – Kahuna domain + ( kahu) adds up to 95 & 96 percentage, which is quite high but
not above 100% mark yet.
IMP: Kahuna processes and (Kahu) processes cannot run simultaneously, so a potential Kahuna
bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.
It is important to keep a watch on this domain percentage; it will be a matter of concern if it
consistently remains at 100% for days together. In most cases, this will get normalized in few
hours. Hence, do not panic.
Reach Out to NetApp Support
If you are unable to make sense of all this, do not worry, just contact NetApp technical Phone or
Email Support, they are really good. In most cases, they will ask you to collect the logs and upload
it to the NetApp support site.
To help you do this, NetApp support will direct you to following tools for log collection:
Tool : Perfstat
C:\>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out
Download the perfstat tool from the NetApp Support Site – Perfstat tool.
http://support.netapp.com/NOW/download/tools/perfstat/
Tool: NSanity
Collects details of all SAN related components for end-to-end diagnosis.
For full command info check the NSanity page on the NOW site.
http://support.netapp.com/NOW/download/tools/nsanity/
How to upload a file to NetApp
https://kb.netapp.com/support/index?page=content&id=1010090
BUGs that are linked to HIGH CPU Utilization
IMPORTANT TIP: Whenever you open a bug page in the NetApp Support site, always go to the link at the bottom of the 'Fixed-In Version' section, Titled: A complete list of releases where this bug
is fixed is available here. This is b’cos the Fixed-In version section may not contain the complete list of Data ONTAP versions that are fixed.
As shown in the figure below:
BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used'
scanners
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=648017
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798
[Note: The BUG 648017 is fixed in the release since 8.1.2P3 onwards, so that indicates this bug is
present in 8.1.2, but having said that, it doesn’t mean that you are hitting this BUG.]
BUG:91653: Volume SnapMirror source has high CPU usage
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=91653
BUG:110630: Wildcard searches from CIFS on large directories are CPU-intensive
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=110630
C-MODE BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high
number of SAS shelves and disks
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957
BUG: 590193:WAFL background file system scanner may cause high CPU usage.
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193
BUG:164124: Kerberos replay cache can cause high CPU usage
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=164124
Courtesy: NetApp
Jan, 2014