CPU bottleneck issues netapp

10
NetApp CPU Bottleneck Issues Some help when dealing with CPU bottleneck issues A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system and reduce the chance of making a false conclusion. But, to begin with, it makes sense to understand – How Data ONTAP makes use of multiple CPUs. Data ONTAP operating system implements coarse-grained symmetric multiprocessing (CSMP). What that means is that - Data ONTAP handles processes across multiple CPUs and these processes are divided into different domains, but the key information to know is that although different domains can run simultaneously on different processors, each individual domain can only exist on a single CPU at any one time. This is useful, because it means that any domain showing 100% usage indicates a CPU bottleneck for that bundle of related processes. When you run 'sysstat -M 1' you can see CPU statistics across these domains: Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage, Raid, Target, Kahuna ]

description

HIGH CPU Utilization Issues on NetApp Filer

Transcript of CPU bottleneck issues netapp

Page 1: CPU bottleneck issues netapp

NetApp CPU Bottleneck Issues

Some help when dealing with CPU bottleneck issues

A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun

latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system

and reduce the chance of making a false conclusion.

But, to begin with, it makes sense to understand – How Data ONTAP makes use of multiple CPUs.

Data ONTAP operating system implements coarse-grained symmetric multiprocessing (CSMP).

What that means is that - Data ONTAP handles processes across multiple CPUs and these processes

are divided into different domains, but the key information to know is that although different

domains can run simultaneously on different processors, each individual domain can only exist on a

single CPU at any one time. This is useful, because it means that any domain showing 100% usage

indicates a CPU bottleneck for that bundle of related processes.

When you run 'sysstat -M 1' you can see CPU statistics across these domains:

Network

Protocol

Cluster

Storage

Raid

Target

Kahuna

WAFL_Ex(Kahu)

Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage,

Raid, Target, Kahuna ]

Page 2: CPU bottleneck issues netapp

HIGH CPU does not always suggest problem in the filer. For example – On a Multi-Processor Filer the output of sysstat –x 1 may be quite deceiving b’cos it’s not showing the AVG utilization percentage which is more true indicative of system performance.

What is Processor utilization?

Processor utilization is nothing but the percentage of time the processor is busy.

For example – Sysstat –x 1 is showing very high % age

Whereas, sysstat –m 1 shows rather normal figures

Page 3: CPU bottleneck issues netapp

USEFUL KBs

Block reclamation scanners cause kahuna bottleneck.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480

What is the 'wafl scan status' command?

https://kb.netapp.com/support/index?page=content&id=3011346

How does Data ONTAP make use of multiple CPUs?

https://kb.netapp.com/support/index?page=content&id=3010150

[Apparently this KB: 3010150 is removed from the NetApp Support site]

What causes High CPU during disk scrub although raid.scrub.perf_impact is set to low?

https://kb.netapp.com/support/index?page=content&id=3011323

Data ONTAP 8: sysstat shows high CPU utilization on multiple processor system

https://kb.netapp.com/support/index?page=content&id=2013653

How does Data ONTAP schedule work across multiple physical CPUs?

https://kb.netapp.com/support/index?page=content&id=3010118

[Apparently this KB: 3010150 is removed from the NetApp Support site]

If the Filer acts as a snapmirror destination, then it is busy running the Deswizzler after a snapmirror

upgrade which can cause high CPU usage. By the way, what is deswizzler or deswizzling?

https://kb.netapp.com/support/index?page=content&actp=LIST&id=3011866

You can monitor the deswizzler work with the command wafl scan status:

https://kb.netapp.com/support/index?page=content&id=3011346

Diagnosing NetApp CPU Issues – Kahuna Bottlenecks

http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahuna-

bottlenecks/

Page 4: CPU bottleneck issues netapp

Nice to know

FACT: “A high CPU on a Storage Controller does not always mean CPU bottle neck or performance

problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller

is not busy with user protocols workload, it is doing background work like deswizzling or disk

scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this

scanner work down in order dedicate the CPU to user workload. “

FACT: “During Disk scrubbing, system will be checking the disk blocks of all disks for media errors

and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by

reconstructing the data from other disks and rewriting the data and that's the reason you see the

CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to

non-peak hours or change your RAID scrub speed to Low by using.”

filer>options raid.scrub.perf_impact low

WAFL SCAN

There are many backgrounds WAFL scans for internal Filesystem maintenance. As a result one might

"see" read/write activity in sysstat -x 1 command output. wafl scan is one of them which is always on

and prioritized to run when the filer is idle.

Volume vol0:

Scan id Type of scan progress

213 active bitmap rearrangement fbn 1513 of 2230 w/ max_chain_len 3

This is normal!

Page 5: CPU bottleneck issues netapp

NetApp performance Diagnosis commands

Note: Don’t forget to enable print logging 'on' in the putty session, as the output will often exceed

the screen length. Also, note that certain commands may not be available under 'Admin prompt

[priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'.

TIP: If you are not sure or confident about running these commands on the production filer, then

always keep a SIMULATOR running by your side. This way, you can run these commands on the

SIMULATOR and get your confidence level up a bit and before going about your business.

This command will give you over all stats per second [You can change the internal by providing

different value such as 2,3,5,6 etc. for ex – sysstat -x 5]

filer>sysstat -x 1

Gives you a second-by-second readout of the filer’s performance. In particular look at the CP Time

and CP Type – if you’re constantly hitting 100% CP Time and the CP Type is showing lots of B’s (back

to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all

the incoming data quickly enough. This conditions is also called -Deferred back to back CPs (CP

generated CP) (This probably indicates that the condition is getting worse)

filer>priv set diag filer>statit -b

Then wait 5 secs then

filer>statit -e

This command gives detailed stats of filer disk performance. The first begins (-b) the performance

snapshot and the second ends (-e) it. The output can indicate which disks are being hammered.

You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations

Manager]

http://media.netapp.com/documents/tr-4090.pdf

NetApp Storage Monitoring Using HP OpenView

http://www.netapp.com/us/media/tr-3688.pdf

Page 6: CPU bottleneck issues netapp

Average CPU HIGH Bottleneck

To check how all the CPUs are doing: filer>priv set diag filer>sysstat -m 1

sysstat -m displays per-processor and average utilization.

The ANY column in sysstat -m output shows the percentage of the time that one or more CPUs were busy. In addition to this, the utilization of each individual processor is displayed, as well as the average (AVG).

As long as average CPU is not 100%, there is nothing to worry about. NetApp Oncommand Performance Advisor might show CPU as high as 100% consistently but do not panic, it’s just plotting the percentage of the time that one or more CPUs were busy.

As you can see AVG CPU is pretty NORMAL.

Only if you see AVG CPU Percentage @ 100 % consistently that you need to be concerned and talk to Netapp and check if you are hitting the BUG..

Page 7: CPU bottleneck issues netapp

Kahuna bottleneck

The sum of the Kahuna domain and the (Kahu) from the WAFL_Ex domain reach 100% utilization.

To check how all the CPUs are doing across all domains: filer>sysstat -M 1

In this example below: I have circled 'kahuna domain' and squared 'kahu' just to make it clear.

In this example – Kahuna domain + ( kahu) adds up to 95 & 96 percentage, which is quite high but

not above 100% mark yet.

IMP: Kahuna processes and (Kahu) processes cannot run simultaneously, so a potential Kahuna

bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.

It is important to keep a watch on this domain percentage; it will be a matter of concern if it

consistently remains at 100% for days together. In most cases, this will get normalized in few

hours. Hence, do not panic.

Page 8: CPU bottleneck issues netapp

Reach Out to NetApp Support

If you are unable to make sense of all this, do not worry, just contact NetApp technical Phone or

Email Support, they are really good. In most cases, they will ask you to collect the logs and upload

it to the NetApp support site.

To help you do this, NetApp support will direct you to following tools for log collection:

Tool : Perfstat

C:\>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out

Download the perfstat tool from the NetApp Support Site – Perfstat tool.

http://support.netapp.com/NOW/download/tools/perfstat/

Tool: NSanity

Collects details of all SAN related components for end-to-end diagnosis.

For full command info check the NSanity page on the NOW site.

http://support.netapp.com/NOW/download/tools/nsanity/

How to upload a file to NetApp

https://kb.netapp.com/support/index?page=content&id=1010090

Page 9: CPU bottleneck issues netapp

BUGs that are linked to HIGH CPU Utilization

IMPORTANT TIP: Whenever you open a bug page in the NetApp Support site, always go to the link at the bottom of the 'Fixed-In Version' section, Titled: A complete list of releases where this bug

is fixed is available here. This is b’cos the Fixed-In version section may not contain the complete list of Data ONTAP versions that are fixed.

As shown in the figure below:

BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used'

scanners

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=648017

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798

[Note: The BUG 648017 is fixed in the release since 8.1.2P3 onwards, so that indicates this bug is

present in 8.1.2, but having said that, it doesn’t mean that you are hitting this BUG.]

BUG:91653: Volume SnapMirror source has high CPU usage

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=91653

BUG:110630: Wildcard searches from CIFS on large directories are CPU-intensive

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=110630

C-MODE BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high

number of SAS shelves and disks

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957

Page 10: CPU bottleneck issues netapp

BUG: 590193:WAFL background file system scanner may cause high CPU usage.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193

BUG:164124: Kerberos replay cache can cause high CPU usage

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=164124

Courtesy: NetApp

[email protected]

Jan, 2014