Post on 15-Jul-2020
Copyright 2009 Velocity Software, Inc. All Rights Reserved.
Other products and company names mentioned herein may be trademarks of their respective owners.
Linux on z/VMUnderstanding CPU Usage
Rob van der Heij
rob @ velocitysoftware.de
Velocity Software GmbHhttp://www.velocitysoftware.com/
IBM System z
Technical Conference
Brussels, 2009
Session LX44
LX44 – Linux on z/VM – Understanding CPU Usage2
Introduction
Why would you care about CPU Usage?
� CPU virtualization is the easiest part
• System z CPU is designed for virtualization and sharing
� Sharing CPU resources raises questions as well
• Is my workload held back because of CPU constraints?
• Why did other workload get the resources that I want?
• Are there more resources available, why didn’t I get them?
� Sharing the resources requires “social behavior”
• Cycles wasted on useless work can’t be used for real work
LX44 – Linux on z/VM – Understanding CPU Usage3
Introduction
What could you do about CPU Usage?
� Understand where your CPU cycles are spent
� Measure and identify CPU usage
� Reduce peak CPU requirements
• Find more economic ways to do the work
• Avoid work that does not need to be done
• Move some work to quiet “night shift” hours
� Pay attention to “idle load”
Not the same as benchmarks� Top speed versus mileage
� Most economic way to do the work within SLA
� Scalability of applications
LX44 – Linux on z/VM – Understanding CPU Usage4
Agenda
CPU Usage Breakdown
� LPAR
� z/VM
� Virtual Machine
CPU Accounting
Linux CPU Usage• What is that Penguin Doing?
• Linux Server with High Overhead
• Improving TSM Throughput
• My Penguin can't SleepPerformance data shown in the presentation
was collected and processed with ESALPS.
LX44 – Linux on z/VM – Understanding CPU Usage5
CPU Usage Breakdown
Logical Partition Level
� LPARs more or less share the CPU resources
� Logical CPUs are defined shared or dedicated
� For shared CPUs the LPAR weight may be important
� CPs and IFLs are not mixed
• Exception: “VM Mode LPAR” - z/VM 5.4 on selected hardware
<----Logical----> <-Logical Proc-><---Partition---> VCPU <%Assigned>Name No. Type Addr Total Ovhd-------- --- ---- ---- ----- ----LP1 0 CP 0 49.2 0.4
1 48.6 0.32 49.2 0.4
LP2 1 IFL 0 100.1 0.01 100.1 0.02 99.9 0.0
LP3 2 CP 0 0.1 0.0
PR/SM Overhead
Normally pretty small
Only 1 IFL LPAR: 3 dedicated CPUs
Nobody to share with
LX44 – Linux on z/VM – Understanding CPU Usage6
CPU Usage Breakdown
PR/SM Management Time
� Not attributed to one specific LPAR
• Scheduling etc – spread over all physical CPUs
• Typically < 1% per CPU
• Depends on number of LPARs & logical CPUs
LPAR Overhead
� PR/SM work on behalf of a specific LPAR
• Dispatching, QDIO, etc - spread over logical CPUs of the LPAR
• Typically < 1% per logical CPU
• Depends on workload
LPAR Management Time
LPAR2 LPAR3LPAR1 Unused
LPAR Overhead
<-------Logical Partition--->Virt <%Assigned>
Name Nbr CPUs Total Ovhd-------- --- ---- ------ ----LP1 0 3 83.6 1.2LP3 2 1 0.1 0.0
Productive work by Guest OS
Physical CPU Management time:CPU Percent --- -------0 0.003 4 0.414 6 0.414 9 0.416 11 0.003 13 0.003
_______ Total: 1.254
LX44 – Linux on z/VM – Understanding CPU Usage7
CPU Usage Breakdown
Logical CPU View
� Guest OS dispatches workload over logical CPUs
• When no work to do: CPU idle (wait state)
• LPAR recognizes idle CPU – “white space” is sharedException: when “wait complete” is set for the LPAR
LPAR1
CPU 0
100%
CPU 1 CPU 2
<-----CPU (percentages)----->
Total Emul User Sys IdleCPU util time ovrhd ovrhd time-- *---- ----- ----- ----- -----00 40.9 30.8 6.9 3.3 58.601 40.4 32.4 5.9 2.1 59.102 40.2 31.9 6.0 2.3 59.3
LPAR Overhead
LX44 – Linux on z/VM – Understanding CPU Usage8
CPU Usage Breakdown
z/VM View – Totals
� System Overhead – General CP work
• Scheduling, Monitor, Accounting
� User Overhead – CP work on behalf of specific user
• I/O translation, instruction simulation, CP functions
� Emulation Time – Productive work for user
� z/VM metrics: True CPU%
100%<-----CPU (percentages)----->
Total Emul User Sys IdleCPU util time ovrhd ovrhd time-- *---- ----- ----- ----- -----00 40.9 30.8 6.9 3.3 58.601 40.4 32.4 5.9 2.1 59.102 40.2 31.9 6.0 2.3 59.3 System
Overhead
User
Overhead
Emulation
Time
vtime
ttime
T/V Ratio
idle
LX44 – Linux on z/VM – Understanding CPU Usage9
CPU Usage Breakdown
z/VM View – Virtual Machines
� Virtual Time – Virtual Machine work (SIE)
� Total Time – Virtual Time plus User Overhead
� Some Virtual Machines are not real “users” but system functions
• RACFVM, TCPIP, DIRMAINT
System
Overhead
User
Overhead
Emulation
Time
Virtual
MachinesScreen: ESAUSP2 Velocity Software1 of 3 User Percent Utilization
UserID <Processor> Time /Class Total Virt-------- -------- ----- -----16:13:00 SUSELNX2 3.64 3.59
REDHAT04 2.89 2.80 ORACLE 2.12 2.08 VMRLNX 1.89 1.88 DXT2LV 0.61 0.31 ROBLX1 0.35 0.35TCPIP 0.28 0.13 SUSELNX1 0.24 0.21 REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2 0.12 0.11
Sum of per-user
usage = total:
Capture Ratio
LX44 – Linux on z/VM – Understanding CPU Usage10
CPU Accounting
Mainframe Operating Systems do CPU accounting
� Required for charge back of shared resources
z/VM account records
� At logoff or through CP command
� Resource usage per virtual machine
• CPU usage
• I/O operations
� Very simple to process
• Easy to audit
� Does not tell you why
� Lacks detail for Linux
LX44 – Linux on z/VM – Understanding CPU Usage11
CPU Accounting
Charge-back is to recover the total data center cost� CPU is not the major cost factor anymore
� CPU usage is traditionally still used for charge-back
• CPU usage considered representative for amount of usage
• Total data center cost divided by consumed CPU hours
• CPU tariff based on estimated usage and capacity plan
IFL with Linux and z/VM break the model� Installations add a substantial amount of MIPS
� Linux applications also consume a lot of CPU hours
• Linux Proof of Concept charged much of z/OS license cost
Charge-back motivates users to save resources� Make sure to arrange a correct cost model for Linux
LX44 – Linux on z/VM – Understanding CPU Usage12
CPU Accounting
CPU Accounting for Linux on z/VM needs detail� Just listing totals is not enough to convince customers
� Exceptional usage must be explained very clearly
Performance Monitor can reveal the detail� Collects CPU data along with many other metrics
• ESALPS collects ~ 3500 unique metrics every minute100's of them repeated per device, per user or per Linux process
• Helps to understand sequence of events causing a problem
• Explains any excessive usage to the application owner
� Requires detailed "performance history"
� Requires complete data – capture ratio of 100%
Performance Monitor helps to validate the cost model
LX44 – Linux on z/VM – Understanding CPU Usage13
Visualization Techniques
Comparing Memory and CPU Usage
LX44 – Linux on z/VM – Understanding CPU Usage14
CPU Usage Breakdown
Linux System View
� Virtual Machine Emulation Time is available for Linux usage
� Steal Time: when Linux does not know what CPU was used for
� Linux Administrator be aware: "idle" ≠ "available for use"
User
Overhead
Emulation
Time
Virtual Machine
Linux-2.6Linux-2.4
CPU waiting for I/OI/O wait
No CPU usageIdleIdle
Steal
Interrupt
Soft-IRQ
Kernel
Nice
User
Kernel-related CPU usage
First-level Interrupt handlers
CPU cycles "stolen" by hypervisor
Kernel UsageSystem
Background Process UsageNice
Process UsageUser
Nice
User
System
Linux
LX44 – Linux on z/VM – Understanding CPU Usage15
CPU Usage Breakdown
Linux Process View
� CPU resources allocated to processes
� Each process some system time plus some user time (or nice)
� Processes should add up to total system and user time
• Capture ratio!
Linux CPU accounting
� Traditionally wrong due to virtualization
• Linux tools would show too high numbers
� Modern kernels use virtual CPU accounting
• Linux tools sometimes show wrong data
Nice
User
System
Linux
Processes
LX44 – Linux on z/VM – Understanding CPU Usage16
Why can’t I use my Linux Tools?
Linux data is incomplete and sometimes incorrect� Virtualization changes the rules of the game
• CPU Usage perceived by Linux can be very wrong
• Assumptions about used and available do not hold anymore
� z/VM performance impacts Linux behavior
� Need to combine Linux and z/VM performance data
z/VM does not clone system administrators� You may not have time to look when it happens
� Complex interactions make it hard to reproduce scenarios
� Multi-tier application involves multiple virtual servers
� Centralized data collection is easier to manage
� May need to share data with others to understand it
LX44 – Linux on z/VM – Understanding CPU Usage17
What is that Penguin doing
High Level Overview
� Shows no real detail
� Sometimes enough for quick check
Screen: ESAMAIN
1 of 3 System Overview
<---Users----> Transact. <Processor>
<-avg number-> per Avg. Utilization
Time On Actv In Q Sec. Time CPUs Total Virt.
*------- ---- ---- ---- ---- ---- ---- *---- -----
02:49:00 87 72 67.0 19.3 0.35 3 67.1 63.7
02:48:00 87 71 69.0 19.3 0.35 3 65.9 62.6
02:47:00 87 72 64.0 19.1 0.35 3 54.3 50.7
02:46:00 87 68 70.0 18.3 0.40 3 48.6 44.9
02:45:00 87 69 66.0 19.1 0.32 3 42.8 39.3
02:44:00 87 69 68.0 20.1 0.34 3 43.2 39.7
02:43:00 87 69 67.0 18.3 0.35 3 42.8 39.1
LX44 – Linux on z/VM – Understanding CPU Usage18
What is that Penguin doing
High Level Overview
� Shows breakdown per class
� Zoom in on one class
Screen: ESAUSP2
1 of 3 User Percent Utilization
<-------Main Storage-------->
UserID <Processor> <Resident-> Lock <-WSSize-->
Time /Class Total Virt Total Actv -ed Total Actv
-------- -------- ----- ----- ----- ----- ----- ----- -----
02:56:00 System: 118 112 19M 19M 142 19M 19M
*TheUsrs 116 110 18M 18M 92.00 18M 18M
LINUX 0.93 0.85 594K 594K 0.00 594K 594K
*Servers 0.68 0.61 11564 4096 0.00 11505 4050
TCPCTL 0.00 0.00 20814 20814 34.00 20672 20672
02:55:00 System: 151 144 19M 19M 140 19M 19M
*TheUsrs 148 142 18M 18M 90.00 18M 18M
LINUX 1.08 0.99 594K 594K 0.00 594K 594K
*Servers 0.66 0.59 11565 4390 0.00 11486 4315
TCPCTL 0.00 0.00 20814 20814 34.00 20672 20672
LX44 – Linux on z/VM – Understanding CPU Usage19
What is that Penguin doing
Usage Breakdown per user
� So one server used 25% of a CPU last minute
• Is that good or bad?
• Often you can’t really tell without knowing behavior over time
Screen: ESAUSP2 ESAMON 3.7.0
1 of 3 User Percent Utilization CLASS *THEUS
<-------Main Storage-------->
UserID <Processor> <Resident-> Lock <-WSSize-->
Time /Class Total Virt Total Actv -ed Total Actv
-------- -------- ----- ----- ----- ----- ----- ----- -----
02:57:00 DOMINOZ1 25.62 23.10 522K 522K 0.00 522K 522K
IBMTSM 21.51 19.35 522K 522K 29.00 522K 522K
TDIRTIM 12.30 12.24 1044K 1044K 0.00 1044K 1044K
EBIZ2 9.72 9.51 260K 260K 0.00 260K 260K
DB2-A1 6.83 6.79 522K 522K 0.00 522K 522K
ACME 4.73 4.53 190K 190K 0.00 190K 190K
EBIZDEV1 4.53 4.42 260K 260K 0.00 260K 260K
EBIZDEV2 4.27 4.18 260K 260K 0.00 260K 260K
EBIZ1 4.23 4.14 260K 260K 0.00 260K 260K
TDIRDB2 3.35 3.32 852K 852K 0.00 852K 852K
IBMRED2 2.70 2.64 260K 260K 0.00 260K 260K
LX44 – Linux on z/VM – Understanding CPU Usage20
What is that Penguin doing
Single User over Time
� Looking at usage in recent past shows “when it started”
• Frequently more productive than waiting until it stops
� For multi-tier applications you need to look at multiple servers
• Arrange servers in classes for “application view”
Screen: ESAUSP2 ESAMON 3.7.0
1 of 3 User Percent Utilization CLASS * USER
<-------Main Storage-------->
UserID <Processor> <Resident-> Lock <-WSSize-->
Time /Class Total Virt Total Actv -ed Total Actv
-------- -------- ----- ----- ----- ----- ----- ----- -----
03:07:00 DOMINOZ1 34.75 31.13 522K 522K 0.00 522K 522K
03:06:00 DOMINOZ1 28.27 24.91 522K 522K 0.00 522K 522K
03:05:00 DOMINOZ1 26.57 23.97 522K 522K 0.00 522K 522K
03:04:00 DOMINOZ1 24.01 21.65 522K 522K 0.00 522K 522K
03:03:00 DOMINOZ1 24.07 21.50 522K 522K 0.00 522K 522K
03:02:00 DOMINOZ1 75.92 72.51 522K 522K 0.00 522K 522K
03:01:00 DOMINOZ1 33.35 30.01 522K 522K 0.00 522K 522K
03:00:00 DOMINOZ1 26.31 23.74 522K 522K 0.00 522K 522K
02:59:00 DOMINOZ1 22.17 19.47 522K 522K 0.00 522K 522K
LX44 – Linux on z/VM – Understanding CPU Usage21
What is that Penguin doing
Looking inside the Linux server
� Identify the Linux process that consume the resources
Screen: ESALNXP ESAMON 3.7.0 03/27 03:
1 of 3 LINUX VSI Process Statistics Report NODE DOMINOZ1 LIMIT 2
<-Process Ident-> <-----CPU Percents----->
Time Node Name ID PPID GRP Tot sys user syst usrt
-------- -------- --------- ----- ----- ----- ---- ---- ---- ---- ----
03:02:00 dominoz1 clrepl 12194 2536 2483 0.1 0.0 0.1 0.0 0.0
updall 11500 2536 2483 7.9 3.7 4.2 0.0 0.0
smdemf 5209 2536 2483 0.2 0.0 0.1 0.0 0.0
sched 5181 2536 2483 4.7 2.9 1.8 0.0 0.0
update 5174 2536 2483 1.1 0.5 0.5 0.0 0.0
replica 5168 2536 2483 32.8 3.4 29.4 0.0 0.0
server 2536 2483 2483 24.5 4.4 20.1 0.0 0.0
snmpd 1768 1 1767 0.4 0.3 0.1 0.0 0.0
kjournal 1140 1 1 0.1 0.1 0.0 0.0 0.0
kswapd0 134 1 1 0.2 0.2 0.0 0.0 0.0
pdflush 133 8 0 0.1 0.1 0.0 0.0 0.0
*Totals* 0 0 0 72.5 15.8 56.6 0.0 0.0
LX44 – Linux on z/VM – Understanding CPU Usage22
CPU Overhead
CPU Overhead can mean many different things
� Productive work for one is overhead for another
� Make sure your peer means the same thing
� You're only aware of it when you can measure
� With System z and z/VM we can measure it
� Hardware support keeps overhead mostly low
Sometimes abnormal behavior increases overhead
� Spending resources on other things than workload
� Performance Monitor often helps to clarify things
LX44 – Linux on z/VM – Understanding CPU Usage23
Linux Server with High Overhead
Customer reports on Linux server with high CP cost
� Linux server using 25-30% of a CPU
� Almost half of that is “CP overhead”
• T/V ratio of 1.8
• Work that CP does on behalf of the virtual machine
� z/VM has plenty of CPU resources
• Linux guest does not appear to be held back
Question
� What is Linux doing?
� Why high overhead?
<---CPU time-->UserID <(Percent)> T:V/Class Total Virt Rat-------- ----- ----- ---
12:50 LINUX806 25.84 14.30 1.812:51 LINUX806 26.44 14.73 1.812:52 LINUX806 28.25 15.37 1.812:53 LINUX806 27.78 15.26 1.812:54 LINUX806 28.20 15.51 1.812:55 LINUX806 29.95 16.52 1.812:55 LINUX806 27.01 14.94 1.8
Answer: Doing Nothing!
LX44 – Linux on z/VM – Understanding CPU Usage24
Linux Server with High Overhead
Review Linux internal CPU statistics
� Linux reports total usage of ~ 5-6%
� z/VM reports total usage of ~ 25-30%
� Someone is off by factor of 5
Server runs SLES 10
� Uses “virtual time accounting” to get “correct” numbers
Date/ Node <Processor Pct Util>Time Total Syst User Idle-------- -------- ----- ---- ---- ----12:50:00 LINUX806 5.9 4.1 1.8 19412:51:00 LINUX806 6.1 4.4 1.8 19412:52:00 LINUX806 6.3 4.5 1.8 19312:53:00 LINUX806 5.9 4.3 1.6 19412:54:00 LINUX806 6.1 4.4 1.7 18912:55:00 LINUX806 6.5 4.8 1.8 198
LX44 – Linux on z/VM – Understanding CPU Usage25
Linux Server with High Overhead
Per-process breakdown
� Many db2sysc processes
• DB2 worker threads
� One db2fmcd with 1%
Only 2.7% accounted for
� Remainder disappeared
� Linux claims “idle”
node/ <-Process Ident-> Nice <------CPU Percents---->Name ID PPID GRP Valu Tot sys user syst usrt--------- ----- ----- ----- ---- ---- ---- ---- ---- ----12:51:00 LINUX806 0 0 0 0 2.79 0.72 1.01 0.30 0.77events/0 6 1 0 -5 0.02 0.02 0 0 0kjournal 607 1 1 0 0.02 0.02 0 0 0kjournal 1713 1 1 0 0.02 0.02 0 0 0multipat 2607 1 2606 0 0.02 0.02 0 0 0snmpd 2660 1 2659 -10 0.13 0.08 0.05 0 0ha_logd 2664 2662 2588 0 0.02 0 0.02 0 0heartbea 2775 1 2775 0 0.05 0.02 0.03 0 0heartbea 2778 2775 2775 0 0.02 0.02 0 0 0ntpd 2805 1 2805 0 0.02 0.02 0 0 0nscd 2815 1 2815 0 0.02 0.02 0 0 0cron 2839 1 2839 0 0.08 0 0 0.02 0.07db2fmcd 3060 1 3060 0 1.01 0.03 0 0.28 0.70db2fmd 4704 1 4703 0 0.18 0.02 0.17 0 0db2fmd 5199 1 5198 0 0.18 0.02 0.17 0 0db2fmp 10758 10736 10736 0 0.03 0.02 0.02 0 0db2sysc 11154 10741 10736 0 0.02 0.02 0 0 0db2sysc 11155 10741 10736 0 0.05 0.03 0.02 0 0db2sysc 11156 10741 10736 0 0.02 0.02 0 0 0db2sysc 13140 10741 10736 0 0.05 0.03 0.02 0 0db2sysc 13141 10741 10736 0 0.02 0 0.02 0 0db2sysc 13148 10741 10736 0 0.07 0.03 0.03 0 0db2sysc 13152 10741 10736 0 0.02 0.02 0 0 0db2sysc 13153 10741 10736 0 0.05 0.03 0.02 0 0db2fmp 15485 15473 15473 0 0.02 0 0.02 0 0db2sysc 15558 15478 15473 0 0.03 0.02 0.02 0 0. . .
z/VM monitor: 30%Linux statistics: 6%Explained usage: 3%
LX44 – Linux on z/VM – Understanding CPU Usage26
Linux Server with High Overhead
DB2 process ‘db2fmcd’ is suspicious
� No function with Linux on System z
• Provided for compatibility with some other configurations
� Largest single source of CPU usage in sample
• Likely triggers the work done by the db2sysc processes
� Probably does something that creates high overhead
Reviewed CP trace data to understand overhead
� Determine the cause for SIE intercept
� Normal behavior: Linux goes idle and wakes up again
� But it does that very often…
• 100,000 SIE intercept per sec
<CPU percents><--Internal (per second)-->Totl Ovrhead Diag Inst SIE Fast Page
Time Util Usr Sys nose Sim intrcp path fault----- ---- --- --- ---- ---- ------ ---- -----12:50 1078 48 77 512 17K 95073 25K 23.912:51 1010 49 78 1042 17K 100235 43K 36.412:52 1018 50 84 503 16K 103837 21K 21.412:53 896 46 69 479 15K 103922 19K 12.212:54 909 46 71 506 15K 104306 18K 11.212:55 817 46 64 520 14K 111731 23K 15.3
LX44 – Linux on z/VM – Understanding CPU Usage27
Linux Server with High Overhead
Application requests frequent wake-up� Wake-up request with delay of less than 10 ms
• This is polling – frowned upon in shared environment
� Unclear whether this is bug or design failure
Kernel bug rounds small delay to 0� Introduced with “high resolution timer” support
� Rounded to 0 ms, or immediate wake-up
Timer interrupt presented when enabled� CP dispatches virtual machine immediately
� Eventually minor time slice is consumed• Scheduler reviews the queue and dispatches later
LX44 – Linux on z/VM – Understanding CPU Usage28
Linux Server with High Overhead
Conclusion� Something in the application is polling
• Customer did some traces that point at process db2redom
• The db2fmcd process was the biggest single consumer
• Probably DB2 was confused in the recovery process
• Most likely not productive processing
� High CP overhead due to Linux kernel bug
• Turns short sleep into immediate wake-up
• Fix is upstream and will eventually go into distributions
� Wrong Linux CPU accounting due to another bug
• Fix is supposed to be in the pipeline
� Latency in z/VM prevented Linux from taking more
� You can’t always tell from CPU alone that it is looping
LX44 – Linux on z/VM – Understanding CPU Usage29
Improving TSM Throughput
Customer Scenario
� Nightly backup of discrete servers to TSM on System z
� Dedicated OSA for Linux server with TSM
� Bottleneck appears to be the physical GbE connection
� Limited CPU usage thanks to QBESM
VSWITCH
TSMSERV
LX44 – Linux on z/VM – Understanding CPU Usage30
Improving TSM Throughput
LACP: Link Aggregation Control Protocol
� Bundles multiple physical links in one logical path (IEEE 801.3ad)
� Connection between external switches and VSWITCH
� Also provides also the fail-over function
� Using 4 GbE should give 4-fold throughput
VSWITCH
TSMSERV
LACP
LACP
LX44 – Linux on z/VM – Understanding CPU Usage31
Improving TSM Throughput
LACP VSWITCH – Real World Experience
� Potential 4-fold throughput is just theoretical
• Discrete servers connect with single GbE
• Need sufficient servers to provide the data
� Distribution over physical paths is not balanced
• Connections are spread over paths by some hash function
• In this scenario only 3-4 communication pairs are active
� Still achieved almost 50% improvement over single fiber
• Increased qdio buffers from 16 to 128Network Throughput - 19 Jan 2009
0
20
40
60
80
100
120
140
160
00:00 00:30 01:00 01:30 02:00 02:30
MB
/s R
eceiv
ed
3D00
2D00
1D00
0D00Huh?
LX44 – Linux on z/VM – Understanding CPU Usage32
CP overhead TSMSERV
y = 0.2217x + 1.311
0
10
20
30
40
10 30 50 70 90 110 130 150
VSWITCH Throughput (MB/s)
CP
Overh
ead (
CP
U%
)
TSMSERV CPU Usage - 19 Jan 2009
0
50
100
150
200
00:00 00:30 01:00 01:30 02:00 02:30
CP
U%
cp
emul
Improving TSM Throughput
LACP VSWITCH – Real World Experience
� CP overhead has increased significantly – T/V ratio of 1.3
� Dedicated OSA was replaced by VNIC
• No hardware support from QEBSM - CP simulates SIGA
� Strong correlation between bandwidth and user overhead
• No strange things happening – linear relation
• Receiving 100 MB/sLinux time – 65% of CPU
User overhead – 22% of CPU
Emulation Time TSMSERV
0
20
40
60
80
100
120
10 30 50 70 90 110 130 150
VSWITCH Throughput (MB/s)
vti
me (
CP
U%
)
LX44 – Linux on z/VM – Understanding CPU Usage33
Improving TSM Throughput
LACP VSWITCH – CPU Usage
� More than just the virtual machine
� Also rather large System Overhead
• Total CPU utilization ~ 190%
� Other high priority workload kicked in
• Matches the dip in throughput
• Throughput is now limited by CPU
TSMSERV CPU Usage - 19 Jan 2009
0
50
100
150
200
00:00 00:30 01:00 01:30 02:00 02:30
CP
U%
cp
emul
CPU Usage - 19 Jan 2009
0
50
100
150
200
00:00 00:30 01:00 01:30 02:00 02:30
CP
U%
TSMSERV
Others
SAP005
SAP000
SAP025
CPU Usage - 19 Jan 2009
0
50
100
150
200
00:00 00:30 01:00 01:30 02:00 02:30
CP
U%
System
CP
User
LX44 – Linux on z/VM – Understanding CPU Usage34
Improving TSM Throughput
LACP VSWITCH – CPU Usage
� System overhead correlates with VSWITCH bandwidth
• This is different from the CP overhead charged to TSMSERV
• Pretty linear relation - about 24% CPU for 100 MB/s
� Probably for work that CP does to receive data
• Decoding the LACP packets
• Copying data from real QDIO buffers to VNIC buffers
Overhead vs VSWITCH Bandwidthy = 0.2372x + 1.4176
0
5
10
15
20
25
30
35
40
10 30 50 70 90 110 130 150
VSWITCH Throughput (MB/s)
Syste
m O
verh
ead (
CP
U%
)
111%Total
24%System overhead
22%CP overhead Linux
65%Linux internal work
Receiving 100 MB/s
LX44 – Linux on z/VM – Understanding CPU Usage35
Improving TSM Throughput
Ethernet Bonding in Linux
� Linux implementation of LACP
� Requires exclusive OSA ports like VSWITCH
• Other ports for VSWITCH fail-over
VSWITCH
TSMSERV
LACP
LACP
LX44 – Linux on z/VM – Understanding CPU Usage36
Improving TSM Throughput
Linux Bonding – Performance measurements
� Maximum Throughput slightly higher using all 4 paths
� System Overhead has disappeared
• CP has no inbound traffic for VSWITCH anymore
� CP Overhead for TSMSERV is gone
• CP is not even aware of traffic – QBESM handles it
� Linux CPU usage per MB has increased
• Code paths are different for qeth using QBESM vs SIGA
T hro ughput Linux B o nding
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
01:00 01:15 01:30 01:45 02:00 02:15 02:30 02:45
19
18
01
00
C P U Usage - Linux B o nding
0
20
40
60
80
100
120
140
160
180
200
01:00 01:15 01:30 01:45 02:00 02:15 02:30 02:45
Syst em
CP
Emul
TSMSERV
LX44 – Linux on z/VM – Understanding CPU Usage37
Improving TSM Throughput
VSWITCH LACP versus Linux Bonding� VSWITCH solution provides flexibility and ease of use
• At very high bandwidth there is a significant CPU cost
� Linux Bonding solution does not share interfaces among servers
• Additional OSA and router ports may be required
• Network routing becomes more complicated
� Throughput improvement less than expected
• Still latencies to be discovered
� Not every application uses 100 MB/s
• With lower bandwidth CPU cost is less
• But LACP is meant for high bandwidth
� It is not obvious what the CPU is used for
• There may be options for improvement
CPU Usage at 100 MB/s
0
20
40
60
80
100
120
VSWITCH LACP Linux Bonding
CP
U U
sag
e (
CP
U%
)
Syst em Overhead
TSMSERV CP Overhead
TSMSERV Emulat ion Time
LX44 – Linux on z/VM – Understanding CPU Usage38
My Penguin can't sleep
Linux servers without work should be idle
� Virtual machines drop from queue at transaction end
• CP defines transactions complete after 300 ms idleThe queue drop delay is a bit more complicated than this
� Linux servers tend to have some background work
• Frequent CPU usage causes server to stay in queue
• CP is reluctant to take pages from in-queue virtual machines
• No queue drop = non-interactive virtual machine (batch like)
� In-queue idle servers impact scalability
LX44 – Linux on z/VM – Understanding CPU Usage39
My Penguin can't sleep
Example of an idle Linux server
� Found waiting for CPU resource 5% of the time
� Never found actually running
� Waiting for queue drop 95% of the time
Screen: ESAXACT ESAMON 3.7.0
1 of 2 Transaction Delay Analysis CLASS * USER
<----------------Percent non-dormant-------
UserID E- D- T- Tst <As
Time /Class Run Sim CPU SIO Pag SVM SVM SVM CF Idl I/O
-------- ----------- --- --- --- --- --- --- --- --- --- ---
10:00:00 LROBV1 0 0 5 0 0 0 0 0 0 95 0
Screen: ESAUSRQ ESAMON 3.7.0
1 of 3 User Queue and Load Analysis CLASS * USER
<----------User Load------------->
UserID Logged Non- Disc- Total Tran
Time /Class on Idle Active conn InQue /min
-------- -------- ------ ----- ------- ------ ----- ----
10:00:00 LROBV1 1 1 1 0 1.00 0
95% of time “test idle”(waiting for Q-drop)
No transactions, permanently in queue
LX44 – Linux on z/VM – Understanding CPU Usage40
My Penguin can't sleep
Frequent timer ticks keep virtual machine in queue
� Work done by Linux on each tick is minimal
• Total CPU usage is still limited
� Can be verified by CP TRACE EXT 1004
• 1004 is Clock Comparator external interrupt
Reason #1 - Linux on-demand timer not active
� Traditionally Linux has a 10 ms timer interrupt to check for work
lrobv1:~ # vmcp trace ext 1004 printer run
lrobv1:~ # vmcp sp prt purge ; sleep 10 ; vmcp sp prt close
PRT FILE 0046 SENT **** PURGED **** AS 0056 RECS 1718 CPY
PRT FILE 0047 SENT TO ROBV RDR AS 0057 RECS 1002 CPY
lrobv1:~ # vmcp trace end
Trace ended
1002 ticks/10 sec~ 100 per second
LX44 – Linux on z/VM – Understanding CPU Usage41
My Penguin can't sleep
More detailed trace of timer interrupt
� Add display of TOD clock at interrupt
� CMS Pipelines can convert the TOD clocks
lrobv1:~ # vmcp sp con purge
CON FILE 0045 SENT **** PURGED **** AS 0058 RECS 0005 CPY
lrobv1:~ # vmcp trace i r 10cb4e term run cmd d rde8.8
lrobv1:~ # vmcp trace goto niets
lrobv1:~ # vmcp sp con close robv
CON FILE 0049 SENT TO ROBV RDR AS 0059 RECS 3758 CPY
-> 000000000010CB4E' STCK B2050DE8 >>
R00000DE8 C067D666 6D01818A
-> 000000000010CB4E' STCK B2050DE8 >>
R00000DE8 C067D666 6F80F54A
Timer Interrupts (w ith 10 ms timer)
9
9.5
10
10.5
11
0 0.2 0.4 0.6 0.8 1
Elapsed Time (s)
Tim
e D
iffe
ren
ce (
ms)
Mostly 10 msbetween ticks
LX44 – Linux on z/VM – Understanding CPU Usage42
My Penguin can't sleep
Linux on-demand timer – System #1
� Avoids 10 ms timer ticks when otherwise idle
� Should be configured as /proc/sys/kernel/hz_timer = 0
� Default setting changed with various releases
Screen: ESAXACT ESAMON 3.7.0
1 of 2 Transaction Delay Analysis CLASS * USER
<----------------Percent non-dormant-------
UserID E- D- T- Tst <As
Time /Class Run Sim CPU SIO Pag SVM SVM SVM CF Idl I/O
-------- ----------- --- --- --- --- --- --- --- --- --- ---
10:22:00 LROBV1 0 0 0 0 0 0 0 0 0 100 0
Screen: ESAUSRQ ESAMON 3.7.0
1 of 3 User Queue and Load Analysis CLASS * USER
<----------User Load------------->
UserID Logged Non- Disc- Total Tran
Time /Class on Idle Active conn InQue /min
-------- -------- ------ ----- ------- ------ ----- ----
10:22:00 LROBV1 1 1 1 0 0.02 42
100% of time “test idle”
42 trans/min ~average 1.5 second idle
LX44 – Linux on z/VM – Understanding CPU Usage43
My Penguin can't sleep
Tracing on-demand timer ticks – System #1
� Timer interrupts depend on process wake-up requests
• This is my server – yours will be different
• Counted 78 timer ticks per minute
� Frequently periods of 2 seconds “silence”
• During test interval counted 45 periods of > 300 ms
• Good enough to show CP that virtual machine is idle
� Less timer ticks would be even better
• More chance CP finds the server not in queue
Timer Interrupts (w ith on-demand timer)
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60
Elapsed Time (s)
Tim
e D
iffe
ren
ce (
ms)
LX44 – Linux on z/VM – Understanding CPU Usage44
My Penguin can't sleep
Anatomy of the average transaction
� Periods of activity with short wait time between them
• Each starts with a timer interrupt (simplification: ignore I/O interrupts)
� Longer idle period, followed by queue drop
run
0 1.4Time (s)
test idle
Queue drop
Dormantwait
60 sec / 44
run wait run T T T
Screen: ESARATE ESAMON 3.7.0
1 of 2 Transaction Rates And Response Times CLASS * USER
<Triv+NonTriv> <-UP Trivial-->
UserID <-Users-> Transactions Tran Resp(sec)
Time /Class Actv In Q /min Response /min Qtim Trans
-------- -------- ---- ---- ----- -------- ---- ---- -----
05:26:00 LROBV1 1 0.0 44.00 0.043 39.0 0.18 0.009
Queue time = 180Trans time = 9
87%
LX44 – Linux on z/VM – Understanding CPU Usage45
My Penguin can't sleep
Show z/VM scheduler that virtual machine is idle
� Next activity is considered new transaction
� Avoids virtual machines end up in Q3 with batch priority
� z/VM Scheduler classifies virtual machine at queue drop time
Allow z/VM storage management to take resources away
� Free list replenishment is done by sampling
• Dormant % determines whether CP finds server idle
• It pays if you try to reduce the number of timer interrupts
Percentage dormant mostly determined by timer interrupts
� CPU usage is minimal
� Frequency and distribution of timer interrupts
run
Time (s)
test idle Dormantwait run wait run T T T
LX44 – Linux on z/VM – Understanding CPU Usage46
My Penguin can't sleep
Detailed measurement of timer requests
� Transactions are reported by z/VM Monitor
• Does not provide information about the timer interrupts
• Does not tell which process requested the timer interrupt
� Virtual machine trace on the right place* in the kernel
• Timeout value (in ticks)
• Process ID
• TOD clock
-> 000000000033BCCC' LGR B904003B
G03=00000000000005DC
-> 000000000033BCD4' LG E3100DD8
G01=000000001FE86D88
V1FE86E8C 0000044C 0000044B
V00000DE8 C066F5C8 C8E8E5C2
Timeout value
Process ID
TOD clock
* The most convenient place to trace differs with each kernel
level.Eventually the timer_stats patch from Ingo Molnar (2.6.20)
should provide an even easier way to get this information.
LX44 – Linux on z/VM – Understanding CPU Usage47
My Penguin can't sleep
Timer Requests – System #1
�PID 1: init
• 5 sec check for dead orphans
�PID 1086 / 1087: nscd
• 15 sec to expire any cached items
� There are also timer interrupts for the kernel threads and drivers
• Visible in TRACE EXT 1004
• Not something you tune yourself
� Timer interrupts is different from wake-up calls
• Multiple places where call is made
• Merge of timer requests
Request Time Timeout PID
2007-04-10 10:29:25.387323 500 1
2007-04-10 10:29:30.386720 500 1
2007-04-10 10:29:33.227057 1500 1086
2007-04-10 10:29:33.227057 1500 1087
2007-04-10 10:29:36.226925 500 1
2007-04-10 10:29:41.227045 500 1
2007-04-10 10:29:46.227236 500 1
2007-04-10 10:29:48.227111 1500 1086
2007-04-10 10:29:48.227111 1500 1087
2007-04-10 10:29:52.226871 500 1
2007-04-10 10:29:57.227049 500 1
2007-04-10 10:30:02.227412 500 1
2007-04-10 10:30:04.227001 1500 1086
2007-04-10 10:30:04.227001 1500 1087
2007-04-10 10:30:07.226932 500 1
2007-04-10 10:30:12.226734 500 1
LX44 – Linux on z/VM – Understanding CPU Usage48
My Penguin can't sleep
Timer Requests – System #1
� Stopped nscd process
� Remains init at 5 second interval
� Kernel interrupts
• 2 sec reap_cache
• 30 sec do_cache_clean
Timer Interrupt Analysis - System #1
0
0.5
1
1.5
2
2.5
Tim
e b
etw
een
in
terr
up
ts (
s)
Dormant
Test Idle
LX44 – Linux on z/VM – Understanding CPU Usage49
My Penguin can't sleep
PowerTOP
� Frequent wake-up for nothing bothers others too!
• Unable to lower CPU frequency – reduces laptop battery life
� PowerTOP reveals what causes the wake-up
PowerTOP 1.8 (C) 2007 Intel Corporation
Collecting data for 15 seconds
< Detailed C-state information is only available on Mobile CPUs (laptops) >
P-states (frequencies)
Wakeups-from-idle per second : 122.5 interval: 15.0s
Top causes for wakeups:
98.4% (120.5) java : schedule_timeout (process_timeout)
0.4% ( 0.5) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn)
0.2% ( 0.2) init : schedule_timeout (process_timeout)
0.2% ( 0.2) <kernel core> : page_writeback_init (wb_timer_fn)
0.2% ( 0.2) <kernel module> : neigh_table_init_no_netlink (neigh_periodic_timer)
0.2% ( 0.2) nscd : schedule_timeout (process_timeout)
0.1% ( 0.1) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer)
java processes cause 120 wake-up calls per second (worse than 100 Hz timer)
LX44 – Linux on z/VM – Understanding CPU Usage50
My Penguin can't sleep
PowerTOP
� Wake-up calls disappear when JVM is stopped
• This may not be a useful option in real life
PowerTOP 1.8 (C) 2007 Intel Corporation
Collecting data for 15 seconds
< Detailed C-state information is only available on Mobile CPUs (laptops) >
P-states (frequencies)
Wakeups-from-idle per second : 1.9 interval: 15.0s
Top causes for wakeups:
29.6% ( 0.5) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn)
14.8% ( 0.3) <kernel module> : neigh_table_init_no_netlink (neigh_periodic_timer)
11.1% ( 0.2) init : schedule_timeout (process_timeout)
11.1% ( 0.2) <kernel core> : page_writeback_init (wb_timer_fn)
11.1% ( 0.2) nscd : schedule_timeout (process_timeout)
7.4% ( 0.1) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer)
3.7% ( 0.1) sshd : schedule_timeout (process_timeout)
3.7% ( 0.1) <kernel core> : sk_reset_timer (tcp_delack_timer)
3.7% ( 0.1) sshd : sk_reset_timer (tcp_write_timer)
3.7% ( 0.1) ip : __netdev_watchdog_up (dev_watchdog)
Requires2.6.21 kernelShould work on SLES11
LX44 – Linux on z/VM – Understanding CPU Usage51
My Penguin can't sleep
Linux on-demand timer – System #2
� Virtual machine reported as 135% in-queue: virtual 2-way
• To be really idle, both virtual CPU’s must be idle at the same time
• Makes it very hard for CP to find the virtual machine idleNot an easy candidate to take pages away
Screen: ESAXACT Marist OSDL ESAMON 3.7.0
1 of 2 Transaction Delay Analysis CLASS * USER
<----------------Percent non-dormant-------
UserID E- D- T- Tst <As
Time /Class Run Sim CPU SIO Pag SVM SVM SVM CF Idl I/O
-------- ----------- --- --- --- --- --- --- --- --- --- ---
09:35:00 LNEALE1 0 0 0 0 0 0 0 0 0 100 0
Screen: ESAUSRQ Marist OSDL ESAMON 3.7.0
1 of 3 User Queue and Load Analysis CLASS * USER
<----------User Load------------->
UserID Logged Non- Disc- Total Tran
Time /Class on Idle Active conn InQue /min
-------- -------- ------ ----- ------- ------ ----- ----
09:35:00 LNEALE1 1 1 1 0 1.35 4
Counting Ext 1004CPU 00: 97CPU 01: 314
LX44 – Linux on z/VM – Understanding CPU Usage52
My Penguin can't sleep
Transactions in MP virtual machine
� Transaction time is very small compared to queue time
� Different timer events not synchronized
� Using more CPU’s will spread timer events over CPU’s
• Using more CPU’s means more transactions = more queue time
• Different in-queue periods not synchronized
• Percentage truly dormant is low
test idle run
test idle
1 CPU
3 CPU’s
test idle
test idle
Virtual
machine
dormant
LX44 – Linux on z/VM – Understanding CPU Usage53
My Penguin can't sleep
Linux on-demand timer – System #2
� Varied CPU1 offline (one CPU is enough for an idle server)
� Reduces in-queue time to ~ 50% - much better
� Never more than 500 ms silence
Screen: ESAXACT ESAMON 3.7.0
1 of 2 Transaction Delay Analysis CLASS * USER
<----------------Percent non-dormant-------
UserID E- D- T- Tst <As
Time /Class Run Sim CPU SIO Pag SVM SVM SVM CF Idl I/O
-------- ----------- --- --- --- --- --- --- --- --- --- ---
10:21:00 LNEALE1 3 0 0 0 0 0 0 0 0 97 0
Screen: ESAUSRQ ESAMON 3.7.0
1 of 3 User Queue and Load Analysis CLASS * USER
<----------User Load------------->
UserID Logged Non- Disc- Total Tran
Time /Class on Idle Active conn InQue /min
-------- -------- ------ ----- ------- ------ ----- ----
10:21:00 LNEALE1 1 1 1 0 0.54 40
Ext 1004: 230/min
Timer Interrupts (w ith on-demand timer) - System #2
0
100
200
300
400
500
600
0 10 20 30 40 50 60
Elapsed Time (s)
Tim
e D
iffe
ren
ce (
ms)
LX44 – Linux on z/VM – Understanding CPU Usage54
My Penguin can't sleep
Timer requests – System #2
� Analyzed 400 timer requests (5 minutes)
� One task has frequent requests for 510 ms wake-up
• This explains why we never see silence for more than 500 ms
• Linux “/proc” helps identify the process requesting wake-upafs_rxevent polling with 510 ms
httpd2-prefork polling with 1000 ms
• Checking twice per second may not be polling but is frowned upon
• Later AFS versions are supposed to fix this
Count Timeout Task Address
128 51 00000000fd465830
65 100 00000000fdff9090
...
LX44 – Linux on z/VM – Understanding CPU Usage55
Using cpuplugd
Approach to have best of both worlds - cpuplugd� Maximum number of virtual CPUs during peaks
� Less virtual CPUs when workload does not need more• Reduces Linux internal overhead because of MP-effects
• Address VM issues with idle virtual MP
� Daemon to set virtual CPUs online and offline• Thresholds and rules define when to switch on and off
No exact science – defaults mostly determined by trial & error
Idea is interesting and approach is intuitive
Might make sense to look at steal time and idle time
� It does work and varies excessive virtual CPUs offline• CP emulates the SIGP to stop the CPU
� Daemon wakes up every 10 seconds to check• Uses the “load average” – takes time to get CPUs online
• The cpuplugd daemon uses some resources itself
HOTPLUG="(loadavg > onumcpus + 0.75) & (idle < 10.0)"HOTUNPLUG="(loadavg < onumcpus - 0.25) | (idle > 50)"
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01.
Linux Speak
"plug" – enable
"unplug" - disable
LX44 – Linux on z/VM – Understanding CPU Usage56
Virtual CPU SHARE distribution
Assigned SHARE is distributed over virtual CPUs� When business justifies multiple CPUs, also review SHARE setting
� Surprises customers – changes one CPU into two half CPUs
� Probably poor design – but hard to change without breaking things
Detaching a virtual CPU speeds up the remaining CPUs� No business justification – little work, so why must it be faster?
� Sharing resources requires “social behavior” rather than “grab what you can”
� Fortunately detaching a CPU is disruptive process
z/VM 5.4 enhancement: distribute SHARE also with stopped CPU� Makes it easier for Linux to “detach” CPUs and distribute SHARE
� Motivated by different objectives• Workload based capacity planning• Maximum single server throughput
� Need an option in CP to avoid SHARE distribution• Maybe with more intuitive per-CPU setting: SHARE REL 100 DWIM* DNMWM*
* Do Why I MeanDo Not Mess With Me
LX44 – Linux on z/VM – Understanding CPU Usage57
My Penguin can't sleep
Conclusion� z/VM memory management needs to see servers idle
• Queue drop allows scheduler to see transaction end
• Dormant period allows memory management to take pages
� Adding extra virtual CPUs impacts scalability• All virtual CPU’s must drop from queue to be dormant
• Virtual machine in Q3 even when idle - sluggish
• Linux CPU affinity is not the best choice for Linux on z/VM
• See where cpuplugd fits in your capacity plan – be aware
� Do not define more virtual CPUs than you need• Not even more than the resources you expect to get
• The “equal to number of real CPUs” rule is nonsense
When in doubt, one will do.
When you have measured, probably too.
LX44 – Linux on z/VM – Understanding CPU Usage58
Summary
Shared environment requires CPU resource management
� Don't waste resources that other servers can use better
� Charge-back motivates users to save resources
Many layers of CPU resource management
� Instrumentation required in all layers
� Linux CPU data must be combined with z/VM data
Performance monitor helps explain exceptional usage
� Data must be correct and complete to be useful
Linux on z/VM virtual CPU configuration
When in doubt, one will do. When you measured, probably too.
Idle Linux servers must drop from queue
� Use "powertop" to identify the cause of polling
Copyright 2009 Velocity Software, Inc. All Rights Reserved.
Other products and company names mentioned herein may be trademarks of their respective owners.
Linux on z/VMUnderstanding CPU Usage Rob van der Heij
rob @ velocitysoftware.de
Velocity Software GmbHhttp://www.velocitysoftware.com/
Big “Thank You” to our customers who let me work on their performance problems
If you have performance problems, just drop me a note
or catch me somewhere
IBM System z
Technical Conference
Brussels, 2009
Session LX44