Linux on z/VM Understanding CPU Usage · 2009-05-04 · REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2...

Other products and company names mentioned herein may be trademarks of their respective owners.

Linux on z/VMUnderstanding CPU Usage

Rob van der Heij

rob @ velocitysoftware.de

Velocity Software GmbHhttp://www.velocitysoftware.com/

IBM System z

Technical Conference

Brussels, 2009

Session LX44

LX44 – Linux on z/VM – Understanding CPU Usage2

Introduction

Why would you care about CPU Usage?

� CPU virtualization is the easiest part

• System z CPU is designed for virtualization and sharing

� Sharing CPU resources raises questions as well

• Is my workload held back because of CPU constraints?

• Why did other workload get the resources that I want?

• Are there more resources available, why didn’t I get them?

� Sharing the resources requires “social behavior”

• Cycles wasted on useless work can’t be used for real work

Introduction

What could you do about CPU Usage?

� Understand where your CPU cycles are spent

� Measure and identify CPU usage

� Reduce peak CPU requirements

• Find more economic ways to do the work

• Avoid work that does not need to be done

• Move some work to quiet “night shift” hours

� Pay attention to “idle load”

Not the same as benchmarks� Top speed versus mileage

� Most economic way to do the work within SLA

� Scalability of applications

Agenda

CPU Usage Breakdown

� LPAR

� z/VM

� Virtual Machine

CPU Accounting

Linux CPU Usage• What is that Penguin Doing?

• Linux Server with High Overhead

• Improving TSM Throughput

• My Penguin can't SleepPerformance data shown in the presentation

was collected and processed with ESALPS.

CPU Usage Breakdown

Logical Partition Level

� LPARs more or less share the CPU resources

� Logical CPUs are defined shared or dedicated

� For shared CPUs the LPAR weight may be important

� CPs and IFLs are not mixed

• Exception: “VM Mode LPAR” - z/VM 5.4 on selected hardware

<----Logical----> <-Logical Proc-><---Partition---> VCPU <%Assigned>Name No. Type Addr Total Ovhd-------- --- ---- ---- ----- ----LP1 0 CP 0 49.2 0.4

1 48.6 0.32 49.2 0.4

LP2 1 IFL 0 100.1 0.01 100.1 0.02 99.9 0.0

LP3 2 CP 0 0.1 0.0

PR/SM Overhead

Normally pretty small

Only 1 IFL LPAR: 3 dedicated CPUs

Nobody to share with

CPU Usage Breakdown

PR/SM Management Time

� Not attributed to one specific LPAR

• Scheduling etc – spread over all physical CPUs

• Typically < 1% per CPU

• Depends on number of LPARs & logical CPUs

LPAR Overhead

� PR/SM work on behalf of a specific LPAR

• Dispatching, QDIO, etc - spread over logical CPUs of the LPAR

• Typically < 1% per logical CPU

• Depends on workload

LPAR Management Time

LPAR2 LPAR3LPAR1 Unused

LPAR Overhead

<-------Logical Partition--->Virt <%Assigned>

Name Nbr CPUs Total Ovhd-------- --- ---- ------ ----LP1 0 3 83.6 1.2LP3 2 1 0.1 0.0

Productive work by Guest OS

Physical CPU Management time:CPU Percent --- -------0 0.003 4 0.414 6 0.414 9 0.416 11 0.003 13 0.003

_______ Total: 1.254

CPU Usage Breakdown

Logical CPU View

� Guest OS dispatches workload over logical CPUs

• When no work to do: CPU idle (wait state)

• LPAR recognizes idle CPU – “white space” is sharedException: when “wait complete” is set for the LPAR

CPU 1 CPU 2

<-----CPU (percentages)----->

Total Emul User Sys IdleCPU util time ovrhd ovrhd time-- *---- ----- ----- ----- -----00 40.9 30.8 6.9 3.3 58.601 40.4 32.4 5.9 2.1 59.102 40.2 31.9 6.0 2.3 59.3

LPAR Overhead

CPU Usage Breakdown

z/VM View – Totals

� System Overhead – General CP work

• Scheduling, Monitor, Accounting

� User Overhead – CP work on behalf of specific user

• I/O translation, instruction simulation, CP functions

� Emulation Time – Productive work for user

� z/VM metrics: True CPU%

100%<-----CPU (percentages)----->

Total Emul User Sys IdleCPU util time ovrhd ovrhd time-- *---- ----- ----- ----- -----00 40.9 30.8 6.9 3.3 58.601 40.4 32.4 5.9 2.1 59.102 40.2 31.9 6.0 2.3 59.3 System

Overhead

Emulation

T/V Ratio

CPU Usage Breakdown

z/VM View – Virtual Machines

� Virtual Time – Virtual Machine work (SIE)

� Total Time – Virtual Time plus User Overhead

� Some Virtual Machines are not real “users” but system functions

• RACFVM, TCPIP, DIRMAINT

System

Overhead

Emulation

Virtual

MachinesScreen: ESAUSP2 Velocity Software1 of 3 User Percent Utilization

UserID <Processor> Time /Class Total Virt-------- -------- ----- -----16:13:00 SUSELNX2 3.64 3.59

REDHAT04 2.89 2.80 ORACLE 2.12 2.08 VMRLNX 1.89 1.88 DXT2LV 0.61 0.31 ROBLX1 0.35 0.35TCPIP 0.28 0.13 SUSELNX1 0.24 0.21 REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2 0.12 0.11

Sum of per-user

usage = total:

Capture Ratio

CPU Accounting

Mainframe Operating Systems do CPU accounting

� Required for charge back of shared resources

z/VM account records

� At logoff or through CP command

� Resource usage per virtual machine

• CPU usage

• I/O operations

� Very simple to process

• Easy to audit

� Does not tell you why

� Lacks detail for Linux

CPU Accounting

Charge-back is to recover the total data center cost� CPU is not the major cost factor anymore

� CPU usage is traditionally still used for charge-back

• CPU usage considered representative for amount of usage

• Total data center cost divided by consumed CPU hours

• CPU tariff based on estimated usage and capacity plan

IFL with Linux and z/VM break the model� Installations add a substantial amount of MIPS

� Linux applications also consume a lot of CPU hours

• Linux Proof of Concept charged much of z/OS license cost

Charge-back motivates users to save resources� Make sure to arrange a correct cost model for Linux

CPU Accounting

CPU Accounting for Linux on z/VM needs detail� Just listing totals is not enough to convince customers

� Exceptional usage must be explained very clearly

Performance Monitor can reveal the detail� Collects CPU data along with many other metrics

• ESALPS collects ~ 3500 unique metrics every minute100's of them repeated per device, per user or per Linux process

• Helps to understand sequence of events causing a problem

• Explains any excessive usage to the application owner

� Requires detailed "performance history"

� Requires complete data – capture ratio of 100%

Performance Monitor helps to validate the cost model

Visualization Techniques

Comparing Memory and CPU Usage

CPU Usage Breakdown

Linux System View

� Virtual Machine Emulation Time is available for Linux usage

� Steal Time: when Linux does not know what CPU was used for

� Linux Administrator be aware: "idle" ≠ "available for use"

Overhead

Emulation

Virtual Machine

Linux-2.6Linux-2.4

CPU waiting for I/OI/O wait

No CPU usageIdleIdle

Interrupt

Soft-IRQ

Kernel

Kernel-related CPU usage

First-level Interrupt handlers

CPU cycles "stolen" by hypervisor

Kernel UsageSystem

Background Process UsageNice

Process UsageUser

System

CPU Usage Breakdown

Linux Process View

� CPU resources allocated to processes

� Each process some system time plus some user time (or nice)

� Processes should add up to total system and user time

• Capture ratio!

Linux CPU accounting

� Traditionally wrong due to virtualization

• Linux tools would show too high numbers

� Modern kernels use virtual CPU accounting

• Linux tools sometimes show wrong data

System

Processes

Why can’t I use my Linux Tools?

Linux data is incomplete and sometimes incorrect� Virtualization changes the rules of the game

• CPU Usage perceived by Linux can be very wrong

• Assumptions about used and available do not hold anymore

� z/VM performance impacts Linux behavior

� Need to combine Linux and z/VM performance data

z/VM does not clone system administrators� You may not have time to look when it happens

� Complex interactions make it hard to reproduce scenarios

� Multi-tier application involves multiple virtual servers

� Centralized data collection is easier to manage

� May need to share data with others to understand it

What is that Penguin doing

High Level Overview

� Shows no real detail

� Sometimes enough for quick check

Screen: ESAMAIN

1 of 3 System Overview

<---Users----> Transact. <Processor>

<-avg number-> per Avg. Utilization

Time On Actv In Q Sec. Time CPUs Total Virt.

*------- ---- ---- ---- ---- ---- ---- *---- -----

02:49:00 87 72 67.0 19.3 0.35 3 67.1 63.7

02:48:00 87 71 69.0 19.3 0.35 3 65.9 62.6

02:47:00 87 72 64.0 19.1 0.35 3 54.3 50.7

02:46:00 87 68 70.0 18.3 0.40 3 48.6 44.9

02:45:00 87 69 66.0 19.1 0.32 3 42.8 39.3

02:44:00 87 69 68.0 20.1 0.34 3 43.2 39.7

02:43:00 87 69 67.0 18.3 0.35 3 42.8 39.1

High Level Overview

� Shows breakdown per class

� Zoom in on one class

Screen: ESAUSP2

1 of 3 User Percent Utilization

<-------Main Storage-------->

UserID <Processor> <Resident-> Lock <-WSSize-->

Time /Class Total Virt Total Actv -ed Total Actv

-------- -------- ----- ----- ----- ----- ----- ----- -----

02:56:00 System: 118 112 19M 19M 142 19M 19M

*TheUsrs 116 110 18M 18M 92.00 18M 18M

LINUX 0.93 0.85 594K 594K 0.00 594K 594K

*Servers 0.68 0.61 11564 4096 0.00 11505 4050

TCPCTL 0.00 0.00 20814 20814 34.00 20672 20672

02:55:00 System: 151 144 19M 19M 140 19M 19M

*TheUsrs 148 142 18M 18M 90.00 18M 18M

LINUX 1.08 0.99 594K 594K 0.00 594K 594K

*Servers 0.66 0.59 11565 4390 0.00 11486 4315

TCPCTL 0.00 0.00 20814 20814 34.00 20672 20672

Usage Breakdown per user

� So one server used 25% of a CPU last minute

• Is that good or bad?

• Often you can’t really tell without knowing behavior over time

Screen: ESAUSP2 ESAMON 3.7.0

1 of 3 User Percent Utilization CLASS *THEUS

<-------Main Storage-------->

-------- -------- ----- ----- ----- ----- ----- ----- -----

02:57:00 DOMINOZ1 25.62 23.10 522K 522K 0.00 522K 522K

IBMTSM 21.51 19.35 522K 522K 29.00 522K 522K

TDIRTIM 12.30 12.24 1044K 1044K 0.00 1044K 1044K

EBIZ2 9.72 9.51 260K 260K 0.00 260K 260K

DB2-A1 6.83 6.79 522K 522K 0.00 522K 522K

ACME 4.73 4.53 190K 190K 0.00 190K 190K

EBIZDEV1 4.53 4.42 260K 260K 0.00 260K 260K

EBIZDEV2 4.27 4.18 260K 260K 0.00 260K 260K

EBIZ1 4.23 4.14 260K 260K 0.00 260K 260K

TDIRDB2 3.35 3.32 852K 852K 0.00 852K 852K

IBMRED2 2.70 2.64 260K 260K 0.00 260K 260K

Single User over Time

� Looking at usage in recent past shows “when it started”

• Frequently more productive than waiting until it stops

� For multi-tier applications you need to look at multiple servers

• Arrange servers in classes for “application view”

Screen: ESAUSP2 ESAMON 3.7.0

1 of 3 User Percent Utilization CLASS * USER

<-------Main Storage-------->

-------- -------- ----- ----- ----- ----- ----- ----- -----

03:07:00 DOMINOZ1 34.75 31.13 522K 522K 0.00 522K 522K

03:06:00 DOMINOZ1 28.27 24.91 522K 522K 0.00 522K 522K

03:05:00 DOMINOZ1 26.57 23.97 522K 522K 0.00 522K 522K

03:04:00 DOMINOZ1 24.01 21.65 522K 522K 0.00 522K 522K

03:03:00 DOMINOZ1 24.07 21.50 522K 522K 0.00 522K 522K

03:02:00 DOMINOZ1 75.92 72.51 522K 522K 0.00 522K 522K

03:01:00 DOMINOZ1 33.35 30.01 522K 522K 0.00 522K 522K

03:00:00 DOMINOZ1 26.31 23.74 522K 522K 0.00 522K 522K

02:59:00 DOMINOZ1 22.17 19.47 522K 522K 0.00 522K 522K

Looking inside the Linux server

� Identify the Linux process that consume the resources

Screen: ESALNXP ESAMON 3.7.0 03/27 03:

1 of 3 LINUX VSI Process Statistics Report NODE DOMINOZ1 LIMIT 2

<-Process Ident-> <-----CPU Percents----->

Time Node Name ID PPID GRP Tot sys user syst usrt

-------- -------- --------- ----- ----- ----- ---- ---- ---- ---- ----

03:02:00 dominoz1 clrepl 12194 2536 2483 0.1 0.0 0.1 0.0 0.0

updall 11500 2536 2483 7.9 3.7 4.2 0.0 0.0

smdemf 5209 2536 2483 0.2 0.0 0.1 0.0 0.0

sched 5181 2536 2483 4.7 2.9 1.8 0.0 0.0

update 5174 2536 2483 1.1 0.5 0.5 0.0 0.0

replica 5168 2536 2483 32.8 3.4 29.4 0.0 0.0

server 2536 2483 2483 24.5 4.4 20.1 0.0 0.0

snmpd 1768 1 1767 0.4 0.3 0.1 0.0 0.0

kjournal 1140 1 1 0.1 0.1 0.0 0.0 0.0

kswapd0 134 1 1 0.2 0.2 0.0 0.0 0.0

pdflush 133 8 0 0.1 0.1 0.0 0.0 0.0

*Totals* 0 0 0 72.5 15.8 56.6 0.0 0.0

CPU Overhead

CPU Overhead can mean many different things

� Productive work for one is overhead for another

� Make sure your peer means the same thing

� You're only aware of it when you can measure

� With System z and z/VM we can measure it

� Hardware support keeps overhead mostly low

Sometimes abnormal behavior increases overhead

� Spending resources on other things than workload

� Performance Monitor often helps to clarify things

Linux Server with High Overhead

Customer reports on Linux server with high CP cost

� Linux server using 25-30% of a CPU

� Almost half of that is “CP overhead”

• T/V ratio of 1.8

• Work that CP does on behalf of the virtual machine

� z/VM has plenty of CPU resources

• Linux guest does not appear to be held back

Question

� What is Linux doing?

� Why high overhead?

<---CPU time-->UserID <(Percent)> T:V/Class Total Virt Rat-------- ----- ----- ---

12:50 LINUX806 25.84 14.30 1.812:51 LINUX806 26.44 14.73 1.812:52 LINUX806 28.25 15.37 1.812:53 LINUX806 27.78 15.26 1.812:54 LINUX806 28.20 15.51 1.812:55 LINUX806 29.95 16.52 1.812:55 LINUX806 27.01 14.94 1.8

Answer: Doing Nothing!

Review Linux internal CPU statistics

� Linux reports total usage of ~ 5-6%

� z/VM reports total usage of ~ 25-30%

� Someone is off by factor of 5

Server runs SLES 10

� Uses “virtual time accounting” to get “correct” numbers

Date/ Node <Processor Pct Util>Time Total Syst User Idle-------- -------- ----- ---- ---- ----12:50:00 LINUX806 5.9 4.1 1.8 19412:51:00 LINUX806 6.1 4.4 1.8 19412:52:00 LINUX806 6.3 4.5 1.8 19312:53:00 LINUX806 5.9 4.3 1.6 19412:54:00 LINUX806 6.1 4.4 1.7 18912:55:00 LINUX806 6.5 4.8 1.8 198

Per-process breakdown

� Many db2sysc processes

• DB2 worker threads

� One db2fmcd with 1%

Only 2.7% accounted for

� Remainder disappeared

� Linux claims “idle”

node/ <-Process Ident-> Nice <------CPU Percents---->Name ID PPID GRP Valu Tot sys user syst usrt--------- ----- ----- ----- ---- ---- ---- ---- ---- ----12:51:00 LINUX806 0 0 0 0 2.79 0.72 1.01 0.30 0.77events/0 6 1 0 -5 0.02 0.02 0 0 0kjournal 607 1 1 0 0.02 0.02 0 0 0kjournal 1713 1 1 0 0.02 0.02 0 0 0multipat 2607 1 2606 0 0.02 0.02 0 0 0snmpd 2660 1 2659 -10 0.13 0.08 0.05 0 0ha_logd 2664 2662 2588 0 0.02 0 0.02 0 0heartbea 2775 1 2775 0 0.05 0.02 0.03 0 0heartbea 2778 2775 2775 0 0.02 0.02 0 0 0ntpd 2805 1 2805 0 0.02 0.02 0 0 0nscd 2815 1 2815 0 0.02 0.02 0 0 0cron 2839 1 2839 0 0.08 0 0 0.02 0.07db2fmcd 3060 1 3060 0 1.01 0.03 0 0.28 0.70db2fmd 4704 1 4703 0 0.18 0.02 0.17 0 0db2fmd 5199 1 5198 0 0.18 0.02 0.17 0 0db2fmp 10758 10736 10736 0 0.03 0.02 0.02 0 0db2sysc 11154 10741 10736 0 0.02 0.02 0 0 0db2sysc 11155 10741 10736 0 0.05 0.03 0.02 0 0db2sysc 11156 10741 10736 0 0.02 0.02 0 0 0db2sysc 13140 10741 10736 0 0.05 0.03 0.02 0 0db2sysc 13141 10741 10736 0 0.02 0 0.02 0 0db2sysc 13148 10741 10736 0 0.07 0.03 0.03 0 0db2sysc 13152 10741 10736 0 0.02 0.02 0 0 0db2sysc 13153 10741 10736 0 0.05 0.03 0.02 0 0db2fmp 15485 15473 15473 0 0.02 0 0.02 0 0db2sysc 15558 15478 15473 0 0.03 0.02 0.02 0 0. . .

z/VM monitor: 30%Linux statistics: 6%Explained usage: 3%

DB2 process ‘db2fmcd’ is suspicious

� No function with Linux on System z

• Provided for compatibility with some other configurations

� Largest single source of CPU usage in sample

• Likely triggers the work done by the db2sysc processes

� Probably does something that creates high overhead

Reviewed CP trace data to understand overhead

� Determine the cause for SIE intercept

� Normal behavior: Linux goes idle and wakes up again

� But it does that very often…

• 100,000 SIE intercept per sec

<CPU percents><--Internal (per second)-->Totl Ovrhead Diag Inst SIE Fast Page

Time Util Usr Sys nose Sim intrcp path fault----- ---- --- --- ---- ---- ------ ---- -----12:50 1078 48 77 512 17K 95073 25K 23.912:51 1010 49 78 1042 17K 100235 43K 36.412:52 1018 50 84 503 16K 103837 21K 21.412:53 896 46 69 479 15K 103922 19K 12.212:54 909 46 71 506 15K 104306 18K 11.212:55 817 46 64 520 14K 111731 23K 15.3

Application requests frequent wake-up� Wake-up request with delay of less than 10 ms

• This is polling – frowned upon in shared environment

� Unclear whether this is bug or design failure

Kernel bug rounds small delay to 0� Introduced with “high resolution timer” support

� Rounded to 0 ms, or immediate wake-up

Timer interrupt presented when enabled� CP dispatches virtual machine immediately

� Eventually minor time slice is consumed• Scheduler reviews the queue and dispatches later

Conclusion� Something in the application is polling

• Customer did some traces that point at process db2redom

• The db2fmcd process was the biggest single consumer

• Probably DB2 was confused in the recovery process

• Most likely not productive processing

� High CP overhead due to Linux kernel bug

• Turns short sleep into immediate wake-up

• Fix is upstream and will eventually go into distributions

� Wrong Linux CPU accounting due to another bug

• Fix is supposed to be in the pipeline

� Latency in z/VM prevented Linux from taking more

� You can’t always tell from CPU alone that it is looping

Improving TSM Throughput

Customer Scenario

� Nightly backup of discrete servers to TSM on System z

� Dedicated OSA for Linux server with TSM

� Bottleneck appears to be the physical GbE connection

� Limited CPU usage thanks to QBESM

VSWITCH

TSMSERV

LACP: Link Aggregation Control Protocol

� Bundles multiple physical links in one logical path (IEEE 801.3ad)

� Connection between external switches and VSWITCH

� Also provides also the fail-over function

� Using 4 GbE should give 4-fold throughput

VSWITCH

TSMSERV

LACP VSWITCH – Real World Experience

� Potential 4-fold throughput is just theoretical

• Discrete servers connect with single GbE

• Need sufficient servers to provide the data

� Distribution over physical paths is not balanced

• Connections are spread over paths by some hash function

• In this scenario only 3-4 communication pairs are active

� Still achieved almost 50% improvement over single fiber

• Increased qdio buffers from 16 to 128Network Throughput - 19 Jan 2009

00:00 00:30 01:00 01:30 02:00 02:30

0D00Huh?

CP overhead TSMSERV

y = 0.2217x + 1.311

10 30 50 70 90 110 130 150

VSWITCH Throughput (MB/s)

TSMSERV CPU Usage - 19 Jan 2009

00:00 00:30 01:00 01:30 02:00 02:30

LACP VSWITCH – Real World Experience

� CP overhead has increased significantly – T/V ratio of 1.3

� Dedicated OSA was replaced by VNIC

• No hardware support from QEBSM - CP simulates SIGA

� Strong correlation between bandwidth and user overhead

• No strange things happening – linear relation

• Receiving 100 MB/sLinux time – 65% of CPU

User overhead – 22% of CPU

Emulation Time TSMSERV

10 30 50 70 90 110 130 150

LACP VSWITCH – CPU Usage

� More than just the virtual machine

� Also rather large System Overhead

• Total CPU utilization ~ 190%

� Other high priority workload kicked in

• Matches the dip in throughput

• Throughput is now limited by CPU

TSMSERV CPU Usage - 19 Jan 2009

00:00 00:30 01:00 01:30 02:00 02:30

CPU Usage - 19 Jan 2009

00:00 00:30 01:00 01:30 02:00 02:30

TSMSERV

Others

SAP005

SAP000

SAP025

CPU Usage - 19 Jan 2009

00:00 00:30 01:00 01:30 02:00 02:30

System

LACP VSWITCH – CPU Usage

� System overhead correlates with VSWITCH bandwidth

• This is different from the CP overhead charged to TSMSERV

• Pretty linear relation - about 24% CPU for 100 MB/s

� Probably for work that CP does to receive data

• Decoding the LACP packets

• Copying data from real QDIO buffers to VNIC buffers

Overhead vs VSWITCH Bandwidthy = 0.2372x + 1.4176

10 30 50 70 90 110 130 150

111%Total

24%System overhead

22%CP overhead Linux

65%Linux internal work

Receiving 100 MB/s

Ethernet Bonding in Linux

� Linux implementation of LACP

� Requires exclusive OSA ports like VSWITCH

• Other ports for VSWITCH fail-over

VSWITCH

TSMSERV

Linux Bonding – Performance measurements

� Maximum Throughput slightly higher using all 4 paths

� System Overhead has disappeared

• CP has no inbound traffic for VSWITCH anymore

� CP Overhead for TSMSERV is gone

• CP is not even aware of traffic – QBESM handles it

� Linux CPU usage per MB has increased

• Code paths are different for qeth using QBESM vs SIGA

T hro ughput Linux B o nding

01:00 01:15 01:30 01:45 02:00 02:15 02:30 02:45

C P U Usage - Linux B o nding

01:00 01:15 01:30 01:45 02:00 02:15 02:30 02:45

Syst em

TSMSERV

VSWITCH LACP versus Linux Bonding� VSWITCH solution provides flexibility and ease of use

• At very high bandwidth there is a significant CPU cost

� Linux Bonding solution does not share interfaces among servers

• Additional OSA and router ports may be required

• Network routing becomes more complicated

� Throughput improvement less than expected

• Still latencies to be discovered

� Not every application uses 100 MB/s

• With lower bandwidth CPU cost is less

• But LACP is meant for high bandwidth

� It is not obvious what the CPU is used for

• There may be options for improvement

CPU Usage at 100 MB/s

VSWITCH LACP Linux Bonding

Syst em Overhead

TSMSERV CP Overhead

TSMSERV Emulat ion Time

My Penguin can't sleep

Linux servers without work should be idle

� Virtual machines drop from queue at transaction end

• CP defines transactions complete after 300 ms idleThe queue drop delay is a bit more complicated than this

� Linux servers tend to have some background work

• Frequent CPU usage causes server to stay in queue

• CP is reluctant to take pages from in-queue virtual machines

• No queue drop = non-interactive virtual machine (batch like)

� In-queue idle servers impact scalability

Example of an idle Linux server

� Found waiting for CPU resource 5% of the time

� Never found actually running

� Waiting for queue drop 95% of the time

Screen: ESAXACT ESAMON 3.7.0

1 of 2 Transaction Delay Analysis CLASS * USER

<----------------Percent non-dormant-------

UserID E- D- T- Tst <As

Time /Class Run Sim CPU SIO Pag SVM SVM SVM CF Idl I/O

-------- ----------- --- --- --- --- --- --- --- --- --- ---

10:00:00 LROBV1 0 0 5 0 0 0 0 0 0 95 0

Screen: ESAUSRQ ESAMON 3.7.0

1 of 3 User Queue and Load Analysis CLASS * USER

<----------User Load------------->

UserID Logged Non- Disc- Total Tran

Time /Class on Idle Active conn InQue /min

-------- -------- ------ ----- ------- ------ ----- ----

10:00:00 LROBV1 1 1 1 0 1.00 0

95% of time “test idle”(waiting for Q-drop)

No transactions, permanently in queue

Frequent timer ticks keep virtual machine in queue

� Work done by Linux on each tick is minimal

• Total CPU usage is still limited

� Can be verified by CP TRACE EXT 1004

• 1004 is Clock Comparator external interrupt

Reason #1 - Linux on-demand timer not active

� Traditionally Linux has a 10 ms timer interrupt to check for work

lrobv1:~ # vmcp trace ext 1004 printer run

lrobv1:~ # vmcp sp prt purge ; sleep 10 ; vmcp sp prt close

PRT FILE 0046 SENT **** PURGED **** AS 0056 RECS 1718 CPY

PRT FILE 0047 SENT TO ROBV RDR AS 0057 RECS 1002 CPY

lrobv1:~ # vmcp trace end

Trace ended

1002 ticks/10 sec~ 100 per second

More detailed trace of timer interrupt

� Add display of TOD clock at interrupt

� CMS Pipelines can convert the TOD clocks

lrobv1:~ # vmcp sp con purge

CON FILE 0045 SENT **** PURGED **** AS 0058 RECS 0005 CPY

lrobv1:~ # vmcp trace i r 10cb4e term run cmd d rde8.8

lrobv1:~ # vmcp trace goto niets

lrobv1:~ # vmcp sp con close robv

CON FILE 0049 SENT TO ROBV RDR AS 0059 RECS 3758 CPY

-> 000000000010CB4E' STCK B2050DE8 >>

R00000DE8 C067D666 6D01818A

-> 000000000010CB4E' STCK B2050DE8 >>

R00000DE8 C067D666 6F80F54A

Timer Interrupts (w ith 10 ms timer)

0 0.2 0.4 0.6 0.8 1

Elapsed Time (s)

Mostly 10 msbetween ticks

Linux on-demand timer – System #1

� Avoids 10 ms timer ticks when otherwise idle

� Should be configured as /proc/sys/kernel/hz_timer = 0

� Default setting changed with various releases

-------- ----------- --- --- --- --- --- --- --- --- --- ---

10:22:00 LROBV1 0 0 0 0 0 0 0 0 0 100 0

<----------User Load------------->

-------- -------- ------ ----- ------- ------ ----- ----

10:22:00 LROBV1 1 1 1 0 0.02 42

100% of time “test idle”

42 trans/min ~average 1.5 second idle

Tracing on-demand timer ticks – System #1

� Timer interrupts depend on process wake-up requests

• This is my server – yours will be different

• Counted 78 timer ticks per minute

� Frequently periods of 2 seconds “silence”

• During test interval counted 45 periods of > 300 ms

• Good enough to show CP that virtual machine is idle

� Less timer ticks would be even better

• More chance CP finds the server not in queue

Timer Interrupts (w ith on-demand timer)

0 10 20 30 40 50 60

Elapsed Time (s)

Anatomy of the average transaction

� Periods of activity with short wait time between them

• Each starts with a timer interrupt (simplification: ignore I/O interrupts)

� Longer idle period, followed by queue drop

0 1.4Time (s)

test idle

Queue drop

Dormantwait

60 sec / 44

run wait run T T T

Screen: ESARATE ESAMON 3.7.0

1 of 2 Transaction Rates And Response Times CLASS * USER

<Triv+NonTriv> <-UP Trivial-->

UserID <-Users-> Transactions Tran Resp(sec)

Time /Class Actv In Q /min Response /min Qtim Trans

-------- -------- ---- ---- ----- -------- ---- ---- -----

05:26:00 LROBV1 1 0.0 44.00 0.043 39.0 0.18 0.009

Queue time = 180Trans time = 9

Show z/VM scheduler that virtual machine is idle

� Next activity is considered new transaction

� Avoids virtual machines end up in Q3 with batch priority

� z/VM Scheduler classifies virtual machine at queue drop time

Allow z/VM storage management to take resources away

� Free list replenishment is done by sampling

• Dormant % determines whether CP finds server idle

• It pays if you try to reduce the number of timer interrupts

Percentage dormant mostly determined by timer interrupts

� CPU usage is minimal

� Frequency and distribution of timer interrupts

Time (s)

test idle Dormantwait run wait run T T T

Detailed measurement of timer requests

� Transactions are reported by z/VM Monitor

• Does not provide information about the timer interrupts

• Does not tell which process requested the timer interrupt

� Virtual machine trace on the right place* in the kernel

• Timeout value (in ticks)

• Process ID

• TOD clock

-> 000000000033BCCC' LGR B904003B

G03=00000000000005DC

-> 000000000033BCD4' LG E3100DD8

G01=000000001FE86D88

V1FE86E8C 0000044C 0000044B

V00000DE8 C066F5C8 C8E8E5C2

Timeout value

Process ID

TOD clock

* The most convenient place to trace differs with each kernel

level.Eventually the timer_stats patch from Ingo Molnar (2.6.20)

should provide an even easier way to get this information.

Timer Requests – System #1

�PID 1: init

• 5 sec check for dead orphans

�PID 1086 / 1087: nscd

• 15 sec to expire any cached items

� There are also timer interrupts for the kernel threads and drivers

• Visible in TRACE EXT 1004

• Not something you tune yourself

� Timer interrupts is different from wake-up calls

• Multiple places where call is made

• Merge of timer requests

Request Time Timeout PID

2007-04-10 10:29:25.387323 500 1

2007-04-10 10:29:30.386720 500 1

2007-04-10 10:29:33.227057 1500 1086

2007-04-10 10:29:33.227057 1500 1087

2007-04-10 10:29:36.226925 500 1

2007-04-10 10:29:41.227045 500 1

2007-04-10 10:29:46.227236 500 1

2007-04-10 10:29:48.227111 1500 1086

2007-04-10 10:29:48.227111 1500 1087

2007-04-10 10:29:52.226871 500 1

2007-04-10 10:29:57.227049 500 1

2007-04-10 10:30:02.227412 500 1

2007-04-10 10:30:04.227001 1500 1086

2007-04-10 10:30:04.227001 1500 1087

2007-04-10 10:30:07.226932 500 1

2007-04-10 10:30:12.226734 500 1

Timer Requests – System #1

� Stopped nscd process

� Remains init at 5 second interval

� Kernel interrupts

• 2 sec reap_cache

• 30 sec do_cache_clean

Timer Interrupt Analysis - System #1

Dormant

Test Idle

PowerTOP

� Frequent wake-up for nothing bothers others too!

• Unable to lower CPU frequency – reduces laptop battery life

� PowerTOP reveals what causes the wake-up

Collecting data for 15 seconds

< Detailed C-state information is only available on Mobile CPUs (laptops) >

P-states (frequencies)

Wakeups-from-idle per second : 122.5 interval: 15.0s

Top causes for wakeups:

98.4% (120.5) java : schedule_timeout (process_timeout)

0.4% ( 0.5) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn)

0.2% ( 0.2) init : schedule_timeout (process_timeout)

0.2% ( 0.2) <kernel core> : page_writeback_init (wb_timer_fn)

0.2% ( 0.2) <kernel module> : neigh_table_init_no_netlink (neigh_periodic_timer)

0.2% ( 0.2) nscd : schedule_timeout (process_timeout)

0.1% ( 0.1) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer)

java processes cause 120 wake-up calls per second (worse than 100 Hz timer)

PowerTOP

� Wake-up calls disappear when JVM is stopped

• This may not be a useful option in real life

Collecting data for 15 seconds

< Detailed C-state information is only available on Mobile CPUs (laptops) >

P-states (frequencies)

Wakeups-from-idle per second : 1.9 interval: 15.0s

Top causes for wakeups:

29.6% ( 0.5) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn)

14.8% ( 0.3) <kernel module> : neigh_table_init_no_netlink (neigh_periodic_timer)

11.1% ( 0.2) init : schedule_timeout (process_timeout)

11.1% ( 0.2) <kernel core> : page_writeback_init (wb_timer_fn)

11.1% ( 0.2) nscd : schedule_timeout (process_timeout)

7.4% ( 0.1) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer)

3.7% ( 0.1) sshd : schedule_timeout (process_timeout)

3.7% ( 0.1) <kernel core> : sk_reset_timer (tcp_delack_timer)

3.7% ( 0.1) sshd : sk_reset_timer (tcp_write_timer)

3.7% ( 0.1) ip : __netdev_watchdog_up (dev_watchdog)

Requires2.6.21 kernelShould work on SLES11

� Virtual machine reported as 135% in-queue: virtual 2-way

• To be really idle, both virtual CPU’s must be idle at the same time

• Makes it very hard for CP to find the virtual machine idleNot an easy candidate to take pages away

Screen: ESAXACT Marist OSDL ESAMON 3.7.0

-------- ----------- --- --- --- --- --- --- --- --- --- ---

09:35:00 LNEALE1 0 0 0 0 0 0 0 0 0 100 0

Screen: ESAUSRQ Marist OSDL ESAMON 3.7.0

<----------User Load------------->

-------- -------- ------ ----- ------- ------ ----- ----

09:35:00 LNEALE1 1 1 1 0 1.35 4

Counting Ext 1004CPU 00: 97CPU 01: 314

Transactions in MP virtual machine

� Transaction time is very small compared to queue time

� Different timer events not synchronized

� Using more CPU’s will spread timer events over CPU’s

• Using more CPU’s means more transactions = more queue time

• Different in-queue periods not synchronized

• Percentage truly dormant is low

test idle run

test idle

3 CPU’s

test idle

Virtual

machine

dormant

� Varied CPU1 offline (one CPU is enough for an idle server)

� Reduces in-queue time to ~ 50% - much better

� Never more than 500 ms silence

-------- ----------- --- --- --- --- --- --- --- --- --- ---

10:21:00 LNEALE1 3 0 0 0 0 0 0 0 0 97 0

<----------User Load------------->

-------- -------- ------ ----- ------- ------ ----- ----

10:21:00 LNEALE1 1 1 1 0 0.54 40

Ext 1004: 230/min

Timer Interrupts (w ith on-demand timer) - System #2

0 10 20 30 40 50 60

Elapsed Time (s)

Timer requests – System #2

� Analyzed 400 timer requests (5 minutes)

� One task has frequent requests for 510 ms wake-up

• This explains why we never see silence for more than 500 ms

• Linux “/proc” helps identify the process requesting wake-upafs_rxevent polling with 510 ms

httpd2-prefork polling with 1000 ms

• Checking twice per second may not be polling but is frowned upon

• Later AFS versions are supposed to fix this

Count Timeout Task Address

128 51 00000000fd465830

65 100 00000000fdff9090

Using cpuplugd

Approach to have best of both worlds - cpuplugd� Maximum number of virtual CPUs during peaks

� Less virtual CPUs when workload does not need more• Reduces Linux internal overhead because of MP-effects

• Address VM issues with idle virtual MP

� Daemon to set virtual CPUs online and offline• Thresholds and rules define when to switch on and off

No exact science – defaults mostly determined by trial & error

Idea is interesting and approach is intuitive

Might make sense to look at steal time and idle time

� It does work and varies excessive virtual CPUs offline• CP emulates the SIGP to stop the CPU

� Daemon wakes up every 10 seconds to check• Uses the “load average” – takes time to get CPUs online

• The cpuplugd daemon uses some resources itself

HOTPLUG="(loadavg > onumcpus + 0.75) & (idle < 10.0)"HOTUNPLUG="(loadavg < onumcpus - 0.25) | (idle > 50)"

01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01.

Linux Speak

"plug" – enable

"unplug" - disable

Virtual CPU SHARE distribution

Assigned SHARE is distributed over virtual CPUs� When business justifies multiple CPUs, also review SHARE setting

� Surprises customers – changes one CPU into two half CPUs

� Probably poor design – but hard to change without breaking things

Detaching a virtual CPU speeds up the remaining CPUs� No business justification – little work, so why must it be faster?

� Sharing resources requires “social behavior” rather than “grab what you can”

� Fortunately detaching a CPU is disruptive process

z/VM 5.4 enhancement: distribute SHARE also with stopped CPU� Makes it easier for Linux to “detach” CPUs and distribute SHARE

� Motivated by different objectives• Workload based capacity planning• Maximum single server throughput

� Need an option in CP to avoid SHARE distribution• Maybe with more intuitive per-CPU setting: SHARE REL 100 DWIM* DNMWM*

* Do Why I MeanDo Not Mess With Me

Conclusion� z/VM memory management needs to see servers idle

• Queue drop allows scheduler to see transaction end

• Dormant period allows memory management to take pages

� Adding extra virtual CPUs impacts scalability• All virtual CPU’s must drop from queue to be dormant

• Virtual machine in Q3 even when idle - sluggish

• Linux CPU affinity is not the best choice for Linux on z/VM

• See where cpuplugd fits in your capacity plan – be aware

� Do not define more virtual CPUs than you need• Not even more than the resources you expect to get

• The “equal to number of real CPUs” rule is nonsense

When in doubt, one will do.

When you have measured, probably too.

Summary

Shared environment requires CPU resource management

� Don't waste resources that other servers can use better

� Charge-back motivates users to save resources

Many layers of CPU resource management

� Instrumentation required in all layers

� Linux CPU data must be combined with z/VM data

Performance monitor helps explain exceptional usage

� Data must be correct and complete to be useful

Linux on z/VM virtual CPU configuration

When in doubt, one will do. When you measured, probably too.

Idle Linux servers must drop from queue

� Use "powertop" to identify the cause of polling

Other products and company names mentioned herein may be trademarks of their respective owners.

Linux on z/VMUnderstanding CPU Usage Rob van der Heij

rob @ velocitysoftware.de

Velocity Software GmbHhttp://www.velocitysoftware.com/

Big “Thank You” to our customers who let me work on their performance problems

If you have performance problems, just drop me a note

or catch me somewhere

IBM System z

Technical Conference

Brussels, 2009

Session LX44

Linux on z/VM Understanding CPU Usage · 2009-05-04 · REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2...

Documents

Transcript of Linux on z/VM Understanding CPU Usage · 2009-05-04 · REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2...

Re s t u Re s t u d y puta p ut a ion In s u t e t io nI s ... · t io nI s t t u e s t u LESS d y WINTER VALID 29 APRIL - 5 MAY 2019 ULAR * R e p u t a ion In s u t e s t u d y MMZNHS2385_1

12th T T : U T # 04

Use a pencil. No calculators or protractors or rulers are ... · Order these from smallest to largest: s. t u s. r t u s. t r u s. r u t s. t s. u t s. u r t My journey to school

#0.18 July-August Issue

U(t - T, x)

ter u'.r.!' u'* t 9 - Graphic Pars - CV.pdf · u'IJC u Jiue-w Ip g yg c?l;b ,u,ll!gj aoUP. u,t> ter u lc!J.b I u'.r.!' = u'*" ,7 v ..SU; ,t"'f t ,.>! t , 0 ..1w.,!r.! 9 o 9 I 9 t"'

0.18 µm CMOS Process · 0.18 µm CMOS Process Data Sheet XC018 • Rev. 3.0 • August 2008 0.18 Micron Modular RF enabled CMOS Logic and Analog Technology XC018 Description MIXED

A function u t):R t u t

S O T A , F U T U R E & O P P O R T U N I T Y A U T O N O ...

Atemporal Uncertainty How to evaluate?are.berkeley.edu/~traeger/pdf/Slides_Traeger_EZ... · Atemporal Uncertainty How to evaluate? ♣ t t t t t u(x1) u(x2) u(x3) u(x4) u(x5) p1 p2

Chapter3 Containerui;'u;i\;y\ui;;i];tu]i;]t;ui;t]u;i';krhmgkui;'u;i\;y\ui;;i];tu]i;]t;ui;t]u;i';krhmgkui;'u;i\;y\ui;;i];tu]i;]t;ui;t]u;i';krhmgkui;'u;i\;y\ui;;i];tu]i;]t;ui;t]u;i';krhmgkui;'u;i\;y\ui;;i];tu]i;]t;ui;t]u;i';krhmgk

i S d R¾ n } R …………… ~ U S j R h S ^ Y a R ¾ S Y u R S j R ...hrlibrary.umn.edu/arab/MHRJPL5ar.pdf · i T g T j U ` [ c T Y U c W U l T j U ` [ c T ¿ U [ w T U l T c-

PLL (Tsmc 0.18 process)

Finite-Element Electrical Machine Simulation · −Im{u 2} 4 T t = re im alternating fields rotating field t u(t) t u(t) t u(t) re re re-im-im-im t u(t) re-im. 6 Dr.-Ing. Herbert

2017-2018 AMVETS National Constitution · v t u [ æ v t u \ ¬ u á v t u [ . u ã u

C U L T U R A L I N D U S T R I E S A N D I N S T I T U T ...

Eco-Fix - encon.co.uk 0.22 0.21 0.20 110240012005.00 0.19 0.19 0.18 120240012005.46 0.18 0.18 0.17 130240012005.91 0.17 0.16 0.16 140240012006 ...

2020 US AB · 2020. 10. 1. · abo ab o ab u abo u t us ab t us ab o o abo u t u s ab abo ab o ab u u t us ab t u o s ab abo ab o ab u abo u t us ab t o u s ab 7. standard koh13pr

N e w s le t t e r E d u c a t io n O u t r e a c h D u b u q u e C o u n t y · 2021. 1. 6. · D u b u q u e C o u n t y. E d u c a t io n O u t r e a c h. N e w s le t t e r. January

T T U.. TAGE