Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux,...

55
© 2011 IBM Corporation 1 Problem Reporting and Analysis Linux on System z - How to survive a Linux Critical Situation Sven Schuetz Linux on System z Development and Service [email protected]

Transcript of Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux,...

Page 1: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation1

Problem Reporting and Analysis Linux on System z -How to survive a Linux Critical Situation

Sven SchuetzLinux on System z Development and [email protected]

Page 2: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation2

IBM Live Virtual Class – Linux on System z

Agenda

Introduction How to help us to help you Systems monitoring How to dump a Linux on System z Real Customer cases

Page 3: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation3

IBM Live Virtual Class – Linux on System z

Introductory remarks

Problem analysis looks straight forward on the charts but it might have taken weeks to get it done.

A problem does not necessarily show up on the place of origin

The more information is available, the sooner the problem can be solved, because gathering and submitting additional information again and again usually introduces delays.

This presentation can only introduce some tools and how the tools can be used, comprehensive documentation on their capabilities is to be found in the documentation of the corresponding tool.

Do not forget to update your systems

Page 4: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation4

IBM Live Virtual Class – Linux on System z

Describe the problem

Get as much information as possible about the circumstances:– What is the problem?

– When did it happen? (date and time, important to dig into logs )

– Where did it happen? One or more systems, production or test environment?

– Is this a first time occurrence?

– If occurred before: how frequently does it occur?

– Is there any pattern?

– Was anything changed recently?

– Is the problem reproducible?

Write down as much information as possible about the problem!

Page 5: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation5

IBM Live Virtual Class – Linux on System z

Describe the environment

Machine Setup–Machine type (z196, z10, z9, ...)–Storage Server (ESS800, DS8000, other vendors models)–Storage attachment (FICON, ESCON, FCP, how many channels)–Network (OSA (type, mode), Hipersocket) ...

Infrastructure setup–Clients–Other Computer Systems–Network topologies–Disk configuration

Middleware setup–Databases, web servers, SAP, TSM, (including version information)

Page 6: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation6

IBM Live Virtual Class – Linux on System z

Trouble Shooting First-Aid Kit (1/2)

Install packages required for debugging– s390-tools/s390-utils

• dbginfo.sh

– sysstat• sadc/sar

• iostat

– procps• vmstat, top, ps

– net-tools• netstat

– dump tools crash / lcrash• lcrash (lkcdutils) available with SLES9 and SLES10

• crash available on SLES11

• crash in all RHEL distributions

Page 7: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation7

IBM Live Virtual Class – Linux on System z

Trouble Shooting First-Aid Kit (2/2)

Collect dbginfo.sh output– Proactively in healthy system

– When problems occur – then compare with healthy system

Collect system data– Always archive syslog (/var/log/messages)

– Start sadc (System Activity Data Collection) service when appropriate (please include disk statistics)

– Collect z/VM MONWRITE Data if running under z/VM when appropriate

When System hangs– Take a dump

• Include System.map, Kerntypes (if available) and vmlinux file

– See “Using the dump tools” book onhttp://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26ddt02.pdf

Enable extended tracing in /sys/kernel/debug/s390dbf for subsystem

Page 8: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation8

IBM Live Virtual Class – Linux on System z

dbginfo Script (1/2)

dbginfo.sh is a script to collect various system related files, for debugging purposes. It generates a tar-archive which can be attached to PMRs / Bugzilla entries

part of the s390-tools package in SUSE and recent Red Hat distributions– dbginfo.sh gets continuously improved by service and development

Can be downloaded at the developerWorks website directly

http://www.ibm.com/developerworks/linux/linux390/s390-tools.html

It is similar to the RedHat tool sosreport or supportconfig from Novell

root@larsson:~> dbginfo.sh Create target directory /tmp/DBGINFO-2011-01-15-22-06-20-t6345057Change to target directory /tmp/DBGINFO-2011-01-15-22-06-20-t6345057[...]

Page 9: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation9

IBM Live Virtual Class – Linux on System z

dbginfo Script (2/2)

Linux Information:– /proc/[version, cpu, meminfo, slabinfo, modules, partitions, devices ...]

– System z specific device driver information: /proc/s390dbf (RHEL 4 only) or /sys/kernel/debug/s390dbf

– Kernel messages /var/log/messages

– Reads configuration files in directory /etc/ [ccwgroup.conf, modules.conf, fstab]

– Uses several commands: ps, dmesg

– Query setup scripts

• lscss, lsdasd, lsqeth, lszfcp, lstape

– And much more

z/VM information:

– Release and service Level: q cplevel

– Network setup: q [lan, nic, vswitch, v osa]

– Storage setup: q [set, v dasd, v fcp, q pav ...]

– Configuration/memory setup: q [stor, v stor, xstore, cpus...]

– When the system runs as z/VM guest, ensure that the guest has the appropriate privilege class authorities to issue the commands

Page 10: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation10

IBM Live Virtual Class – Linux on System z

SADC/SAR

Capture Linux performance data with sadc/sar – CPU utilization

–Disk I/O overview and on device level–Network I/O and errors on device level–Memory usage/Swapping–… and much more–Reports statistics data over time and creates average values for

each item SADC example (for more see man sadc)

– System Activity Data Collector (sadc) --> data gatherer

– /usr/lib64/sa/sadc [options] [interval [count]] [binary outfile]

– /usr/lib64/sa/sadc 10 20 sadc_outfile

Page 11: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation11

IBM Live Virtual Class – Linux on System z

SADC/SAR (cont'd)

– /usr/lib64/sa/sadc -d 10 sadc_outfile

– -d option: statistics for disk

– Should be started as a service during system start

✱ SAR example (for more see man sar)

– System Activity Report (sar) command --> reporting tool

– sar -A

– -A option: reports all the collected statistics

– sar -A -f sadc_outfile >sar_outfile

Please include the binary sadc data and sar -A output when submitting SADC/SAR information to IBM support

Page 12: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation12

IBM Live Virtual Class – Linux on System z

CPU utilization

Per CPU values:watch out for

system time (kernel time)iowait time (slow I/O subsystem)steal time (time taken by other guests)

Page 13: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation13

IBM Live Virtual Class – Linux on System z

Disk I/O rates

read/write operations- per I/O device- tps: transactions- rd/wr_secs: sectorsis your I/O balanced?Maybe you should stripe your LVs

Page 14: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation14

IBM Live Virtual Class – Linux on System z

Linux on System z dump tools

DASD dump tool– Writes dump directly on DASD partition

– Uses s390 standalone dump format

– ECKD and FBA DASDs supported

– Single volume and multiple volume (for large systems) dump possible

– Works in z/VM and in LPAR

SCSI dump tool– Writes dump into filesystem

– Uses lckd dump format

– Works in z/VM and in LPAR

VMDUMP– Writes dump to vm spool space (VM reader)

– z/VM specific dump format, dump must be converted

– Only available when running under z/VM

Tape dump tool– Writes dump directly on ESCON/FICON Tape device

– Uses s390 standalone dump format

Page 15: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation15

IBM Live Virtual Class – Linux on System z

DASD dump tool – general usage

1. Format and partition dump device

2. Prepare dump device in Linux

3. Stop all CPUs

4. Store Status

5. IPL dump device

6. Copy dump to Linux

root@larsson:~>  zipl ­d /dev/dasd<x1>

root@larsson:~>  zgetdump /dev/<x1> > dump_file

root@larsson:~>  dasdfmt ­f /dev/dasd<x> ­b 4096

root@larsson:~>  fdasd /dev/dasd<x>

Page 16: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation16

IBM Live Virtual Class – Linux on System z

DASD dump under z/VM

Prepare dump device under Linux, if possible on 64Bit environment:

After Linux crash issue these commands on 3270 console:

Wait until dump is saved on device:

Only disabled wait PSW on older Distributions Attach dump device to a linux system with dump tools installed Store dump to linux file system from dump device (e.g. zgetdump)

root@larsson:~>  zipl ­d /dev/dasd<x1>

#cp cpu all stop#cp cpu 0 store status#cp i <dasd_devno>

00: zIPL v1.6.0 dump tool (64 bit)00: Dumping 64 bit OS00: 00000087 / 00000700 MB 0...00: Dump successful

Page 17: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation17

IBM Live Virtual Class – Linux on System z

DASD dump on LPAR (1/2)

Page 18: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation18

IBM Live Virtual Class – Linux on System z

DASD dump on LPAR (2/2)

Page 19: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation19

IBM Live Virtual Class – Linux on System z

Multi volume dump

zipl can now dump to multiple DASDs. It is now possible to dump system images, which are larger than a single DASD.

–You can specify up to 32 ECKD DASD partitions for a multi-volume dump

Obtain messages, which have not been written to the syslog due to a crash

What are dumps good for?

–Full snapshot of system state taken at any point in time (e.g. after a system has crashed, of or a running system)

–Can be used to analyse system state beyond messages written to the syslog

–Internal data structures not exported to anywhere

Page 20: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation20

IBM Live Virtual Class – Linux on System z

Multi volume dump (cont'd)

How to prepare a set of ECKD DASD devices for a multi-volume dump? (64-bit systems only)

–We use two DASDs in this example:

–Create the partitions with fdasd. The sum of the partition sizes must be sufficiently large (the memory size + 10 MB):

–Create a file called sample_dump_conf containing the device nodes (e.g. /dev/dasdc1) of the two partitions, separated by one or more line feed characters

–Prepare the volumes using the zipl command.

root@larsson:~>  dasdfmt ­f /dev/dasdc ­b 4096     root@larsson:~>  dasdfmt ­f /dev/dasdd ­b 4096

root@larsson:~>  fdasd /dev/dasdc     root@larsson:~>  fdasd /dev/dasdd

root@larsson:~>  zipl ­M sample_dump_conf        [...]

Page 21: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation21

IBM Live Virtual Class – Linux on System z

Multi volume dump (cont'd)

To obtain a dump with the multi-volume DASD dump tool, perform the following steps:–Stop all CPUs, Store status on the IPL CPU.– IPL the dump tool using one of the prepared volumes, either 4711 or

4712.–After the dump tool is IPLed, you'll see a messages that indicates the

progress of the dump. Then you can IPL Linux again

Copying a multi-volume dump to a file–Use zgetdump without any option to copy the dump parts to a file:

#cp cpu all stop#cp cpu 0 store status#cp ipl 4711

root@larsson:~>  zgetdump /dev/dasdc > mv_dump_file

Page 22: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation22

IBM Live Virtual Class – Linux on System z

Multi volume dump (cont'd)

Display information of the involved volumes:

Display information about the dump itself:

root@larsson:~>  zgetdump ­d /dev/dasdc                  '/dev/dasdc' is part of Version 1 multi­volume dump,which is spread along the following DASD volumes:         0.0.4711 (online, valid) 0.0.4712 (online, valid)[...]

root@larsson:~>  zgetdump ­i /dev/dasdc                  Dump device: /dev/dasdc>>>  Dump header information  <<<Dump created on: Fri Aug  7 15:12:41 2009  [...]Multi­volume dump: Disk 1 (of 2)Reading dump contents from 0.0.4711.................................Dump ended on:   Fri Aug  7 15:12:52 2009Dump End Marker found: this dump is valid.

Page 23: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation23

IBM Live Virtual Class – Linux on System z

SCSI dump tool – general usage

1. Create partition with PCBIOS disk-layout (fdisk)

2. Format partition with ext2 or ext3 filesystem

3. Install dump tool:–mount and prepare disk :

–Optional: /etc/zipl.conf:

4. Stop all CPUs

5. Store Status

6. IPL dump device

Dump tool creates dumps directly in filesystem

SCSI dump supported for LPARs and as of z/VM 5.4

root@larsson:~>  mount /dev/sda1 /dumpsroot@larsson:~>  zipl ­D /dev/sda1 ­t dumps

dumptofs=/dev/sda1target=/dumps

Page 24: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation24

IBM Live Virtual Class – Linux on System z

SCSI dump under z/VM

SCSI dump from z/VM is supported as of z/VM 5.4 Issue SCSI dump

To access the dump, mount the dump partition

#cp cpu all stop#cp cpu 0 store status#cp set dumpdev portname 47120763 00ce93a7 lun 47120000 00000000 bootprog 0#cp ipl 4b49 dump

Page 25: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation25

IBM Live Virtual Class – Linux on System z

SCSI dump on LPAR

Select CPC image for LPAR to dump Goto Load panel Issue SCSI dump

–FCP device–WWPN–LUN

Page 26: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation26

IBM Live Virtual Class – Linux on System z

VMDUMP

The only method to dump NSSes or DCSSes under z/VM Works nondisruptive Create dump:

Receive dump:–Store the dump from the reader into CMS dump file:

–Transfer the dump to linux from CMS e.g. FTP–NEW: vmur device driver:

Linux tool to convert vmdump to lkcd format:

Problem: Dump process relatively slow

#cp vmdump to cmsguest

root@larsson:~>  vmconvert vmdump linux.dump

#cp dumpload

root@larsson:~>  vmur rec <spoolid> vmdump

Page 27: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation27

IBM Live Virtual Class – Linux on System z

How to obtain information about a dump

Display information of the involved volume:

Display information about the dump itself:

root@larsson:~>  zgetdump ­d /dev/dasdb                 '/dev/dasdb' is Version 0 dump device. Dump size limit: none

root@larsson:~>  zgetdump ­i /dev/dasdb1                  Dump device: /dev/dasdb1

Dump created on: Thu Oct  8 15:44:49 2009

Magic number:  0xa8190173618f23fdVersion number:  3Header size:  4096Page size:  4096Dumped memory:  1073741824Dumped pages:  262144Real memory:  1073741824cpu id:  0xff00012320978000System Arch:  s390x (ESAME)Build Arch:  s390x (ESAME)>>>  End of Dump header  <<<

Dump ended on:  Thu Oct  8 15:45:01 2009Dump End Marker found: this dump is valid.

Page 28: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation28

IBM Live Virtual Class – Linux on System z

How to obtain information about a dump (cont'd)

Display information about the dump itself (dump header) and check if the dump is valid, use lcrash with options

’-i’ and ’-d’.

root@larsson:~>  lcrash ­i ­d /dev/dasdb1                     Dump Type: s390 standalone dump

          Machine: s390x (ESAME)           CPU ID: 0xff00012320978000

     Memory Start: 0x0       Memory End: 0x40000000      Memory Size: 1073741824

     Time of dump: Thu Oct  8 15:44:49 2009  Number of pages: 262144 Kernel page size: 4096   Version number: 3     Magic number: 0xa8190173618f23fd Dump header size: 4096       Dump level: 0x4       Build arch: s390x (ESAME) Time of dump end: Thu Oct  8 15:45:01 2009

End Marker found! Dump is valid!

Page 29: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation29

IBM Live Virtual Class – Linux on System z

Automatic dump on panic (SLES 10/11, RHEL 5/6): dumpconf

The dumpconf tool configures a dump device that is used for automatic dump in case of a kernel panic.

–The command can be installed as service script under /etc/init.d/dumpconf or can be called manually.

–Start service: # service dumpconf start

–It reads the configuration file /etc/sysconfig/dumpconf.

–Example configuration for CCW dump device (DASD) and reipl after dump:

ON_PANIC=dump_reipl  DUMP_TYPE=ccwDEVICE=0.0.4711

Page 30: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation30

IBM Live Virtual Class – Linux on System z

Automatic dump on panic (SLES 10/11, RHEL 5): dumpconf (cont'd)

–Example configuration for FCP dump device (SCSI disk):

–Example configuration for re-IPL without taking a dump, if a kernel panic occurs:

–Example of executing a CP command, and rebooting from device 4711 if a kernel panic occurs:

ON_PANIC=reipl

ON_PANIC=vmcmd    VMCMD_1="MSG <vmguest> Starting VMDUMP" VMCMD_2="VMDUMP"VMCMD_3="IPL 4711"

ON_PANIC=dump DUMP_TYPE=fcpDEVICE=0.0.4714WWPN=0x5005076303004712 LUN=0x4047401300000000BOOTPROG=0BR_LBA=0

Page 31: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation31

IBM Live Virtual Class – Linux on System z

Get dump and send it to service organization

DASD/Tape:–Store dump to Linux file system from dump device:

–Alternative: lcrash (Compression possible)

SCSI:

–Get dump from filesystem Additional files needed for dump analysis:

– SUSE (lcrash tool): /boot/System.map-xxx and /boot/Kerntypes-xxx

– Redhat & SUSE (crash tool): vmlinux file (kernel with debug info) contained in debug kernel rpms:

• RedHat: kernel-debuginfo-xxx.rpm and kernel-debuginfo-common-xxx.rpm

• SUSE: kernel-default-debuginfo-xxx.rpm

root@larsson:~>  zgetdump /dev/<device node> > dump_file

root@larsson:~>  lcrash ­d /dev/dasdxx ­s <dir>

Page 32: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation32

IBM Live Virtual Class – Linux on System z

Handling large dumps

Compress the dump and split it into parts of 1 GB

Several compressed files such as xaa, xab, xac, .... are created Create md5 sums of the compressed files

Upload all parts together with the md5 information

Verification of the parts for a receiver

Merge the parts and uncompress the dump

root@larsson:~>  zgetdump /dev/dasdc1 | gzip | split ­b 1G

root@larsson:~>  md5sum xa* > dump.md5     

root@larsson:~>  md5sum ­c dump.md5 xaa: OK[....]  

root@larsson:~>  cat xa* | gunzip ­c > dump

Page 33: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation33

IBM Live Virtual Class – Linux on System z

Transferring dumps

Transferring single volume dumps with ssh

Transferring multi-volume dumps with ssh

Transferring a dump with ftp– Establish an ftp session with the target host, login and set the transfer mode to

binary

– Send the dump to the host

root@larsson:~>  zgetdump /dev/dasdc1 | ssh user@host "cat > dump_file_on_target_host" 

root@larsson:~>  zgetdump /dev/dasdc | ssh user@host "cat > multi_volume_dump_file_on_target_host"

root@larsson:~>  ftp> put |"zgetdump /dev/dasdc1" <dump_file_on_target_host>

Page 34: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation34

IBM Live Virtual Class – Linux on System z

Dump tool summary

See “Using the dump tools” book on http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Tool Stand alone toolsVMDUMP

DASD Tape SCSI

EnvironmentVM&LPAR VM&LPAR VM

Preparation---

Creation

Tape cartridges VM reader

---

Viewing

zipl -d /dev/<dump_dev>mkdir /dumps/mydumpszipl -D /dev/sda1 ...

Stop CPU & Store status ipl <dump_dev_CUU>

vmdump

Dumpmedium

ECKD orFBA

LINUX file systemon a SCSI disk

Copy tofilesystem

zgetdump /dev/<dump_dev>> dump_file

Dumploadftp ...vmconvert ...

lcrash or crash

Page 35: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation35

IBM Live Virtual Class – Linux on System z

Customer Cases

Page 36: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation36

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots

Configuration:– Oracle RAC server or other HA

solution under z/VM

Problem Description: – Occasionally guests spontaneously

reboot without any notification or console message

Tools used for problem determination:– cp instruction trace of (re)IPL code

– Crash dump taken after trace was hit

Linux 1

Oracle RAC Database

Linux 2

HA Cluster

Oracle RACServer

Oracle RACServer

communication

Page 37: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation37

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots - Steps to find root cause

Question: Who rebooted the system? Step 1

–Find out address of (re)ipl code in the system map–Use this address to set instruction trace

cd /bootgrep machine_restart System.map­2.6.16.60­0.54.5­default 000000000010c364 T machine_restart00000000001171c8 t do_machine_restart0000000000603200 D _machine_restart

Page 38: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation38

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)

Step 2–Set CP instruction trace on the reboot address–System is halted at that address, when a reboot is triggered

CP CPU ALL TR IN R 10C364.4HCPTRI1027I An active trace set has turned RUN off

CP Q TR

NAME  INITIAL     (ACTIVE)  1     INSTR   PSWA  0010C364­0010C367        TERM    NOPRINT  NORUN SIM        SKIP 00000  PASS 00000 STOP 00000  STEP 00000        CMD  NONE

 ­> 000000000010C364'  STMF    EBCFF0780024 >> 000000003A557D48     CC 2

Page 39: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation39

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)

Step 3–Take a dump, when the (re)ipl code is hit

cp cpu all stopcp store statusStore complete. cp i 4fc6Tracing active at IPLHCPGSP2630I The virtual machine is placed in CP mode due to a SOGP stop and store status from CPU 00.zIPL v1.6.3­0.24.5 dump tool (64bit)Dumping 64 bit OS00000128 / 00001024 MB......00001024 / 00001024 MBDump successfulHCPIR450W CP entered, disabled wait PSW 00020000 80000000 00000000 00000000

Page 40: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation40

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)

Step 4–Save dump in a file

zgetdump /dev/dasdb1  > dump_fileDump device: /dev/dasdb1

>>>  Dump header information  <<<Dump created on: Wed Oct 27 12:00:40 2010Magic number:  0xa8190173618f23fdVersion number:  4Header size:  4096Page size:  4096Dumped memory:  1073741824Dumped pages:  262144Real memory:  1073741824cpu id:  0xff00012320948000System Arch:  s390x (ESAME)Build Arch:  s390x (ESAME)>>>  End of Dump header  <<<

Reading dump content ................................Dump ended on:  Wed Oct 27 12:00:52 2010Dump End Marker found: this dump is valid.

Page 41: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation41

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)

 STACK: 0 start_kernel+950 [0x6a690e] 1 _stext+32 [0x100020]================================================================TASK HAS CPU (1): 0x3f720650 (oprocd.bin): LOWCORE INFO:  ­psw      : 0x0704200180000000 0x000000000010c36a  ­function : machine_restart+6  ­prefix   : 0x3f438000  ­cpu timer: 0x7fffffff 0xff9e6c00  ­clock cmp: 0x00c6ca69 0x22337760  ­general registers:

<snip>

 STACK: 0 __handle_sysrq+248 [0x361240] 1 write_sysrq_trigger+98 [0x2be796] 2 sys_write+392 [0x225a68] 3 sysc_noemu+16 [0x1179a8]

Step 5–Use (l)crash, to find out, which process has triggered the reboot

/var/opt/oracle/product/crs/bin/oprocd.bin

Page 42: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation42

IBM Live Virtual Class – Linux on System z

Availability: Guest spontaneously reboots (cont'd)

Problem Origin:– HA component erroneously detected a system hang

• hangcheck_timer module did not receive timer IRQ

• z/VM 'time bomb' switch

• TSA monitor

z/VM cannot guarantee 'real-time' behavior if overloaded– Longest 'hang' observed: 37 seconds(!)

Solution:– Offload HA workload from overloaded z/VM

• e.g. use separate z/VM

• or: run large Oracle RAC guests in LPAR

Page 43: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation43

IBM Live Virtual Class – Linux on System z

Network: network connection is too slow

Configuration:–z/VSE running CICs, connection to DB2 in Linux on System z–Hipersocket connection from Linux to z/VSE –But also applies to hipersocket connections between Linux and z/OS

Problem Description: –When CICS transactions were monitored, some transactions take a

couple of seconds instead of milliseconds Tools used for problem determination:

–dbginfo.sh –s390 debug feature–sadc/sar–CICS transaction monitor

Page 44: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation44

IBM Live Virtual Class – Linux on System z

Network: network connection is too slow (cont'd)

s390 debug feature–Check for qeth errors:

dbginfo file

–Check for buffer count:

Problem Origin:

–Too few inbound buffers

cat /sys/kernel/debug/s390dbf/qeth_qerr00 01282632346:099575 2 ­ 00 0000000180b20218  71 6f 75 74 65 72 72 00 | qouterr.00 01282632346:099575 2 ­ 00 0000000180b20298  20 46 31 35 3d 31 30 00 |  F15=10.00 01282632346:099576 2 ­ 00 0000000180b20318  20 46 31 34 3d 30 30 00 |  F14=00.00 01282632346:099576 2 ­ 00 0000000180b20390  20 71 65 72 72 3d 41 46 |  qerr=AF00 01282632346:099576 2 ­ 00 0000000180b20408  20 73 65 72 72 3d 32 00 |  serr=2.

cat /sys/devices/qeth/0.0.1e00/buffer_count16

Page 45: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation45

IBM Live Virtual Class – Linux on System z

Network: network connection is too slow (cont'd)

Solution:– Increase inbound buffer count (default: 16, max 128)

– Check actual buffer count with 'lsqeth -p'

– Set the inbound buffer count in the appropriate config file:• SUSE SLES10:

- in /etc/sysconfig/hardware/hwcfg-qeth-bus-ccw-0.0.F200 - add QETH_OPTIONS="buffer_count=128"

• SUSE SLES11: - in /etc/udev/rules.d/51-qeth-0.0.f200.rules add ACTION=="add", SUBSYSTEM=="ccwgroup", KERNEL=="0.0.f200", ATTR{buffer_count}="128"

• Red Hat: - in /etc/sysconfig/network-scripts/ifcfg-eth0 - add OPTIONS="buffer_count=128"

Page 46: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation46

IBM Live Virtual Class – Linux on System z

High Disk response times

Configuration:–z10, HDS Storage Server (Hyper PAV enabled)–z/VM, Linux with Oracle Database–VM controlled Minidisks attached to Linux, LVM on top

Problem description:–I/O throughput not matching expectations–Oracle Database shows poor performance because of that–One LVM volume showing significant stress

Tools used for problem determination:–dbginfo.sh–sadc/sar–z/VM Monitor data

Page 47: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation47

IBM Live Virtual Class – Linux on System z

High Disk response times (cont'd)Observation in Linux

PAV not being utilized

No Hyper PAV support in SLES10 SP2

Static PAV not possible with current setup (VM controlled minidisks)

Need to look for other ways for more parallel I/O

–Link same minidisk multiple times to a guest

–Use smaller minidisks and increase striping in Linux

Conclusion

dm-9 0.00 0.00 49.75 0.00 19790.50 0.00 795.56 17.89 15.79 2.01 100.00

UtilizationThroughput

Response time

Page 48: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation48

IBM Live Virtual Class – Linux on System z

High Disk response times (cont'd)Initial and proposed setup

Physical Disk Logical Disk(s) VM Logical Disk(s) Linux

2.Link Minidisks to guest multiple times

1. Initial Setup

3.Smaller disks, more stripes

multipath

striping

LVM /Device mapper

Link multiple times

Page 49: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation49

IBM Live Virtual Class – Linux on System z

High Disk response times (cont'd)New Observation in Linux

Response times stay equal Throughput equal No PAV being used!!

Page 50: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation50

IBM Live Virtual Class – Linux on System z

High Disk response times (cont'd)Solution: check PAV setup in VM

Page 51: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation51

IBM Live Virtual Class – Linux on System z

High Disk response times (cont'd)Finally:

dasdaen         133.33     0.00  278.11    0.00 12298.51     0.00    88.44     0.79    2.83   1.48  41.29dasdcbt         161.19     0.00  248.26    0.00 12260.70     0.00    98.77     0.91    3.47   1.88  46.77dasdfwc         149.75     0.00  266.17    0.00 12374.13     0.00    92.98     1.88    7.07   2.54  67.66dasdael         162.19     0.00  250.25    0.00 12483.58     0.00    99.77     1.90    7.57   2.86  71.64dasddyz         134.83     0.00  277.61    0.00 12431.84     0.00    89.56     0.75    2.71   1.68  46.77dasdaem         151.24     0.00  266.17    0.00 12595.02     0.00    94.64     2.01    7.61   2.82  75.12dasdcbr         169.65     0.00  242.79    0.00 12386.07     0.00   102.03     1.72    7.05   2.83  68.66dasdfwd         162.69     0.00  249.25    0.00 12348.26     0.00    99.08     1.92    7.70   2.83  70.65dasddyy         157.21     0.00  259.70    0.00 12409.95     0.00    95.57     2.58    9.96   3.05  79.10dasddyx         174.63     0.00  237.81    0.00 12374.13     0.00   104.07     1.76    7.38   2.93  69.65dasdcbs         144.78     0.00  272.14    0.00 12264.68     0.00    90.14     2.53    9.31   2.89  78.61dasda             0.00     0.00    0.00    1.00     0.00     3.98     8.00     0.01   10.00   5.00   0.50dasdq             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00dasdss            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00dasdadx           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.dasdawh           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00dasdamk           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00dasdaek         160.70     0.00  255.22    0.00 12382.09     0.00    97.03     2.27    8.95   2.88  73.63dasdcbq         148.76     0.00  265.67    0.00 12372.14     0.00    93.14     2.14    8.01   2.85  75.62dasddyw         162.19     0.00  254.23    0.00 12384.08     0.00    97.42     2.12    8.40   2.90  73.63dasdfwe         146.27     0.00  271.64    0.00 12419.90     0.00    91.44     2.63    9.71   2.80  76.12dasdfwf         162.19     0.00  249.75    0.00 12455.72     0.00    99.75     0.71    2.83   1.79  44.78dasdb             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00dm­0              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70dm­1              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53dm­2              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55dm­3              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.05    7.32   0.53  87.56dm­4              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70dm­5              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53dm­6              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55dm­7              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.06    7.32   0.53  87.56dm­8              0.00     0.00    0.00    1.99     0.00     7.96     8.00     0.00    0.00   0.00   0.00dm­9              0.00     0.00  497.51    0.00 197900.50    0.00   795.56     7.89   15.79   2.01 100.00

Page 52: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation52

IBM Live Virtual Class – Linux on System z

Bonding throughput not matching expectations

Configuration:– SLES10 system, connected via OSA card and using bonding driver

Problem Description:– Bonding only working with 100mbps

– FTP also slow

Tools used for problem determination:– dbginfo.sh, netperf

Problem Origin:– ethtool cannot determine line speed correctly because qeth does not report it

Solution:– Ignore the 100mbps message – upgrade to SLES11

bonding: bond1: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full

Page 53: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation53

IBM Live Virtual Class – Linux on System z

Availability: Unable to mount file system after LVM changes

Configuration:– Linux HA cluster with two nodes

– Accessing same dasds which are exported via ocfs2

Problem Description: – Added one node to cluster, brought

Logical Volume online

– Unable to mount the filesystem from any node after that

Tools used for problem determination:– dbginfo.sh

Linux 1

Linux 2

dasda fedcb

Logical Volume

OCFS2

Page 54: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation54

IBM Live Virtual Class – Linux on System z

Availability: Unable to mount file system after LVM changes (cont'd)

Problem Origin:– LVM metadata was

overwritten when adding 3rd node

– e.g. superblock not found

Solution:– Extract meta data from

running node (/etc/lvm/backup) and write to disk again

Linux 1

Linux 2

dasdf dbace

Logical Volume

OCFS2

Linux 3

{pv|vg|lv}create

Page 55: Problem Reporting and Analysis Linux on System z - How to ... · Prepare dump device under Linux, if possible on 64Bit environment: After Linux crash issue these commands on 3270

© 2011 IBM Corporation55

IBM Live Virtual Class – Linux on System z

Kernel panic: Low address protection

Configuration:– z10 only

– High work load

– The more likely the more multithreaded applications are running Problem Description:

– Concurrent access to pages to be removed from the page table Tools used for problem determination:

– crash/lcrash

Problem Origin:– Race condition in memory management in firmware

Solution:– Upgrade to latest firmware!

– Upgrade to latest kernels – fix to be integrated in all supported distributions