Troubleshooting E1 Kernels

download Troubleshooting E1 Kernels

of 42

Transcript of Troubleshooting E1 Kernels

Troubleshooting E1 Kernels

Including: Types of Kernel Problems Kernel Error Troubleshooting Procedure Getting and Using an OS Core File OS Tools for Obtaining a call Stack from a running code

Copyright Oracle 2011. All rights reserved

[i]

Table of Contents

TABLE OF CONTENTS ............................................................................................................................................................ II CHAPTER 1 - INTRODUCTION .............................................................................................................................................. 1 Intended Audience Structure of this Document Related Materials 1 1 1

CHAPTER 2 - TYPES OF KERNEL PROBLEMS ................................................................................................................. 3 Hung Kernel with Low CPU Hung Kernel with High CPU Zombie Process / Zombie Kernel Out of Memory Kernel / Memory Leak Kernel 3 3 3 3

CHAPTER3 - KERNEL ERROR TROUBLESHOOTING PROCEDURE ........................................................................... 4 General Troubleshooting Philosophy Troubleshooting Procedure Identify Product Area of Problem Interactive Problems Enterprise Server Problem / Batch Problem Batch Problem 4 4 4 6 7

CHAPTER 4 - ZOMBIE KERNELS ........................................................................................................................................ 8 Call Object Kernels (COBK) Metadata Kernel 8 12

CHAPTER 5 - HUNG KERNELS WITH HIGH CPU ......................................................................................................... 13 CHAPTER 6 - HUNG KERNELS WITH LOW CPU .......................................................................................................... 14 Is a Package Deployment Currently Underway? Troubleshooting Low-CPU Hung Kernels 14 14

CHAPTER 7 - OUT OF MEMORY / MEMORY LEAK KERNELS................................................................................. 15 Memory Leaks Overly-Aggressive Caching Troubleshooting Out-of-Memory Issues 15 15 15

Copyright Oracle 2011. All rights reserved

[ii]

Troubleshooting E1 Kernels

5/18/2011

APPENDIX A VALIDATION AND FEEDBACK ............................................................................................................... 17 Customer Validation Field Validation 17 17

APPENDIX B GLOSSARY .................................................................................................................................................... 18 APPENDIX C GETTING AND USING AN OS CORE FILE ............................................................................................ 19 Windows AS400 iSeries 19 27

UNIX 29 HP ............................................................................................................................................................................................ 30 LINUX ..................................................................................................................................................................................... 31 AIX........................................................................................................................................................................................... 31 SUN .......................................................................................................................................................................................... 32 APPENDIX D OS TOOLS FOR OBTAINING A CALL STACK FROM RUNNING CODE ........................................ 33 Unix Windows AS400 33 33 33

Copyright Oracle 2011. All rights reserved

iii

Troubleshooting E1 Kernels

5/18/2011

Chapter 1 - IntroductionJD Edwards EnterpriseOne Kernels consist of several types of processes. The process definitions can be found in JDE.INI. On the enterprise server, two process name are registered, JDENET_N and JDENET_K. The JDENET_N process services incoming and outgoing requests for the JDENET_K processes. The number of JDENET_N processes needed on an EnterpriseOne server can be calculated based on the number of connections and maximum number of net processes. For a detailed JDENET calculation, please refer to the document, JD Edwards EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The calculation is described in the section, Understanding the jde.ini File Settings, [JDENET]. E.g. The base guides for 898 are located here: http://download.oracle.com/docs/cd/E13780_01/jded/html/docset.html The minimum and maximum numbers of each type of JDENET_K process are defined in JDE.INI. For each type of JDENET_K kernel, there is a section titled [JDENET_KERNEL_DEF#] where # stands for 1, 2, etc. As of 8.97 tool release, there are 32 JDENET_KERNEL_DEF definitions. (Two new definitions, JDENET_KERNEL_DEF31 and JDENET_KERNEL_DEF32, were introduced in 8.97, and they correspond to the XMLPublisher and Management Kernels respectively.) For detailed definitions of the JDENET_K processes, please refer to the document, JD Edwards EnterpriseOne Tools #### System Administration Guide, where #### refers to the tools GA release. The necessary calculations are described in the section, Understanding the jde.ini File Settings, [JDENET_KERNEL_DEF#].

INTENDED AUDIENCEThis document is intended for use by three different groups: Customers, Consultants, and Oracle Global Customer Support (GCS). This document is primarily concerned with debugging kernel issues for tools releases prior to 8.98.3.0. Tools release 8.98.3.0 introduces several new utilities to aid in troubleshooting kernel issues. While the information in this document will still be correct when applied to releases beyond 8.98.3.0, it provides only minimal coverage of the improved troubleshooting utilities and methodologies that are available in newer tools releases.

STRUCTURE OF THIS DOCUMENTThis document provides guidance to self diagnose the Kernel Issues based on pre-KRM methodology (pre-898_2.0) The KRM Documentation is present here: OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384 Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

Keep in mind that Oracle updates this document as needed so that it reflects the most current feedback we receive from the field. Therefore, the structure, headings, content, and length of this document are likely to vary with each posted version. To see if the document has been updated since you last downloaded it, compare the date of your version to the date of the version posted on My Oracle Support.

RELATED MATERIALS

Copyright Oracle 2011. All rights reserved

1

Troubleshooting E1 Kernels

5/18/2011

We assume that our readers are experienced IT professionals, with a good understanding of JD Edwards EnterpriseOne. To take full advantage of the information covered in this document, we recommend that you have a basic understanding of system administration, basic Internet architecture, relational database concepts/SQL, and how to use Oracle JDEdwards applications. This document is not intended to replace the documentation delivered with the CRM PeopleBooks. We recommend that before you read this document, you read the PIA related information in the PeopleTools PeopleBooks to ensure that you have a wellrounded understanding of our PIA technology. Note: Much of the information in this document will eventually be incorporated into subsequent versions of the PeopleBooks. Many of the fundamental concepts related to PIA are discussed in the following PeopleSoft PeopleBooks: PeopleSoft Internet Architecture Administration (PeopleTools|Administration Tools|PeopleSoft Internet Architecture Administration) Application Designer (Development Tools|Application Designer) Application Messaging (Integration Tools|Application Messaging) PeopleCode (Development Tools|PeopleCode Reference) Customers using tools release 8.98.3.0 or newer should also read KRM documentation for information on additional troubleshooting techniques that are available to users of those releases as a supplement to the techniques described in this document. KRM Docs: OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384 Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

Copyright Oracle 2011. All rights reserved

2

Troubleshooting E1 Kernels

5/18/2011

Chapter 2 - Types of Kernel ProblemsThis document refers to several specific types of kernel issues that a customer may encounter. The most important categories of kernel problems are explained below.

HUNG KERNEL WITH LOW CPUDefinition:A hung kernel with low CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with very little CPU activity. Generally, this points to a root cause related to deadlock.

HUNG KERNEL WITH HIGH CPUDefinition:A hung kernel with high CPU refers to a kernel that has stopped functioning correctly but whose process continues to run with significant CPU activity. Generally, this points to a root cause related to an infinite loop.

ZOMBIE PROCESS / ZOMBIE KERNELDefinition:When an E1 server process crashes due to a programming error in some piece of code that it is running, the kernel stops running from the perspective of the OS. The process is flagged as a zombie kernel within the E1 Enterprise Server, where some of the process IPC data is saved in shared memory. The process is listed in Server Manager as a zombie process. There are many potential causes of a zombie process, including but not limited to null or invalid pointer dereferences, heap memory corruption, stack memory corruption, and race conditions.

OUT OF MEMORY KERNEL / MEMORY LEAK KERNELDefinition:An out of memory kernel is a kernel that has crashed because its memory footprint exceeded the maximum amount it is allowed to allocate. Generally, this points to a memory leak or the caching of overly large quantities of data.

Copyright Oracle 2011. All rights reserved

3

Troubleshooting E1 Kernels

5/18/2011

Chapter3 - Kernel Error Troubleshooting Procedure

GENERAL TROUBLESHOOTING PHILOSOPHYOracle JD Edwards EnterpriseOne is a highly complex system with many interacting components. The remainder of this chapter and the chapters that follow group similar problems together into a few broad categories and provide generalized techniques to handle any problem in one of these categories. However, in many cases, a more specific troubleshooting procedure may be necessary for a complex problem/issue. Whenever a problem is encountered, the very first action on the part of the troubleshooter should be to examine any relevant logfiles. Generally speaking, this means consulting jde_####.log, where #### is the Process ID (PID) of the relevant jdenet_k and/or jdenet_n, and also jas.log. If there is a clear error message at or near the end of any of these logfiles, acting on that message may be more efficient than following the procedure below. Similarly, the procedure below is designed to guide a troubleshooter until he or she finds something that reveals the root cause of the problem. If, at any point while following this procedure, the troubleshooter should find some clue to the root cause that is too specific to be discussed below, he or she should go off-script and pursue that clue; if this search results in a dead-end, the troubleshooter may resume the scripted procedure where he or she left off.

TROUBLESHOOTING PROCEDURE IDENTIFY PRODUCT AREA OF PROBLEMThere are several types of issues that can cause an E1 User to receive a time-out message or a Web-Exception. The following sections provide a question-and-answer decision tree to help identify the root cause of the problem. First the E1 admin needs to determine whether the problem is an Interactive Problem, an Enterprise Server Problem, or a Batch Problem.

INTERACTIVE PROBLEMSGeneral: 1) Did the user receive a Web Exception with the following message, There was a problem with the server while running business function ? Yes No Continue Go to Transaction Processing

2) Get the jas.log file. a. b. Search within in the jas logfile for the phrase, Associated kernel not found, where is the process ID of the COBK. Does the jas logfile contain the above phrase? Yes Continue

Copyright Oracle 2011. All rights reserved

4

Troubleshooting E1 Kernels No Go to Transaction Processing

5/18/2011

3) Log in to SM and go to the Management Dashboard. 4) Select the Enterprise Server from the list of Managed Instances. 5) Select Runtime Metrics->Process Detail. 6) Does the process ID #### exist in the process detail list for the Enterprise Server? Yes No Continue Go to COBK Zombies:

7) From SM, does the process ID #### (COBK) have a status of zombie? Yes No Continue Go to Transaction Processing

8) Is the process ID #### (COBK) the only kernel with a status of zombie? Yes No Go to COBK Zombies: Go to Multiple COBK Zombies:

Transaction Processing: 1) Did the user receive a Transaction Rollback message? Yes No High CPU: 1) Determine how much CPU the COBK process is using. Platform specific instructions follow: (Note that, beginning in Tools Release 8.98.2.0, this information is also available from Server Manager in the Runtime Metrics->Process Detail page for the Enterprise Server.) a. Windows i. ii. iii. iv. v. b. Launch Windows Task Manager. On the Performance tab, there is a graph showing overall CPU activity. To see CPU activity specific to the COBK process, first select the Processes tab. Go to View->Select Columns and check the box for the PID column if it is not already enabled. (The CPU Usage column should already be enabled, but if it is not, check that box as well.) Click OK, and when you return to the table of processes, click on the PID column to sort by that value. Find the PID of the COBK, and check the value of the CPU Usage for that row. Go to Chapter 6 - Hung Kernels with Low CPU Go to High CPU

AS/400 iSeries From the terminal, type the command wrkactjob. This will show a table of processes running on that machine. If you know the name of the specific library/subsystem, you may view relevant processes only via the command wrkactjob sbs() where is the appropriate library.

Copyright Oracle 2011. All rights reserved

5

Troubleshooting E1 Kernels c.

5/18/2011

Unix SSH to the machine hosting the Enterprise Server and type the command top p where is the Process ID (PID) of the COBK. Consult the %CPU column.

2) Is the COBK to which the user is connected using significant CPU? Yes No Go to Chapter 5 - Hung Kernels with High CPU Continue to Memory Leaks.

Memory Leaks: 1) Answer yes if any of the following are true: Yes No The processes memory usage keeps increasing This can be observed by using any OS supplied Tool such as Perfmon in Windows or Glance in HP-UX , etc The processes amount of allocated memory is already extremely large An out-of-memory error has been observed. Chapter 7 - Out of Memory / Memory Leak Kernels Continue to Metadata Kernel

Metadata Kernel: 1) Are there any Metadata Zombie Kernels listed in Server Manager? Yes No Go to Chapter 4 - Zombie Kernel::Metadata Kernel Go to Chapter 4 - Zombie Kernel :: CallObject Kernels

ENTERPRISE SERVER PROBLEM / BATCH PROBLEM1) Are there any outstanding requests for jdenet_k or jdenet_n from SM or NetWM? (If this is a UBE problem, or if this is a multi-threaded kernel, answer no.) Yes No Go to Outstanding Requests. Continue

2) Are there one or more COBK / RUNBATCH zombies? Yes No Go to Chapter 4 - Zombie Kernels COBK Zombies. Continue

3) Is the process using a significant amount of CPU? Yes No Go to Chapter 5 - Hung Kernels with High CPU Continue

4) Is the processes memory usage continuously and steadily increasing?

Copyright Oracle 2011. All rights reserved

6

Troubleshooting E1 Kernels Yes No Go to Chapter 7 - Out of Memory / Memory Leak Kernels Continue

5/18/2011

5) Is the processes memory usage constant but extremely large? Yes No Go to Chapter 7 - Out of Memory / Memory Leak Kernels Continue

6) Is the process otherwise hanging or not responding? Yes No Go to Chapter 6 - Hung Kernels with Low CPU Continue

7) It appears you have a very unusual issue. Contact Oracle GCS with as much information as is available. Especially make sure to include any of the following that are available: a) steps to reproduce the issue

b) jde_####.log for the kernel. c) jde_####.log for the kernels jdenet_n parent process.

d) jdedebug_####.log for the kernel e) f) jdedebug_####.log for the kernels jdenet_n parent process. dumpfile, core file, or callstack

g) jas log h) java logs for enterprise server Outstanding Requests 1) Is the number of processed requests increasing over time? Yes The kernel is still processing requests, but it is unable to keep up with the rate at which new requests are coming in, resulting in a backlog of queued operations. There may be a misconfiguration, or your hardware resources may be insufficient to meet the demands of your userbase.

No

Continue

2) Observe the trend in the number of outstanding requests over time. Is the number increasing, decreasing, or constant? Return to Step 2 of Enterprise Server Problem above, but include this information if you end up contacting Oracle GCS.

BATCH PROBLEMRefer to the corresponding Knowledge Experts or Documentation in Batch Area

Copyright Oracle 2011. All rights reserved

7

Troubleshooting E1 Kernels

5/18/2011

Chapter 4 - Zombie KernelsThere are a myriad of programming errors that can cause a kernel to crash (resulting in a zombie kernel), including but not limited to null or invalid pointer dereferences, heap memory corruption, stack memory corruption, and race conditions. Furthermore, the crash may not occur until some time after the code containing the logic error executes. The main focus of this chapter will be on localizing the crash to a specific business function (BSFN) containing the error. Once the BSFN has been identified, the code can be examined for any programming errors.

CALL OBJECT KERNELS (COBK)Determining the cause of the zombie status: COBK Zombies: 1) Open the log file for the COBK/UBE to which the user is connected. Prior to tools release 8.98.3.0, this file will be named jde_####.log, where #### is either the Process ID (Windows and Unix) or the Job ID (iSeries) of the relevant COBK/UBE. From tools release 8.98.3.0 onward you will be looking for a file with a name of the form jde_*_dmp.log. (This file is created when a kernel crashes, and * represents the PID of the kernel and the timestamp of the crash.)

2) Go to the end of the log file. Is there a call stack? Yes No Continue Go to JDENet Process Log

3) Does the call stack show the BSFN? Yes No Continue Go to JDENet Process Log

4) Can the issue be reproduced? Yes No Go to Reproducing the Issue. Continue to JDENet_N Parent Process Log

JDENET_N Parent Process Log 1) Obtain the jde_####.log where #### is the PID of the parent jdenet_n that spawned the zombie COBK/UBE. If you need instructions on finding the file, consult Obtaining the logfile for the Parent JDENET_N Process. 2) Search the logfile for the keywords zombie and died. (If there are no hits on either search term, try searching for the Process ID of the COBK/UBE.) 3) Is there a callstack associated with any of the search terms? No Yes Go to Getting an OS Core File. Continue

4) Does the call stack contain a BSFN? Copyright Oracle 2011. All rights reserved

8

Troubleshooting E1 Kernels No Yes Go to Getting an OS Core File. Continue

5/18/2011

5) Can the issue be reproduced? No Yes Go to Multiple COBK Zombies. Continue to Reproducing the Issue.

Reproducing the Issue 1) Turn on dynamic debugging before reproducing the issue. 2) Can the issue be reproduced with debugging turned on? No Yes Go to Tool Release Continue

3) Go ahead and reproduce the problem with debugging on. 4) Open the resulting debug logfile (jdedebug_####.log) and scroll to the end of the file. 5) Search upwards for the string BSFNLevel this should tell you the last BSFN to run before the kernel crashed. Continue to Trouble with a specific BSFN. Trouble with a Specific BSFN 1) Is this a customized BSFN? Yes No Go to Trouble with Customized BSFN Continue

2) Is there an ESU for this BSFN? Yes No Apply the ESU. Generally, this will resolve the issue. If it persists go to Contacting Oracle GCS Go to Contacting Oracle GCS

Trouble Involving a Customized BSFN 1) Is it possible to try replacing the BSFN with the original code from the release? Yes No Continue. Consult with the developers who customized the BSFN for your purposes.

2) Try replacing the BSFN with the original code from the release. Does the problem disappear? Yes No Consult with the developers who customized the BSFN for your purposes. Continue

3) Is there an ESU for this BSFN? Yes No Continue Go to Contacting Oracle GCS

4) When the ESU is applied, does the problem go away? Copyright Oracle 2011. All rights reserved

9

Troubleshooting E1 Kernels Yes No

5/18/2011

You will need to merge the changes you made to the original BSFN into the version of the BSFN supplied by the ESU. Go to Contacting Oracle GCS

Contacting Oracle GCS 1) Contact Oracle GCS with as much information as is available. Especially make sure to include any of the following that are available: a) the name of the BSFN b) whether the BSFN is customized c) whether there are any ESUs for the BSFN d) what tools release is in use e) steps to reproduce the issue f) jde_####.log for the kernel.

g) jde_####.log for the kernels jdenet_n parent process. h) jdedebug_####.log for the kernel i) j) jdedebug_####.log for the kernels jdenet_n parent process. dumpfile, core file, or callstack

k) jas log l) java logs for enterprise server

Multiple COBK Zombies: 1) Open all of jde_####.log files for all jdenet_n parent processes. There are two ways to do this: a) Option 1: If you have easy access to the machine hosting the Enterprise Server. i) ii) On the hosting machine, navigate to the log folder for your Enterprise Server. Grep (search within the text of these files) for the strings zombie and died.

iii) Open up any files that contain either of these expressions. b) Option 2: If you have easy access to the Server Manager for your Enterprise Server. i) ii) Log in to SM and go to the Management Dashboard. Select the Enterprise Server from the list of Managed Instances.

iii) Select Runtime Metrics->Process Detail. iv) Sort by Process Name. v) For any jdenet_n (Network Listener) processes, click the link in the JDELOG File Size column for that row to view the logfile. 2) In each jde_####.log for a jdenet_n, locate the Business Functions (BSFN) call stack. 3) Is there a pattern that one BSFN stands out more than the others in the call stack? Copyright Oracle 2011. All rights reserved

10

Troubleshooting E1 Kernels Yes No Continue Go to Consult the OS Core File

5/18/2011

4) Can the issue be reproduced? Yes No Go to Reproducing the Issue Go to Consult the OS Core File

Check Tools Release 1) Is the customer on a supported release? Yes No Continue The customer should upgrade to a supported release or provide a compelling reason why this is not possible.

2) Is the customer on the current release? Yes No Skip to step 4. Continue

3) Can the customer upgrade to the current release? Yes No The customer should upgrade to the current release and see if the problem is resolved. If the problem persists, then continue. Continue

4) Is there a Solution Document or announcements document in My Oracle Support Knowledge base for the customers issue? Yes No Follow the instructions in the document for resolving the issue. Go to Contacting Oracle GCS.

Obtaining the Logfile for the Parent JDENET_N Process. 1) If a COBK kernel has crashed, and there is no useful information in its log, there may be helpful information in the logfile for the parent JDENET_N process. This section will provide instructions on obtaining the file. 2) Log in to Server Manager and go to the Management Dashboard. 3) Select your Enterprise Server from the list of Managed Instances. 4) Select Runtime Metrics->Process Detail. 5) Is the zombie COBK listed? Yes No Continue The list of zombies has already been cleared. Skip to step #10

6) Click the name (CALL OBJECT KERNEL) of the COBK that has crashed (the zombie COBK). 7) Under General Information, find Parent Process ID. Is the Parent PID non-zero? Yes Continue

Copyright Oracle 2011. All rights reserved

11

Troubleshooting E1 Kernels No Skip to step #10

5/18/2011

8) Return to the Runtime Metrics->Process Detail page, and find the JDENET_N process whose PID matches the Parent PID. Click on the size of its log file (the entry under JDELOG File Size for that row) to view the logfile. 9) Return to JDENET_N Parent Process Log. 10) If there is more than one JDENET_N, you will have to find all JDENET_N logfiles and grep (search within the text of these files) for the PID of the zombie COBK to determine the appropriate logfile. If you have access to the machine hosting the Enterprise Server, the easiest way to do this is to connect to that machine, navigate to the log folder for the Enterprise Server, and search within jde_*.log Alternatively, the JDENET_N logfiles can be accessed one-at-a-time from the Runtime Metrics->Process Detail page of Server Manager by clicking on the JDELOG File Size for each process that is a Network Listener.

11) Once you have identified the correct logfile, return to JDENET_N Parent Process Log. Consult the OS Core File If it has proven impossible to obtain a (useful) callstack from any of the EntepriseOne log files, it may still be possible to obtain a callstack from an OS-generated core file. If you are unfamiliar with generating and working with OS core dumps on your platform, information on doing so is available in Appendix C Getting and Using an OS Core File. Once you have examined the callstack, if you can determine which BSFN is running at the time of the crash, go to Trouble with a specific BSFN above. If you cannot isolate a specific BSFN, you should consult Oracle GCS.

METADATA KERNELThere are historical issues that exist with Metadata Kernel, particularly in terms of out-of-memory errors and UBEnot-processing errors. It is believed that these issues were all resolved by Tools Release 8.98.2.0. If a customer is experiencing crashes of the Metadata Kernel, the customer should attempt to upgrade to a newer tools release. If the customer is already running a recent release, or an upgrade is not practical, the customer should contact Oracle GCS. It will be helpful to Oracle GCS to have: Any available logfiles for the kernel, Steps to reproduce the issue, A copy of the Java heap dump (see Enabling a Java Heap Dump).

Enabling a Java Heap Dump To Enable a Java heap Dump is a JDK and OS specific set of instructions . Since better and more recent methods are being created in a very rapid pace its best to contact the Kernel Support or Dev SMEs for the latest means to create a Java Dump.

Copyright Oracle 2011. All rights reserved

12

Troubleshooting E1 Kernels

5/18/2011

Chapter 5 - Hung Kernels with High CPUA non-responsive kernel with high-CPU has not crashed per se. While the kernel is no longer performing its required duties, code continues to execute, most likely in some form of infinite loop. The first step in resolving this issue is to identify where in the continued code the execution is taking place. One can determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac and cstack. These tools are discussed in Appendix D OS Tools for Obtaining a Call Stack from Running Code. Note that customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager. It is important to note that, while a high-CPU hung kernel is most likely engaged in some sort of infinite loop, that loop will generally not be contained in the inner-most executing function of the callstack you obtain. Rather, the inner-most functions are likely to be contained within the infinite loop. Therefore, it is necessary to repeat the process of obtaining a callstack several (five to ten) times. The outermost entries in the callstack will remain the same across all the callstacks collected while the innermost entries will vary. The infinite loop most likely resides at the level of the inner-most function that is common to all of the collected callstacks.

Copyright Oracle 2011. All rights reserved

13

Troubleshooting E1 Kernels

5/18/2011

Chapter 6 - Hung Kernels with Low CPU IS A PACKAGE DEPLOYMENT CURRENTLY UNDERWAY?When a package is currently being deployed to the Enterprise Server, the kernels temporarily suspend normal operation, mimicking the behavior of a hung kernel with low CPU usage. Generally, package deployments are fairly quick to complete, but under certain circumstances, deployments can require extended time. Once the package deployment completes or times out, normal kernel operations will resume. If a package deployment is not underway, proceed to the next section.

TROUBLESHOOTING LOW-CPU HUNG KERNELSSimilar to a hung kernel with high-CPU, a non-responsive kernel with low-CPU has also not crashed in the traditional sense. Although the kernel is no longer performing its required duties, code continues to execute, most likely in some form of deadlock. A program is said to be in deadlock when two or more operations are each waiting for the other to finish, creating a situation in which neither operation ever completes and both wait forever. Though not technically deadlock, a situation with similar symptoms can arise when a single operation is waiting to obtain a lock on a resource, but that lock was not properly released when a previous operation finished using the resource. While UBE kernels are not multi-threaded, it is important to note that they are not immune from deadlock. Two separate UBE's executing simultaneously (or, more likely, the same UBE being executed multiple times simultaneously) can compete for locks on shared resources and end up in deadlock As in the previous chapter, the first step in resolving this issue is to identify where in the code the execution is. One can determine what code is running by examining a callstack. Since the kernel has not crashed in the sense of encountering a fatal error, there will NOT be a callstack written out to a file. Instead, a callstack can be obtained using OS tools such as procstac and cstack. The tools are discussed in Appendix D OS Tools for Obtaining a Call Stack from Running Code. Note that customers running tools release 8.98.3.0 and beyond can obtain such a callstack through Server Manager. After obtaining a call stack for all low-CPU hung kernels, the troubleshooter should examine the executing code to identify what resource locks are currently held and what locks are pending. The troubleshooter should then study the remainder of the code to determine where else these locks are obtained / released, and where the logical flaw resides.

Copyright Oracle 2011. All rights reserved

14

Troubleshooting E1 Kernels

5/18/2011

Chapter 7 - Out of Memory / Memory Leak Kernels MEMORY LEAKSGenerally speaking, a kernel suffering from a memory leak is discovered after it has crashed. The kernel crashes when a memory allocation attempt fails because the process has reached its maximum allowed memory.1 Sometimes examining the callstack at the time of the crash can indicate where this failed memory allocation occurred, but that may or may not provide useful information. Often, the failed memory allocation is merely the unrelated victim of a programming error elsewhere in the code that prevents no-longer-needed memory from being recycled.

OVERLY-AGGRESSIVE CACHINGAn out-of-memory error does not necessarily imply the existence of a memory leak per se. Misuse of the JDB cache is a common source of out-of-memory errors. The JDB cache can be used to store the result of a frequent database query in memory for improved performance. However, if the cache is used too liberally with large tables, free memory will fill up with JDB cache entries. Overly-aggressive caching can be an issue with call object kernels, but it more often causes problems in batch jobs, simply due to the much higher volume of data batch jobs generally manipulate. If an out of memory error is encountered, the troubleshooter should investigate what information is being stored in the JDB cache and verify that no unreasonably large queries are being cached. There are two ways that a query result may be stored in the JDB cache. 1. 2. If the table over which the query is made has been registered in the F98613 table, then the query result will be placed in the JDB cache. To check which tables' queries are being cached through this method, examine the F98613 table. A BSFN can use the JDB_AddTableToDBCache API to have a table's query results added to the cache. To check whether this has happened, debug logging must be enabled, and the debug log should be searched for the messages of the form: Entering JDB_AddTableToDBCache (Table =)

Small, unchanging tables such as company constants are prime candidates for caching in the JDB cache. Except in very unusual circumstances, tables containing business data should never be cached.

TROUBLESHOOTING OUT-OF-MEMORY ISSUESIf an out-of-memory error does not appear to be related to overly-aggressive caching, the best way to troubleshoot a kernel that is running out of memory is to recreate the issue while using a memory profiling tool such as Purify, Valgrind, or Pex. (Customers using tools release 8.98.3.0 and beyond have the additional options of using BMD or Jade.). Memory profiling tools such as these will show the user what memory has been allocated and never been freed (reclaimed).

Even when there is plentiful total free memory, an attempt to allocate a large block of memory will still fail if there is no adequately large block of contiguous free memory

1

Copyright Oracle 2011. All rights reserved

15

Troubleshooting E1 Kernels

5/18/2011

It is important to note that using any of the above profiling tools will incur a heavy performance penalty. If it is at all possible, this should be done on a non-production server.

Copyright Oracle 2011. All rights reserved

16

Troubleshooting E1 Kernels

5/18/2011

Appendix A Validation and FeedbackThis section documents that real-world validation that this Document has received.

CUSTOMER VALIDATIONOracle is working with PeopleSoft customers to get feedback and validation on this document. Lessons learned from these customer experiences will be posted here.

FIELD VALIDATIONOracle Consulting has provided feedback and validation on this document. Additional lessons learned from field experience will be posted here.

Copyright Oracle 2011. All rights reserved

17

Troubleshooting E1 Kernels

5/18/2011

Appendix B GlossaryTerm BSFN COBK E1 ESU GCS MDK PID SAR SM NetWM Callstack UBE OS Infinite Loop Deadlock Business Function Call Object Kernel Oracle JD Edwards EnterpriseOne Electronic Software Update Global Customer Support Metadata Kernel Process Identifier (Process ID) Software Action Request Server Manager Network Work Management standalone utility shipped with Enterprise Server that shows queues, outstanding requests, etc. A list of currently executing functions organized hierarchically to show parent (caller) to child (callee) relationships Universal Batch Engine Operating System A program is said to be in an infinite loop when it continues to execute the same section of code repeatedly forever. A program is said to be in deadlock when two or more operations are each waiting for the other to finish, creating a situation where neither operation ever completes and both wait forever. While not technically deadlock, a situation with similar symptoms can arise when a single operation is waiting to obtain a lock on a resource and that lock was not properly released when a previous operation finished with the resource. The entry page to Server Manager (SM). The page has the title Managed Homes and Managed Instances and can be reached by clicking a link in the upper left corner of most SM pages. Definition

Management Dashboard

Copyright Oracle 2011. All rights reserved

18

Troubleshooting E1 Kernels

5/18/2011

Appendix C Getting and Using an OS Core FileIn Tools Release 8.98.3.0, several new features were added to streamline the debugging of kernel issues. This document is primarily intended for users of Tools Releases in the 8.98.2 family and earlier. Users of Tools Release 8.98.3 and beyond will find a simpler, platform independent set of instructions in the document, The KRM Documentation is present here: OU Recording:http://oukc.oracle.com/static09/opn/login/?t=checkusercookies|r=-1|c=839298384 Documentation: https://support.oracle.com/CSP/main/article?cmd=show&id=1090646.1&type=NOT

This chapter provides instructions for obtaining a call stack and a dump file on the following platforms: Window Server AS400 - iSeries UNIX

WINDOWSPre-requisite This is for the Window platform only1) Machine should have Debugging tools for windows installed, In this is not installed please download and install from following url:

http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx PS: The above package will install windbg, please note the path of windbg.exe we will use this to capture the crash dump.

2) Have the customer download this version:

Current Release version 6.11.1.402 - February 6, 2009 Install 32-bit version 6.6.7.5 [15.2 MB]

Steps to install UserDump:1. Download Site (version 8.1) http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF6928AC08570B7E&displaylang=en&displaylang=en a) Click Download

Copyright 2011 Oracle, Inc. All rights reserved.

19

Troubleshooting E1 Kernels b) Click Run c) After the download completed, a new folder, C:\kktools\userdump8.1, will be created.

5/18/2011

2. Setup http://support.microsoft.com/kb/241215

a) In C:\kktools\userdump8.1\x86, click setup.exe b) A folder C:\WINDOWS\system32\kktools will be created after the setup.

3. Capturing E1 COBK a) Go to Control Panel->Process Dumper

Copyright Oracle 2011. All rights reserved

20

Troubleshooting E1 Kernels b) Click New

5/18/2011

Copyright Oracle 2011. All rights reserved

21

Troubleshooting E1 Kernels c) enter: jdenet_k.exe and click OK

5/18/2011

Copyright Oracle 2011. All rights reserved

22

Troubleshooting E1 Kernels d) Click Rules:

5/18/2011

Copyright Oracle 2011. All rights reserved

23

Troubleshooting E1 Kernels e) Select Use custom rules - Point the Dump file folder to the folder is easily accessible. - Keep all the setting as seen. - Check the Kill process after dumping - Click OK Make sure the folder exist

5/18/2011

Copyright Oracle 2011. All rights reserved

24

Troubleshooting E1 Kernels

5/18/2011

f) Optional: (unless instructed) 1) Check All Exceptions OR 2) Select specific exceptions i) Access violation ii) Array bounds exceeded iii) Stack Overflow iv) Invalid handle v) Overflow vi) Stack Check g) Click Apply or OK

Copyright Oracle 2011. All rights reserved

25

Troubleshooting E1 Kernels

5/18/2011

Getting Page Heap: (Optional)http://support.microsoft.com/kb/267802 1. From the command line, go to the drive where the Debugging Tools for Window folder is installed. 2. From the command line:

>gflags /p /enable runbatch.exe /full /full = full page heap, this will use a lot memory and resources.

3. Targetting specific dll >gflags /p /enable jdenet_k.exe /dlls callbsfn.dll cruntime.dll

4. From the GUI interface of GFLAGS. a) Go to Start All Programs Global flags

b) Debugging Tools for Window c) Click on Image File tab page

d) Enter an executable name and TAB OUT - DO NOT HIT ENTER

Copyright Oracle 2011. All rights reserved

26

Troubleshooting E1 Kernels - check the options as seen

5/18/2011

e) To remove the settings, follow instruction 4a thru 4d but uncheck all options

AS400 ISERIESWhen a C2M1211 or C2M1212 message is generated from a single-level store heap routine, the code checks for a *DTAARA named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists, the program stack is dumped. If the data area does not exist, no dump is performed.

Setup data area to capture call stack for C2M1212 heap error message.

Copyright Oracle 2011. All rights reserved

27

Troubleshooting E1 Kernels CRTDTAARA DTAARA(QGPL/QC2M1212) TYPE(*CHAR) LEN(1)

5/18/2011

Setup data area to capture call stack for C2M1211 heap error message. To setup C2M1211 data area will require SI27412 and SI28640 PTF ON V5R4. CRTDTAARA DTAARA(QGPL/QC2M1211) TYPE(*CHAR) LEN(1)

Once the data area is in place, a spool file named QPRINT is created (this we can read to figure out which tools, apps or OS API is causing the memory overwrite) with dump information for every C2M1211 message or C2M1212 message (this may be something IBM can read). The spool file is created for the user running the job that gets the message. For example, if the job getting the C2M1211 message or C2M1212 message is a server job or batch job running under userid ABC123, then the spool file is created in the output queue for userid ABC123. Once the spool files containing stack tracebacks are obtained, the data area can be removed, and the tracebacks analyzed. To disable the dumps, delete the data area(s). For further information please read Diagnosing and Debugging Memory Problems : C2M1211 and C2M1212 Messages from IBM website. When a C2M1211 message or C2M1212 message is generated from a teraspace heap routine, the code checks for a *DTAARA named QGPL/QC2M1211 or QGPL/QC2M1212. If the data area exists and contains at least 50 characters of data, a 50 character string is retrieved from the data area. If the string within the data area matches one of the following strings, special behavior is triggered. _C_TS_dump_stack _C_TS_dump_stack_vfy_heap _C_TS_dump_stack_vfy_heap_wabort _C_TS_dump_stack_vry_heap_wsleep

If the data area does not exist, no dump or heap verification is performed. For further information please read Enablement for teraspace heap memory managers from IBM website.

Here is an example of how to create a data area to indicate to call _C_TS_malloc_debug to verify the heap whenever a C2M1211 message or C2M1212 message is generated: On IBM i 6.1 (with PTF SI33945) and IBM i 7.1 you can use following information to the data area. CRTDTAARA DTAARA(QGPL/QC2M1211) TYPE(*CHAR) LEN(50) VALUE('_C_TS_dump_stack_vfy_heap_wabort') CRTDTAARA DTAARA(QGPL/QC2M1212) TYPE(*CHAR) LEN(50) VALUE('_C_TS_dump_stack_vfy_heap_wabort') This will re-validate the heap, if it detects memory corruption and will abort the job.

Copyright Oracle 2011. All rights reserved

28

Troubleshooting E1 Kernels

5/18/2011

Caution : this should be used in a test environment as this can start throwing lot of errors/exceptions and with abort option you will see more zombie process.

UNIX1) In the JDE.INI config file, under the [JDENET] section, set the following: HandleKrnlSignals=0 and krnlCoreDump=1. This will cause a core file to be dumped, provided the operating system allows it. 2) If the Oracle client is being used to connect to an Oracle database, log in as the oracle userid that owns the Oracle Client install. Add the following line to the $ORACLE_HOME/network/admin/sqlnet.ora file: DIAG_SIGHANDLER_ENABLED=false 3) Next, you must ensure that the operating system allows the creation of core files. a) On the command line type the command: ulimit -c. This will show the current maximum size for core files. b) If the size is 0 (or very small), then no core file will be created. c) To change the size for the core file, on the command line, type: ulimit -c where is the size in bytes d) Confirm the ulimit change by rerunning ulimit -c on the command line. If the value from step c above is not displayed, the hard limit may need to be raised by the root user. Changes to the /etc/security/limits e) If E1 Enterprise Server services are to be started from the command line using RunOneWorld.sh, start the E1 Enterprise Server services from a login session where ulimit -c was run. The ulimit command has to be run for each new login session on the server that is used to run the RunOneWorld.sh script. If the E1 Enterprise Server needs to be stopped and restarted often, adding the ulimit -c command to the bottom of the $SYSTEM/bin32/toolsenv.sh script will ensure the ulimit command is run each time a new login session is opened. f) If the E1 Enterprise Server is to be stopped and restarted remotely via Server Manager, the Server Manager client on the Enterprise Server must be restarted from a login session where ulimit -c has been run. Run the ulimit command, then goto the jde_home/bin directory and run the command: restartAgent g) Test that core files are being created properly by selecting a jdenet_k process-PID and run the following command: kill -15 This should generate a core file. 4) When the core file is generated, the core file has the same name in the $SYSTEM/bin32 directory, unless the operation system is actively managing core file names and locations. The server may already be configured to put all core files in a central location. If so, the server may be reconfigured, or the core files can be copied to the $SYSTEM/bin32 directory to be read. Option to generate the core file with the unique name. a) On Sun Solaris, put the coreadm command in the user profile: coreadm -p core.%f.%p $$ The above command will generate the core file with the following format name: core.. b) On Linux, log in as root and edit the /etc/sysctl.conf file and add the following line: kernel.core_uses_pid = 1 Anytime the /etc/sysctl.conf file is changed, the root user must run the following command to make the change effective immediately: sysctl -p Once this is run, every new login session will get the new settings. Stop and restart E1 following the directions in step 3e or 3f. c) If no other core naming options are available, create a script to detect the core file and rename it. See the following for example. Run the script from the $SYSTEM/bin32 directory in the background with nohup using this command: nohup rename_core &

Copyright Oracle 2011. All rights reserved

29

Troubleshooting E1 Kernels

5/18/2011

rename_core script sample #!/bin/ksh # This script just hangs around waiting for a core file to appear, and if # one does, renames it to a name based on the current date and time. while true do sleep 30 if [ -f core ] then cname="core.$(date +%Y%m%d%H%M%S)" echo renaming core to $cname mv core $cname

done 5) Once the core files are captured, the core files must be opened at the customer site to get the call stack. 6) Which platform the customer is using? HP LINUX AIX SUN

HP1) Do you know what executable create the core file? Yes No 2) On the command line type: file 3) The above command will give you the executable name to be used in the Get HP Callstack (#4) Get HP Callstack 4) Getting the callstack Command line: >gdb Example: >gdb jdenet_k core.xxxx.12345 Once the core file is open, do the following >info thread >thread # >where >quit This will give you a list of threads that were created within jdenet_k process. Open thread number List the callstack within that thread # Exit gdb

Copyright Oracle 2011. All rights reserved

30

Troubleshooting E1 Kernels

5/18/2011

LINUXLinux core files generally must be read on the same server they were created. Displaying the core file on a different server can produce incorrect output. 1) Do you know what executable create the core file? Yes No 2) On the command line type: file 3) The above command will give you the executable name to be used in the Get Linux Callstack (#4)

Get Linux Callstack 4) Getting the callstack

Command line: >gdb Example: >gdb jdenet_k core.12345 Once the core file is open, do the following >info thread >thread # >where >quit This will give you a list of threads that were created within jdenet_k process. Open thread number List the callstack within that thread # Exit gdb

There is some optional information that can be collected along with the stack: show charset Show the effective character set when the process crashed. show environment Show the environment variables when the processed crashed.

AIX1) Do you know what executable create the core file? Yes No 2) On the command line type: file 3) The above command will give you the executable name to be used in the Get AIX Callstack (#4)

Get AIX Callstack 4) Getting the callstack Command Line: dbx prog This will bring up the dbx command, the user has to hit enter or return key several time

>where

List the callstack

Copyright Oracle 2011. All rights reserved

31

Troubleshooting E1 Kernels

5/18/2011

SUN1) Simply type the following in the command line: Command Line: pstack This will list the callstack

Copyright Oracle 2011. All rights reserved

32

Troubleshooting E1 Kernels

5/18/2011

Appendix D OS Tools for Obtaining a Call Stack from Running CodeFollowing Procstack/ Pstack command is to be used when a process is either hung or running on CPU with high usage. Please note that this should be used on Systems which are pre-898_3x as in 898_3.x and beyond the same call stacks can be obtained from CPU Diagnostics in Server manager (simply press the CPU Diagnostics in Server Manager.) Caution: This document may contain information, software, products, services which are not supported by Oracle Support Services and are being provided as is without warranty. Please refer to the following site for My Oracle Support Terms of use: https://support.oracle.com/CSP/ui/TermsOfUse.html

UNIXFollowing should be run on various Unixes to dump call stacks: HP- UX : /usr/ccs/bin/pstack AIX: /usr/bin/procstack SUN: /usr/bin/pstack LINUX: /usr/bin/pstack

More information on Procstack can be found on the following IBM link for Prockstack Command.

WINDOWSUse ADPlus tool to collect the call stack information on Windows platform. For more information on how to use the tool, follow the link from Microsoft on How to use ADPlus to troubleshoot "hangs" and "crashes

AS400The process below can be used to retrieve the program stack for a job with a single thread or the first thread of a multithreaded job.

cmd: ADDLIBLE E900SYS cmd: SAW | Option 2 Work with Server Processes | Option 3 Display OneWorld Processes

Copyright 2011 Oracle, Inc. All rights reserved.

33

Troubleshooting E1 Kernels

5/18/2011

Copyright Oracle 2011. All rights reserved

34

Troubleshooting E1 Kernels

5/18/2011

The following creates a spool file contaiing the program stack(call Stack) Cmd: DSPJOB JOB(072347/ONEWORLD/JDENET K) OUTPUT(*PRINT) OPTION(*PGMSTK) The following creates a spool file containing the program stack (call stack) cmd: DSPJOB JOB(072347/ONEWORLD/JDENET_K) OPTION(*PGMSTK)

Copyright Oracle 2011. All rights reserved

35

Troubleshooting E1 Kernels

5/18/2011

1.

Create a library and output queue to move the previously generated spool file items.

cmd: CRTLIB JDETEMP cmd: CRTOUTQ JDETEMP/JDETEMP

2.

Copy the items found in output queue WRKOUTQ JDETEMP/JDETEMP via iSeries Navigator to a local Windows folder. a. Expand the host name node. Login to the system. Expand the Basic Operations node. Right-hand click on Printer Output highlight Customize this View and select Include.

Copyright Oracle 2011. All rights reserved

36

Troubleshooting E1 Kernels

5/18/2011

Change the Users value to All. Type JDETEMP/JDETEMP in the Output queues field as shown below.

Copyright Oracle 2011. All rights reserved

37

Troubleshooting E1 Kernels

5/18/2011

b.

Highlight all of the spool files found in the right-hand window pane. Click Ctrl-C (to copy) and paste these files into a local Windows Explorer folder, e.g. SND2DENVER.

Copyright Oracle 2011. All rights reserved

38