HND202 – Using NSD: A Practical Guide - IBM€¦ · HND202 – Using NSD: A Practical Guide Rob...
Transcript of HND202 – Using NSD: A Practical Guide - IBM€¦ · HND202 – Using NSD: A Practical Guide Rob...
®
HND202 – Using NSD: A Practical GuideRob Gearhart – Domino Quality & Serviceability Engineer
Elliott Harden – Field Support Engineer
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Checklist
Case Studies
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Checklist
Case Studies
What is NSD?
NSD (Notes System Diagnostic) is one of the primary FFDC diagnostics used for Lotus Domino Products:
Domino/NotesQuickplace/ DomDoc/ Domino WorkflowSametime
FFDC = First Failure Data Capture
Used for troubleshooting Notes/Domino and Companion ProductsCrashesHangsSevere Performance Problems (not a good tool for mild or moderate performance)
NSD applies equally well for Notes Client vs. Domino Server
What is NSD?
Used on all Platform's (except Mac)ND6 - iSeries NSD is a different animal (output wise) ND7 - iSeries NSD matches other platforms more closelyIn 6.5.4, NSD on iSeries includes Memcheck (must be configured via environment variable)
On Unix – NSD is a shell script (nsd.sh)Memcheck is a separate compiled binary
On W32 – NSD is a compiled binary (nsd.exe)Memcheck is built into nsd.exe
NSD Update Strategy
NSD is one of the primary focuses for Serviceability – has undergone many continual improvements
IBM has implemented the NSD Update Strategy to periodically compile the newest improvements for NSD for the most recent existing versions of Domino (ND6 & ND7)
Supported by the addition of versioning information for NSDSpecial Hotfix installer for NSD (does not conflict with regular hotfix installer)Periodically re-sync NSD source from ND8 back into MRs for ND6 & ND7
Current NSD Build is 2382 – equivalent to ND7.0.2 version – back ported for numerous versions of Notes/Domino.
NSD Update Strategy
NSD Updates available to customers through 2 methods:Contact Support or PSM to get hotfix installer
ORDownload from the website
For more information, seeTechnote # 1233676 – NSD Fix List and NSD Update Strategy (Fixlist here)
http://www.ibm.com/support/docview.wss?uid=swg21233676Technote # 4013182 – Updated NSD For Domino Releases (Downloads here)
http://www.ibm.com/support/docview.wss?uid=swg24013182
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Checklist
Case Studies
NSD Major Sections
Process Information (Call Stacks)tells Support code path involved in the problem
Memcheck (Domino Memory Objects)tells Support resources (databases, files, views, users) involved in the problem
System Informationtells Support OS configuration (Patches, etc) – NOT IN THIS LAB
Environment Info tells Support execution environment (Notes INI, etc) – NOT IN THIS LAB
NSD - Call Stacks
Dump of thread stacks for all Domino processesincluding external applications that Call into Notes API
Provides insight to the code path where a crash or hang occurs
NSD - Call Stacks
W32 - for fatal thread, will make 3 passes1). Dumps complete call stack (divided into "before" and "after" frames)2). Granular break down of stack frames, showing arguments, return address, basic register information 3). Function parameters that are pointers are de-referenced
UNIX – Provides one pass for call stackno break down of stack framesregister information for limited platforms (AIX, Linux & OS390)On AIX - arguments may show as "???", meaning code not compiled with debug levels
W32 Call Stacks
NSD takes three passes of the fatal call stackPass ONE – dumps stack trace summary, but no frame infoPass TWO – dumps contents of stack frames (along with ascii equivalent)Pass THREE – de-references pointer parameters, meaning we can see the contents of pointer arguments passed to a function
W32 Call Stacks - Pass ONE
Pass ONE - there are two halves of the call stackFirst half of call stack is everything that happened AFTER the fatal (i.e. this is what the thread did to handle the exception). Often times will see ‘JVM_FindSignal’ near the lower portion. Ignore this, it is nothing.
############################################################
### thread 5/21: [ nIMAP:07b4:06cc]
### FP=07e4e208, PC=77f83786, SP=07e4e1e4, stkbase=07d50000, stksize=262144
############################################################
[ 1] 0x77f83786 ntdll.ZwWaitForSingleObject+11 (560,36ee80,0,601a7c06)
[ 2] 0x77e87837 KERNEL32.WaitForSingleObject+15 (7e4e5a0,77e8ae88,7e4ec0c,0)
@[ 3] 0x601a7046 nnotes._OSFaultCleanup@12+342 (0,0,0,7e4ec0c)
@[ 4] 0x601b07b1 nnotes._OSNTUnhandledExceptionFilter@4+145 (7e4ec0c,7e4ec0c,6ef1ab5,7e4ec0c)
[ 5] 0x1000e596 jvm._JVM_FindSignal@4+180 (7e4ec0c,77ea18a5,7e4ec14,0)
[ 6] 0x77ea8e90 KERNEL32.CloseProfileUserMapping+161 (0,0,0,0)
W32 Call Stacks - Pass ONE (cont)
Second half is what you are interested in (real meat of crash). Look for module names, and function names
Look for FATAL threadFatal, panic, halt, access violationShould pair this with console output (e.g. PANIC message)
W32 - NSD demangles C++ functions in the call stack, meaning it provides the class name and function name (in that order)
UNIX Call Stacks
On Unix, there is only one pass (no dump of stack frame contents)
Upper portion of the call stack is the part of the stack that deals with the fatal condition
Look at portion of stack below the “fatal”, “raise.raise”, “signal handler”, “abort”, or “terminate” line
Which one this shows under depends on platform and nature of fatal
On Unix, C++ function names are mangled (except zSeries)
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Checklist
Case Studies
NSD - Memcheck
Analyzes Domino Objects
Steps through shared and private pools allocated by Domino Memory Manager
Summarizes Memory Usage
Dumps information about Open Databases, Views and Documents, andOpen Files (in a nutshell)
Memory Usage does NOT include externally allocated memory, such as LotusScript, Java, or third-party code
Will need OS diagnostics to determine the total memory usage
NSD - Memcheck
Memcheck can be thought of as 3 major sectionsShared MemoryPrivate MemoryResource Usage Summary
Shared Memory Includes
Summary of Shared PoolsKEYWORD "Shared Memory" - Total Shared Memory Usage should be around 1.1 GBKEYWORD "Top 10" - largest block type should be UBM (0x82cd) at 750 MB
OS Package InfoND6 KEYWORD "Shared OS Field“ND7 KEYWORD “MM/OS Structure Information”Indicates thread ID of crashing thread and PANIC Message (if any)
NSF Package InfoKEYWORD "Open Databases" (lists db name, db handle)KEYWORD "Open Documents" (lists noteID's and database handles)
NIF Package InfoKEYWORD “NIF Collections" (lists open views)KEYWORD “NIF Collection USers" (lists [thread] users of those views)
Shared Memory Pool Summary
<@@ ---- Notes Memory Analyzer (memcheck) -> Shared Memory Stats (Time 17:45:55) ---- @@>
TYPE : Count SIZE ALLOC FREE FRAG OVERHEAD %used %free
Static-DPOOL: 35 125829120 116479928 9334512 0 19408 92% 7%
Overall : 35 125829120 116479928 9334512 0 19408 92% 7%
Note – Size shows overall amount of memory allocated by Domino MM, Allocshows what’s actually in use (or sub allocated). You WANT %used to be high.
Top 10 Shared
<@@ ------ Notes Memory Analyzer (memcheck)...-> Top 10 Shared Memory Block Usage ... ------ @@>
BY SIZE
Type TotalSize Handles Typename
-----------------------------------------------------------
0x82cd 637673472 162 BLK_UBMBUFFER
0x8252 20971520 20 BLK_NSF_POOL
0x834a 18350080 18 BLK_GB_CACHE
0x82cc 10511340 161 BLK_UBMBCB
0x824b 9810466 160 BLK_OPENED_NOTE
0x8a03 6760070 1604 BLK_NETBUFFER
0x8311 5242880 5 BLK_NIF_POOL
0x890b 4578420 70 BLK_EXECPOOL
0x8a05 2460000 1 BLK_NET_SESSION_TABLE
0x8a01 2289210 35 BLK_NETPOOL
-----------------------------------------------------------
MM/OS Section
<@@ ------ Notes Memory Analyzer (memcheck) -> MM/OS Structure Information (Time 13:15:45) ------ @@>
Start Time = 12/13/2005 01:15:02 PM
Crash Time = 12/13/2005 01:15:32 PM
Error Message = PANIC: LookupHandle: handle out of range
SharedDPoolSize = 4194304
FaultRecovery = 0x00010013
Cleanup Script Timeout= 300
Crash Limits = 3 crashes in 5 minutes
StaticHang = [ nhttp: 2752: 10]/[ nhttp: 2752: 3500] (0xac0/0xa/0xdac)
ConfigFileSem = ( SEM:#0:0x010d) n=0, wcnt=-1, Users=-1, Owner=[0:0]
FDSem = ( RWSEM:#11:0x410f) rdcnt=-1, refcnt=0 Writer=[0:0], n=11, wcnt=-1
Open Databases<@@ ------ Notes Memory Analyzer (memcheck) -> Open Databases (Time 11:45:21) ------ @@>
D:\Lotus\DominoR65\Data\events4.nsf
Version = 43.0
SizeLimit = 0, WarningThreshold = 0
ReplicaID = 86256ae0:02697903
bContQueue = NSFPool [ 0: 48836]
FDGHandle = 0xf01c0098, RefCnt = 10, Dirty = N
DB Sem = (FRWSEM:0x0244) state=0, nlrdrs=0 Writer=[]
SemContQueue = (RWSEM:#0:0x029d) rdcnt=-1, Writer=[] Owner=[]
By: [ nevent:0e2c: 2] DBH= 3, User=CN=Sithlord/O=SET
By: [ nevent:0e2c: 2] DBH= 16, User=CN=Sithlord/O=SET
By: [ nevent:0e2c: 2] DBH= 18, User=CN=Sithlord/O=SET
By: [ nevent:0e2c: 2] DBH= 20, User=CN=Sithlord/O=SET
Note: edited for clarity – some info is missing
Open Documents
<@@ ------ Notes Memory Analyzer (memcheck) -> Open Documents (BLK_OPENED_NOTE): total=352 ...------ @@>
DBH NOTEID HANDLE CLASS FLAGS IsProf #Pools #Items Size Database
531 7330 0x24ff 0x0001 0x0200 Yes 1 4 2984 d:\notedata\drmail\jsmith.nsf
.
Open By: CN=John Smith/O=ACME/C=US
Flags2 = 0x0404
Flags3 = 0x0000
OrigHDB = 531
First Item = [ 9471: 836]
Last Item = [ 9471: 1228]
Non-pool size : 0
Member Pool handle=0x24ff, size=2984
.
Note Classes
Replication Formula Note0x0800
Agent Note0x0200
ACL Note0x0040
View Note0x0008
Form Note0x0004
Data Note - document0x0001
Note TypeNote Class Value
Open Views – NIF Collections
<@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collections (Time 12:48:35) ------ @@>
CollectionVB ViewNoteID UNID OBJID RefCnt Flags Options Corrupt Deleted Temp NS Entries ViewTitle
------------ ---------- -------- ------ ------ ------ -------- ------- ------- ---- --- ------- ------------
[ 0020e005] 1518 1356a8 358710 1 0x0000 00000008 NO NO NO NO 0 MyNotices
CIDB = [ 0253cc05]
CollSem (FRWSEM:0x030b) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[ : 0000]
NumCollations = 2
bCollationBlocks = [ 001e72e5]
bCollation[0] = [ 00117005]
bCollation[1] = [ 001a2205]
CollIndex = [ 00012a09]
Collation 0:BufferSize 26,Items 1,Flags 0
0: Ascending, by KEY, "StartDateTime", summary# 2
CollIndex = [ 00012c09]
Collation 1:BufferSize 26,Items 1,Flags 0
0: Descending, by KEY, "StartDateTime", summary# 2
ResponseIndex [ 0010e4b6]
NoteIDIndex [ 0010e385]
UNIDIndex [ 0010e5e7]
Open Views – NIF Collection Users
<@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collection Users (hash) (Time 12:48:33) ------ @@>
CollUserVB ... CollectionVB Remote OFlags ViewNoteID Data HDB/Full View HDB/Full ... Open By
------------ ... ------------ ------ ------ ---------- ------------- ------------- ... --------------
[ 00239805] ... [ 0023d005] NO 0x0082 786 1219/1874 1219/1874 ... [ nserver:09d8:04ca]
CurrentCollation = 0
[ 0013a805] ... [ 00136005] NO 0x00c2 11122 886/785 886/785 ... [ nserver:09d8:0266]
CurrentCollation = 0
[ 0028d805] ... [ 0020e005] NO 0x00c2 1518 551/1432 551/1432 ... [ nserver:09d8:03b0]
CurrentCollation = 0
Private Memory Includes
Info for each processKEYWORD "Attach to process [procname:PID]" - to find beginning of info for each process
TLS MappingKEYWORD "TLS Mapping" - shows map of physical thread to virtual thread (its a Support "Thing")
Open DocumentsKEYWORD "Open Documents" - lists documents opened in private memory (if any)
Private Pools Allocated through Domino Memory ManagerKEYWORD "Process Heap Memory" - total size across all private pools (should be below 100 MB with a few exceptions like server & http)KEYWORD "Top 10" - shows highest block type used
TLS Mapping
------ TLS Mapping -----
NativeTID VirtualTID PrimalTID
[ nSERVER:0514:0510] [ nSERVER:0514:0002] [ nSERVER:0514:0002]
[ nSERVER:0514:0504] [ nSERVER:0514:0004] [ nSERVER:0514:0004]
[ nSERVER:0514:05d4] [ nSERVER:0514:0005] [ nSERVER:0514:0005]
[ nSERVER:0514:0600] [ nSERVER:0514:0006] [ nSERVER:0514:0006]
[ nSERVER:0514:0604] [ nSERVER:0514:0007] [ nSERVER:0514:0007]
[ nSERVER:0514:0608] [ nSERVER:0514:0008] [ nSERVER:0514:0008]
Memcheck - prints out virtual thread ID in most places. We need to be able to map this to physical thread ID from the call stack. TLS Mapping section does this quite nicely!
Open Documents (Private)
<@@ ------ Notes Memory Analyzer (memcheck) -> Open Documents (BLK_OPENED_NOTE): total=352 ...------ @@>
DBH NOTEID HANDLE CLASS FLAGS IsProf #Pools #Items Size Database
531 7330 0x24ff 0x0001 0x0200 Yes 1 4 2984 d:\notedata\drmail\jsmith.nsf
.
Open By: CN=John Smith/O=ACME/C=US
Flags2 = 0x0404
Flags3 = 0x0000
OrigHDB = 531
First Item = [ 9471: 836]
Last Item = [ 9471: 1228]
Non-pool size : 0
Member Pool handle=0x24ff, size=2984
.
Top 10 Process Memory
<@@ ----- Notes Memory Analyzer (memcheck)...-> Top 10 [ nSERVER: 09d8] Memory Block Usage... ------ @@>
BY SIZE
Type TotalSize Handles Typename
-----------------------------------------------------------
0x4129 20447232 39 BLK_LOCAL
0x0a04 10595772 162 BLK_NET
0x028b 3327954 53 BLK_FOLDERREPLOPS
0x0910 1999180 1126 BLK_SRV_NAMES_LIST
0x093c 1219526 242 BLK_SRV_HASH_TBL
0x024b 1131334 19 BLK_OPENED_NOTE
0x0221 930818 96 BLK_NEW_NOTE
0x0130 562418 1545 BLK_TLA
0x0149 548834 101 BLK_PHTCHUNK
0x030a 319190 1 BLK_LOOKUP_THREAD
-----------------------------------------------------------
Process Heap Memory<@@ ------ Notes Memory Analyzer (memcheck) -> Process Heap Memory Stats (Time 17:46:00) ------ @@>
TYPE : Count SIZE ALLOC FREE FRAG OVERHEAD %used %free
Static-DPOOL: 12 6291456 3795788 2489080 0 9486 60% 39%
VPOOL : 2 130808 8994 117628 0 4210 6% 89%
POOL : 3 86348 58790 24468 0 3114 68% 28%
Overall : 12 6291456 3653692 2631176 0 16810 58% 41%
Resource Usage Summary
Provides great value (one stop-shopping)KEYWORD "Resource Usage" - easy to read summary of open resources listed by process and thread (physical/virtual)
Lists Resources in use by each thread Open Databases (name and handle)Open Views (name and handle)Open Documents (noteID)Open Files (OS file descriptor)
Search on the Physical Thread ID in questionKEYWORD "VThread [ ] Mapped To: PTHREAD [ ]"
Resources Per Thread
** VThread [ ndiiop:0904: 14]
.Mapped To: PThread [ ndiiop:0904: 1508]
.. using: Primal Thread [ ndiiop:0904: 7]
.. SOBJ: addr=0x4cd7261c, h=0xf010404d t=f982 (PKG_NSF9+386)
.. SOBJ: addr=0x51cd039c, h=0xf0104043 t=c275 (BLK_NSFT)
.. SOBJ: addr=0x4d4aba64, h=0xf010404c t=c130 (BLK_TLA)
.. Database: e:\notes\data\a_dir\archive.nsf
.... DBH: 3401, By: CN=John Smith/OU=New York/O=ACME
.... DBH: 3664, By: CN=John Smith/OU=New York/O=ACME
...... view: hCol=3666, cg=N, noteID=798, (archiveLookup)|archiveLookup
.... DBH: 3665, By: CN=John Smith/OU=New York/O=ACME
.. file: fd: 2388, e:\notes\data\a_dir\software_functions.nsf
Resource Usage Summary
Allows Support to quickly isolate any potential patterns regarding database and/or documents
CAVEATS Problem may not be directly attributable to a specific database/view/documentResource Usage is not guaranteed to be a silver bullet Just because a crash occurs on a database or document does NOT mean its the database/document's fault (don’t assume database corruption)Never look at an NSD in a vacuum - must know the nature of the problem first, then use NSD to fill in the gapsMust use insight from the call stacks and other key factors to know if this will determine a pattern
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Checklist
Case Studies
NSD Checklist - Call Stacks
Find the call stack for the crashing process/threadKEYWORDs FATAL, CHILD_DIED, HALT, PANICWhat is the physical thread ID? (you will need this later)What was the crash point? (you will need Support's assistance)What modules are on the stack? (NNOTES, NLSXBE, etc)Is third-party code involved? (LSX, DSAPI, RDBMS Kernel, etc)
SymptomsWhat is the flow of events?When did the crashes start?What changed since it began?How does the crash manifest itself (access violation, PANIC, etc)What do users experience?How many servers/users affectWhat do OS diagnostics show (CPU, disk, memory, etc)You know the drill
NSD Checklist - Memcheck
Shared MemoryTotal Shared Memory UsageTop 10 Shared Block UsageOpen DatabasesOpen Documents
Private (look at appropriate Process)Process Heap MemoryTop 10 Process Block UsageTLS Mapping (if needed)
Resource Usage (find appropriate Physical Thread)Databases/Views/Documents
Agenda
What is NSD?
NSD Major SectionsCall StacksMemcheck
NSD Hit List
Case Studies
Agenda
Scenario 1 – Agent Manager Crash (find agent note)
Scenario 2 – Domino Server Crash (find out why)
Scenario 3 – HTTP Crash (examine call stacks)
Scenario 4 – HTTP Performance Problem (look at SEMDEBUG)
Scenario 5 (BONUS) – Domino Server crash (find out why)
Discussion/Questions
NSD Knowledge CollectionTechnote # 7007508 – Knowledge Collection: NSD for ND 6 & 7
http://www.ibm.com/support/docview.wss?uid=swg27007508
For NSD Update Strategy, seeTechnote # 1233676 – NSD Fix List and NSD Update Strategy (Fixlist here)
http://www.ibm.com/support/docview.wss?uid=swg21233676Technote # 4013182 – Updated NSD For Domino Releases (Downloads here)
http://www.ibm.com/support/docview.wss?uid=swg24013182