Hands-On Lotus Domain Monitoring AdminCamp June 2012 , in ...FILE/H8-ddm.pdf · Log/Event...
Transcript of Hands-On Lotus Domain Monitoring AdminCamp June 2012 , in ...FILE/H8-ddm.pdf · Log/Event...
AdminCamp June 2012 , in GelsenkirchenDaniel Nashed, Nash!Com
Hands-On Lotus Domain Monitoring
About the presenter
● Daniel Nashed – Nash!Com - IBM/Lotus Advanced Business Partner/ISV
– Member of The Penumbra group● an international consortium of selected Business Partners pooling their talent and
resources
– focused on Cross-Platform C-API, Domino® Infrastructure, Lotus Traveler, Administration, Integration and Troubleshooting
– Platform Focus: W32, xLinux, zLinux, AIX® and Solaris®
– Regular speaker at International Conferences
– DNUG Enthusiast IBM Lotus® Domino Infrastructure
– Author of the Domino Start Script for Linux and Unix
● http://www.nashcom.de
Agenda
● Introduction / other Log Options
● Domino Domain Monitoring (DDM)
● Hands On DDM
● Q&A – Any time
Important Server Setting
● notes.ini EVENT_POOL_SIZE=41943040 – Set this value for any server
– Not just if you use DDM
– But it is more important when you user DDM probes
– Size is in Bytes
– The maximum size is 100 MB
Log.nsf and Events
● Domino generates errors and messages– Error … “Object Store Manager: File does not exist”
– Message … “Index update process started”
● All logged errors and messages are raised as events– Log.nsf records errors and messages sequentially
● Log.nsf– Pro: Great for maintaining a record of all errors and messages
– Pro: Great for debugging, if you know what you are looking for
– Con: A lot of unfiltered data
Event Monitoring
● Event Monitoring has been introduced in Notes 4.5
● Events and Event Task is the foundation below all logging– Events4.nsf contains all log message including categorization and severity
– Event Monitoring can generate own events like statistic events
– Event Handlers can be still useful for automation ● Mutliple trigger types: Log To database, Run an Agent, Run Task, SNMP Trap. Send
via Mail, Log to system console, ...
– Most of the functionality is now available in DDM ● But Events are still the base used for log messages consolidated in DDM
– DDM configuration is in events4.nsf
Domino Domain Monitoring (DDM)
● Comprehensive Monitoring– ddm.nsf contains focused monitoring results
– Detailed error messages including names of resources
– Suggestions for problem solution including actions!
● Based on the foundation build by event monitoring– Event categorization and severity defined in events4.nsf
● Additional build in probes into the code– Replication (detailed reporting for failing replication)
– Agent Manager (long running agents, high memory/CPU usage, ...)
● You can also leverage statistics and platform statistics
Log/Event Monitoring vs. DDM
● As of Domino 7, all events are cached and tracked by DDM– DDM.NSF is the on-disk version and superset of the event cache
– DDM.NSF records a set of associated events into a single report document (not just a single event)
● Pro– great for recording problem context
– great for tracking and organizing problems
– great for exposing knowledge about problems
– great for resolving problems
● Con– lousy for tracking sequential order of problems
● lousy for extracting ad hoc data from reports (use statistics and statrep.nsf for ad hoc data)
The Five C’s of DDM
● Consolidate– Track multiple, related errors in a single event report
● Check– Assess health of functional areas
● Collect– Access all domain wide reports from a single database
● Correlate– Locate related reports from other servers
● Correct– Assess knowledge base of explanations and recommendations.
– Click to resolve issues.
DDM Enhancements in Domino 8
● DDM has been introduced in Domino 7– Already great options available
● DDM 8 is a fit and finish of what has been introduced in D7– More and enhanced views
– More options for corrective actions
– Common actions
– More probes
– More possible solutions and corrections
– ... and many more details
● More features planned for next releases
Event Reports
● Like log.nsf, all events are recorded into ddm.nsf
● Unlike log.nsf, each set of events has it’s own report document
● Each set of events has an associated unique ID (a PUID)– Every event with the same PUID is mapped into the same ddm.nsf report
document
– A PUID is built using the enhanced event context data
● Target database, target user, target agent, etc.
● DDM.nsf Reports are presented in a variety of views
DDM Architecture
Simple and Enhanced Events
● Events can have one or two associated errors. – When there are two errors, the “root cause” is the second error.
– 1 error: Event: Could not locate view 'xyz'
– 2 errors: Object Store Manager: File does not exist
● Simple events are legacy events which include these attributes …– Time stamp, originating server, resource strings and IDs, severities, types
● Enhanced events also include the following attributes …– New types, subtypes, target server
● Enhanced events include one or more of these attributes – Target database, target user, target UNID, extra target data (i.e. string blobs),
rich text, call stacks, correlation codes, and the NOTEID of a config doc responsible for firing the event.
– This “target” data is some of the new event “context”
Event Report State
● Open = The event has a non-Normal severity and has not been manually closed – Auto Clearing Events = Some events can be automatically closed by DDM if the
problem is reported as resolved with a Normal severity event
– Many events must be closed manually
– Simple events never auto close (only enhanced events)
● Closed = The event has a Normal severity or has been manually closed– Automatically reopened if a severity change is detected
● Permanently Closed = Used by Admins to say “I don’t care about this problem, keep it out of my sight.”– Not automatically reopened, on a Normal severity, but continue to be tracked
and recorded in its reports
DDM Probe Configuration
● Probes are used to automated check your environment
● There are different kind of probe types– Probes are already defined in the configuration
– Most probes just need to be enabled● Some of the probes need to be configured
DDM Probe Configuration
● Three flavors of probes (some probes have more than one):– Scheduled probes run according to a configurable schedule and defaults are supplied
– Embedded probes run inside feature area and catch problems/issues as they occur
– Listening probes run when particular error codes are logged
● Probe configuration is quick and flexible:– Defaults probe configuration documents supplied for “out-of-box” values
– “Special target servers” concept allows out-of-the-box probing without having to specify named servers
– Thresholds and result content are highly customizable● what the probe will actually check● probe sensitivity (when will they generate an event)● what severity event the probe will generate
– Schedule is highly customizable for schedulable probes
– Probes can be enabled/disabled per server/server group
Probe Categories
● There are different probe categories with various individual probes
DDM Filters
● Event filters control what and how much information is reported to ddm.nsf. Why?– Initial flood of events is striking, many of which have always been there
– Over time, administrators will want to “adjust the volume”, seeing more or less of certain events
– Enhanced and simple events can be filtered.
– Filters can target specific servers and filter out events by event type/area and severity
● A default filter is supplied and enabled for simple events to reduce the initial “noise”.
Server Collection Hierarchy
● Used to control replication of DDM events in your environment– Based on automated selective replication formulas
● You can have multiple Server Collection Hierarchies based on your Notes Server Topology and Administration Topology
● The collecting server will receive all DDM events for his servers– Helps to further focus information into a single instance of DDM
● Easy to setup once you are familiar with the configuration dialogs
Advanced Event Handlers
● Run an Agent when events occur– In Memory copy of Event Document is passed in the document context
● It's not written into statrep after agent is exectured
● Sample Agent
Dim session As New NotesSessionDim doc As NotesDocumentDim eventtext As StringSet doc = session.DocumentContextPrint "EventText : >" + doc.EventText(0) + "<"Print "ErrorCode : >" + doc.ErrorCode(0) + "<"
Advanced Event Handlers
● Run a Program – Parameters are passed to the command-line
● Sample Parameter List– All options enabled in event handler
Process name: >C:\Lotus\Domino85\nnsharg.EXE<NshArg: Arg[0]: >C:\Lotus\Domino85\nnsharg.EXE<NshArg: Arg[1]: >7<NshArg: Arg[2]: >3<NshArg: Arg[3]: >02/25/2010 12:00:00 CET<NshArg: Arg[4]: >Event Monitor0x33C5<NshArg: Arg[5]: >CN=nsh-d85-win-01/OU=Srv/O=NashComLab<NshArg: Arg[6]: >Access to server nsh-d85-win-01/Srv/NashComLab is slow. <NshArg: Arg[7]: >CN=nsh-d85-win-01/OU=Srv/O=NashComLab<NshArg: Arg[8]: >NAMES.NSF<OSTerm> Terminating, name: C:\Lotus\Domino85\nnsharg.EXE
Lotus Domino Statistics
● Valuable resource of information– Combines Domino Statistics and Platform statistics
– Platform statistics depend on the OS platform but are sort of unified between platforms
– Check events4.nsf for a description of each platform stat per platform
● You should collect Server stats at least every 15 minutes (default is 90 minutes)– Enable collect task, configure settings in events4.nsf
– Configure statistic events for important stats with the right thresholds
– Keep long term data to compare current and historic data
– You can also leverage SNMP to query stats ● Limitation: Only works for 1 partition per OS instance on all platforms
Top Statistics – NSF Buffer Pool
● Used for Buffering Database I/O
● Check Server Stats– Database.Database.BufferPool.Maximum.Megabytes
– Database.Database.BufferPool. PercentReadsInBuffer
● Interpretation– Bad < 90% < PercentReadsInBuffer < 98% < Perfect
● Tune: notes.ini NSF_Buffer_Pool_Size_MB=n (in MB)– Default: 512 MB
Top Statistics – NSF Cache
● Used for Caching Open Databases
● Check Server Stats– Database.DbCache.HighWaterMark
– Database.DbCache.CurrentEntries
– Database.DbCache.MaxEntries
– Database.DbCache.OvercrowdingRejections
● Interpretation– Good = HighWaterMark < MaxEntries
– Good = 0 OvercrowdingRejections
● Tune: notes.ini NSF_DbCache_MaxEntries = n– Default: NSF_BUFFER Pool size multiplied by 3
Top Statistics – (Cluster) Replication
● Use to check Cluster Replicator Performance
● Check Server Stats– Replica.Cluster.Failed
– Replica.Cluster.SecondsOnQueue
– Replica.Cluster.WorkQueueDepth
● Interpretation– Perfect < 10 < SecondsOnQueue > 15 > Bad
– Perfect < 10 < WorkQueueDepth > 15 > Bad
● Tune: – Add more cluster replicators
– optimize cluster server usage (e.g. Split active users between cluster mates)
Top Statistics – Transactions
● Use for Indication of Server Load
● Check Server Stats– Server.Trans.PerMinute
● Interpretation:– Heavy < 30 < Trans.PerMinute (per User) > 10 > Light
● Tune: Analyze Heavy users and try to avoid load
Top Statistics – Concurrent Tasks
● Use to checkSimultaneous Active Database Connections
● Check Server Stats– Server.ConcurrentTasks
– Server.ConcurrentTasks.Waiting
● Interpretation– Waiting should be ZERO
● Tune:– Server_Pool_Tasks = n ( e.g. 80)
– Server_Max_Concurrent_Trans = m (e. g. Server_Pool_Tasks * Number of Ports)
Top Statistics – Platform Memory
● Used to check Allocated using memory pools and sub-allocations
● Check Server Stats– Mem.Allocated
– Mem.Allocated.Process
– Mem.Allocated.Shared
● Interpretation– Memory Leaks when increasing over days / weeks
● Tune– By several parameters (bufferpool, cache, namelookup…
● Note – Be careful interpreting this statistic… Not all memory might be included
Top Statistics – Platform CPU
● Used to check CPU Utilization on Server
● Check Server Stats– Platform.System.PctCombinedCpuUtil
– Platform.System.PctTotalPrivilegedCpuUtil
– Platform.System.PctTotalUserCpuUtil
● Interpretation:– OK < 90% CombinedCpuUtil > 90% > TOO HIGH
● Tune– Many Root Causes Possible
●
Top Statistics – Paging File
● Use to check Server Memory Swapping to Disk
● Check Server Stats– Platform.PagingFile.Total.PctUtil
● Interpretation– OK < 0% < PctUtil.Avg > 10% > BAD
● Tune– OS Level tuning, Check Memory
Top Statistics – Platform Disk● Used to check Disk Performance
● Check Server Stats– Platform.LogicalDisk.1.AvgQueueLen
– Platform.LogicalDisk.1.PctUtil
● Interpretation– Good < 2% < AvgQueueLen > 5% > BAD
– Good = PctUtil < 80%
● Tune– By several parameters (bufferpool, cache, namelookup
● Note– Platform.LogicalDisk.1.AssignedName=C points to the disk
Q&A
● I hope you enjoyed the workshop
● Questions now or later?
● Contact– [email protected]
– http://www.nashcom.de
– http://blog.nashcom.de