Hands-On Lotus Domain Monitoring AdminCamp June 2012 , in ...FILE/H8-ddm.pdf · Log/Event...

AdminCamp June 2012 , in GelsenkirchenDaniel Nashed, Nash!Com

Hands-On Lotus Domain Monitoring

About the presenter

● Daniel Nashed – Nash!Com - IBM/Lotus Advanced Business Partner/ISV

– Member of The Penumbra group● an international consortium of selected Business Partners pooling their talent and

resources

– focused on Cross-Platform C-API, Domino® Infrastructure, Lotus Traveler, Administration, Integration and Troubleshooting

– Platform Focus: W32, xLinux, zLinux, AIX® and Solaris®

– Regular speaker at International Conferences

– DNUG Enthusiast IBM Lotus® Domino Infrastructure

– Author of the Domino Start Script for Linux and Unix

● [email protected]

● http://www.nashcom.de

Agenda

● Introduction / other Log Options

● Domino Domain Monitoring (DDM)

● Hands On DDM

● Q&A – Any time

Important Server Setting

● notes.ini EVENT_POOL_SIZE=41943040 – Set this value for any server

– Not just if you use DDM

– But it is more important when you user DDM probes

– Size is in Bytes

– The maximum size is 100 MB

Log.nsf and Events

● Domino generates errors and messages– Error … “Object Store Manager: File does not exist”

– Message … “Index update process started”

● All logged errors and messages are raised as events– Log.nsf records errors and messages sequentially

● Log.nsf– Pro: Great for maintaining a record of all errors and messages

– Pro: Great for debugging, if you know what you are looking for

– Con: A lot of unfiltered data

Event Monitoring

● Event Monitoring has been introduced in Notes 4.5

● Events and Event Task is the foundation below all logging– Events4.nsf contains all log message including categorization and severity

– Event Monitoring can generate own events like statistic events

– Event Handlers can be still useful for automation ● Mutliple trigger types: Log To database, Run an Agent, Run Task, SNMP Trap. Send

via Mail, Log to system console, ...

– Most of the functionality is now available in DDM ● But Events are still the base used for log messages consolidated in DDM

– DDM configuration is in events4.nsf

Domino Domain Monitoring (DDM)

● Comprehensive Monitoring– ddm.nsf contains focused monitoring results

– Detailed error messages including names of resources

– Suggestions for problem solution including actions!

● Based on the foundation build by event monitoring– Event categorization and severity defined in events4.nsf

● Additional build in probes into the code– Replication (detailed reporting for failing replication)

– Agent Manager (long running agents, high memory/CPU usage, ...)

● You can also leverage statistics and platform statistics

Log/Event Monitoring vs. DDM

● As of Domino 7, all events are cached and tracked by DDM– DDM.NSF is the on-disk version and superset of the event cache

– DDM.NSF records a set of associated events into a single report document (not just a single event)

● Pro– great for recording problem context

– great for tracking and organizing problems

– great for exposing knowledge about problems

– great for resolving problems

● Con– lousy for tracking sequential order of problems

● lousy for extracting ad hoc data from reports (use statistics and statrep.nsf for ad hoc data)

The Five C’s of DDM

● Consolidate– Track multiple, related errors in a single event report

● Check– Assess health of functional areas

● Collect– Access all domain wide reports from a single database

● Correlate– Locate related reports from other servers

● Correct– Assess knowledge base of explanations and recommendations.

– Click to resolve issues.

DDM Enhancements in Domino 8

● DDM has been introduced in Domino 7– Already great options available

● DDM 8 is a fit and finish of what has been introduced in D7– More and enhanced views

– More options for corrective actions

– Common actions

– More probes

– More possible solutions and corrections

– ... and many more details

● More features planned for next releases

Event Reports

● Like log.nsf, all events are recorded into ddm.nsf

● Unlike log.nsf, each set of events has it’s own report document

● Each set of events has an associated unique ID (a PUID)– Every event with the same PUID is mapped into the same ddm.nsf report

document

– A PUID is built using the enhanced event context data

● Target database, target user, target agent, etc.

● DDM.nsf Reports are presented in a variety of views

DDM Architecture

Simple and Enhanced Events

● Events can have one or two associated errors. – When there are two errors, the “root cause” is the second error.

– 1 error: Event: Could not locate view 'xyz'

– 2 errors: Object Store Manager: File does not exist

● Simple events are legacy events which include these attributes …– Time stamp, originating server, resource strings and IDs, severities, types

● Enhanced events also include the following attributes …– New types, subtypes, target server

● Enhanced events include one or more of these attributes – Target database, target user, target UNID, extra target data (i.e. string blobs),

rich text, call stacks, correlation codes, and the NOTEID of a config doc responsible for firing the event.

– This “target” data is some of the new event “context”

Event Report State

● Open = The event has a non-Normal severity and has not been manually closed – Auto Clearing Events = Some events can be automatically closed by DDM if the

problem is reported as resolved with a Normal severity event

– Many events must be closed manually

– Simple events never auto close (only enhanced events)

● Closed = The event has a Normal severity or has been manually closed– Automatically reopened if a severity change is detected

● Permanently Closed = Used by Admins to say “I don’t care about this problem, keep it out of my sight.”– Not automatically reopened, on a Normal severity, but continue to be tracked

and recorded in its reports

DDM Probe Configuration

● Probes are used to automated check your environment

● There are different kind of probe types– Probes are already defined in the configuration

– Most probes just need to be enabled● Some of the probes need to be configured

DDM Probe Configuration

● Three flavors of probes (some probes have more than one):– Scheduled probes run according to a configurable schedule and defaults are supplied

– Embedded probes run inside feature area and catch problems/issues as they occur

– Listening probes run when particular error codes are logged

● Probe configuration is quick and flexible:– Defaults probe configuration documents supplied for “out-of-box” values

– “Special target servers” concept allows out-of-the-box probing without having to specify named servers

– Thresholds and result content are highly customizable● what the probe will actually check● probe sensitivity (when will they generate an event)● what severity event the probe will generate

– Schedule is highly customizable for schedulable probes

– Probes can be enabled/disabled per server/server group

Probe Categories

● There are different probe categories with various individual probes

DDM Filters

● Event filters control what and how much information is reported to ddm.nsf. Why?– Initial flood of events is striking, many of which have always been there

– Over time, administrators will want to “adjust the volume”, seeing more or less of certain events

– Enhanced and simple events can be filtered.

– Filters can target specific servers and filter out events by event type/area and severity

● A default filter is supplied and enabled for simple events to reduce the initial “noise”.

Server Collection Hierarchy

● Used to control replication of DDM events in your environment– Based on automated selective replication formulas

● You can have multiple Server Collection Hierarchies based on your Notes Server Topology and Administration Topology

● The collecting server will receive all DDM events for his servers– Helps to further focus information into a single instance of DDM

● Easy to setup once you are familiar with the configuration dialogs

Advanced Event Handlers

● Run an Agent when events occur– In Memory copy of Event Document is passed in the document context

● It's not written into statrep after agent is exectured

● Sample Agent

Dim session As New NotesSessionDim doc As NotesDocumentDim eventtext As StringSet doc = session.DocumentContextPrint "EventText : >" + doc.EventText(0) + "<"Print "ErrorCode : >" + doc.ErrorCode(0) + "<"

Advanced Event Handlers

● Run a Program – Parameters are passed to the command-line

● Sample Parameter List– All options enabled in event handler

Process name: >C:\Lotus\Domino85\nnsharg.EXE<NshArg: Arg[0]: >C:\Lotus\Domino85\nnsharg.EXE<NshArg: Arg[1]: >7<NshArg: Arg[2]: >3<NshArg: Arg[3]: >02/25/2010 12:00:00 CET<NshArg: Arg[4]: >Event Monitor0x33C5<NshArg: Arg[5]: >CN=nsh-d85-win-01/OU=Srv/O=NashComLab<NshArg: Arg[6]: >Access to server nsh-d85-win-01/Srv/NashComLab is slow. <NshArg: Arg[7]: >CN=nsh-d85-win-01/OU=Srv/O=NashComLab<NshArg: Arg[8]: >NAMES.NSF<OSTerm> Terminating, name: C:\Lotus\Domino85\nnsharg.EXE

Lotus Domino Statistics

● Valuable resource of information– Combines Domino Statistics and Platform statistics

– Platform statistics depend on the OS platform but are sort of unified between platforms

– Check events4.nsf for a description of each platform stat per platform

● You should collect Server stats at least every 15 minutes (default is 90 minutes)– Enable collect task, configure settings in events4.nsf

– Configure statistic events for important stats with the right thresholds

– Keep long term data to compare current and historic data

– You can also leverage SNMP to query stats ● Limitation: Only works for 1 partition per OS instance on all platforms

Top Statistics – NSF Buffer Pool

● Used for Buffering Database I/O

● Check Server Stats– Database.Database.BufferPool.Maximum.Megabytes

– Database.Database.BufferPool. PercentReadsInBuffer

● Interpretation– Bad < 90% < PercentReadsInBuffer < 98% < Perfect

● Tune: notes.ini NSF_Buffer_Pool_Size_MB=n (in MB)– Default: 512 MB

Top Statistics – NSF Cache

● Used for Caching Open Databases

● Check Server Stats– Database.DbCache.HighWaterMark

– Database.DbCache.CurrentEntries

– Database.DbCache.MaxEntries

– Database.DbCache.OvercrowdingRejections

● Interpretation– Good = HighWaterMark < MaxEntries

– Good = 0 OvercrowdingRejections

● Tune: notes.ini NSF_DbCache_MaxEntries = n– Default: NSF_BUFFER Pool size multiplied by 3

Top Statistics – (Cluster) Replication

● Use to check Cluster Replicator Performance

● Check Server Stats– Replica.Cluster.Failed

– Replica.Cluster.SecondsOnQueue

– Replica.Cluster.WorkQueueDepth

● Interpretation– Perfect < 10 < SecondsOnQueue > 15 > Bad

– Perfect < 10 < WorkQueueDepth > 15 > Bad

● Tune: – Add more cluster replicators

– optimize cluster server usage (e.g. Split active users between cluster mates)

Top Statistics – Transactions

● Use for Indication of Server Load

● Check Server Stats– Server.Trans.PerMinute

● Interpretation:– Heavy < 30 < Trans.PerMinute (per User) > 10 > Light

● Tune: Analyze Heavy users and try to avoid load

Top Statistics – Concurrent Tasks

● Use to checkSimultaneous Active Database Connections

● Check Server Stats– Server.ConcurrentTasks

– Server.ConcurrentTasks.Waiting

● Interpretation– Waiting should be ZERO

● Tune:– Server_Pool_Tasks = n ( e.g. 80)

– Server_Max_Concurrent_Trans = m (e. g. Server_Pool_Tasks * Number of Ports)

Top Statistics – Platform Memory

● Used to check Allocated using memory pools and sub-allocations

● Check Server Stats– Mem.Allocated

– Mem.Allocated.Process

– Mem.Allocated.Shared

● Interpretation– Memory Leaks when increasing over days / weeks

● Tune– By several parameters (bufferpool, cache, namelookup…

● Note – Be careful interpreting this statistic… Not all memory might be included

Top Statistics – Platform CPU

● Used to check CPU Utilization on Server

● Check Server Stats– Platform.System.PctCombinedCpuUtil

– Platform.System.PctTotalPrivilegedCpuUtil

– Platform.System.PctTotalUserCpuUtil

● Interpretation:– OK < 90% CombinedCpuUtil > 90% > TOO HIGH

● Tune– Many Root Causes Possible

●

Top Statistics – Paging File

● Use to check Server Memory Swapping to Disk

● Check Server Stats– Platform.PagingFile.Total.PctUtil

● Interpretation– OK < 0% < PctUtil.Avg > 10% > BAD

● Tune– OS Level tuning, Check Memory

Top Statistics – Platform Disk● Used to check Disk Performance

● Check Server Stats– Platform.LogicalDisk.1.AvgQueueLen

– Platform.LogicalDisk.1.PctUtil

● Interpretation– Good < 2% < AvgQueueLen > 5% > BAD

– Good = PctUtil < 80%

● Tune– By several parameters (bufferpool, cache, namelookup

● Note– Platform.LogicalDisk.1.AssignedName=C points to the disk

Q&A

● I hope you enjoyed the workshop

● Questions now or later?

● Contact– [email protected]

– http://www.nashcom.de

– http://blog.nashcom.de

http://www.nashcom.de/

Hands-On Lotus Domain Monitoring AdminCamp June 2012 , in ...FILE/H8-ddm.pdf · Log/Event...

Documents

Transcript of Hands-On Lotus Domain Monitoring AdminCamp June 2012 , in ...FILE/H8-ddm.pdf · Log/Event...