Jeff Sly - Case Study Nagios @ Nu Skin
-
Upload
krovidiprasanna -
Category
Documents
-
view
225 -
download
0
Transcript of Jeff Sly - Case Study Nagios @ Nu Skin
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
1/58
Jeff Sly
Principal IT Architect
Case Study
Nagios @ Nu Skin
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
2/58
Who is in the Audience?
How many of you are: Suppliers of Nagios or some value add-on for
Nagios?
Customers using Nagios?
Just implementing Nagios or expanding
implementation?
Using NagiosXI?
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
3/58
Who is Nu Skin?
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
4/58
Our Technology Footprint
EcommerceHome grownApplicationsJava, EJB, ABAP, .Net
DatabasesOracle, MySQL, MSSQL
OSHPUX, Redhat, Windows, VMWare
ERPSAP Supply Chain, CRM, FI
Datacenters6 locations in 6 countries
Offices50 Countries
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
5/58
Monitoring Goals
Monitoring presents operations with acompletely integrated global view.
Good monitoring is proactive; it helps
teams prevent problems from becomingoutages.
Good monitoring helps minimize outage
downtime, quickly identify root cause andcontacts correct people.
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
6/58
Centralized Monitoring System
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
7/58
Our Monitoring History
We tried for 10 years
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
8/58
Do it all in One Tool Projects
One Monitoring Tool to rule them all: Mercury SiteScope
Remedy Help Desk
HP OpenView
Quest Foglight
Home grown (several)
One monitoring person
He decided to quit!
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
9/58
Could never get everything
All FailedWe always gave up! Why?Servers and agents that were proprietary
Huge foot print inefficient performance
Steep learning curve
Very expensive
Updates costly and very time consuming
System Administrators like their own
scripts, can see what they are doing
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
10/58
Resulting Monitoring Issues
Tried to make Operations clearing housefor all warnings and alerts from 10+ tools
Operations was overwhelmed
Took 4 process steps and lots of softwareto notify of critical failures
Most Administrators setup own private
monitoring to receive warningsMany false notifications
Late notifications
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
11/58
As Is (start of project)
Our Business Customers were Unhappy
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
12/58
Old Monitoring Work Flow
Four steps to notify system administrator
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
13/58
Network
Foglight
Email
HelpDesk
Error
System
Scripts
BAC
HP
NNM
SiteScope
8
Sitescope
6
Step 1: Everything Emails Operations
Nagios
Database
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
14/58
Network
Foglight
System
Scripts
BAC
HP
NNM
SiteScope
8
Sitescope
6
Step 2: Operations Opens Email
Nagios
Database
Email
HelpDesk
Error
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
15/58
Email
HelpDesk
Error
Step 3: Operations Checks Source
Network
Foglight
System
Scripts
BAC
HP
NNM
SiteScope
8
Sitescope
6
Nagios
Database
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
16/58
Email
HelpDesk
Error
Step 4: Operations Calls admin
Network
Foglight
System
Scripts
BAC
HP
NNM
SiteSco
pe 8
Sitescope
6
Nagios
Database
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
17/58
Inventory of Existing Checks
Regular Expression
found on Web Page
Monitoring
HTTPCheck - Up or
Down
PingHost Up or Down
PORTmonitoring
FTPchecking
SMTPchecking
SNMPmonitoring - no
trap catching yet
Radius
DNSmonitoring
DiskSpace monitoring CPUand Load Average
monitoring
MemoryMonitoring
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
18/58
Inventory of Existing Checks
Servicemonitoring
Transactionmonitoring -
page load times
performance graph
Website click through
(Webinject not working)
Log File monitorparse
for Errors
JavaHEAP, Thread,
Threadlock monitoring
Apachethread and
worker count monitors
Ecommerceshop
monitors
Emailcan send and
receive
SQLquery ODBC
(catalog ODBC had bugs)
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
19/58
To Be
Happy Customers
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
20/58
Key Ideas
1. MoM2. Tool Requirements
3. Shared Ownership
4. Lowest Level5. Nagios Monitor Method
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
21/58
Idea 1: MoM
Our first break though was the idea thateven through we needed a centralized
view for all monitoring that did not mean all
monitoring had to be done by one
monitoring tool.
We had to pick a Manager
of the Monitors (MoM)to bring together the best of
breed monitoring.
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
22/58
MoM - according to Gartner
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
23/58
Idea 2: Tool Requirements
Opennot proprietary and closed
Mainstreamwanted good native support and
strong community
Interfaceto 3rdParty Monitoring
Flexibleadapt to many types of monitoring
Efficientminimal foot print on production
servers, not chatty on network
Notificationgranular controlReliablegood clean architecture
UsabilityGUI interface, reporting
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
24/58
Idea 3: Shared Ownership
Core team Operation of Monitoring Environment: backups,
upgrades, & custom plug-ins
Monitoring Experts
TrainingMonitoring leads in Development & Admin teams:
Set up own monitors
Keep own monitors current
Adjust monitors
If something is not monitored not core teams fault
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
25/58
Email
HelpDesk
Error
Operations Owned Monitoring
Network
Foglight
System
Scripts
BAC
HP
NNM
SiteScope
8
Sitescope
6
Nagios
Database
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
26/58
Team Leads Own Monitoring
Network
System
Scripts
SAP
Asia
Europe
Web
Database
Operations
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
27/58
How to Guides
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
28/58
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
29/58
How to Setup NRPE - HPUX
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
30/58
Idea 4: Lowest Level
Handle alerts at the lowest possible level in theorganization
Only forward alerts if not handled at lower levels
before they become critical
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
31/58
Handle events at lowest level
Network
System
Scripts
SAP
Asia
Europe
Web
Database
Operations
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
32/58
Only forward unhandled alerts
Network
System
Scripts
SAP
Asia
Europe
Web
Database
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
33/58
Idea 5: Nagios Monitor Method
Choose the Nagios Monitoring MethodActive Check from Nagios Server (normal)
Active Check performed by remote client
NRPE, NSClientPassive CheckListen to 3rdparty
monitors
NSCA
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
34/58
Active LocalCheck
DB
DBMonitor
Web
Unix
Win
HTTP
or
Ping
Nagios
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
35/58
Active RemoteCheck - UX
DB
DBMonitor
Web
Unix
Win
CPU, RAM
(NRPE)
Nagios
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
36/58
Active Remote Check - Win
DB
DBMonitor
Web
Unix
Win
CPU, RAM
(NSClient)
Nagios
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
37/58
Passive 3rdParty Alert
DB
DBMonitor
Web
Unix
Win
Nagios
3rdParty
Alert NSCA
3rdParty Check DB
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
38/58
Bonus Idea - Tune
Tune the database
Add Ram Drive
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
39/58
Tune the Database
Modify contents of the /etc/my.cnf [mysqld] section.
tmp_table_size=524288000
max_heap_table_size=524288000
table_cache=768
set-variable=max_connections=100
wait_timeout=7800query_cache_size = 12582912
query_cache_limit=80000
thread_cache_size = 4
join_buffer_size = 128K
http://web3us.comInfo on: MySQL Tuning, Nagios Tuning
http://web3us.com/http://web3us.com/ -
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
40/58
RAM DriveCreate a RAM disk for Nagios tempory files
I created a ramdisk by adding the following entry to the /etc/fstabfile:
none /mnt/ram tmpfs size=500M 0 0
Mount the disk using the following commands
# mkdir -p /mnt/ram; mount /mnt/ram
Verify the disk was mounted and created
# df -k
Modify the /usr/local/nagios/etc/nagios.cfg file with the following
tuned parameters
temp_file=/mnt/ram/nagios.tmp
temp_path=/mnt/ram
status_file=/mnt/ram/status.dat
precached_object_file=/mnt/ram/objects.precache
object_cache_file=/mnt/ram/objects.cache
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
41/58
Implementation Methodology
Site Survey
Inventory existing monitors
Proof of concept
Build new environmentMigrate monitors from each platform to
Nagios, one at a time
Integrate OEM, and to send monitors toNagios
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
42/58
Three Project Phases
Deliver something useful in each phase
Build a level at a time
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
43/58
Phase I1. Set up a pilot of Nagios XI using Trial License.
2. Set up Foglight monitoring of JVM (Java Virtual Machine).3. Purchase NagiosXI and Consulting Support
4. Bring in a consultant for two weeks to help set up the
architecture and help us work with the system.
5. Documentation Web Site for Nagios learning's and How to
guides
6. Define a set of standardsand guidelines to follow to help aid an
effective monitoring process.
7. Backups on Running on Production Nagios Server
8. Set up services which aren't being caught right now and move afew of the important services over to the new Nagios XI
monitoring system.
9. Test Nagios plugins and server performance
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
44/58
Phase II
1. Migrate off of Sitescope 6 and shutdown
2. Migrate off of Sitescope 8 and shutdown3. Decommission Foglight
4. Clean up the old monitoring server
5. Migrate the network team from old Nagios to core NagiosXI
system
6. Set up standby NagiosXI system, cron to replicate weekly
7. Research missing alerts and add them to the new NagiosXI
system
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
45/58
Phase III
1. Implement Global Monitoring
Add monitors for existing international systems Add monitors using JMX to monitor Java servers
Nagios Remote Process Execution (NRPE) to monitor remotely
Remote Monitoring for Windows Servers (NS Client++)
Implement notification and escalation of alerts Add monitors for critical business functions
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
46/58
Phase III continued
2. Corporate Enhancements
Request recurring down time enhancement from Ethan Galstad Automate refresh of NagiosXI standby system
Build Network Map
Retire Windows SiteScope
Add monitors for phone systems Add monitors to data center (UPS, Temperature, Humidity)
Integrate to SAP Tidal monitoring tool
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
47/58
Phase III continued
3. Business
Business review and approve SLA (using business terms) Monitor both the Business Functions and the individual point
devices that provide the Business Function
Follow the Sun with Eyes on Glass.
Training
How to setup alerts
How to receive alerts
How to report on performance graphs
Create a new Dashboard for HelpDesk and International IT
Staff
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
48/58
Inventory of Monitor Checks
Qty Things we figured out how to do from Nagios Solution
50 Regular Expression found on Web Page Monitoring HTTP Check170 HTTP Check - Up or down HTTP Check
600 Ping Host Up or down Nagios Check alive
100 PORT monitoring Check TCP port #
10 FTP checking Nagios FTP plugin
8SMTP checking Nagios SMTP plugin
5 SNMP monitoring - no trap catching yet Not Using
4 Radius Nagios plugin, difficult
16 DNS monitoring Nagios Check DNS
250 Disk Space monitoringNSClient, NRPE
-Nagios Disk plugin
170 CPU and Load Average monitoringNSClient, NRPE
- Custom Linux plugin
170 Memory MonitoringNSClient, NRPE
-Custom Linux plugin
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
49/58
Inventory continued
Qty Things we figured out how to do from Nagios Solution
170 Memory Monitoring NSClient, NRPE-CustomLinux plugin
80 Service monitoringNSClient, NRPE
with bash shell script
30Transaction monitoring - page load times -
performance data graphsCustomusing Selenium Scripts
30 Website click through (webinject not working) Customusing mechanize
10 Log File monitor -p parse for Errors NRPE - script parse log files
6 Day HEAP, Thread, Threadlock monitoringJava Management Extensions
(JMX)
8 Apache thread and worker count monitors Customplugin Apache statics
18 ShopApp and SignupApp monitorsHTTP Check Customapp status
page
5 Email can send and receive CustomNagios plugin
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
50/58
Nagios XI Interface
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
51/58
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
52/58
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
53/58
Data Centers in 7 Countries
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
54/58
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
55/58
Goal Quick Notification & Recovery
from Outage
Type of
Monitor
Notification of outages with details
on which system is down, so we
know who to contact
Solution Migrate from Sitescope, Openview
to NagiosXI
IT Operations
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
56/58
Goal Prevention of outage
Type of
Monitor
Warnings about conditions before
outages occur, allow for corrective
actions that will prevent likely
outages
Solution Migrate from Sitescope, Openview
to NagiosXI, Integrate OEM SAP
and Scripts with Nagios
IT Team Managers
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
57/58
Summary
1. MoM ~ Manager of Managers Allow specialized tools
2. Tool Requirements, enough but not all
3. Ownership for implementation, shared4. Handle alerts, lowest level in organization
5. Choose Nagios monitoring method
-
8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin
58/58
Tips , Tr icks & Demos
Nag ios XI Large Implementat ionDay 3, 2:00 Track 3 (Nate Broderick)
3 Demos
Performance challenges and solutionsIntegrating monitoring solutions Oracle
Migrating from BAC & Foglight
Customization
Graphing, and more.