Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher,...
-
Upload
blackboard-apac -
Category
Education
-
view
442 -
download
0
Transcript of Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher,...
Machine Data to Readable ReportsSystem Monitoring, Alerting and Reporting
Ashley Fisher University of the Sunshine Coast, Queensland.
2
Welcome
Ashley FisherBusiness Systems Analyst
University of the Sunshine Coast
Sunshine Coast, Queensland, Australia.
3
• System Health• Monitoring• Alerting• Reporting
4
Microsoft Windows Ahead
While this presentation focusses on Microsoft Windows Server and associated technologies, the concepts and implementation of these systems is similar in other operating environments.
5
Underlying Infrastructure
• USC is Microsoft centric
• Servers are running on Windows Server 2008 R2
• Authentication through Active Directory
• Currently running Microsoft SQL Server 2008
6
Blackboard Infrastructure
• 5 Environments
• Total 12 Application Servers
• 3 Dedicated Batch Servers
• 4 SQL Clusters, 1 Standalone MSSQL Installation
• 7 F5 BigIP Pools
• 7tb File Share Storage
• Approx. 12,000 Successful Logins per Day.
7
Mediasite Infrastructure
• 2 Environments
• Total 12 Application Servers
• 2 SQL Clusters
• 8 F5 BigIP Pools
• 9.5tb File Share Storage
• 380 Recorded Presentations per Week
• Approximately 1,100 hours of content viewed per Day
8
Monitoring Systems In Place
• Nagios – Monitoring Server Availability
• Zabbix (Pictured Left)
– Monitoring Server Availability and Performance
– Currently Proof of Concept
• Splunk – Log Monitoring
9
Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
Splunk has a mission of making machine data accessible across an organization by identifying data patterns, providing metrics, diagnosing problems and providing intelligence for business operations. Splunk is a horizontal technology used for application management, security and compliance, as well as business and web analytics.
https://en.wikipedia.org/wiki/Splunk
10
Splunk Interface
11
Blackboard Logging
• 67 Log files on each Blackboard host» A lot of information we can and are using. » A lot we’re potentially missing.
• Daily rotation of important logs» Troubleshooting issues across multiple days is frustrating.
• Logs archived Monday morning, weekly» As above, however we need to unzip the archived logs to get access to the
contained information.
12
Blackboard Database
• The activity_accumulator table retains a transcript of user activity.
• We can use the behind table joins to track user login times, course access times, and individual content item interactions.
• USC rotates our activity_accumulator table data into a backup database every 180 days.
13
Student Contesting Late Submission PenaltyStudents are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.
• Traditional Method of Investigation– Database Query (activity_accumulator) – Individual Host Log Interrogation (Repeat)
• Lots of Steps
• Time Consuming
• Room for Error or Misinterpretation
14
Student Contesting Late Submission PenaltyStudents are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.
• Intermediate Method– Database Query (activity_accumulator) – Log Into Splunk– Search string:
index=“blackboard_prod” “_userpk1_”
• Few Steps
• Easy Training
• Now Dashboarded
15
16
Zabbix is an enterprise open source monitoring solution for networks and applications(…) It is designed to monitor and track the status of various network services, servers, and other network hardware.• Simple checks can verify the availability and responsiveness of standard services such
as SMTP or HTTP without installing any software on the monitored host.• A Zabbix agent can also be installed on UNIX and Windows hosts to monitor statistics
such as CPU load, network utilization, disk space, etc.• As an alternative to installing an agent on hosts, Zabbix includes support for monitoring
via SNMP, TCP and ICMP checks, as well as over IPMI, JMX, SSH, Telnet and using custom parameters(…)
https://en.wikipedia.org/wiki/Zabbix
17
Zabbix InterfaceOverview/Landing Page
18
• Zabbix holds a very template centred view of deployment.
• The approach we’ve taken is to have ‘opt-in’ templates available for hosts.
• CPU Load, Memory Use, Network Traffic/Bandwidth and HDD Space checks are in a template added to all hosts with an agent installed
Our Zabbix Environment
19
Zabbix Templates
• Example Template: ‘Core Infrastructure Connectivity’.
When this template is applied to a host, the Zabbix agent on the host will ping those end-points locally. We can see if an individual host cannot connect to the time servers, domain controllers or our LDAP servers.
20
Blackboard and Zabbix
• We have multiple Blackboard specific templates, one is inline with the last example, however it watches availability and response times of external connectors, SafeAssign and Collaborate for example.
21
Blackboard and Zabbix
22
Blackboard and Zabbix
• One very powerful tool we have is JMX monitoring pulling information about the Blackboard application itself.
23
Zabbix Environment Mapping
Zabbix allows you to map relationships between nodes. Show where problems lay, and their impact.
IE. If there was a problem with file03, the line between bbdev01 and file03 would turn red, file03’s status would change from OK to Problem. This is an easy way to assess what the problem will impact.
24
Mediasite and Zabbix
• Mediasite is really the forefront of monitoring through Zabbix.
• In Nagios, we currently have 5 checks per recorder in production.
• In Zabbix so far, I have 26 individual checks per recorder.
25
26
• The graph below shows the available space on our production Blackboard file share for the incident.
• Emergency maintenance was carried out on the 15th to increase the allocated disk space.
The Platforms in Collaboration
27
• An alert was set up in Splunk to in real time, let us know when a student submits an assessment submission is greater than 200mb.
The Platforms in Collaboration
28
Self-Healing?
https://mediasite.usc.edu.au/Mediasite/Play/4af80791a9784f0bb418be531d7e31671d
The above video is the only way that I could think of how to present this particular part.
In the video, I have the Zabbix monitoring platform on one side, and a camera feed of the remote Mediasite recorder on the other.
As illustrated in the previous slide, there are a few checks deemed “self-healing”, this is one such scenario. In the event that the Mediasite scheduler service fails, or stops, Zabbix picks it up, realises there is something not right, and I’ve got it sending a command to the recorder to shut the software down, and force a restart on the recorder.
29
Questions?