Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher,...

Post on 21-Mar-2017

442 views 0 download

Transcript of Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher,...

Machine Data to Readable ReportsSystem Monitoring, Alerting and Reporting

Ashley Fisher University of the Sunshine Coast, Queensland.

2

Welcome

Ashley FisherBusiness Systems Analyst

University of the Sunshine Coast

Sunshine Coast, Queensland, Australia.

3

• System Health• Monitoring• Alerting• Reporting

4

Microsoft Windows Ahead

While this presentation focusses on Microsoft Windows Server and associated technologies, the concepts and implementation of these systems is similar in other operating environments.

5

Underlying Infrastructure

• USC is Microsoft centric

• Servers are running on Windows Server 2008 R2

• Authentication through Active Directory

• Currently running Microsoft SQL Server 2008

6

Blackboard Infrastructure

• 5 Environments

• Total 12 Application Servers

• 3 Dedicated Batch Servers

• 4 SQL Clusters, 1 Standalone MSSQL Installation

• 7 F5 BigIP Pools

• 7tb File Share Storage

• Approx. 12,000 Successful Logins per Day.

7

Mediasite Infrastructure

• 2 Environments

• Total 12 Application Servers

• 2 SQL Clusters

• 8 F5 BigIP Pools

• 9.5tb File Share Storage

• 380 Recorded Presentations per Week

• Approximately 1,100 hours of content viewed per Day

8

Monitoring Systems In Place

• Nagios – Monitoring Server Availability

• Zabbix (Pictured Left)

– Monitoring Server Availability and Performance

– Currently Proof of Concept

• Splunk – Log Monitoring

9

Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.

Splunk has a mission of making machine data accessible across an organization by identifying data patterns, providing metrics, diagnosing problems and providing intelligence for business operations. Splunk is a horizontal technology used for application management, security and compliance, as well as business and web analytics.

https://en.wikipedia.org/wiki/Splunk

10

Splunk Interface

11

Blackboard Logging

• 67 Log files on each Blackboard host» A lot of information we can and are using. » A lot we’re potentially missing.

• Daily rotation of important logs» Troubleshooting issues across multiple days is frustrating.

• Logs archived Monday morning, weekly» As above, however we need to unzip the archived logs to get access to the

contained information.

12

Blackboard Database

• The activity_accumulator table retains a transcript of user activity.

• We can use the behind table joins to track user login times, course access times, and individual content item interactions.

• USC rotates our activity_accumulator table data into a backup database every 180 days.

13

Student Contesting Late Submission PenaltyStudents are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.

• Traditional Method of Investigation– Database Query (activity_accumulator) – Individual Host Log Interrogation (Repeat)

• Lots of Steps

• Time Consuming

• Room for Error or Misinterpretation

14

Student Contesting Late Submission PenaltyStudents are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.

• Intermediate Method– Database Query (activity_accumulator) – Log Into Splunk– Search string:

index=“blackboard_prod” “_userpk1_”

• Few Steps

• Easy Training

• Now Dashboarded

15

16

Zabbix is an enterprise open source monitoring solution for networks and applications(…) It is designed to monitor and track the status of various network services, servers, and other network hardware.• Simple checks can verify the availability and responsiveness of standard services such

as SMTP or HTTP without installing any software on the monitored host.• A Zabbix agent can also be installed on UNIX and Windows hosts to monitor statistics

such as CPU load, network utilization, disk space, etc.• As an alternative to installing an agent on hosts, Zabbix includes support for monitoring

via SNMP, TCP and ICMP checks, as well as over IPMI, JMX, SSH, Telnet and using custom parameters(…)

https://en.wikipedia.org/wiki/Zabbix

17

Zabbix InterfaceOverview/Landing Page

18

• Zabbix holds a very template centred view of deployment.

• The approach we’ve taken is to have ‘opt-in’ templates available for hosts.

• CPU Load, Memory Use, Network Traffic/Bandwidth and HDD Space checks are in a template added to all hosts with an agent installed

Our Zabbix Environment

19

Zabbix Templates

• Example Template: ‘Core Infrastructure Connectivity’.

When this template is applied to a host, the Zabbix agent on the host will ping those end-points locally. We can see if an individual host cannot connect to the time servers, domain controllers or our LDAP servers.

20

Blackboard and Zabbix

• We have multiple Blackboard specific templates, one is inline with the last example, however it watches availability and response times of external connectors, SafeAssign and Collaborate for example.

21

Blackboard and Zabbix

22

Blackboard and Zabbix

• One very powerful tool we have is JMX monitoring pulling information about the Blackboard application itself.

23

Zabbix Environment Mapping

Zabbix allows you to map relationships between nodes. Show where problems lay, and their impact.

IE. If there was a problem with file03, the line between bbdev01 and file03 would turn red, file03’s status would change from OK to Problem. This is an easy way to assess what the problem will impact.

24

Mediasite and Zabbix

• Mediasite is really the forefront of monitoring through Zabbix.

• In Nagios, we currently have 5 checks per recorder in production.

• In Zabbix so far, I have 26 individual checks per recorder.

25

26

• The graph below shows the available space on our production Blackboard file share for the incident.

• Emergency maintenance was carried out on the 15th to increase the allocated disk space.

The Platforms in Collaboration

27

• An alert was set up in Splunk to in real time, let us know when a student submits an assessment submission is greater than 200mb.

The Platforms in Collaboration

28

Self-Healing?

https://mediasite.usc.edu.au/Mediasite/Play/4af80791a9784f0bb418be531d7e31671d

The above video is the only way that I could think of how to present this particular part.

In the video, I have the Zabbix monitoring platform on one side, and a camera feed of the remote Mediasite recorder on the other.

As illustrated in the previous slide, there are a few checks deemed “self-healing”, this is one such scenario. In the event that the Mediasite scheduler service fails, or stops, Zabbix picks it up, realises there is something not right, and I’ve got it sending a command to the recorder to shut the software down, and force a restart on the recorder.

29

Questions?