Nick LeRoy Computer Sciences Department University of Wisconsin-Madison [email protected] Hawkeye.
-
Upload
ariel-newton -
Category
Documents
-
view
214 -
download
0
Transcript of Nick LeRoy Computer Sciences Department University of Wisconsin-Madison [email protected] Hawkeye.
Nick LeRoyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor/hawkeye
Hawkeye
www.cs.wisc.edu/condor
What is Hawkeye?
› A monitoring and management tool for distributed systems
› That's great, but... What does that mean? What can Hawkeye do for me?
www.cs.wisc.edu/condor
What is does that mean?
› Hawkeye is a tool that can be used to monitor various aspects of your computers
› Examples: System load monitoring Watching for run-away processes Monitoring the health of your Condor
pool
www.cs.wisc.edu/condor
What can Hawkeye do?
› Hawkeye can alert you when things go wrong. For example, Hawkeye can: Alert you when virtually any condition is
found Alert you when various Condor problems
are identified Allow you to specify your own custom
alerts
www.cs.wisc.edu/condor
Why Hawkeye?
› Make system administration easier› Make Condor pool maintenance
easier
www.cs.wisc.edu/condor
Hawkeye Monitoring Agent
Hawkeye Architecture
Hawkeye Module
Hawkeye Module
Hawkeye Monitoring Agent
CondorPool
Grid
Hawkeye Module
Hawkeye Manager
www.cs.wisc.edu/condor
Hawkeye Matchmaking
› Hawkeye alerts are done using ClassAd matchmaking.
MachineAd
TriggerAd
Match Alert
www.cs.wisc.edu/condor
Hawkeye ClassAds
› Hawkeye uses ClassAds to represent collected data Schema-free data representation Provides matching mechanism Represent whatever data you gather in
a way that works best for you
www.cs.wisc.edu/condor
Hawkeye ClassAds
› Example ClassAd “snippet”:RAM_MemFree = 841932800
RAM_MemShared = 0
RAM_MemTotal = 1055367168
RAM_SwapCached = 0
RAM_SwapFree = 2147483647
RAM_SwapTotal = 2147483647
www.cs.wisc.edu/condor
Hawkeye ClassAds
› Example ClassAd “snippet” #2:Condor_NumExecs = 2
Condor_NumMasters = 1
Condor_NumRunaway = 2
Condor_NumSchedds = 0
Condor_NumShadows = 0
Condor_NumStartds = 1
Condor_NumStarters = 2
Condor_RunawayPids = "3214,8753”
www.cs.wisc.edu/condor
Sample Alert Trigger
[
AlertTrigger = ( MyType == "Pool" && Absent.count > 5 );
AlertSeverity = ( Absent.count > 5 ) ? 1 : 0;
Name = "Absent Nodes";
AlertText = StrCat(Absent.count,
" machines are missing in ",
Name)
]
www.cs.wisc.edu/condor
Hawkeye at UW
› Currently at UW, we're using Hawkeye: To monitor our Condor cluster To aid in detecting and correcting
cluster problems To monitor the US/CMS testbed health
www.cs.wisc.edu/condor
›
www.cs.wisc.edu/condor
www.cs.wisc.edu/condor
www.cs.wisc.edu/condor
Customizing Hawkeye
› Hawkeye allows you to run your own custom “modules” to gather data.
› Hawkeye allows you in set your own custom “alerts”, on attributes generated by “standard” and “custom” modules.
www.cs.wisc.edu/condor
What is the status of Hawkeye?
› Hawkeye 1.0 Release Candidate 1 (RC1)
› Current module library includes modules to monitor system load, users, disk space, Condor, and more
› Available from http://cs.wisc.edu/condor/hawkeye