05 Monitoring the Cluster

12
Copyright ©2012 Cloudwick Technologies 1 Copyright ©2012 Cloudwick Technologies

description

Monitoring the Cluster

Transcript of 05 Monitoring the Cluster

Copyright ©2012 Cloudwick Technologies1

Copyright ©2012 Cloudwick Technologies

Copyright ©2012 Cloudwick Technologies2

Copyright ©2012 Cloudwick Technologies3

• System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default• Each Hadoop daemon running on a machine produces two logfiles.• The first is the log output written via log4j. This file, which ends in .log, should be the first port of call when diagnosing problems, since most application log messages are written here.• Old logfiles are never deleted, so you should arrange for them to be periodically deleted or archived, so as to not run out of disk space on the local node.• The second logfile is the combined standard output and standard error log. This logfile, which ends in .out, usually contains little or no output, since Hadoop uses log4j for Logging. Only the last five logs are retained. Old logfiles are suffixed with a number between 1 and 5, with 5 being the oldest file.

Copyright ©2012 Cloudwick Technologies4

Copyright ©2012 Cloudwick Technologies5

Setting the Log Levelshttp://localhost:50030/logLevelorg.apache.hadoop.mapred.JobTracker logname to DEBUG log level or

% hadoop daemonlog -setlevel jobtracker-host:50030 \org.apache.hadoop.mapred.JobTracker DEBUG

Log levels changed in this way are reset when the daemon restarts, which is usuallywhat you want. However, to make a persistent change to a log level, simply change thelog4j.properties file in the configuration directory. In this case, the line to add is:log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

Copyright ©2012 Cloudwick Technologies6

Copyright ©2012 Cloudwick Technologies7

The HDFS and MapReduce daemons collect information about events and measurements that are collectively known as metrics.

Metrics belong to a context, and Hadoop currently uses “dfs”, “mapred”, “rpc”, and“jvm” contexts. Hadoop daemons usually collect metrics under several contexts.

You can view raw metrics gathered by a particular Hadoop daemon by connecting toits /metrics web page. This is handy for debugging. For example, you can view jobtracker metrics in plain text at http://jobtracker-host:50030/metrics . To retrieve metrics in JSON format you would use http://jobtracker-host:50030/metrics?format=json

FileContextFileContext writes metrics to a local file. It exposes two configuration properties:fileName, which specifies the absolute name of the file to write to, and period, for thetime interval (in seconds) between file updates.FileContext can be useful on a local system for debugging purposes, but is unsuitableon a larger cluster since the output files are spread across the cluster, which makesanalyzing them difficult.

Copyright ©2012 Cloudwick Technologies8

GangliaContext

Ganglia (http://ganglia.info/) is an open source distributed monitoring system for very large clusters. It is designed to impose very low resource overheads on each node in the cluster. Ganglia itself collects metrics, such as CPU and memory usage; by using GangliaContext, you can inject Hadoop metrics into Ganglia.

CompositeContext

CompositeContext allows you to output the same set of metrics to multiple contexts,such as a FileContext and a GangliaContext.

Copyright ©2012 Cloudwick Technologies9

Java Management Extensions (JMX) is a standard Java API for monitoring and managing applications. Hadoop includes several managed beans (MBeans), which expose Hadoop metrics to JMX-aware applications. There are MBeans that expose the metrics in the “dfs” and “rpc” contexts

Many third-party monitoring and alerting systems (such as Nagios or Hyperic) can query MBeans, making JMX the natural way to monitor your Hadoopcluster from an existing monitoring system.

Copyright ©2012 Cloudwick Technologies10

It’s common to use Ganglia in conjunction with an alerting system like Nagios formonitoring a Hadoop cluster. Ganglia is good for efficiently collecting a large numberof metrics and graphing them, whereas Nagios and similar systems are good at sending alerts when a critical threshold is reached in any of a smaller set of metrics.

Demo

Copyright ©2012 Cloudwick Technologies11

Copyright ©2012 Cloudwick Technologies12