Development of the distributed monitoring system for the NICA cluster
005 cluster monitoring
-
Upload
scott-miao -
Category
Technology
-
view
1.147 -
download
2
description
Transcript of 005 cluster monitoring
Cluster Monitoring2012/07/26Scott Miao
2
Agenda Course Credit
Introduction
Metrics Framework
Tools Tools on wikihttp://wiki.spn.tw.trendnet.org/wiki/Hadoop_Related_Web_Site_List
3
Course Credit Show up, 30 scores Ask question, each question earns 5 scores Hands-on, 40 scores 70 scores will pass this course
Each course credit will be calculated once for each course finished
The course credit will be sent to you and your supervisor by mail
4
Introduction – (1/2) Using a cluster without monitoring and
metrics is… the same as driving a car while
blindfolded It is great to run load tests against your
HBase cluster need to correlate the cluster’s
performance with what the system is doing under the hood
5
Introduction – (2/2) Graphing
Captures the exposed metrics of a system and displays them in visual charts
A picture speaks a thousand words Are good for historical, quantitative data
Monitoring Still difficult to see what a system is doing
right now Qualitative data is needed, which is handled
by the monitoring kind of support systems Sends out emails to various recipients SMS messages to telephones Does something by customized scripts
6
The Metrics Framework – Basic Classes from Hadoop
PNG ¼v¹³
7
The Metrics Framework – Extended Classes in HBase
PNG ¼v¹³
8
The Metrics Framework – Classes Collaboration
PNG ¼v¹³
9
The Metrics Framework –Metric Types – (1/3)
Metric Type Description
Integer value (IV) An integer counter. Only updated when the value changes
Long value (LV) A long counter. Only updated when the value changes
Rate (R) A float value representing a rate.1. The rate is calculated as number of operations /
elapsed time in seconds.2. The rate is stored in the previous value field.3. The internal counter is reset to zero.4. The last polled timestamp is set to the current
time.5. The computed rate is returned to the caller.
10
The Metrics Framework –Metric Types – (2/3)
Metric Type Description
String (S) Static, text-based information and never reset nor changed. E.g., HBase version number, build date, and so on.
Time varying integer (TVI)
The context keeps aggregating the value. When the value is polled it returns the accrued integer value, and resets to zero, until it is polled again
Time varying long (TVL)
Same as TVI, but uses Long
11
The Metrics Framework –Metric Types – (3/3)
Metric Type Description
Time varying rate (TVR)
The number of operations or events and the time they required to complete.
The values for operation count and time accrued are reset once the metric is polled
Persistent time varying rate (PTVR)
Same as TVR, but NOT reset for every poll
12
The Metrics Framework –Master Metrics
The master process exposes all metrics relating to its role in a cluster
Metric Property Name Description
Cluster requests (R)
hbase.master.cluster_requests
The total number of requests to the cluster, aggregated across all region servers
Split time (PTVR)
hbase.master.splitTime
The time it took to split the write-ahead log files after a restart
Split size (PTVR)
hbase.master.splitSize
The total size of the write-ahead log files that were split
13
The Metrics Framework –Region Server Metrics
A substantial number of metrics here Includes details about different parts of the over-
all architecture inside the server Into following groups
Block cache metrics Compaction metrics Memstore metrics Store metrics I/O metrics Miscellaneous metrics
14
Region Server Metrics – Block cache metrics – (1/2)
Metric Property Name Description
count (LV) hbase.regionserver.blockCacheCount
The number of blocks currently in the cache
size (LV) hbase.regionserver.blockCacheSize
The number of the size of blocks currently in the occupied Java heap space
free (LV) hbase.regionserver.blockCacheFree
Remaining heap for the cache
evicted (LV) hbase.regionserver.blockCacheEvictedCount
The number of blocks that had to be removed because of heap size constraints
15
Region Server Metrics – Block cache metrics – (2/2)
Metric Property Name
Description
cache hit (LV) hbase.regionserver.blockCacheHitCount
The number of cache block hits
miss (LV) hbase.regionserver.blockCacheMissCount
The number of cache block hit missed
hit ratio (IV) hbase.regionserver.blockCacheHitRation
The number of cache hits in relation to the total number of requests to the cache
16
Region Server Metrics – Compaction metrics
Metric Property Name
Description
compaction size (PTVR)
hbase.regionserver.compactionSize
The total size (in bytes) of the storage files that have been compacted
compaction time (PTVR)
hbase.regionserver.compactionTime
How long that operation took.Above metrics reported after a completed compaction run
compaction queue size (IV)
hbase.regionserver.compactionQueueSize
How many files a region serverhas queued up for compaction currently (recommended for monitoring)
17
Region Server Metrics – Memstore metrics
Metric Property Name Description
memstore size MBmetric (IV)
hbase.regionserver.memstoreSizeMB
The total heap space occupied by all memstores (in online regions) for the server in megabytes
flush queue size (IV)
hbase.regionserver.flushQueueSize
The number of enqueued regions that are being flushed next(recommended for monitoring)
flush size (PTVR) hbase.regionserver.flushSize
The total size (in bytes) of the memstore that has been flushed
flush time (PTVR)
hbase.regionserver.flushTime
The total time took for the memstore that has been flushed
18
Region Server Metrics – Store metrics
Metric Property Name Description
store files (IV) hbase.regionserver.storefiles
The total number of storage files, spread across all stores (regions) managed by current server
stores (IV) hbase.regionserver.stores
The total number of stores for the server, across all regions
store file index size MB metric (IV)
hbase.regionserver.storefileIndexSizeMB
The sum of the block index,and optional meta index, for all store files in megabytes
19
Region Server Metrics – I/O metrics
Metric Property Name
Description
fs read latency (TVR)
hbase.regionserver.fsReadLatency
Filesystem read latency. e.g., the time it takes to load a block from the storage files
fs write latency (TVR)
hbase.regionserver.fsWriteLatency
The same as above, but for write operations, including the storage files and write-ahead log
fs sync latency (TVR)
hbase.regionserver.fsSyncLatency
The latency to sync the write-ahead logrecords to the filesystem.
All numbers in milliseconds
20
Region Server Metrics – Miscellaneous metrics
Metric Property Name Description
read request count (LV)
hbase.regionserver.readRequestCount
The total number of read (such as get()) operations
write request count (LV)
hbase.regionserver.writeRequestCount
The total number of write (such as put()) operations
requests (R) hbase.regionserver.requests
The actual request rate per second
regions (IV) hbase.regionserver.regions
The number of regions that are currently online and hosted by this region server
21
The Metrics Framework –RPC Metrics
Metric Property Name
Description
RPC Process Time
rpc.metrics.RpcProcessingTime
The average time took to process the RPCs on the server side
RPC Queue Time
rpc.metrics.RpcQueueTime
The time the call arrived and when it is actually processed, which is the queue time(recommended for monitoring)
22
The Metrics Framework –JVM Metrics Tuning the JVM settings for optimizing
your HBase setup You need to know what is going on in the
cluster Into following groups
Memory usage metrics Garbage collection metrics Thread metrics System event metrics
23
JVM Metrics –Memory usage metrics
Metric Property Name Description
Non-heap used memory
jvm.RegionServer.metrics.memNonHeapUsedM
What used versus committed memory meanshttp://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryUsage.html
Non-heap committed memory
jvm.RegionServer.metrics.memNonHeapCommittedM
Heap used memory jvm.RegionServer.metrics.memHeapUsedM
Heap committed memory
jvm.RegionServer.metrics.memHeapCommittedM
24
JVM Metrics –Garbage collection metrics
Metric Property Name Description
gc count jvm.RegionServer.metrics.gcCount
The number of garbage collections
gc time millis jvm.RegionServer.metrics.gcTimeMillis
The accumulated time spent in garbage collection
• Garbage collection process causes so-called stop-the-world pauses in certain step
• It is difficult to handle when a system is bound by tight SLAs
• These pauses approach the multiminute range, because this can cause a region server to miss its ZooKeeper lease renewal — forcing the master to take evasive actions • So-called “Juliet Pause”
25
JVM Metrics – Thread metricsMetric Property Name Description
new state jvm.RegionServer.metrics.threadsNew
The count for each possible thread state, including new, runnable, blocked, andso on.You could refer to following docshttp://www.programcreek.com/2009/03/thread-status/http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html
runnable state jvm.RegionServer.metrics.threadsRunnable
blocked state jvm.RegionServer.metrics.threadsBlocked
waiting state jvm.RegionServer.metrics.threadsWaiting
timed waiting state
jvm.RegionServer.metrics.threadsTimedWaiting
terminated state
jvm.RegionServer.metrics.threadsTerminated
26
JVM Metrics – System event metrics
Metric Property Name Description
log fatal jvm.RegionServer.metrics.logFatal
System event metrics provide counts for various log-level events.e.g., the log error metric provides the number of log events that occurred on the error level.
log error jvm.RegionServer.metrics.logError
log warn jvm.RegionServer.metrics.logWarn
log info jvm.RegionServer.metrics.logInfo
27
The Metrics Framework –Info Metrics Only accessible through JMX
28
The Metrics Framework If you find other Metrics not listed here
Please refer to API docs directly… http://
hbase.apache.org/apidocs/index.html?overview-summary.html
29
Tools - Ganglia
A distributed, scalable monitoring system suitable for large cluster systems
HBase inherits its native support for Ganglia directly from Hadoop
30
Ganglia – Three components Ganglia monitoring daemon (gmond)
Runs on every machine that is monitored Collects the local data and prepares the statistics to be
polled by other systems
Ganglia meta daemon (gmetad) Is installed on a central node Acts as the federation node to the entire cluster Polls from one or more monitoring daemons to receive
the current cluster status
Ganglia PHP web frontend Ganglia Web Frontend Retrieves the combined statistics from the meta daemon
and presents it as HTML
31
Ganglia - Installation
http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start
32
Tools - Nagios polls current metrics on a regular basis
and compares them with given thresholds
Once the thresholds are exceededing it will start evasive actions Ranging from sending out emails, SMS
messages to telephones, to triggering scripts, or even physically rebooting the server when necessary
33
Tools - JMX Java Management Extensions
technology The standard for Java applications to
export their status Also has the ability to provide operations
Common tools for JMX JConsole JMXToolkit
http://hbase.apache.org/metrics.html
34
Hands-on Use Ganglia “Aggregate Graphs” feature
Title with your name Including 5 hosts Use any two Metrics Cut the image file, just like this sample
Put the image file into Git YOUR_HOME=${GIT_ROOT}/hbase-training/005/
hands-on/<your_name> mkdir ${YOUR_HOME} Put your hands-on into ${YOUR_HOME}
PNG ¼v¹³