005 cluster monitoring

Cluster Monitoring2012/07/26Scott Miao

2

Agenda Course Credit

Introduction

Metrics Framework

Tools Tools on wikihttp://wiki.spn.tw.trendnet.org/wiki/Hadoop_Related_Web_Site_List

http://wiki.spn.tw.trendnet.org/wiki/Hadoop_Related_Web_Site_List



3

Course Credit Show up, 30 scores Ask question, each question earns 5 scores Hands-on, 40 scores 70 scores will pass this course

Each course credit will be calculated once for each course finished

The course credit will be sent to you and your supervisor by mail

4

Introduction – (1/2) Using a cluster without monitoring and

metrics is… the same as driving a car while

blindfolded It is great to run load tests against your

HBase cluster need to correlate the cluster’s

performance with what the system is doing under the hood

5

Introduction – (2/2) Graphing

Captures the exposed metrics of a system and displays them in visual charts

A picture speaks a thousand words Are good for historical, quantitative data

Monitoring Still difficult to see what a system is doing

right now Qualitative data is needed, which is handled

by the monitoring kind of support systems Sends out emails to various recipients SMS messages to telephones Does something by customized scripts

6

The Metrics Framework – Basic Classes from Hadoop

PNG ¼v¹³

7

The Metrics Framework – Extended Classes in HBase

PNG ¼v¹³

8

The Metrics Framework – Classes Collaboration

PNG ¼v¹³

9

The Metrics Framework –Metric Types – (1/3)

Metric Type Description

Integer value (IV) An integer counter. Only updated when the value changes

Long value (LV) A long counter. Only updated when the value changes

Rate (R) A float value representing a rate.1. The rate is calculated as number of operations /

elapsed time in seconds.2. The rate is stored in the previous value field.3. The internal counter is reset to zero.4. The last polled timestamp is set to the current

time.5. The computed rate is returned to the caller.

10



String (S) Static, text-based information and never reset nor changed. E.g., HBase version number, build date, and so on.

Time varying integer (TVI)

The context keeps aggregating the value. When the value is polled it returns the accrued integer value, and resets to zero, until it is polled again

Time varying long (TVL)

Same as TVI, but uses Long

11



Time varying rate (TVR)

The number of operations or events and the time they required to complete.

The values for operation count and time accrued are reset once the metric is polled

Persistent time varying rate (PTVR)

Same as TVR, but NOT reset for every poll

12

The Metrics Framework –Master Metrics

The master process exposes all metrics relating to its role in a cluster

Metric Property Name Description

Cluster requests (R)

hbase.master.cluster_requests

The total number of requests to the cluster, aggregated across all region servers

Split time (PTVR)

hbase.master.splitTime

The time it took to split the write-ahead log files after a restart

Split size (PTVR)

hbase.master.splitSize

The total size of the write-ahead log files that were split

13

The Metrics Framework –Region Server Metrics

A substantial number of metrics here Includes details about different parts of the over-

all architecture inside the server Into following groups

Block cache metrics Compaction metrics Memstore metrics Store metrics I/O metrics Miscellaneous metrics

14

Region Server Metrics – Block cache metrics – (1/2)


count (LV) hbase.regionserver.blockCacheCount

The number of blocks currently in the cache

size (LV) hbase.regionserver.blockCacheSize

The number of the size of blocks currently in the occupied Java heap space

free (LV) hbase.regionserver.blockCacheFree

Remaining heap for the cache

evicted (LV) hbase.regionserver.blockCacheEvictedCount

The number of blocks that had to be removed because of heap size constraints

15

Region Server Metrics – Block cache metrics – (2/2)

Metric Property Name

Description

cache hit (LV) hbase.regionserver.blockCacheHitCount

The number of cache block hits

miss (LV) hbase.regionserver.blockCacheMissCount

The number of cache block hit missed

hit ratio (IV) hbase.regionserver.blockCacheHitRation

The number of cache hits in relation to the total number of requests to the cache

16

Region Server Metrics – Compaction metrics


Description

compaction size (PTVR)

hbase.regionserver.compactionSize

The total size (in bytes) of the storage files that have been compacted

compaction time (PTVR)

hbase.regionserver.compactionTime

How long that operation took.Above metrics reported after a completed compaction run

compaction queue size (IV)

hbase.regionserver.compactionQueueSize

How many files a region serverhas queued up for compaction currently (recommended for monitoring)

17

Region Server Metrics – Memstore metrics


memstore size MBmetric (IV)

hbase.regionserver.memstoreSizeMB

The total heap space occupied by all memstores (in online regions) for the server in megabytes

flush queue size (IV)

hbase.regionserver.flushQueueSize

The number of enqueued regions that are being flushed next(recommended for monitoring)

flush size (PTVR) hbase.regionserver.flushSize

The total size (in bytes) of the memstore that has been flushed

flush time (PTVR)

hbase.regionserver.flushTime

The total time took for the memstore that has been flushed

18

Region Server Metrics – Store metrics


store files (IV) hbase.regionserver.storefiles

The total number of storage files, spread across all stores (regions) managed by current server

stores (IV) hbase.regionserver.stores

The total number of stores for the server, across all regions

store file index size MB metric (IV)

hbase.regionserver.storefileIndexSizeMB

The sum of the block index,and optional meta index, for all store files in megabytes

19

Region Server Metrics – I/O metrics


Description

fs read latency (TVR)

hbase.regionserver.fsReadLatency

Filesystem read latency. e.g., the time it takes to load a block from the storage files

fs write latency (TVR)

hbase.regionserver.fsWriteLatency

The same as above, but for write operations, including the storage files and write-ahead log

fs sync latency (TVR)

hbase.regionserver.fsSyncLatency

The latency to sync the write-ahead logrecords to the filesystem.

All numbers in milliseconds

20

Region Server Metrics – Miscellaneous metrics


read request count (LV)

hbase.regionserver.readRequestCount

The total number of read (such as get()) operations

write request count (LV)

hbase.regionserver.writeRequestCount

The total number of write (such as put()) operations

requests (R) hbase.regionserver.requests

The actual request rate per second

regions (IV) hbase.regionserver.regions

The number of regions that are currently online and hosted by this region server

21

The Metrics Framework –RPC Metrics


Description

RPC Process Time

rpc.metrics.RpcProcessingTime

The average time took to process the RPCs on the server side

RPC Queue Time

rpc.metrics.RpcQueueTime

The time the call arrived and when it is actually processed, which is the queue time(recommended for monitoring)

22

The Metrics Framework –JVM Metrics Tuning the JVM settings for optimizing

your HBase setup You need to know what is going on in the

cluster Into following groups

Memory usage metrics Garbage collection metrics Thread metrics System event metrics

23

JVM Metrics –Memory usage metrics


Non-heap used memory

jvm.RegionServer.metrics.memNonHeapUsedM

What used versus committed memory meanshttp://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryUsage.html

Non-heap committed memory

jvm.RegionServer.metrics.memNonHeapCommittedM

Heap used memory jvm.RegionServer.metrics.memHeapUsedM

Heap committed memory

jvm.RegionServer.metrics.memHeapCommittedM

http://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryUsage.html




24

JVM Metrics –Garbage collection metrics


gc count jvm.RegionServer.metrics.gcCount

The number of garbage collections

gc time millis jvm.RegionServer.metrics.gcTimeMillis

The accumulated time spent in garbage collection

• Garbage collection process causes so-called stop-the-world pauses in certain step

• It is difficult to handle when a system is bound by tight SLAs

• These pauses approach the multiminute range, because this can cause a region server to miss its ZooKeeper lease renewal — forcing the master to take evasive actions • So-called “Juliet Pause”

http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/

25

JVM Metrics – Thread metricsMetric Property Name Description

new state jvm.RegionServer.metrics.threadsNew

The count for each possible thread state, including new, runnable, blocked, andso on.You could refer to following docshttp://www.programcreek.com/2009/03/thread-status/http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html

runnable state jvm.RegionServer.metrics.threadsRunnable

blocked state jvm.RegionServer.metrics.threadsBlocked

waiting state jvm.RegionServer.metrics.threadsWaiting

timed waiting state

jvm.RegionServer.metrics.threadsTimedWaiting

terminated state

jvm.RegionServer.metrics.threadsTerminated

http://www.programcreek.com/2009/03/thread-status/



http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html




26

JVM Metrics – System event metrics


log fatal jvm.RegionServer.metrics.logFatal

System event metrics provide counts for various log-level events.e.g., the log error metric provides the number of log events that occurred on the error level.

log error jvm.RegionServer.metrics.logError

log warn jvm.RegionServer.metrics.logWarn

log info jvm.RegionServer.metrics.logInfo

27

The Metrics Framework –Info Metrics Only accessible through JMX

28

The Metrics Framework If you find other Metrics not listed here

Please refer to API docs directly… http://

hbase.apache.org/apidocs/index.html?overview-summary.html

http://hbase.apache.org/apidocs/index.html?overview-summary.html



29

Tools - Ganglia

A distributed, scalable monitoring system suitable for large cluster systems

HBase inherits its native support for Ganglia directly from Hadoop

30

Ganglia – Three components Ganglia monitoring daemon (gmond)

Runs on every machine that is monitored Collects the local data and prepares the statistics to be

polled by other systems

Ganglia meta daemon (gmetad) Is installed on a central node Acts as the federation node to the entire cluster Polls from one or more monitoring daemons to receive

the current cluster status

Ganglia PHP web frontend Ganglia Web Frontend Retrieves the combined statistics from the meta daemon

and presents it as HTML

31

Ganglia - Installation

http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start



32

Tools - Nagios polls current metrics on a regular basis

and compares them with given thresholds

Once the thresholds are exceededing it will start evasive actions Ranging from sending out emails, SMS

messages to telephones, to triggering scripts, or even physically rebooting the server when necessary

33

Tools - JMX Java Management Extensions

technology The standard for Java applications to

export their status Also has the ability to provide operations

Common tools for JMX JConsole JMXToolkit

http://hbase.apache.org/metrics.html

http://hbase.apache.org/metrics.html

34

Hands-on Use Ganglia “Aggregate Graphs” feature

Title with your name Including 5 hosts Use any two Metrics Cut the image file, just like this sample

Put the image file into Git YOUR_HOME=${GIT_ROOT}/hbase-training/005/

hands-on/<your_name> mkdir ${YOUR_HOME} Put your hands-on into ${YOUR_HOME}

PNG ¼v¹³

005 cluster monitoring

Technology

Transcript of 005 cluster monitoring