Simple practices in performance monitoring and evaluation

34
Simple Practices in Performance Monitoring and Evaluation Schubert Zhang 2016.3.24

Transcript of Simple practices in performance monitoring and evaluation

Page 1: Simple practices in performance monitoring and evaluation

Simple Practices in Performance Monitoring and Evaluation

Schubert Zhang 2016.3.24

Page 2: Simple practices in performance monitoring and evaluation

SLA

Service Level Agreements

https://en.wikipedia.org/wiki/Service-level_agreement

SLAs commonly include segments to address: a definition of services, performance measurement, problem management, customer duties,

warranties, disaster recovery, termination of agreement.

Page 3: Simple practices in performance monitoring and evaluation

• APIIM SLA

• Performance

• Performanceperformance oriented SLA

Page 4: Simple practices in performance monitoring and evaluation

MetricsSLA Performance SLA

Performance Metrics

e.g.1: API

• (99%)

e.g.2: Call Center

• Abandonment Rate: Percentage of calls abandoned while waiting to be answered.

• ASA (Average Speed to Answer): Average time it takes for a call to be answered by the service desk.

• TSF (Time Service Factor): Percentage of calls answered within a definite timeframe, e.g., 80% in 20 seconds.

• FCR (First-Call Resolution): Percentage of incoming calls that can be resolved without the use of a callback or without having the caller call back the helpdesk to finish resolving the case.

• TAT (Turn-Around Time): Time taken to complete a certain task.

Metrics

Performance Metrics

Page 5: Simple practices in performance monitoring and evaluation

Benchmarking

the quality of a service must be measured, evaluated, … benchmarked.

and we must have a set of approaches for benchmarking.

Page 6: Simple practices in performance monitoring and evaluation

Metrics to be monitored

Page 7: Simple practices in performance monitoring and evaluation

Throughput

QPS TPS CPS

in seconds, in minutes, in hours …

Page 8: Simple practices in performance monitoring and evaluation

Concurrency

Page 9: Simple practices in performance monitoring and evaluation

Latency

Response Time Round-Trip Time(RTT) …

Average Median Min. Max. Percentile …

Page 10: Simple practices in performance monitoring and evaluation

Quantile / Percentile

refers to Google Sawzall Paper

Page 11: Simple practices in performance monitoring and evaluation

A Summary of these Concepts

Client-1

Client-2

Client-3

Client-N

Work Thread

Work Thread

Work Thread

Work Thread

Work Thread

ThroughputLatency Concurrency

Clients Server

Page 12: Simple practices in performance monitoring and evaluation

A Life-World Example

Page 13: Simple practices in performance monitoring and evaluation

Example-1 Paper Amazon Dynamo

Page 14: Simple practices in performance monitoring and evaluation
Page 15: Simple practices in performance monitoring and evaluation
Page 16: Simple practices in performance monitoring and evaluation

Average

99.9%, quantile

Page 17: Simple practices in performance monitoring and evaluation

Example-2 Evaluation Report to a NoSQL DB

Cassandra

Page 18: Simple practices in performance monitoring and evaluation

Benchmark for Write APIBenchmark for Writes Cluster overview

Throughput Latency

• Eachnoderuns6clients(threads),totally54clients.• EachclientgeneratesrandomCDRsfor50millionusers/phone-numbers,

andputsthemintoDaStoronebyone.– KeySpace:50million– SizeofaCDR: Thrift-compactedencoding,~200bytes

ü Throughput: average~80Kops/s;per-node:average~9Kops/sü Latency:average~0.5msp Bottleneck:network (andmemory)

Page 19: Simple practices in performance monitoring and evaluation

Benchmark for Read API• Eachnoderuns8clients(threads),totally72clients.• Eachclientrandomlyusesauser-id/phone-numberoutofthe50-million

space,togetit’srecent20CDRs(onepage)fromDaStor.• AllclientsreadCDRsofasameday/bucket.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61100ms

percentageofreadops

ü Throughput: average~140ops/s;per-node:average~16ops/sü Latency:average~500ms,97%<2s(SLA)p Bottleneck:diskIO(randomseek)(CPUloadisverylow)

average97%

quantile

Page 20: Simple practices in performance monitoring and evaluation

Total & Delta

Total: Delta:

Page 21: Simple practices in performance monitoring and evaluation

Generate the metrics and monitor them

Page 22: Simple practices in performance monitoring and evaluation

• In server side

• Add a operation-count and the time-cost for every client call

• For every monitor interval, pull and push the current Throughput and Latency the monitor-tool(ganglia/zabbix) or console.

• Throughput = sum of count / time interval

• Latency = average(sum of latency / sum of count), max, min, quantile …

Code in Gitlab and Gerrit

Page 23: Simple practices in performance monitoring and evaluation

Code for Spring Project

Page 24: Simple practices in performance monitoring and evaluation

• Java

• JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch)

• javaagent (java -javaagent:jar path [= premain ] )

• jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric)

• Ganglia

• Zabbix

• …

Page 25: Simple practices in performance monitoring and evaluation

Ganglia Zabbix etc.

Page 26: Simple practices in performance monitoring and evaluation

Performance Benchmark Programing

Demo Test and Evaluation the Throughput and Latency of http://www.fangdd.com

Page 27: Simple practices in performance monitoring and evaluation

Demo Time …

Page 28: Simple practices in performance monitoring and evaluation

demo screenshots

Page 29: Simple practices in performance monitoring and evaluation

demo screenshots

���

���

���

��

����

����

����

� � � � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � � � � � � � � � ���

���

���

���

��

���

���

���

���

��

���

���

���

���

��

���

���

����

����

�� ������� ���� �

Average 95%

The long tail …

Page 30: Simple practices in performance monitoring and evaluation

Statistical Monitoring for Outlier

usually for trouble-shooting

Page 31: Simple practices in performance monitoring and evaluation

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.

The magic matrix:

Page 32: Simple practices in performance monitoring and evaluation

• Redis Memcache

• Just add at a point, very low-cost

• Very

• Logs ELK

Page 33: Simple practices in performance monitoring and evaluation

Heavy Logs & ELK

It’s another topic!

Page 34: Simple practices in performance monitoring and evaluation

Thank You!