Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems...

22
Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28,2016

Transcript of Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems...

Page 1: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

MonitoringSwiftOpenStackSummit, Austin2016

AdamTakvam,Sr.SystemsEngineerMartinLanner,EngagementManager

April28,2016

Page 2: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

2 |SwiftStack Confidential

Page 3: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

3

Overview

• Problems- Usage intelligence- Capacityplanning- Operational health- Audittrails

• Background- Methods: logs+systemmetrics- Interpretation ofmetrics- Actions:thresholds +alerting

• Swiftkeymonitoring concepts- Whattomonitor?- Howtomonitor

• Monitoring methods - demos- Logging:ELK- Trending/Forecasting:

Prometheus +Grafana- Systemmonitoring:Zabbix

|SwiftStack Confidential

Page 4: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

4

It’sLinux!

|SwiftStack Confidential

Page 5: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

5

PropertiesofSwift

• Distributed system

• Extremelydurable through replicationorErasure Coding

• Nosinglepointoffailure

• Evendistributionofdata

• Resilient

• Self-healing capabilities

• Cantakealotofabuseandnegligence

Page 6: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

6

AnatomyofaMonitoringSolution

• Agent: Gathersmetricsonahostandeitherpushedoradvertisesthem- Logstash- PrometheusNodeExporter- ZabbixAgent- NagiosNRPE

• Aggregation Engines: Collects metrics fromagents andprovides an APIwith access toaggregated metric values- Nagios- Zabbix- Elasticsearch- Prometheus

• Visualizer: Renders graphs inahuman-friendlyformat for easy comprehension ofsystemstate- Kibana- Grafana

• Alerting: Uses metric thresholds totriggeralerts when metrics fall out ofan acceptablerange- AlertManager- PagerDuty

|SwiftStack Confidential

Page 7: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

7

FormsofMonitoring

• Systemutilization: CPU,memory,diskI/O,network,auditingcycles,replicatortiming

• Performance:Transaction latency

• Errors:Invalidrequests orstates

• Outages:Servicefailures

• Featureusage:Understand CRUDoperations andtrafficpatterns

• Audittrail:Whodidwhatwhen?

MonitoringLifecycle

• Measurement

• Reporting

• Characterization

• Thresholds

• Alerting

• Rootcauseanalysis

• Remediation- Manual- Automated

|SwiftStack Confidential

Developing aMonitoring Strategy

Page 8: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

8

Examplesofmonitoringmethods

• ELK: Usage intelligence- Who?- Agents- HTTPresponse codes- Errors- Audittrails

• Prometheus: Capacityplanning- Datagrowth- Trendinganalytics

• Zabbix: Operationalhealth- Network- CPU- RAM

Page 9: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

9

KeyconceptsformonitoringSwift

• Cluster full- df- Datagrowth- Capacityplanning

• Networking- Availability- Saturation

• Proxystate- CPU- /healthcheck

• Auditingcycles

• Replicationcycletiming

Page 10: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

10

LoadbalancerhealthchecksagainstSwiftproxyservers

demo@demo:~$ curl http://swift.swiftstack.oss/healthcheckOK

|SwiftStack Confidential

• Mostloadbalancers runICMPchecksagainstallIPsinitspoolbydefault

• Also,considerconfiguring theloadbalancer torunTCPchecksagainstSwift’s/healthcheck endpoint

Example:

Page 11: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

11

AudittrailswithELK

|SwiftStack Confidential

Page 12: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

12

Objectsizedistribution

|SwiftStack Confidential

Page 13: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

13

DistributionofCRUDoperationsovertime

|SwiftStack Confidential

Page 14: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

14

ZabbixtriggersforSwift

|SwiftStack Confidential

Page 15: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

15

Zabbixnodememoryusage

|SwiftStack Confidential

Page 16: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

16

Zabbixdriveutilizationevents

|SwiftStack Confidential

Page 17: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

17

DiskI/O

|SwiftStack Confidential

Page 18: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

18

ObjectReplicatorOperations

|SwiftStack Confidential

Page 19: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

19

Prometheus+Grafanatrendingandforecasting

|SwiftStack Confidential

Page 20: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

20

Alerting

ALERT StorageCritical24HoursIF sum(predict_linear(node_filesystem_free{

job='swiftstack',mountpoint=~"/srv/node/.*”}[1d]), 24*3600) < sum(node_filesystem_size{job="swiftstack",mountpoint=~"/srv/node/.*”}) * 0.2

FOR 1hLABELS {group="storage_admin“severity="critical“

}

|SwiftStack Confidential

Translation:Sendacriticalalerttoallmembersofthestorage_admin groupifthetotalavailablestoragecapacityisprojectedtobelessthan20%ofthetotalstoragecapacitywithinthenext24hoursandthatforecasthasheldtrueforatleast1hour,recalculatingevery5minutes(perserverconfig /notshown).

Example:

Page 21: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

21

Q&A/Demo

|SwiftStack Confidential

Page 22: Monitoring Swift · Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

22

Thankyou!

|SwiftStack Confidential