ZMON: Monitoring Zalando's Engineering Platform
-
Upload
zalando-technology -
Category
Technology
-
view
3.304 -
download
1
Transcript of ZMON: Monitoring Zalando's Engineering Platform
![Page 1: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/1.jpg)
ZMON - Monitoring Our PlatformDevOps Meetup Dublin | September 3, 2015 | [email protected] | @JanMussler
![Page 2: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/2.jpg)
15 countries3 fulfillment centers16+ million active customers2.2+ billion € revenue 2014130+ million visits per month8.000+ employees
ONE of EUROPE’S LARGEST ONLINE FASHION RETAILERS
Visit us: tech.zalando.com
![Page 3: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/3.jpg)
Zalando’s Technology History
![Page 4: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/4.jpg)
(Some!) Technologies We Use
![Page 5: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/5.jpg)
Monitoring Situation Until Late 2013
ICINGA plus custom frontend (ZMON 1)
Did not scale with growth:
● Our UI became too slow● Number of systems to check too many● Number of teams that wanted checks grew● Every request had to go through single team
![Page 6: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/6.jpg)
Improve performance and throughput
Autonomy for individual teams
Flexibility and extendability
Integration into tooling (CMDB, DeployCtl …)
Goals of new ZMON development
![Page 7: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/7.jpg)
Entity:Anything you may want to monitorCan be used as a "dimension"
Checks:Runnable Python snippet fetching data
Alert on Check:Python expression yielding true or false
The basic terminology ...
![Page 8: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/8.jpg)
Zalando Tech - 24x7 team setup
Incident TeamIncident TeamAlerts
Database Team Alerts
2nd LevelDatabase
Calls if help neededInheritance with custom thresholds
SMS / E-Mail
observes
![Page 9: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/9.jpg)
Customizable ZMON dashboards
![Page 10: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/10.jpg)
Customizable ZMON dashboards
![Page 11: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/11.jpg)
Customizable ZMON dashboards
![Page 12: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/12.jpg)
Display historic data using Grafana
![Page 13: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/13.jpg)
Workers(Python)
Workers(Python)
ZMON’s core components
Scheduler(jvm) Redis Workers
(Python)
KairosDB(java)
Controller(Java)
PostgreSQL
Queue/State
CLI(Python)
Check/Alert definitionEntity data
Cassandra
Frontend(Angular)
Redis
Slave
![Page 14: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/14.jpg)
● hosts, databases, applications, instances ...● generic key value object● 4000+ entities in our deployment
Entities
{ "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1"}
Entity "node01:8080"
![Page 15: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/15.jpg)
Database Entity
{ "id": "customer-live-slave", "type": "database", "role": "slave", "environment": "live", "shards": { "customer1": "customer1.db:5432/customer1" "customer2": "customer2.db:5432/customer2" "customer3": "customer3.db:5432/customer3" "customer4": "customer4.db:5432/customer4" }}
Entity: customer-live-slave
![Page 16: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/16.jpg)
Integrated easy-to-use entity store with REST API
>zmon entities push local-postgres.yaml
Entity Service
id: localhost:5432
type: postgres
host: localhost
port: 5432shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
![Page 17: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/17.jpg)
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, Cloudwatch,
Redis, SNMP, tcp, SOAP, Scalyr...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
![Page 18: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/18.jpg)
REST API to update / auto-import from SCM
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team 1"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgres
interval: 15
description: "test connection"
select-1-check.yaml
![Page 19: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/19.jpg)
![Page 20: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/20.jpg)
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities and tags
Alerts
![Page 21: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/21.jpg)
![Page 22: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/22.jpg)
![Page 23: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/23.jpg)
Trial Run - Quick feedback and download YAML
![Page 24: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/24.jpg)
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
![Page 25: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/25.jpg)
Workers(Python)
Workers(Python)
ZMON Core + UI + KairosDB
Scheduler(jvm) Redis Worker
(Python)
KairosDB(java)
Controller(Java)
PostgreSQL
Queue/State
CLI(Python)
Check/Alert definitionEntity data
Cassandra
Frontend(Angular)
Redis
Slave
![Page 26: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/26.jpg)
Vagrant Box deploys Docker images
![Page 27: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/27.jpg)
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
![Page 28: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/28.jpg)
Extendability - Check and Alert functions
● Improve user experience through provided functions
![Page 29: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/29.jpg)
Extendability - Check and Alert functions
● Improve user experience through function wrappers
![Page 30: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/30.jpg)
The Microservices World
![Page 31: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/31.jpg)
● Request rates
● Response rates by HTTP status code
● Latency
Key Metrics for your service?
![Page 32: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/32.jpg)
Expose your data
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18076110580284566,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.1518180485219247,
"zmon.response.200.GET.checks.all-active-check-definitions.meanRate": 0.06792011610723951,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512398137982051,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.98thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.mean": 1170,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.min": 1114,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.stdDev": 42,
}
![Page 33: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/33.jpg)
Start tracking your metrics
![Page 34: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/34.jpg)
Display application statistics
![Page 35: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/35.jpg)
Application metrics
![Page 36: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/36.jpg)
Continued ...
![Page 37: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/37.jpg)
Reuse of check
![Page 38: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/38.jpg)
Spring boothttps://github.com/zalando/zmon-actuator
Clojurehttps://github.com/zalando-stups/friboo/
Play (done, to be released)
Libraries available for
![Page 39: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/39.jpg)
Multi DC / AWS
![Page 40: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/40.jpg)
ZMON in AWS Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2Instance
EC2InstanceEC2
InstanceEC2
Instance
ZMON Appliance
ZMON Appliance
KairosDB
EC2Instance
EC2Instance
ZMONData Service
ELB ELB
![Page 41: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/41.jpg)
● Scheduler supports queue filters by entity○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter○ only handles entities with {"dc":"dc1"}
● Worker can report home using:○ Redis (we use this across DCs)○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
![Page 42: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/42.jpg)
Uses Amazon API to fetch:● ELBs● EC2 instances● RDS instances
Pushes enriched entities to entity service
ZMON AWS Agent
![Page 43: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/43.jpg)
Prometheus?read "text" result
![Page 44: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/44.jpg)
Kubernetes Example: Exports in Prometheus text format
kubelet_docker_operations_latency_microseconds{operation_type="inspect_container",quantile="0.9"} 9602kubelet_docker_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 9740
![Page 45: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/45.jpg)
Yields a usable nested dictionary
{"list_images": {"0.9":"120252", "0.99":"120252", "0.5":"120252"}, "version": {"0.9":"1281", "0.99":"2183", "0.5":"873"}, "list_containers": {"0.9":"9740", "0.99":"23378", "0.5":"3717"}, "inspect_container": {"0.9":"9602", "0.99":"18367", "0.5":"4419"}}
![Page 46: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/46.jpg)
Internals
![Page 47: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/47.jpg)
ZMON’s basic data flow
Scheduler(jvm) Redis
{"check": {"id": 1, "entity": {"host":"monitor01"}, "command": "snmp().load()", "alerts":[ {"id":100, "condition": "value[‘load1’]>10"} ] }}
![Page 48: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/48.jpg)
ZMON’s basic data flow
Worker(python) Redis
-- store check result "snmp().load()"lpush zmon:check:1:monitor01 {"load1":5,"load5":3,"load15:2}
-- keep last 20 results (for dashboard charts)ltrim zmon:check:1:monitor01 20
-- alert active?sadd zmon:alert:100 monitor01
-- alert inactive?srem zmon:alert:100 monitor01
![Page 49: ZMON: Monitoring Zalando's Engineering Platform](https://reader036.fdocuments.in/reader036/viewer/2022081514/587f06711a28abc26f8b4e7d/html5/thumbnails/49.jpg)
ZMON Vagrant Box:https://github.com/zalando/zmon
ZMON Homepage:https://zalando.github.io/zmon
Zalando Tech:https://tech.zalando.com