Monitoring In Motion - events.static.linuxfound.org · Monitoring In Motion Challenges in...
Transcript of Monitoring In Motion - events.static.linuxfound.org · Monitoring In Motion Challenges in...
Monitoring In MotionChallenges in monitoring kubernetes, containers, and dynamic infrastructure.
ContainerCon TorontoAug 24, 2016
Ilan Rabinovitch Director, Technical Community Datadog
$ finger ilan@datadog
[datadoghq.com]Name: Ilan RabinovitchRole: Director, Technical CommunityInterests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events
• SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/)
Datadog Overview
Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...
Monitor Everything
$ cat ~/.plan
1. Intro: The Importance of Monitoring
2. The Challenge: Monitoring Dynamic Infrastructure
3. Finding the Signal: How do we know what to monitor?
4. Implementation: Applying it to Containerized Workloads
Our Focus Area
Culture
Automation
Metrics
Sharing
Damon Edwards and John Willis DevOps Day LA
Culture
“organizations which design systems ... are constrained to produce designs
which are copies of the communication structures of these organizations”
- Melvin E. Conway
Follow @honest_update on Twitter
Collecting data is cheap; not having it when you need it can be expensive
Instrument all the things!
SharingLooping Back on Culture
Describe the problem as your “enemy” not each other
Learn Together
Sharing
Using and Sharing the same metrics and measurements
across teams is key to avoiding misunderstandings.
Source: http://bit.ly/1SvvbuP
Source: http://bit.ly/1RQRsXW
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
https://www.datadoghq.com/docker-adoption/
How much we measure?1 instance
• 10 metrics from cloud providers1 operating system (e.g., Linux)
• 100 metrics50~ metrics per application
How much we measure?1 instance
• 10 metrics from cloud providers1 operating system (e.g., Linux)
• 100 metrics50~ metrics per application N containers
• 150*N metrics
Operational Complexity
100instances
500containers
Operational Complexity: Scale
160metrics per host
800metrics per host
Assuming 5 containers per host
Operational Complexity: Scale
100instances
80,000metrics
Assuming 5 containers per host
How much we measure?1 instance
• 10 metrics from cloud providers1 operating system (e.g., Linux)
• 100 metrics50~ metrics per application N containers
• 150*N metricsMetrics Overload!
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
Source: Datadog
Source: http://bit.ly/1qFylWK
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
Open Questions
• Where is my container running? • What is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?
Source: http://bit.ly/1YxJ7Jy
More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
Monitoring 101
Finding Signal - Categorizing Your Metrics
Examples: NGINX - Metrics
Work Metrics:
• Requests Per Second • Request Time • Error Rates (4xx or 5xx) • Success (2xx)
Resource Metrics:
• Disk I/O • Memory • CPU • Queue Length
Examples: NGINX - Events
• Configuration Change • Code Deployment • Service Started / Stopped
Examples: Events
When to let a sleeping engineer lie?
When to alert?
Recurse until you find root cause
What to demand from our monitoring tooling?
Cryptic Alerts
WHAT?
EVERY ALERT MUST BE ACTIONABLE
Host Centric
Service Centric
Static configurations tracking dynamic infrastructure are not a recipe for success.
Static vs Dynamic
Query Based Monitoring“What’s the average throughput of application:nginx per version ?”
“Alert me when one of my pod from replication
controller:foo is not behaving like the others?”
“Show me rate of HTTP 500 responses from nginx”
“… across all data centers”
“… running my app version 2….”
Getting at the metrics…
Resource MetricsUtilization: • CPU (user + system) • memory • i/o • network traffic
Saturation • throttling • swap
Error • Network Errors
(receive vs transmit)
Container Events
• Starting / Stopping Containers • Scaling Events for Underlying Instances • Deploying a new container build
How do we get at the upper layers?
Getting at the Metrics
CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS
pseudo-files Yes Yes Some Yes, in 1.6.1+
stats command Basic Basic No Basic
API Yes Yes Some Yes
Pseudo-files
• Provide visibility into container metrics via the file system. • Generally under: /cgroup/<resource>/docker/$CONTAINER_ID/ or /sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/
Pseudo-files: CPU Metrics$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot
$ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)
Pseudo-files: CPU Throttling
Docker API• Detailed streaming metrics as JSON HTTP socket
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/28d7a95f468e/stats
STATS Command
# Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB
Side Car Containers
Aren’t we still missing a layer?
Open Questions
• What is the capacity of my cluster? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container? • Where is my container running? what port?
Service Discovery
Docker API Orchestrator
Monitoring Agent Container
A O A O
Containers List & Metadata
Additional Metadata (Tags, etc)
Config Backend Integration Configurations
Host Level Metrics
Custom Metrics
• Instrument custom applications
• You know your key transactions best.
• Use async protocols like Etys’ STATSD or DogstatsD
Source: http://bit.ly/1NoW6aj
ResourcesMonitoring 101: Alerting https://www.datadoghq.com/blog/monitoring-101-alerting/
Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/
Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/
The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/
How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/
8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/