Monitoring In Motion - events.static.linuxfound.org · Monitoring In Motion Challenges in...

Monitoring In MotionChallenges in monitoring kubernetes, containers, and dynamic infrastructure.

ContainerCon TorontoAug 24, 2016

Ilan Rabinovitch Director, Technical Community Datadog

$ finger ilan@datadog

[datadoghq.com]Name: Ilan RabinovitchRole: Director, Technical CommunityInterests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

• SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/)

Datadog Overview

http://www.datadoghq.com/careers/

Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...

Monitor Everything

$ cat ~/.plan

1. Intro: The Importance of Monitoring

2. The Challenge: Monitoring Dynamic Infrastructure

3. Finding the Signal: How do we know what to monitor?

4. Implementation: Applying it to Containerized Workloads

Our Focus Area

Culture

Automation

Metrics

Sharing

Damon Edwards and John Willis DevOps Day LA

Culture

“organizations which design systems ... are constrained to produce designs

which are copies of the communication structures of these organizations”

- Melvin E. Conway

Follow @honest_update on Twitter

Collecting data is cheap; not having it when you need it can be expensive

Instrument all the things!

SharingLooping Back on Culture

Describe the problem as your “enemy” not each other

Learn Together

Sharing

Using and Sharing the same metrics and measurements

across teams is key to avoiding misunderstandings.

Source: http://bit.ly/1SvvbuP

http://bit.ly/1SvvbuP

Source: http://bit.ly/1RQRsXW

http://bit.ly/1RQRsXW

Operational Complexity Increases with..

• Number of things to measure

• Velocity of change

https://www.datadoghq.com/docker-adoption/

How much we measure?1 instance

• 10 metrics from cloud providers1 operating system (e.g., Linux)

• 100 metrics50~ metrics per application



• 100 metrics50~ metrics per application N containers

• 150*N metrics

Operational Complexity

100instances

500containers

Operational Complexity: Scale

160metrics per host

800metrics per host

Assuming 5 containers per host

Operational Complexity: Scale

100instances

80,000metrics

Assuming 5 containers per host



• 100 metrics50~ metrics per application N containers

• 150*N metricsMetrics Overload!

Source: Datadog

Source: http://bit.ly/1qFylWK

http://bit.ly/1qFylWK

Open Questions

• Where is my container running? • What is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

Source: http://bit.ly/1YxJ7Jy

http://bit.ly/1YxJ7Jy

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

Monitoring 101

Finding Signal - Categorizing Your Metrics

Examples: NGINX - Metrics

Work Metrics:

• Requests Per Second • Request Time • Error Rates (4xx or 5xx) • Success (2xx)

Resource Metrics:

• Disk I/O • Memory • CPU • Queue Length

Examples: NGINX - Events

• Configuration Change • Code Deployment • Service Started / Stopped

Examples: Events

When to let a sleeping engineer lie?

When to alert?

Recurse until you find root cause

What to demand from our monitoring tooling?

Cryptic Alerts

WHAT?

EVERY ALERT MUST BE ACTIONABLE

Host Centric

Service Centric

Static configurations tracking dynamic infrastructure are not a recipe for success.

Static vs Dynamic

Query Based Monitoring“What’s the average throughput of application:nginx per version ?”

“Alert me when one of my pod from replication

controller:foo is not behaving like the others?”

“Show me rate of HTTP 500 responses from nginx”

“… across all data centers”

“… running my app version 2….”

Getting at the metrics…

Resource MetricsUtilization: • CPU (user + system) • memory • i/o • network traffic

Saturation • throttling • swap

Error • Network Errors

(receive vs transmit)

Container Events

• Starting / Stopping Containers • Scaling Events for Underlying Instances • Deploying a new container build

How do we get at the upper layers?

Getting at the Metrics

CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS

pseudo-files Yes Yes Some Yes, in 1.6.1+

stats command Basic Basic No Basic

API Yes Yes Some Yes

Pseudo-files

• Provide visibility into container metrics via the file system. • Generally under: /cgroup/<resource>/docker/$CONTAINER_ID/ or /sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/

Pseudo-files: CPU Metrics$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot

$ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)

Pseudo-files: CPU Throttling

Docker API• Detailed streaming metrics as JSON HTTP socket

$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/28d7a95f468e/stats

http://localhost/containers/28d7a95f468e/stats

STATS Command

# Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

Side Car Containers

Aren’t we still missing a layer?

Open Questions

• What is the capacity of my cluster? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container? • Where is my container running? what port?

Service Discovery

Docker API Orchestrator

Monitoring Agent Container

A O A O

Containers List & Metadata

Additional Metadata (Tags, etc)

Config Backend Integration Configurations

Host Level Metrics

Custom Metrics

• Instrument custom applications

• You know your key transactions best.

• Use async protocols like Etys’ STATSD or DogstatsD

Source: http://bit.ly/1NoW6aj

http://bit.ly/1NoW6aj

ResourcesMonitoring 101: Alerting https://www.datadoghq.com/blog/monitoring-101-alerting/

Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/

Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/

The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/

How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/

8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/

Monitoring In Motion - events.static.linuxfound.org · Monitoring In Motion Challenges in...

Documents

Transcript of Monitoring In Motion - events.static.linuxfound.org · Monitoring In Motion Challenges in...