@snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications....
Transcript of @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications....
![Page 1: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/1.jpg)
@snehainguva
![Page 2: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/2.jpg)
prometheus everything, observing kubernetes in the
cloud
digitalocean.com
![Page 3: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/3.jpg)
digitalocean.com
about me
software engineer @DigitalOceanformer delivery, currently observabilitykubernetes, prometheus
![Page 4: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/4.jpg)
digitalocean.com
Some stats
![Page 5: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/5.jpg)
digitalocean.com
15 kubernetes clusters
12 data centers
300+ production applications
![Page 6: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/6.jpg)
digitalocean.com
2 promethei + 1 alertmanager per cluster
1.5 million+ timeseries
99218 samples/sec(note: data-center wide scraping is at 550k samples/sec)
+
![Page 7: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/7.jpg)
digitalocean.com
the plan:● the pre-kubernetes days
● kubernetes at DigitalOcean (aka docc)
● prometheus + alertmanager and kubernetes
● alerting in action: examples
● potential pitfalls
● next steps
![Page 8: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/8.jpg)
digitalocean.com
pre-kubernetes:service owners write an application
provision a server with chef or ansible
use a CI/CD pipeline, bash scripts, or other tools
to deploy and update application on a VM
![Page 9: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/9.jpg)
digitalocean.com
pre-kubernetes:use nagios + various plugins to monitor
use collectd + application metrics + statsd +
graphite
push data to openTSDB
![Page 10: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/10.jpg)
digitalocean.com
pre-kubernetes:longer to provision host than write actual service
blackbox monitoring NOT insightful
whitebox monitoring services NOT easily queryable
![Page 11: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/11.jpg)
digitalocean.com
docc: Digital Ocean Command Center
A tool for deploying containerized, stateless applications
![Page 12: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/12.jpg)
digitalocean.com
What is kubernetes?Container orchestration tool from Google
![Page 13: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/13.jpg)
digitalocean.com
What is docc?An abstraction layer on top of kubernetes
CLI DOCCSERVER
deployment → pods
service
![Page 14: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/14.jpg)
digitalocean.com
post-docc:service owners write an application
service owner dockerizes application
describe application in json manifest file
deploy!
![Page 15: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/15.jpg)
digitalocean.com
post-docc:deployments and updates take minutes, not hours
view running applications
get application logs
easily scale, update, or restart applications
![Page 16: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/16.jpg)
digitalocean.com
But what about monitoring?
![Page 17: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/17.jpg)
digitalocean.com
Let’s use prometheus + alertmanager
![Page 18: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/18.jpg)
digitalocean.com
service
promconfig
alertconfigalertmanager
docc
deployment → pods
![Page 19: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/19.jpg)
digitalocean.com
instrument your application
use prometheus golang client
expose metrics endpoint
1
![Page 20: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/20.jpg)
digitalocean.com
specify metrics, ports, alerts in your manifest file
Which metrics endpoint should be scraped?
Which container port needs to be exposed?
Specify alerting rule, duration interval, and channel.
2
![Page 21: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/21.jpg)
digitalocean.com
use docc CLI to deploy your application
serviceCLI doccserver
$ docc deploy manifest.json
3
annotations contain rules and receiver info
deployment → pods
![Page 22: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/22.jpg)
digitalocean.com
prometheus talks to the kubernetes api and grabs the metrics endpoint and port information
promconfigservice
4
![Page 23: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/23.jpg)
digitalocean.com
promconfig grabs alert information and rewrites prometheus rules file
promconfigservice
5
![Page 24: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/24.jpg)
digitalocean.com
alertconfig grabs alert routes and rewrites alertmanager configuration file
service alertmanager alertconfig
6
![Page 25: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/25.jpg)
digitalocean.com
What should we monitor?
![Page 26: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/26.jpg)
digitalocean.com
latency
traffic
error
saturation
Request
Errors
Duration
4 Golden Signalsrequest-based system metrics
![Page 27: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/27.jpg)
digitalocean.com
Brendan Gregg’s USE-ful metrics
Utilization
Saturation
Error
“Solves 80% of server issues with 5% of the effort.”
![Page 28: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/28.jpg)
digitalocean.com
counters: cumulative, increasing metric
gauges: single metric that goes up or down
histograms: samples and buckets observations
summaries: samples observations, specify quantile
prom metrics types
![Page 29: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/29.jpg)
digitalocean.com
Putting it all together...
![Page 30: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/30.jpg)
digitalocean.com
service metric: traffichow much demand is placed on the system
loadbalancer backend traffic
sum(rate(haproxy_backend_bytes_out_total{
kubernetes_name="loadbalancer",
backend="tls_default_neptune_nyc3_internal_digitalocean_com"}
[1m])) BY (backend)
fxn: rate() and sum() metric type: counter
labels
![Page 31: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/31.jpg)
digitalocean.com
cluster metric: utilizationaverage time resource is busy servicing work
cluster CPU utilization
(sum(rate(container_cpu_usage_seconds_total{id="/"}[5m]))
/ sum(machine_cpu_cores))
fxn: sum() and rate() metric type: counter
![Page 32: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/32.jpg)
digitalocean.com
How should we alert?
![Page 33: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/33.jpg)
digitalocean.com
Threshold alerts
Do any of the aforementioned metrics exceed a lower or upper bound?
![Page 34: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/34.jpg)
digitalocean.com
Threshold alerts
Are more than 80% of cluster CPU cores being utilized?
(sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores))* 100 > 80
![Page 35: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/35.jpg)
digitalocean.com
State-based alerts
Is there a divergence between expected state and actual state of a service?
![Page 36: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/36.jpg)
digitalocean.com
State-based alerts
Is my service up and/or scrape-able?
absent(up{kubernetes_name="doccserver"}) or sum(up{kubernetes_name="doccserver"}) == 0
![Page 37: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/37.jpg)
digitalocean.com
Common pitfalls
![Page 38: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/38.jpg)
digitalocean.com
Pitfall #1: Alerting fatigue
![Page 39: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/39.jpg)
digitalocean.com
Solution: Slack and/or Pagerduty
send only the most urgent, production alerts to pagerduty
try out different promQL queries to have less spikey metrics
![Page 40: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/40.jpg)
digitalocean.com
Pitfall #2: Who owns what?
![Page 41: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/41.jpg)
digitalocean.com
Solution: opinionated manifest file
services owner must include maintainer information
alerts themselves include descriptions and summaries with
several labels
alerts must include team-specific receivers
![Page 42: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/42.jpg)
digitalocean.com
Pitfall #3: Meta-monitoring
![Page 43: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/43.jpg)
digitalocean.com
Solution: Duplicate promethei and HA alertmanager
alertmanager
alertmanager
alertmanager
![Page 44: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/44.jpg)
digitalocean.com
Solution: Deadman’s switch
ALERT JustKeepSwimming IF vector(1)
elastalert
![Page 45: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/45.jpg)
digitalocean.com
![Page 46: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/46.jpg)
digitalocean.com
#1: Automated alerts
utilize user-defined memory and cpu limits for threshold alerts
automatic state-based alerts
![Page 47: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/47.jpg)
digitalocean.com
#2: Leverage metrics for autopilot
user trusts in our custom controllers and schedulers
collect metrics and build model about resource usage over time
accordingly adjust limits and alerts
![Page 48: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/48.jpg)
digitalocean.com
#3: Leverage metrics for autoscaling
services based on resource usage, # connections, etc.
loadbalancers based on # of frontend and backend connections
# of worker nodes based on memory and cpu capacity metrics
![Page 49: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/49.jpg)
digitalocean.com
a brave new world of container orchestration
prometheus + alertmanager are awesome!
extensibility
![Page 50: @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications. digitalocean.com ... pre-kubernetes: service owners write an application provision a server with](https://reader034.fdocuments.in/reader034/viewer/2022051923/6010a95d2d5a2b5bb42c970e/html5/thumbnails/50.jpg)
thanks!
@snehainguva
● The best prometheus tutorials you will ever
read, Julius Volz
● Actual Prometheus Website
● Kubernetes Project