Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure...

Migrating from Nagios to Prometheus

NOV 07, 2019

Linux (Ubuntu)

SDN (Cisco)

Terraform

Runtastic Infrastructure

Linux KVM

OpenNebula

3600 CPU Cores

20 TB Memory

100 TB Storage

Virtualization

Physical

Hybrid

Core DBs

Really a lot open source

Technologies

Our Monitoring back in 2017...

● Nagios○ Many Checks for all Servers○ Checks for NewRelic

● Pingdom○ External HTTP Checks○ Specific Nagios Alerts○ Alerting via SMS

● NewRelic○ Error Rate○ Response Time

Configuration hell….

Alert overflow...

Goals for our new Monitoring system

● Make On Call as comfortable as possible● Automate as much as possible● Make use of graphs● Rework our alerting● Make it scaleable!

Starting with Prometheus...

Prometheus

Our Prometheus Setup

● 2x Bare Metal● 8 Core CPU● Ubuntu Linux● 7.5 TB of Storage● 7 month of Retention time● Internal TSDB

Automation

Our Goals for Automation

● Roll out Exporters on new servers automatically○ using Chef

● Use Service Discovery in Prometheus○ using Consul

● Add HTTP Healthcheck for a new Microservice○ using Terraform

● Add Silences with 30d duration○ using Terraform

Consul

● Consul for our Terraform State● Agent Rollout via Chef● One Service definition per Exporter on each Server

Consul

What Labels do we need?

● What’s the Load of all workers of our Newsfeed service?○ node_load1{service=”newsfeed”, role=”workers”}

● What’s the Load of a specific Leaderboard server?○ node_load1{hostname=”prd-leaderboard-server-001”}

...and how we implemented them in Consul

{ "service": { "name": "prd-sharing-server-001-mongodbexporter", "tags": [ "prometheus", "role:trinidad", "service:sharing", "exporter:mongodb" ], "port": 9216 }}

Scrape Configuration

- job_name: prd consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prometheus,.* action: keep - source_labels: [__meta_consul_node] target_label: hostname - source_labels: [__meta_consul_tags] regex: .*,service:([^,]+),.* replacement: '${1}' target_label: service

External Health Checks

● 3x Blackbox Exporters● Accessing SSL Endpoints● Checks for

○ HTTP Response Code○ SSL Certificate○ Duration

Add Healthcheck via Terraform

resource "consul_service" "health_check" { name = "${var.srv_name}-healthcheck" node = "blackbox_aws"

tags = [ "healthcheck", "url:https://status.runtastic.com/${var.srv_name}", "service:${var.srv_name}", ]}

Job Config for Blackbox Exporters

- job_name: blackbox_aws metrics_path: /probe params: module: [http_health_monitor] consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,healthcheck,.* action: keep - source_labels: [__meta_consul_tags] regex: .*,url:([^,]+),.* replacement: '${1}' target_label: __param_target

Add Silence via Terraform

resource "null_resource" "prometheus_silence" {

provisioner "local-exec" { command = <<EOF ${var.amtool_path} silence add 'service=~SERVICENAME' \ --duration='30d' \ --comment='Silence for the newly deployed service' \ --alertmanager.url='http://prd-alertmanager:9093' EOF }

OpsGenie

Our Initial Alerting Plan

● Alerts with Low Priority○ Slack Integration

● Alerts with High Priority (OnCall)○ Slack Integration○ OpsGenie

...why not forward all Alerts to OpsGenie?

Define OpsGenie Alert Routing

● Prometheus OnCall Integration○ High Priority Alerts (e.g. Service DOWN)○ Call the poor On Call Person○ Post Alerts to Slack #topic-alerts

● Prometheus Ops Integration○ Low Priority Alerts (e.g. Chef-Client failed runs)○ Disable Notifications○ Post Alerts to Slack #prometheus-alerts

Setup Alertmanager Config

- receiver: 'opsgenie_oncall' group_wait: 10s group_by: ['...'] match: oncall: 'true'

- receiver: 'opsgenie' group_by: ['...'] group_wait: 10s

...and its receivers

- name: "opsgenie_oncall" opsgenie_configs: - api_url: "https://api.eu.opsgenie.com/" api_key: "ourapitoken" priority: "{{ range .Alerts }}{{ .Labels.priority }}{{ end }}" message: "{{ range .Alerts }}{{ .Annotations.title }}{{ end }}" description: "{{ range .Alerts }}\n{{ .Annotations.summary }}\n\n{{ if ne .Annotations.dashboard \"\" -}}\nDashboard:\n{{ .Annotations.dashboard }}\n{{- end }}{{ end }}" tags: "{{ range .Alerts }}{{ .Annotations.instance }}{{ end }}"

Why we use group_by[‘...’]

● Alert Deduplication from OpsGenie● Alerts are being grouped● Overlook Alerts

Example Alerting Rule for On Call

- alert: HTTPProbeFailedMajor expr: max by(instance,service)(probe_success) < 1 for: 1m labels: oncall: "true" priority: "P1" annotations: title: "{{ $labels.service }} DOWN" summary: "HTTP Probe for {{ $labels.service }} FAILED.\nHealth Check URL: {{ $labels.instance }}"

Example Alerting Rule with Low Priority

- alert: MongoDB-ScannedObjects expr: max by(hostname, service)(rate(mongodb_mongod_metrics_query_executor_total[30m])) > 500000 for: 1m labels: priority: "P3" annotations: title: "MongoDB - Scanned Objects detected on {{ $labels.service }}" summary: "High value of scanned objects on {{ $labels.hostname }} for service {{ $labels.service }}" dashboard: "https://prd-prometheus.runtastic.com/d/oCziI1Wmk/mongodb"

Alert Management via Slack

Setting up the Heartbeat

groups:- name: opsgenie.rules rules: - alert: OpsGenieHeartBeat expr: vector(1) for: 5m labels: heartbeat: "true" annotations: summary: "Heartbeat for OpsGenie"

...and its Alertmanager Configuration

- receiver: 'opsgenie_heartbeat' repeat_interval: 5m group_wait: 10s match: heartbeat: 'true'

- name: "opsgenie_heartbeat" webhook_configs: - url: 'https://api.eu.opsgenie.com/v2/heartbeats/prd_prometheus/ping' send_resolved: false http_config: basic_auth: password: "opsgenieAPIkey"

CI/CD Pipeline

Goals for our Pipeline

● Put all Alerting and Recording Rules into a Git Repository● Automatically test for syntax errors● Deploy master branch on all Prometheus servers● Merge to master —> Deploy on Prometheus

How it works

● Jenkins○ running promtool against each .yml file

● Bitbucket sending HTTP calls when master branch changes● Ruby based HTTP Handler on Prometheus Servers

○ Accepting HTTP calls from Bitbucket○ Git pull○ Prometheus reload

Verify Builds for each Branch

runtastic.com

THANK YOU Niko DominkowitschInfrastructure Engineer

niko.dominkowitsch@runtastic.com

Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure...

Documents

Transcript of Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure...

TB Alert’s Programme: Local TB Partnerships

Project: Runtastic for Smart Glasses

Mdr tb & xdr tb ppt.

Introducing the new…...Introducing the new… Launch Overview 2 Runtastic will launch Runtastic Orbit on July 31, 2014 at 8 am ET. The new wearable will allow users to track their

TB/MDR TB guidelines for nurses

TB 15 TB-MT - hdnd.haiduong.gov.vn

A Fun, Fit Future - SXSW Future 15 Talk - Florian Gschwandtner - Runtastic

Runtastic RUNGPS1 GPS Watch User Manual

Resource C - Toolsonlinepubs.trb.org/.../acrp/acrp_rpt_065ResourceC.docx · Web viewAIRBUS TB A320 TB MD80/90 TB CRJ200 TB CRJ700/900 TB E190 TB Q-400 TB Universal TB 737 Pushup stairs

Runtastic RUNSCS1 Manual En

The Vehicle Data Value Chain as a Lightweight Model to ... · Runtastic app downloads demonstrate (Runtastic, 2017b). 30 million app sessions per month in Europe produce a reasonable

CHILDREN AND TB - TB Alliance · TB medicines may be the most successful TB product launch in history. The new childhood TB medicines also achieved the TB Alliance patient access

Cloud meets Big Data - IDGweb.idg.no/app/web/online/Event/Energyworld/2012/Presentasjoner/E… · 16 tb lun 16 tb lun 16 tb lun 16 tb lun 16 tb lun 16 tb lun 16 tb lun 16 tb lun 16

Data Mining - SENG 474/CSC578D · 40PB Hadoop cluster. Amazon.com: the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB (as of 2005). ... Figure:A

Heart Rate Monitor with Chest Strap for Smartphone€¦ · Set Runtastic Receiver with Chest strap and set Enable Runtastic Receiver to ON. 3. Touch Re-Pair. Android 1. Go to Settings

ORBITdownload.runtastic.com/hardware/runor1/qi/Runtastic... · 2014. 7. 17. · hardware.runtastic.com. Note: The Runtastic Orbit can be used as a stand-alone device (without a smartphone/app)

RUNTASTIC | BTO 2015

DD RPT 10 DUCKWORTH - Ontario...Diamond Drilling Township of DUCKWORTH (Cont'd.) Report NO 10 Work performed by: Claim N0 TB 77657 TB 77666 TB 77664 TB 77660 TB 77678 TB 77673 TB 77677

Runtastic Uses Cisco ACI and Red Hat OpenShift to Automate ... · In August 2015, Runtastic was acquired by adidas to support the development of the parent company’s digital strategy.

101-5-ctohobeads.net/common/pdf/101-5-c.pdf · T3-1315 • TB-1319 ra-1321 TB-1322 TB-15CO Ta-1502 TB-1503 13-1504 TB-15C6 TB-15C€ TB-15CG TB-1510 TB-1511 TB-1512 TB-1513 TB-1515