Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with...

58
Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData @pauldix

Transcript of Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with...

Page 1: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor

Paul Dix CTO & cofounder of InfluxData

@pauldix

Page 2: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData
Page 3: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Who am I?

Page 4: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

What is InfluxCloud?

Page 5: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Cost to monitor with SaaS > $10,000/month

Page 6: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

We do it for < $700/month

Page 7: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Raise your hand if you use InfluxDB

Page 8: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Raise your hand if you use Telegraf

Page 9: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Raise your hand if you use Kapacitor

Page 10: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Two kinds of time series data…

Page 11: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

InfluxDB Intro (optional)

Page 12: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

What is time series data?

Page 13: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Stock trades and quotes

Page 14: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Metrics

Page 15: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Analytics

Page 16: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Events

Page 17: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Sensor data

Page 18: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Regular time series

t0 t1 t2 t3 t4 t6 t7

Samples at regular intervals

Page 19: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Irregular time series

t0 t1 t2 t3 t4 t6 t7

Events whenever they come in

Page 20: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

API - 2 endpoints

Page 21: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

API - 2 endpoints

/write?db=mydb&rp=fooWrite: HTTP POST

Page 22: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

API - 2 endpoints

/write?db=mydb&rp=foo

/query?db=mydb&rp=foo

Write: HTTP POST

Read: HTTP GET

Page 23: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Page 24: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Line Protocol

Page 25: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Measurement name

Page 26: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Tags (should be sorted by key, value)

Page 27: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Tags (values are always strings)

Page 28: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=23422.0 1422568543702900257

BODY:

Fields

Page 29: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 some_measurement value=23422.0,context=“asdf jkl” 1422568543702900257

BODY:

Unlimited Fields

Page 30: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 some_measurement value=23422 1422568543702900257

BODY:

Fields (types: int64, float64, string, bool)

Page 31: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=0i 1422568543702900257

BODY:

Fields (ints must have trailing i)

Page 32: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=0.0 1422568543702900257

BODY:

Fields (types must remain consistent)

Page 33: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Writing Data/write?db=mydb&rp=foo

cpu_load,host=server01,region=us-west value=0.64 cpu_load,host=server02,region=us-west value=0.55 1422568543702900257 network,direction=in,host=server01,region=us-west value=0.0 1422568543702900257

BODY:

Timestamp (nanosecond epoch)

Page 34: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Our setup

Page 35: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Telegraf on all hostssystem metrics, InfluxDB metrics

Page 36: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

InfluxDBmonitoring instance

m4.4xlarge

Page 37: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Kapacitormonitoring instance

same instance as InfluxDB

Page 38: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData
Page 39: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Grafana

Page 40: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Grafana = Visibility

Page 41: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Only part of the monitoring story

Page 42: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

hundreds or thousands of time series per server

Page 43: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

~ 1M total series

Page 44: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

You can’t visualize that!

Page 45: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Kapacitortime series processing engine

Page 46: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Kapacitor• Open Source

• MIT

• Go

• TICKscript

• Send alerts to log, web hook, email, exec, Sensu, HipChat, Alerta, Slack, OpsGenie, VictorOps, PagerDuty, Talk, Telegram

Page 47: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Stream Processing

Page 48: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Batch Processing

Page 49: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Some of our scripts

Page 50: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

// detect system down var period = 2m var every = 30s

// select the stream var sys_data = stream |from() .database('telegraf') .measurement('system') .where(lambda: "host" =~ /tot.*/ OR "host" =~ /prod.*/) .groupBy('host', 'cluster_id') |window() .period(period) .every(every)

Page 51: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

msg = 'Node {{ index .Tags "host" }} {{ if ne .Level "OK" }}is down!{{ else }}is back online.{{ end }}'

// alert if no data in period sys_data |where(lambda: "host" != '') |deadman(0.0, period) .id('deadman/{{ index .Tags "host" }}') .message(msg) .details('') .stateChangesOnly() .post('http://localhost:8080/health-check')

Page 52: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

// Alert when disks are this % full var alert_threshold = 80

// Use a larger period here, as the telegraf data can be a little late // if the server is under load. var period = 10m

// How often to query for the period. var every = 1m

batch |query(''' SELECT max(used_percent) FROM "telegraf"."default".disk WHERE ("host" =~ /prod-.*/ OR "host" =~ /tot-.*/) AND "path" = '/influxdb/conf''') .period(period) .every(every) .groupBy('host') |alert() .crit(lambda: "max" > alert_threshold) .message('Host {{ index .Tags "host" }}\'s disk is {{ index .Fields "max" }}% full!') .details('') .stateChangesOnly() .slack()

Page 53: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

// Alert when this much % write failure occur over the specified period var error_threshold = 80 var period = 2m var every = 30s

var error_delta = stream |from() .database('telegraf') .retentionPolicy('default') .measurement('influxdb_httpd') .where(lambda: "host" =~ /tot.*/ OR "host" =~ /prod.*/) |groupBy('host', 'cluster_id') |window() .period(period) .every(every) |default() .field('serverError', 0) |derivative('serverError') .nonNegative() .as('derivative') .unit(period) |eval(lambda: "derivative") .as('derivative')

Page 54: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

var ok_delta = stream |from() .database('telegraf') .retentionPolicy('default') .measurement('influxdb_httpd') .where(lambda: "host" =~ /tot.*/ OR "host" =~ /prod.*/) |groupBy('host', 'cluster_id') |window() .period(period) .every(every) |default() .field('req', 0) |derivative('req') .nonNegative() .as('derivative') .unit(period) |eval(lambda: "derivative") .as('derivative')

Page 55: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

error_delta |join(ok_delta) .as('error', 'ok') |eval(lambda: (float("error.derivative") / float("error.derivative" + "ok.derivative")) * 100.0) .as('error_percent') |alert() .id(‘…’) .message(‘…’) .crit(lambda: "error_percent" >= error_threshold) .stateChangesOnly() .slack()

Page 56: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

User Defined Functions

Page 57: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

some Grafana dashboards…

Page 58: Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and ... · Monitoring InfluxCloud with InfluxDB, Grafana, Telegraf, and Kapacitor Paul Dix CTO & cofounder of InfluxData

Questions?Paul Dix @pauldix