Evolution of Observability Tools at Pinterest · © 2019 Pinterest. All rights reserved. 3...

Post on 24-Jun-2020

2 views 0 download

Transcript of Evolution of Observability Tools at Pinterest · © 2019 Pinterest. All rights reserved. 3...

1© 2019 Pinterest. All rights reserved.

Tools at PinterestNaoman Abbas Engineering Manager, Observability

Evolution of Observability

2© 2019 Pinterest. All rights reserved.

NaomanAbbasEngineering ManagerObservability● Operational Metrics● Alerting● Log Search● Distributed Tracing

3© 2019 Pinterest. All rights reserved.

Pinterest

Helping people discover and do what they love+300M monthly active users+200B pins saved+2000 employees

4© 2019 Pinterest. All rights reserved.

Why Observability?Observability ToolsEvolutionLessons Learned

Agenda

1

2

3

4

5© 2019 Pinterest. All rights reserved.

Why Observability?

1 2 3 4

Microservice Architecture

Client CDN API

8© 2019 Pinterest. All rights reserved.

Change CycleChange

MonitorRoot cause

Deploy

Issue found

Mitigate

9© 2019 Pinterest. All rights reserved.

ObservabilityTools

Masters 1 2 3 4

10© 2019 Pinterest. All rights reserved.

Statsboard

Operational metrics● Service dashboards● Alerts● Debugging● Performance tuning● 2,000 dashboards● 16,000 alerts

11© 2019 Pinterest. All rights reserved.

Logsearch

Debug logs● Root cause analysis● Alerting

12© 2019 Pinterest. All rights reserved.

Pintrace

Distributed tracing● Performance tuning● Root cause analysis

13© 2019 Pinterest. All rights reserved.

Evolution

1 2 3 4

14© 2019 Pinterest. All rights reserved.

Observability Needs

Reliable

Easy to use

Automated

15© 2019 Pinterest. All rights reserved.

Tools Reliability

16© 2019 Pinterest. All rights reserved.

Usage Growth

17© 2019 Pinterest. All rights reserved.

Tool Architecture

● Metrics Storage○ Graphite -> OpenTSDB -> Sharding -> Goku (in-memory storage)

● Metrics Processing○ Storm -> Spark Streaming -> Job Stream (custom streaming)

18© 2019 Pinterest. All rights reserved.

Data Reduction

Metrics Aggregation by Service

Host1

Host2

19© 2019 Pinterest. All rights reserved.

Data Reduction

Chargeback

20© 2019 Pinterest. All rights reserved.

Tools Usability

21© 2019 Pinterest. All rights reserved.

TScriptScripting Language for Time-series

BeforedivideSeries(abs(diffSeries(timeShift

(tc.kafka.stats.kafka.server.brokert

opicmetrics.mbytesinpersec.perTopic.

OneMinuteRate.metrics07{topic=pinlog

ger},'7d'),tc.kafka.stats.kafka.serv

er.brokertopicmetrics.mbytesinpersec

.perTopic.OneMinuteRate.metrics07{to

pic=pinlogger}))), 100)

Aftertoday = metric

one_week = today.timeShift(Week)

return today.pctDiff(one_week).abs()

22© 2019 Pinterest. All rights reserved.

Integrated Alerts

23© 2019 Pinterest. All rights reserved.

Quick Dash

24© 2019 Pinterest. All rights reserved.

RunDash

25© 2019 Pinterest. All rights reserved.

Tools Automation

26© 2019 Pinterest. All rights reserved.

Roughly 70% of outages are due to changes in a live systemGoogle SRE Book

27© 2019 Pinterest. All rights reserved.

Automated Canary Analysis

Baseline (v1.0)

Canary (v1.1)

Production (v1.0)

MetricsCanary AnalysisRouterClients

28© 2019 Pinterest. All rights reserved.

Automated Canary Analysis

29© 2019 Pinterest. All rights reserved.

Automated Root Cause Analysis

Client CDN API A

B

C

30© 2019 Pinterest. All rights reserved.

Automated Root Cause Analysis

31© 2019 Pinterest. All rights reserved.

Automated Root Cause AnalysisTrace Analyzer

32© 2019 Pinterest. All rights reserved.

Automated Root Cause AnalysisTrace Analyzer

Downstream Services Latency

First traces Second traces Difference

smartfeedservice 99.3975 118.445 19.0478

pinacle_p2p 88.9196 92.2861 3.36652

smartfeedgenerator 67.0345 89.5526 22.5181

tenansix 65.3212 52.185 -13.1362

instantpfyservice 10.4688 10.6408 0.172027

pinlaterservice 10.4142 10.8673 0.453135

usercontextservice 9.80483 10.8275 1.02263

pinlinks 9.07438 8.88087 -0.193505

metatron 7.60317 8.72428 1.12111

First traces

Second traces Difference

metatron 764 1350 586

pinandboardservice 5526 4891 -635

pinacle_p2p 1924 1723 -201

terrapinthrift 596 400 -196

pinacle2 530 236 -294

usercontextservice 79 65 -14

anticlimaxservice 34 29 -5

asterix 32 26 -6

Downstream Services Calls

33© 2019 Pinterest. All rights reserved.

Automation Roadmap

● Anomaly detection● Auto remediation● Error budgets (SLO/SLI)

34© 2019 Pinterest. All rights reserved.

LessonsLearned

1 2 3 4

35© 2019 Pinterest. All rights reserved.

Lessons Learned

● Avoid tool fragmentation

● Observability is challenging

● Observability is expensive

● Great ROI

36© 2019 Pinterest. All rights reserved.

Observability Team

Brian Overstreet

Colin Probasco

Dai Nguyen

Humsheen Geo

Wei Zhu

Peter Kim

37© 2019 Pinterest. All rights reserved.

Acknowledgements

● Storage and Caching team (HBase, Goku)● Logging team (Kafka)● BDP team (Compute platform for Spark)

hiring-srecon@pinterest.com

We’re hiring! Come work with us!

@deathtocss

Questions

© Copyright, All Rights Reserved, Pinterest Inc. 2017

Template // Jan 2017