Database Reliability Engineering -...

54
Database Reliability Engineering Velocity China, Dec 2016 Laine Campbell, DB Architect/Entrepreneur [email protected] @lainevcampbell September, 2015

Transcript of Database Reliability Engineering -...

Page 1: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Database Reliability Engineering

Velocity China, Dec 2016Laine Campbell, DB Architect/[email protected] @lainevcampbell

September, 2015

Page 2: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

engineering, not administration

2

Page 3: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

yesterday’s DBA

3

gatekeeper

master builder

superhero

siloed

specialized

Page 4: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Reliability Engineering

4

Ben Treynor, VP of Engineering at Google says the following about reliability engineering:

fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for

human labor.

Page 5: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Database Engineers;O.G. Devops

5

DEV OPS

DBE

shared goals, tools and processes

Page 6: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Database Engineering

6

Guiding Principles

Protect the DataSelf-service for ScaleElimination of ToilDatabases are not Special SnowflakesEliminate the Barriers between Software and Operations

Page 7: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Protect the Data

7

• Responsibility of data protection shared by cross-functional teams.

• Standardization and automation with resiliency over expensive/complicated infrastructures.

• Durability and integrity baked into every part of the architecture and software development lifecycle

Page 8: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Self Service for Scale

8

• Metrics knowledge base and automated discovery/collection

• Backup/recovery utilities and APIs auto-deployed for new builds

• Reference architectures and configs for data stores deployment.

• Security standards for data store deployments.

• Safe deployment patterns and tests for database changesets

Page 9: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

DBs are not Special Snowflakes

9

• Cattle not Pets, Bill Baker

• DBs are the last holdouts of commoditization

Page 10: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

10

A day in the life

Page 11: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

A database’s hierarchy of needs

11

survival and safety

love and belonging

self-actualization

with loving credit and glory to Charity F. Majors

Page 12: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Survival and Safety

12

• Is your database alive?

• Is your data safe?

• Do you have a scaling plan?

Note: Scaling Patterns

Page 13: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Love and Belonging

13

• Are DB practices the same as

SWE and SRE teams?

• Are SWEs empowered to make/

push DB migrations?

• Is there cross-functional sharing

and teamwork?

Example: Guardrails at Etsy

Page 14: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Self-Actualization

14

• Database workflows accelerate velocity rather than hindering it.

• Developer work is safe and not impacting availability.

• Safe and quiet, databases do not take away excessive resources from other work.

Page 15: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

thecraft

15

Page 16: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Ops Core Competencies

16

• Service Level Mgmt.• Operational Visibility• Infrastructure Engineering and Mgmt.• Release and Change Mgmt.• Backup and Recovery

Page 17: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Service Level Management

17

Service Level Objectives

This Stuff is hard!

Social science rather than computation

Drives behaviors for the entire organization

Must reflect customer experience

Page 18: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Service Level Management

18

Why does this matter to the DBRE?• This drives instrumentation plans

• Helps understanding of the DB’s role in the greater picture

• This drives your architectural decisions

Page 19: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Service Level Management

19

Service Level Indicators• Latency

• Availability

• Throughput

• Durability

• Cost/Efficiency

Note: Latency vs. Response Time

Page 20: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Service Level Management

20

Defining Service Level Objectives

use distributions over monolithic averages

multimodal workloads require tiered objectives

target resiliency over robustness

consider impacts of percentages rather than whole populations

use your downtime budgets strategically

Page 21: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Service Level Management

21

Sample mature SLOs - Availability

• 99.9% availability averaged over 1 week.

• No single incident greater than 10.08 minutes.

• Downtime is called if > 5% of users are impacted.

• One annual 4 hour downtime allowed, if:

• Communicated to users >= 2 weeks ahead of time.

• Impacts <= 10% of users at a time.

Page 22: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

22

What is it for?

Break/fix and Alerting

Performance and Behavior Analysis

Capacity Planning

Debugging and Post-Mortems

Business Analysis

Situational Analysis

Page 23: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

23

The New Rules

Metrics and Events are a BI system

Distributed/Ephemeral environments are trending towards the norm

High resolution is becoming standard

Greater opportunity for more noise than signal

Page 24: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

24

How does this impact the DBRE?• Storage for metrics becomes a significant part of

your responsibility.• Abstraction layers for data is required as specific

instances/servers go in and out of service.• Proficiency with analysis of distributions becomes

critical.• Focus on key metrics for alerting is mandatory to

avoid pager fatigue.• Recognizing opportunities for automated remediation

becomes a high priority.

Note: Signal to Noise ratios and management

Page 25: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

25

DBRE Responsibility

Defining metrics and events that must be stored for each persistence store

Educating other teams on analysis of metrics and events data

Ensuring SLOs for monitoring stores are met

Identification of patterns and issues that require depth of knowledge

Page 26: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

26

What data do you need?

USE - Utilization/Saturation/Errors - Brendan Gregg

POV - From app (user context) and from inside the DB (details)

• DB Connection Layer Metrics

• DB Internal Metrics

• Database objects Metrics

• Database calls/queries

• Events - logs, external input

Page 27: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Operational Visibility

27

A day in the life

• Defining and encoding domain knowledge and best practices into the operational visibility stack.

• Advanced forensics and DB Context when problems surpass generalist knowledge.

• Supporting SWE in proper instrumentation and analysis

• Work with SRE on issue identification and resolution playbook and automation.

Page 28: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

28

The Landscape

Virtual hosts

Cloud computing

Local and remote Storage

Containers

DBaaS

Page 29: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

29

How does the DBRE fit in?

Abstracted storage requires even more paranoia about data loss

Databases must become cattle

Advances in performance require continued testing and benchmarking of the DBMS

DBaaS reduces toil, allowing for focus on high-value operations

Page 30: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Virtualized Infrastructure

30

• Data integrity testing becomes more critical.• Performance characteristics often require horizontal

design and scale out.• Dataset portability becomes a common and

significant problem.• Automation must be utilized as more moving parts

are inevitable.• Unpredictability of latency requires new design

profiles for storage access.

Page 31: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Containerized Infrastructure

31

Data attachment and bootstrapping often makes this model unfeasible

Network and IO needs often impact ability to thrive in shared host models

Excellent tool for prototyping, integration and testing.

Page 32: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Database as a Service

32

• Reduces toil / work from DBRE resources

• Impacts visibility at OS, network, hardware

• Black box infrastructures create surprises, particularly around durability.

• When used well, DBRE can focus on high value tasks.

• When used without DBRE input, vendor lock-in, lack of forward vision and minimal knowledge of DB internals can impact latency, availability, performance and durability.

Page 33: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure Mgmt.

33

Version control

Componentizing

Building from Configuration Mgmt.

Maintaining configuration

Infrastructure Definition and Orchestration

Service Discovery

Page 34: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

34

Paradigm Shift: Building

• Decisions on baked images vs. frying at build time

• Maintaining Configuration

• Idempotent changes: (flexible, drift able)

• Immutable infrastructures (simple, predictable, recoverable)

• Enforcement

Page 35: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

35

Paradigm Shift: Orchestration

• Entire clusters and services in version control

• Great power, great responsibility

• High potential for data loss and integrity issues if mistakes occur.

Page 36: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

36

Paradigm Shift: Service Catalogs

• Extensive and complex infrastructures outgrow manual, crafted approaches.

• Failover, sharing and state management become responsibilities of the service catalog.

Page 37: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Infrastructure

37

Day in the Life

• Changes to configuration definitions as required.

• Testing, integration and deployment

• Launching of new clusters via terraform

• Rolling upgrades of significant changes via service catalog updates.

• Durability and recovery tests of new storage solution from AWS

• Setting up tests in build system with docker for integration

Page 38: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

38

Really recovery…

one of the most crucial processes in the organization

a culture of durability and recovery integrates these processes into everything

backup is merely a means to the end

Page 39: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

39

The New Rules

Dataset portability is king

Potential data loss scenarios are legion, defense in depth is required

Detection becomes critical

Where possible recovery must be automated and used extensively

Page 40: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

40

Considerations

Service level objectives must dictate all choices

Workflows and event driven architectures create dependency webs

Rapid development organizations can change data structures before a recovery need is even discovered

Page 41: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

41

Planned Usage to Exercise Process

New production nodes and clusters

Building different environments

ETL and downstream data stores

Operational tests

Page 42: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

42

Unplanned Scenarios

User Errors

Application Errors

Infrastructure Services

Operating Systems and Hardware Errors

Hardware Failures

Datacenter Failures

Page 43: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

43

Building Blocks of Strategy

Detection

Tiered Storage

A Varied Toolbox

Continuous Testing

Page 44: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Backup and Recovery

44

A day in the life

• Reviewing recovery tests, comparing to SLO needs

• Working with engineers to build data validation tests

• Working with engineers to review schema evolutions with an eye towards integrity requirements

• Reviewing downstream processes and testing for impact of failure scenarios

Page 45: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

45

Software Development

developer support

integration and testing

deployment

Page 46: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

46

The New Rules

DBREs must become enablers, not gatekeepers

Engineer training and collaboration is critical

DBREs should be intervening during challenging migrations and large impact decisions

Integration, build and deploy become part of the regular SWE lifecycle and toolset

Page 47: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

47

SWE Education

Become a funnel

Foster conversations

Domain specific knowledge

Collaboration

Page 48: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

48

Integration Prerequisites

Version Control System

Database Build Automation

Test Data

Database Migrations and Packaging

CI Server and Test Framework

Page 49: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

49

Deployment

Migrations and Versioning

Impact Analysis

Migration Patterns

Manual and Automated Deployments

Page 50: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

Release Management

50

A day in the life

Brown bag sessions and knowledge sharing of new features, CVEs and benchmarks/use cases

Reviewing last day’s tests and commits

Updating deployment patterns for SWEs

Migration planning for a rolling data change

Page 51: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

51

buildingbridges

Page 52: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

software engineering

52

bring DBREs into your development processes

integrate DBREs with software versioning system

teach DBREs the testing frameworks

DBREs study the language, the framework, the drivers and the ORMs

Page 53: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

systems engineering

53

Collaborate on infrastructure and operating systems standards

Integrate DBREs with config mgmt. and orchestration

Automate recovery and teach to everyone

Teach SREs about DB forensics and repair

Page 54: Database Reliability Engineering - O'Reillyvelocity.oreilly.com.cn/2016/ppts/DatabaseReliability...O.G. Devops 5 DEV OPS DBE shared goals, tools and processes Database Engineering

further deep dives

54

understand the statistics and math around distributions, anomaly detection and correlation

write and push code!

answer the customer service phones

dive into your network layers

teach everyone about the data