database reliability engineering what. why. how. Reliability... · database reliability engineering...

34
database reliability engineering what. why. how. Percona Live, Dublin, 2017 Laine Campbell, Sr. Dir, Production Engineering, Fastly

Transcript of database reliability engineering what. why. how. Reliability... · database reliability engineering...

database reliability engineeringwhat. why. how.

Percona Live, Dublin, 2017Laine Campbell, Sr. Dir, Production Engineering, Fastly

we made a book!

yesterday’s DBA

3

gatekeeper

master builder

superhero

siloed

specialized

today’s DBRE

4

educator and mentor

platform builder

force multiplier

operations expert

t-shaped

database reliability engineering

5

develop reliable and resilient datastores and shared components for provisioning

Provide patterns and knowledge to support other team’s processes to facilitate their work

understand and teach data access and storage nuances to ensure all service level objectives can be met

anchor teams with expertise for troubleshooting, recovery and other tasks requiring depth, not breadth

why DBRE?

6

reliability is only a smidgen of the operational mandate

reliability isn’t always required in operations

the closer to the data, the more reliability is necessary

data requires paranoia, not chaos- @jessietron - https://blog.codeship.com/growing-tech-stack-say-no/

the path to DBRE

7

Who's your DBA? If you don’t know the answer, it’s probably you.

the DBA, continuing to evolve their skillset

the ops engineer, taking data ownership

software engineers developing data driven applications

8

paradigm shifts

polyglot persistence

9

relational is not the end of the line

new and shiny brings risk

One engine is less likely to suit all cases

we vet, find edge cases, learn patterns

Integration between engines is crucial

The data must flow

virtualization and cloud

10

commodity hardware for DBs

forces designing for resilience

new storage paradigms

the power of rapid provisioning

DBaaS availability reduces toil

infrastructure as code

11

forces componentization

no special snowflakes

requires coding chops

changes become deployments

Infrastructure can be versioned

continuous delivery

12

we must be teachers, not gatekeepers

testing and compliance become top priorities

we build patterns to replace reviews

build guardrails and tools

database engineers:O.G. devops

13

DEV OPS

DBE

shared goals, tools and processes

dbre guiding principles

14

● protect the data

● self-service for scale

● elimination of toil

● databases are not special snowflakes

● reduce the barriers between software and operations

protect the data

15

● responsibility of data protection shared by cross-functional teams.

● standardization and automation with simplicity over expensive/complicated

infrastructures.

● durability and integrity baked into every part of the architecture and

software development lifecycle

self service for scale

16

● metrics knowledge base and automated discovery/collection

● backup/recovery utilities and APIs auto-deployed for new builds

● reference architectures and configs for data stores deployment.

● security standards for data store deployments.

● safe deployment patterns and tests for database changesets

dbs are not special snowflakes

17

● moving databases into the “cattle, not pets”

paradigm a la Bill Baker

● databases are the last holdouts of

commoditization

● keep it simple stupid

18

reduce barriers between teams● Use the same code review and deploy

processes

● Use the same provisioning and config

management

● Data-land should feel familiar and intuitive,

not alienating or “special”

dbre core competencies

19

20

choose boring tech if possible

testing consistency guarantees

benchmarking performance

building gold configurations

testing operability

datastore decision making

21

Spend innovation tokens only on key differentiators

If your company is very young, optimize for velocity and developer productivity.

The more mature your company is, the more operational impact over the long term trumps all.

recoverability

22

build an effective tiered backup solutions - the right backup for the right dataset

work with SWE to build data validation pipelines

integrate recovery into daily activities

build recovery testing automation

23

data distribution

24

determine effective number of copies

distribute broadly based on requirements

replication, partitioning and sharding

failover and availability

25

understand availability strategies for each datastore

Evaluate limitations and consequences of failover (cold caches, data loss etc…)

practice and document failover in controlled circumstances

continued training of all staff on the process to get it in “muscle memory”

scaling paths

26

plan for and test up to 10x growth

keep an eye for constraints that will impact the future (aka don’t screw your future self)

build guard rails

27

● backups, restores. verified.● unit tests● migration and fallback testing● DDL migration patterns● migration heuristic analysis● boring failovers● shared on call rotations● network isolation between prod &

other ● don’t get pissy when people mess

up, help them fix it● don’t swoop in and do it all yourself

automation

28

● rolling DDL

● scaling up or down

● failovers and traffic shifts

● detecting or killing bad queries

build for services

29

● observability

● instrumentation

● graceful degradation

build for people

30

● tools for debugging

● checklists, documentation

● tractable levels of graphs and alerting

release management

31

● develop patterns for migrations

● heuristics to surface risky changes

● migration and fallback testing

● tiered dataset sizes for testing (development, integration, full)

● continued developer training and education

32

33

self-actualized databases

● your data is safe and recoverable

● resilient to common errors

● understandable, debuggable

● shares processes and tooling with the rest of the stack

● empowers you to achieve your mission.

34

In conclusion...

Don’t be scared. Mind your damn backups. Everything else will probably be ok.