database reliability engineering what. why. how. Reliability... · database reliability engineering...
Transcript of database reliability engineering what. why. how. Reliability... · database reliability engineering...
database reliability engineeringwhat. why. how.
Percona Live, Dublin, 2017Laine Campbell, Sr. Dir, Production Engineering, Fastly
database reliability engineering
5
develop reliable and resilient datastores and shared components for provisioning
Provide patterns and knowledge to support other team’s processes to facilitate their work
understand and teach data access and storage nuances to ensure all service level objectives can be met
anchor teams with expertise for troubleshooting, recovery and other tasks requiring depth, not breadth
why DBRE?
6
reliability is only a smidgen of the operational mandate
reliability isn’t always required in operations
the closer to the data, the more reliability is necessary
data requires paranoia, not chaos- @jessietron - https://blog.codeship.com/growing-tech-stack-say-no/
the path to DBRE
7
Who's your DBA? If you don’t know the answer, it’s probably you.
the DBA, continuing to evolve their skillset
the ops engineer, taking data ownership
software engineers developing data driven applications
polyglot persistence
9
relational is not the end of the line
new and shiny brings risk
One engine is less likely to suit all cases
we vet, find edge cases, learn patterns
Integration between engines is crucial
The data must flow
virtualization and cloud
10
commodity hardware for DBs
forces designing for resilience
new storage paradigms
the power of rapid provisioning
DBaaS availability reduces toil
infrastructure as code
11
forces componentization
no special snowflakes
requires coding chops
changes become deployments
Infrastructure can be versioned
continuous delivery
12
we must be teachers, not gatekeepers
testing and compliance become top priorities
we build patterns to replace reviews
build guardrails and tools
dbre guiding principles
14
● protect the data
● self-service for scale
● elimination of toil
● databases are not special snowflakes
● reduce the barriers between software and operations
protect the data
15
● responsibility of data protection shared by cross-functional teams.
● standardization and automation with simplicity over expensive/complicated
infrastructures.
● durability and integrity baked into every part of the architecture and
software development lifecycle
self service for scale
16
● metrics knowledge base and automated discovery/collection
● backup/recovery utilities and APIs auto-deployed for new builds
● reference architectures and configs for data stores deployment.
● security standards for data store deployments.
● safe deployment patterns and tests for database changesets
dbs are not special snowflakes
17
● moving databases into the “cattle, not pets”
paradigm a la Bill Baker
● databases are the last holdouts of
commoditization
● keep it simple stupid
18
reduce barriers between teams● Use the same code review and deploy
processes
● Use the same provisioning and config
management
● Data-land should feel familiar and intuitive,
not alienating or “special”
20
choose boring tech if possible
testing consistency guarantees
benchmarking performance
building gold configurations
testing operability
datastore decision making
21
Spend innovation tokens only on key differentiators
If your company is very young, optimize for velocity and developer productivity.
The more mature your company is, the more operational impact over the long term trumps all.
recoverability
22
build an effective tiered backup solutions - the right backup for the right dataset
work with SWE to build data validation pipelines
integrate recovery into daily activities
build recovery testing automation
data distribution
24
determine effective number of copies
distribute broadly based on requirements
replication, partitioning and sharding
failover and availability
25
understand availability strategies for each datastore
Evaluate limitations and consequences of failover (cold caches, data loss etc…)
practice and document failover in controlled circumstances
continued training of all staff on the process to get it in “muscle memory”
scaling paths
26
plan for and test up to 10x growth
keep an eye for constraints that will impact the future (aka don’t screw your future self)
build guard rails
27
● backups, restores. verified.● unit tests● migration and fallback testing● DDL migration patterns● migration heuristic analysis● boring failovers● shared on call rotations● network isolation between prod &
other ● don’t get pissy when people mess
up, help them fix it● don’t swoop in and do it all yourself
automation
28
● rolling DDL
● scaling up or down
● failovers and traffic shifts
● detecting or killing bad queries
build for people
30
● tools for debugging
● checklists, documentation
● tractable levels of graphs and alerting
release management
31
● develop patterns for migrations
● heuristics to surface risky changes
● migration and fallback testing
● tiered dataset sizes for testing (development, integration, full)
● continued developer training and education
33
self-actualized databases
● your data is safe and recoverable
● resilient to common errors
● understandable, debuggable
● shares processes and tooling with the rest of the stack
● empowers you to achieve your mission.