Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike...

42
Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer www.linkedin.com /in/ mikesvoboda msvoboda @ linkedin.com https://github.com/linkedin/sy

Transcript of Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike...

Page 1: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Leveraging In-Memory Key Value Stores for

Large Scale Operations with Redis and

CFEngine

Mike SvobodaStaff Systems and Automation

Engineerwww.linkedin.com/in/mikesvoboda

[email protected]://github.com/linkedin/sysops-

api

Page 2: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

My Background with LinkedIn / CFEngine

Hired at LinkedIn into System Operations in 2010

When I started, our server count was 300 machines

Implemented CFEngine automation in 2010

Since then, we have grown 100 times that size

Created our Redis API in 2012 to provide visibility

Page 3: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

What is Redis?Redis is an in-memory key value store, similar

to Memcached with additional featuresOffers on disk persistence (snapshots to disk) -

You can use this as a real database instead of just a volatile cache

Offers simple data structures out of the box and commands to work with them natively

dictionaries, lists, sets, sorted sets, etc.Highly scalable data store - A single Redis

server can satisfy hundreds of thousands of requests per second

Supports transactions - Group commands together so they are executed as a single transaction.

Page 4: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

What is CFEngine?

CFEngine: Is an IT infrastructure automation framework that

helps manage infrastructure throughout its lifecycleBuilds, deploys, and manages systemsProvides auditingMaintains infrastructure by enforcing intended

system state for complianceRuns on the smallest embedded devices, servers,

desktops, mainframes, and big iron. CFEngine easily supports tens of thousands of hosts. Provides horizontal scalability.

Page 5: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

How CFEngine works

Page 6: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

CFEngine reduces operational costs

Using CFEngine automation is more effective than hiring additional headcount

Stop fighting fires every day Allow operations to focus on

tomorrow’s problems Stay ahead of the curve Keeping the lights on is

automated Respond to outages rapidly

Page 7: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Why LinkedIn chose CFEngine

Very mature codebase

Not dependent on underlying virtual machines like Ruby, Python, Perl, etc.

Flexible architecture Easily scale upwards to support thousands of

machines Just as simple to support smaller environments

Zero reported security vulnerabilities

Lightweight footprint

Page 8: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

What CFEngine has done for LinkedIn

Since implementing CFEngine:Operations has become extremely agile Quickly respond and resolve outagesSystem administration workload has reduced, even

with 100x the amount of serversHave built new datacenter in minutes with little

effortReal time visibility after creating our Redis

infrastructure, driven by CFEngine execution Can answer any question imaginable about all of our

servers in seconds Know every action that happens on our machines

Page 9: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

How LinkedIn uses CFEngine

Functions we have automated:Hardware failure detectionAccount administrationPrivilege escalationSoftware deploymentO/S configuration management Process / service managementSoftware deploymentSystem monitoring

You never need to log into a machine to manage it

Page 10: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Two problems still existed for Linkedin that automation didn’t

addressThe company wanted to be able to answer any

question imaginable about production.

We didn’t want to break production by pushing new automation changes.

To solve both problems, we needed visibility.

Page 11: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Problem #1: The company wants questions answered. STAT!

Management / Engineers want to have questions answered immediately and ask several times a day interrupting your work.

Page 12: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

LinkedIn was hunting for data

Page 13: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

What LinkedIn sysadmins were doing

Thousands of network connections were made to remote machines from a single host to fetch data.

Did I get results from everything?

Parse results after collection

• Questions about Infrastructure were answered by

sysadmins SSHing to machines to hunt for data.

• As our scale increased, we used a remote execution

tool to parallelize some variant of SSH / DSH

Page 14: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Forcing command execution on remote

machines doesn’t scale

Machines were missed, data wasn’t collected

Firewalls mangled packets

SSHD offline or didn’t spawn on the remote hostDepended on system accounts being valid

Network connections failed to the remote machine

Data collection shouldn’t be complicated

Unsure if we were able to collect all of the necessary data.

Page 15: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Problem #2: We didn’t want to break production by pushing new automation

changes.

Ops was hesitant of using automation because they didn’t know where things would break

When automation was expanded, we didn’t know where systems need alternative behavior to work correctly (or where they have been modified by developers with root access)

Ops had to be agile. We have to work fast. The business needs us to modify production multiple times a day, but we had to make changes without breaking it

Page 16: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Automation changes were happening in the blind

Sysadmins were under pressure from large ticket queues numerous change requests business needs to scale

Automation changes were being performed without fully understanding the impact before that change was executed

We realized that this could lead to mistakes, disasters, outages, and pink slips. To keep this from happening, I built our Redis API to provide visibility.

Page 17: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

To provide visibility, we had to scale data

collectionWe had to build a reliable system that was extremely

fast, which could give us results of remote command execution from tens of thousands of systems in seconds

Querying this data could not put load on production systems

The cache needed to be publically available to the company via an API so they could answer their own questions

We needed to quickly add new data into the cache before pushing automation changes to view production impact.

Page 18: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

We built a cache and populated it with data to answer arbitrary

questions

Instead of executing commands remotely, we have CFEngine populate the cache with commonly queried data

CFEngine executes expensive commands like lshw or dmidecode once and make the output available for everybody to use

Data collection becomes a scheduled event that happens once a day - This data collection becomes a cost of doing business

With the same data being gathered on all machines, it becomes trivial to compare two or more pieces of hardware

Page 19: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Architecture of the Cache

Step 1: Rely on CFEngine execution to drive data insertion

Step 2: Shard your data

Step 3: Use software load balancing!

Page 20: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Step 1: CFEngine drives data insertion

Leverage automation to change what you insert or remove from the cache

Page 21: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

The cache is a simple dictionary, sharded over multiple Redis servers.

Page 22: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Step 2: Extract Sharded Data

Determine scope. How much data do I need to answer my question?

For each CFEngine policy server running Redis, search Redis for matching keys in the dictionary

For each key we find from a search, perform the relevant data extraction Contents Md5sum os.stat() wordcount

Page 23: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Step 3: Use Software Load Balancing!

Have clients populate multiple Redis servers on insertion - Pick a Redis server at random on extraction (Load balancing) If we don’t get a response from our first choice,

pick another Redis server at random (failover)

Find randomized CFEngine policy servers with Redis from each level in the scope If the CFEngine policy server responds, push it

into a list of machines we need to query for data If the CFEngine policy server doesn’t respond,

pick another one at random (fail over)

Page 24: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Local Scope

Page 25: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Example: Local cache extraction

$ time extract_sysops_cache.py \

--search /etc/passwd \

--contents | grep msvoboda | wc -l

487

real 0m1.813s

user 0m1.484s

sys 0m0.087s

Page 26: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Site (datacenter) Scope

Page 27: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Example: Site cache extraction

$ time extract_sysops_cache.py \ --site lva1 \--search /etc/passwd \--contents | grep msvoboda | wc -l 8687

real0m19.169suser 0m30.286ssys 0m1.271s

Page 28: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Global Scope

Page 29: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Example: Global cache extraction

$ time extract_sysops_cache.py \

--scope global \

--search /etc/passwd \

--contents | grep msvoboda | wc -l

27344

real 0m44.827s

user1m39.532s

sys 0m4.288s

Page 30: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Make it fast! Become Multithreaded

Page 31: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Make it faster!Build a Redis pipeline

Page 32: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Cache extraction with a pipeline

Page 33: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Extracting the Cache for Fun and Profit

[msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py \ --scope local \ --search mps*cm.conf \ --md5sum \ --prefix-hostnames

esv4-2360-mps01.corp.linkedin.com#/etc/cm.conf 12721673715de3ee6b9dec487529355eesv4-2360-mps02.corp.linkedin.com#/etc/cm.conf 56b03a16c69e5b246a565dbcda44ba28esv4-2360-mps03.corp.linkedin.com#/etc/cm.conf 11e20e28ec60ac6c71cbb71b0a6c9b35esv4-2360-mps04.corp.linkedin.com#/etc/cm.conf 55402eda02e7f5c17dc7535455adc097

Page 34: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Make it fastest!Compression is significant!

Less network overhead on cache insertion

Less network overhead on cache extraction

More stuff we can put into the Cache

With less network I/O = faster results delivered

Less CPU usage on extraction

Page 35: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Seconds for cache insertion

Page 36: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

CPU cycles for cache insertion

Page 37: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Data size in megabytes of the cache for an entire datacenter

Page 38: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Time for cross country complete datacenter cache

extraction

Page 39: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Drink from the firehose

Page 40: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

With Redis API, you can now be confident in pushing automation

changesYou know what systems will be affected before a

change

You aren’t hit with surprises in production

You have added visibility

You don’t have to log into machines to modify or update

Page 41: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

SummaryBefore

implementation of CFEngine & Redis

APIat LinkedIn

After implementation of CFEngine & Redis

APIat LinkedIn

Headcount 6 people supporting a few hundred machines

6 people supporting tens of thousands of machines

Time spent Hours to build a single machine

Build complete datacenters in minutes

Productivity Hours spent collecting data before change, change itself causing outages

Can focus on building infrastructure, team became proactive to fix future problems, not reactive / firefighting

Ease of scaling server deployment

Incredibly difficult to respond to change, low visibility into production

Superior administration, rapid response to changing needs, complete system visibility

Page 42: Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer .

Open SourceQuestions?

[email protected]

www.linkedin.com/in/mikesvoboda

You can download the code from this presentation here:

https://github.com/linkedin/sysops-api