Rescuing Resque

Backgrounding OverhaulOverview and Results

● This has been a cross-team effort○ Development○ QA○ Operations○ L3

● Lots of people have helped● This includes management (no suckup)

Credit where credit is due

What are background jobs?

● Tasks to be performed in the background (duh)

● May be handed off by the web● May be handed off by other jobs● May be scheduled at regular intervals● Are typically expensive

At PeopleAdmin backgrounding is...● Resque (Ruby API)● Redis (Middleware)● Jobs are put in queues● Workers look at queues for

work● Workers are grouped into

pools● We have 1 pool per worker

server● We have many worker servers● Resque scheduler puts jobs

into queues at their scheduled run time

So what do we use backgrounding for?

EVERYTHING

Specifically...● Transitions of postings,

applications, hiring proposals, etc...

● Emails● Keyword indexing (search)● Import jobs● Export jobs● Report generation (EEO)● Employment task lifecycle● Onboarding task lifecycle● Marketplace integrations (job

boards, background checks)

● Chore notifications● Clearing cached data● Promoting changes between

customer environments● Employer stats collection● Et cetera● Et cetera● Et cetera

So, uh... If everything relies on this, wouldn’t changes be dangerous?

YESBut we are smart and daring

(sometimes)

So what were/are the problems?● Visibility● Performance● Job Contention● Technology limitations● Technology reliability● Deployment interruption● Others...

No Visibility

● Resque was a black box● Operations, L3 & Development had no view

into production● Ability to diagnose problems was limited● Also had no way to know if we were creating

more problems

No Visibility

● Instrumented jobs with Splunk● Gave us sophisticated querying ability and

graphing of results● Gave us view into life of each job● Allowed view into usage patterns, time in

queue, time to perform and other metrics

No Visibility

Performance

● Perceived performance is time in queue + time to perform

● Some individual jobs were particularly slow to perform○ emails○ system events

● These affected system as a whole

Performance

● Emails & system events targeted for performance improvements

● Perform time for emails down from 23 seconds to 9 seconds

● Perform time for system events down from 32 to 8 seconds

Job Contention

● Non-prod jobs interfered with production jobs

Job Contention


● So we separated prod & non-prod queues

Job Contention


● So we separated prod & non-prod queues

● Still have a few issues...

Job Contention

● Jobs of different types in the same queue would contend for workers

Job Contention

● Jobs of different types in the same queue would contend for workers

● So we reallocated jobs into fine-grained queues

Technology Limitations● Resque & Resque-Pool work, but are simple● We are not simple

○ Multiple customers○ Multiple groups○ User activity dynamics○ Flood possibility

● Best illustrated by example...

Technology Limitations

KeywordIndexes

Emails ImportsJobs enter the queues

Workers prioritize queues from left to right

Worker proceeds down list of queues until it finds a job to be processed

If no jobs are available, workers start back at the left of the list Worker


job

KeywordIndexes




If no jobs are available, workers start back at the left of the list Worker


KeywordIndexes




If no jobs are available, workers start back at the left of the list Working


job

job

job

job

job

job

job

job

job

job

job

KeywordIndexes

Emails ImportsSometimes we get floods of jobs

Workers are dumb, they always start at left and move right

Queues of a lower priority of the flooded queue get lonely

Net result is a customer waiting while a job sits in a queue

WorkerWorkerWorker 1


job

job

job

job

job

job

job

job

KeywordIndexes





WorkerWorkerWorking 1


job

job

job

job

job

KeywordIndexes





WorkerWorking 2 Working 1


job job

KeywordIndexes





WorkerWorkerWorking 1


KeywordIndexes





Working 2Worker 3 Working 1


● There was no existing solution to this problem within the Resque ecosystem.

● Our options○ Migrate to a different technology○ Contribute enhancements to our current technology

● We opted for the latter (Qtrix)


Qtrix says, “Your priority is…”Our central Qtrix orchestrator tells each worker what their queue priorities are

Workers still dumb, the lists are intelligently shuffled

Every queue is the top priority of at least one worker

Higher priority queues appear to left more often than lower priority queues

Worker 2

Worker 3

Worker 1Keyword Indexes, Emails, Imports

Emails, Imports,Keyword Indexes

Imports, Keyword Indexes, Emails


job

job

job

job

job

job

job

job

job

job

job

KeywordIndexes

Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are




Worker 3Worker 2Worker 1


job

job

job

job

job

job

job

job

KeywordIndexes





Working 3Working 2Working 1


job

job

job

job

job

KeywordIndexes





Working 3 Working 2Working 1


job

job

KeywordIndexes





Working 3 Working 2Working 1


KeywordIndexes





Worker 3Working 2Working 1


Qtrix also gives us...● The ability to create different priority configurations for

different scenarios● The ability to change to those configurations on the fly● The ability to script these changes in reaction to

different events● The ability to have this work elasticallyWe are not taking advantage of all of these things yet…

Technology Reliability

● Redis is memory bound● Resque would leave a mess● Redis was a single point of failure

Technology Reliability

● Redis is memory bound● Resque would leave a mess● Redis was a single point of failure● Solutions

○ Automated memory cleanup○ Added redis AOF backups○ Added data replication but not failover (yet)

Deployment Interruption

● Jobs would be terminated● Jobs sit idle while workers restart● Scheduler would go down and execution

times missed● Ditto employer method jobs, plus hung locks

Deployment Interruption

● Now…○ All jobs finish gracefully○ There is no delay time where jobs are not getting

worked (includes employer methods jobs)○ Scheduler is not brought down during deploys○ Employer method job locks are still a problem

We have gained● Diagnostic ability● Performance metrics● Better performance● Less long-term &

catastrophic risk● Lowered resource needs● Lower customer pain

And here we are...

Still issues● Redis is single point of

failure● Resque scheduler

reliability● Scaling elastically● Tidying up

Since June...● Total time waiting on jobs decreased 31%

○ SystemEventWorker time decreased 72%● Total time jobs enqueued decreased 68%

○ Production jobs enqueued time decreased 74%● Redis memory use decreased ~70%● “Stuck jobs” during floods decreased 100%● Eliminated 1 worker server

The numbers tell the story

● For the opportunity to work on these fun, challenging problems

● For the help along the way● For the trust to be allowed to work

unrestrained● For the patience & understanding when

things didn’t go according to plan

Thanks!

Questions?

Rescuing Resque

Technology

Transcript of Rescuing Resque