Skip to content

Gem updates and scaling features#11

Open
danielevans wants to merge 9 commits intomalomalo:masterfrom
danielevans:feature/scaling
Open

Gem updates and scaling features#11
danielevans wants to merge 9 commits intomalomalo:masterfrom
danielevans:feature/scaling

Conversation

@danielevans
Copy link

Hi! 👋

We are using this gem in several large applications to prevent the exhaustion of third party rate limits. We have encountered a few issues due to the scale of our queues.

First, with the rate limit sufficiently high (our main example is at: 4000, per: 60) the amount of time spent garbage collecting the queue grows. For us it has reached 5s per attempt.

Second, given the size of the rate limits and the number of commands per check combined with a large number of resque boxes (same example has 25 boxes with 8 workers each) the pressure on our Redis CPU grew extreme.

Third, when a resque worker uncleanly exists in any way which prevents the execution of an ensure the uuid is left in the set and the rate limit is effectively permanently reduced by 1. This can happen during an out of memory kill or other abrupt kill of the resque worker.

To fix this we are:

  1. Adding the capability to centralize the garbage collection to eliminate the worker slowdown and Redis CPU issues.
  2. Adding an optional max_duration option which will cause the garbage collection of tasks which have gone on so long they they are considered dead.
  3. Logs and resolves any situation where the redis hash is missing but the set still contains the uuid.

In addition this change unpegs the resque version and updates the tests to work with Resque 2.0, removes and .gitignores the Gemfile.lock which is conventional for Ruby gems.

@malomalo
Copy link
Owner

I like where this is going, but I don't like having to bring up another process unless it's necessary.

Might a better option be to use a mutex with Redis using set(key, nx: true, ex: ?)?

If it gets the mutex it does the GC and then removes the mutex key, if not it continues assuming the queue has hit it's rate limit. It would only allow 1 GC per job queue, and only when it needs to be run.

I'm not sure what to set the expiration of the mutex to currently. It would probably be some function of the at option. My first guess is at/500 based on your results, that said it should be faster when only 1 client is GCing.

It wouldn't be proactively GC which I can see as a benefit, but for that situation we could have another Resque job that triggers on a schedule.

Thoughts?

@danielevans
Copy link
Author

A distributed lock approach would probably have prevented this from ever causing problems and I like it in general, however knowing what's going on I still prefer a sidecar process for our case.

It is more predictable, gives proactive GC as well as the eases understanding and monitoring the process. It moves as much of the burden as possible out of the workers, allowing the workers to remain entirely dedicated to performing work.

The infrastructure around managing and monitoring processes already exists thanks to resque scheduler and the sidecar process is a much more simple approach.

And you are correct; we have already switched to a centralized process using a monkey-patched version and we immediately saw a ~25% drop in the GC time as well as an 80% drop in Redis CPU usage and a 60% drop in our resque worker CPU usage.

@malomalo
Copy link
Owner

malomalo commented Apr 4, 2020

Cool, I'll give this a spin sometime this week and hopefully get it into master soon after

@parikshit223933
Copy link

Hey, I have refactored the logic and tested it on real production application. seems to be working fine with no issues.
Please check this out: #20

Some of the problems mentiond here are handled in this logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants