Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding storage-throttle module to address "over capacity" issues #2502

Merged
merged 2 commits into from
May 10, 2019

Conversation

Logic-32
Copy link
Contributor

@Logic-32 Logic-32 commented Apr 17, 2019

This feature allows the server to automatically queue and back off when resource-related exceptions occur.

To test this, take the most recent build from jitpack like so and add the throttle variable.

$ curl -sSL https://jitpack.io/com/github/Logic-32/zipkin/zipkin-server/issue-2481-SNAPSHOT/zipkin-server-issue-2481-SNAPSHOT-exec.jar > zipkin.jar
$ STORAGE_THROTTLE_ENABLED=true STORAGE_TYPE=elasticsearch java -jar zipkin.jar 

Configuration

These settings can be used to help tune the rate at which Zipkin flushes data to another, underlying StorageComponent (such as Elasticsearch):

* `STORAGE_THROTTLE_ENABLED`: Enables throttling
* `STORAGE_THROTTLE_MIN_CONCURRENCY`: Minimum number of Threads to use for writing to storage.
* `STORAGE_THROTTLE_MAX_CONCURRENCY`: Maximum number of Threads to use for writing to storage.  In order to avoid configuration drift, this value may override other, storage-specific values such as Elasticsearch's `ES_MAX_REQUESTS`.
* `STORAGE_THROTTLE_MAX_QUEUE_SIZE`: How many messages to buffer while all Threads are writing data before abandoning a message (0 = no buffering).

Change Details:

Adding storage-throttle module/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time.
Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled.
Making sure RejectedExecutionExceptions are "first class" citizens since they are used to reduce the throttle.
Removing HttpCall's Semaphore in favor of the throttle (same purpose, different implementations).
Inspired by work done on #2169.

Fixes #2481

@codefromthecrypt
Copy link
Member

Thanks for the hard work here!

@codefromthecrypt
Copy link
Member

NOTE: in memory storage needs to be changed to not perform work at assembly of the call.


// Make sure we throttle
Future<V> future = executor.submit(() -> {
try (AutoCloseable nameReverter = updateThreadName(call.toString())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added naming to the Threads here (and in enqueue()) based on discussion from the workshop. Let me know if this doesn't work for some reason.

@Logic-32
Copy link
Contributor Author

Rebased off remote/master and updated my licenses. Things should be good now unless I missed something.

Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time I reviewed this top to bottom. Thanks and I learned some things!

There is a fair amount of code with experiential notes in. There aren't enough tests for some of this, especially things like pool resizing we should have tests or notes on how people are going to resize with spring boot for example (in the readme). I made some comments about some code to remove or downsize a bit.

Functionality wise, this looks ok.. we need to up the coverage a bit and also do a multithreaded case .like pool of 10 consumers to prove throttling works under contension.. this could help smoke out any thread safety bugs.

Most importantly, I think the time polishing test/filling is worthwhile as the impl looks like something we'd want and can accept! Thanks!

@Logic-32
Copy link
Contributor Author

I will try to address these items early next week! The testing will be the hardest part because concurrency. But I'll see what I can do :)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 26, 2019 via email

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 27, 2019 via email

@Logic-32
Copy link
Contributor Author

There is a fair amount of code with experiential notes in. There aren't enough tests for some of this, especially things like pool resizing we should have tests or notes on how people are going to resize...

Functionality wise, this looks ok.. we need to up the coverage a bit and also do a multithreaded case .like pool of 10 consumers to prove throttling works under contension.. this could help smoke out any thread safety bugs.

A majority of the tests currently reside in ThrottledCall. Testing ThrottledStorageComponent is proving to be quite difficult due to how Netflix's concurrency-limit works. Reviewing some of their tests, it looks like a certain amount of "latency" is required for the limiter to function as expected. So I can't simply submit tasks that either pass/fail and expect the limit to go up/down.

If the additional tests in ThrottledCall are not sufficient then I'd have to request we do some brainstorming of some kind. The only route I see to testing ThrottledStorageComponent more would involve some significant refactoring to inject mocks/etc. which I'm not sure I can dedicate the time to :(

@codefromthecrypt
Copy link
Member

there is a release that will be cut today. if you are ok with it, I can take the polish stuff after the fact and get this in? otherwise it will wait until the next release which could be probably a few weeks from now

@codefromthecrypt
Copy link
Member

Not a good idea to rush this, or refactoring in I think. Let's proceed with the release @zeagord and we can polish this up for the next one. This will also give folks a time to test it manually (with snapshot or jitpack)

@Logic-32
Copy link
Contributor Author

Logic-32 commented May 2, 2019

I agree on not rushing. This has drifted from what we're using in production ourselves so some additional test deployments would be appreciated :)

Good news is that our queue size is at 8000 messages and we have seen zero heap issues! Drop rate is still consistently 2% or lower per day (previously was peaking at about 15%).

Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, some minor points

README.md Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
} catch (RuntimeException e) {
limitListener.onIgnore();
throw e; // E.g. RejectedExecutionException
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger. I guess the fact this is catching Exception is what threw me... did you want to catch (RuntimeException|Error) here or is something actually declaring exception throws.

zipkin-server/src/main/resources/zipkin-server-shared.yml Outdated Show resolved Hide resolved
@codefromthecrypt
Copy link
Member

PS sorry about the CI meltdown. As of now, if you rebase, it should pass tests and not die on timeouts anymore.

@codefromthecrypt
Copy link
Member

rebased and force pushed to get a green build so that folks can test this (especially with elasticsearch 7). Will post instructions shortly

@codefromthecrypt
Copy link
Member

updated the description with how to test this.

@Logic-32
Copy link
Contributor Author

Logic-32 commented May 9, 2019

Thanks! If I can't get to the additional feedback tomorrow I'll get to it early next week. Sorry for being sluggish but at least it gives people time to test :)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented May 9, 2019 via email

Logic-32 and others added 2 commits May 10, 2019 15:19
Adding ThrottledStorageComponent/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time.
Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled.
Inspired by work done on openzipkin#2169.
@codefromthecrypt codefromthecrypt merged commit b3eefbe into openzipkin:master May 10, 2019
codefromthecrypt pushed a commit that referenced this pull request May 10, 2019
Before this, there was some extra code in the throttle package handling
a bug in our in memory storage. This fixes that and removes the extra
code.

See #2502
codefromthecrypt pushed a commit that referenced this pull request May 10, 2019
Before this, there was some extra code in the throttle package handling
a bug in our in memory storage. This fixes that and removes the extra
code.

See #2502
@Value("${zipkin.storage.throttle.maxConcurrency:200}") int throttleMaxConcurrency) {
ZipkinElasticsearchStorageProperties(
@Value("${zipkin.storage.throttle.enabled:false}") boolean throttleEnabled,
@Value("${zipkin.storage.throttle.maxConcurrency:200}") int throttleMaxConcurrency) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does maxConcurrency need updated to max-concurrency here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be required, but it is nice to do this. good catch

(ps I'm knackered so will deal with the other PR and any comments in the morning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk. Thank you for finishing this up and sorry again for the lag on my end! Work got in the way :(

@codefromthecrypt
Copy link
Member

codefromthecrypt commented May 10, 2019 via email

codefromthecrypt pushed a commit that referenced this pull request May 11, 2019
Before this, there was some extra code in the throttle package handling
a bug in our in memory storage. This fixes that and removes the extra
code.

See #2502
abesto pushed a commit to abesto/zipkin that referenced this pull request Sep 10, 2019
…nzipkin#2502)

Adding ThrottledStorageComponent/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time.

Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is 
enabled.

Inspired by work done on openzipkin#2169.
abesto pushed a commit to abesto/zipkin that referenced this pull request Sep 10, 2019
Before this, there was some extra code in the throttle package handling
a bug in our in memory storage. This fixes that and removes the extra
code.

See openzipkin#2502
@Logic-32 Logic-32 deleted the issue-2481 branch November 29, 2019 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Buffer some requests in order to reduce "over capacity" errors without also killing Elasticsearch
4 participants