-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer some requests in order to reduce "over capacity" errors without also killing Elasticsearch #2481
Comments
I think you mention earlier a separate issue which was a flaw in the semaphore as it was implemented.. I still think this should be fixed independently and at the call abstraction. one interesting food for thought. if this includes some concerns about okhttp dispatcher queue, it is similar to ones we had in brave (where traces get lost on backlog) to solve this, we intercept and use Call.Factory as opposed to OkHttpClient this technique might be irrelevant to the issue. openzipkin/brave#292 |
on the direct issue you are discussing here, it sounds like you want a buffer because there is some absolute number of in flight requests, and a surge will go over that, but you think that this surge is short term and ES will quiesce. Put another way, you think we can do a better job back logging than elasticsearch can (because the queue that is overloaded is a configurable length in elasticsearch). Is that correct? In your experience, how big does this queue need to go to be useful (outlast the capacity deficit) in terms of spans messages and bytes? you can look at collector metrics for an idea of how much you are dropping as that would likely tell you |
Follow up from conversation in #2023
After further comments from you I realize you're talking about AsyncReporter which is in Zipkin, yes, but used in client-side instrumentations for actually reporting to Zipkin. I just want to confirm that we are using this on the client-side but I am not sure how we'd actually make use of it "in Zipkin." Though I do see how it could resolve my issues if we were able to.
Yes. But rather than fixing it independently it is my intended focus of this case. And from what I've been experimenting with, a Call abstraction is definitely helping :)
Yes, we do think the surge is short term. But even if it is an excessive surge or longer than anticipated we will still drop messages as needed to keep Zipkin from going OOM. We're not sure how big the queue will need to be at the moment. Unfortunately, Prometheus doesn't seem to track From Collector surge and error handling
I should further elaborate on my 200 server example: in our case, each of those 200 servers will actually be starting/stopping various, independent JVM processes. Meaning that unlike a single, long-lived web server process, there is no central control point to rate-limit requests to storage. So, while we are using AsyncReporter, there are actually something more like 600 reporters doing the reporting. Lots of chaos, not a lot of control :\ But zipkin-server, at least for us, does have some heap to spare. Since it is our central point, it seems like we can better do some flow-control there instead of having to stand up/support Kafka. |
Note that ByteBoundedQueue (the backend) is fully intended to stop OOM as it is bounded. We chose to drop instead of block, so the problem is less OOM and more if you want more than you allocated, but I'm not sure that doesn't defeat the point of a bounded q :P |
@Logic-32 One thing discussed offline is basically I think you want the advantage of a queue, but you don't want to run kafka, rabbit or another queue? The problem is this implies exactly the same complexity here and we are smaller than the Kafka, elasticsearch community etc. It isn't necessarily fair to try to make zipkin also a bounded queue implementor. One alternative you could consider is that we have demand for activemq #2466 which is easy to run, and embeddable. It is possible that we could invert the flow (push to poll) without you needing a custom queue implementation. Can you think about this option? |
for a custom implementation here I think the closest match to what we can support is re-using what we do client-side somehow. The problem space is exactly the same as client side and it took years to get the async reporter correct (to the degree it is). However, the client reporter also drops spans on problems, so any other handling would need to be considered independently as already mentioned (evaluating which conditions one should push back a message, to avoid putting a poison one back on the queue) |
I feel as though we're on slightly different pages in some respect. To hopefully help clarify, yes, I think using Kafka (or RMQ, which our company does use already) would be a better solution overall. However, as #2023 showed, even pushing with Kafka right now can cause the "over capacity" message as a result of not being able to push back on the MQ and tell it to slow down. Even ignoring that case, there are probably others using RPC that have this issue as well. My goal here is not to reinvent the MQ wheel but simply allow for a little more tolerance in how quickly the "over capacity" error is thrown. How familiar with ExecutorServices are you? They actually have a very convenient feature for this behavior. Take TheadPoolExecutor for instance; looking at the constructor arguments:
Given that review, I want to restate that my goal is not necessarily to have any kind of excessively-reliable queue but merely a moderate buffer to reduce issues from spike-load. "Over capacity" errors (or, RejectedExecutionExceptions in this case) will still get thrown but hopefully less frequently. No retries will be attempted (though, I believe this would give us a hook for retries if that is something we wanted to pursue later). I also understand this may not be a feature the Zipkin team wants to maintain internally and am fully comfortable having a pull request rejected.
You mean have Zipkin poll for messages from something instead of directly issuing RPC requests to it? Zipkin already has support for various MQs so I would suggest simply switching from RPC to an AMQ protocol before going down that road. Or did I misunderstand the question? Lastly, I'm definitely more than happy to discuss design options here as this discussion has already yielded a solution I'm personally more happy with. However, since I think communication sometimes works better in code, I would at least like to send you a patch/pull request for review before wholesale rejecting the idea. |
Based on previous discussion here and the issue you linked above #2023 my opinion (and I think partially Adrian's) is we should solve this at the library and configuration level, where it applies to all collectors that are pull based and all storage options. HTTP collector being mostly an outlier here, most collectors are pull based. The intent being that all storage engines can create a well known error that can then be taken as a signal at the collector level to both retry the request that was dropped (ie: don't advance our cursor) and slow down. |
So focus my changes on ZipkinHttpCollector instead of ElasticsearchSpanConsumer? That definitely seems like a plausible thing to attempt. I'm thinking just have it wrap whatever Collector it gets in one which uses the ExecutorService to throttle things still. Then put the AutoConfig properties for queue size/etc... under
I take that as: if we are "over capacity", make sure Callback.onError() sets a status of 429 (Too Many Requests) on the HttpServerExchange (instead of 500) or something similar. Does that sound reasonable? |
If we had a reliable signal, and we were acting on it in the server, and there was a smart way to flip the state to green, we could push it back yes. Keep in mind, the POST endpoint is async: the response is sent prior to the storage consumer. So there would be some non-synchronous delay sending that back (ex an atomic state variable checked before consuming the request). |
@Logic-32 any chance you can help us spend some synchronous time (pun intended) at our upcoming workshop to review any work you've done here? Eventhough the focus is UI, we can carve time (just grab a slot that works for you). https://cwiki.apache.org/confluence/display/ZIPKIN/2019-04-17+UX+workshop+and+Lens+GA+planning+at+LINE+Fukuoka |
for you specifically I know you are only interested in http collector.
as a project we have to maintain code for all collectors, so literally the
collector library is the best place to put relevant code imho.
the main concern is again exactly the same as the asynchronous reporter.
you have a bundling concern (which translates to how many spans per bulk
request). above that you have a concern of how many simultaneous bulk
requests (that the semaphore should control).
you also want some other things like error handling to not drop spans etc.
I would recommend splitting problems into parts this avoids a lot of
rehashing as you notice the issue is just like many others and some of this
takes longer to rehash than code.
…On Fri, Apr 5, 2019, 6:53 AM Logic-32 ***@***.***> wrote:
So focus my changes on ZipkinHttpCollector instead of
ElasticsearchSpanConsumer? That definitely seems like a plausible thing to
attempt. I'm thinking just have it wrap whatever Collector it gets in one
which uses the ExecutorService to throttle things still. Then put the
AutoConfig properties for queue size/etc... under zipkin.collector.http?
The only catch with doing that is, I'm not sure how I'd make sure the
number of requests the Collector can attempt doesn't get out of sync with
zipkin.elasticsearch.max-requests?
The intent being that all storage engines can create a well known error
that can then be taken as a signal at the collector level to both retry the
request that was dropped (ie: don't advance our cursor) and slow down.
I take that as: if we are "over capacity", make sure Callback.onError()
sets a status of 429 (Too Many Requests) on the HttpServerExchange (instead
of 500) or something similar. Does that sound reasonable?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2481 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD616YoFUnJeptbh51CVPcPgSzJD4Jkks5vdoJogaJpZM4cZTlQ>
.
|
ps if it helps you unlock.. in case you dont want to solve at the
abstraction that is important for the project, you can disable our http
collector and write your own component. you can see how the autoconfig
works for this by looking at scribe here or zipkin-aws for the kinesis or
sqs collectors.
for here we have a lot of unfinished work to complete and you can see
change related to what you discuss in our PR queue. for example the Netflix
rate limiter PR. many change like this become abandoned which is why I try
to make a wiki this time to properly inventory things because sometimes
people forget common problems and explaining across N issues takes our time
away.
…On Fri, Apr 5, 2019, 10:11 AM Adrian Cole ***@***.***> wrote:
for you specifically I know you are only interested in http collector.
as a project we have to maintain code for all collectors, so literally the
collector library is the best place to put relevant code imho.
the main concern is again exactly the same as the asynchronous reporter.
you have a bundling concern (which translates to how many spans per bulk
request). above that you have a concern of how many simultaneous bulk
requests (that the semaphore should control).
you also want some other things like error handling to not drop spans etc.
I would recommend splitting problems into parts this avoids a lot of
rehashing as you notice the issue is just like many others and some of this
takes longer to rehash than code.
On Fri, Apr 5, 2019, 6:53 AM Logic-32 ***@***.***> wrote:
> So focus my changes on ZipkinHttpCollector instead of
> ElasticsearchSpanConsumer? That definitely seems like a plausible thing to
> attempt. I'm thinking just have it wrap whatever Collector it gets in one
> which uses the ExecutorService to throttle things still. Then put the
> AutoConfig properties for queue size/etc... under zipkin.collector.http?
> The only catch with doing that is, I'm not sure how I'd make sure the
> number of requests the Collector can attempt doesn't get out of sync with
> zipkin.elasticsearch.max-requests?
>
> The intent being that all storage engines can create a well known error
> that can then be taken as a signal at the collector level to both retry the
> request that was dropped (ie: don't advance our cursor) and slow down.
>
> I take that as: if we are "over capacity", make sure Callback.onError()
> sets a status of 429 (Too Many Requests) on the HttpServerExchange (instead
> of 500) or something similar. Does that sound reasonable?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2481 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAD616YoFUnJeptbh51CVPcPgSzJD4Jkks5vdoJogaJpZM4cZTlQ>
> .
>
|
True, it is async. But the determination of whether we're over capacity (at least according to the current Semaphore implementation) is made before things actually go async. So sending a 429 is still possible.
I'm UTC-6 and I see you have 10:30 picked out for this?
Thank you for reminding me to go have a look at those! I didn't realize work was already being done on this since nothing was called out in the other issue. Both #2169 and #2166 are definitely "stepping on my toes" (or me theirs, but the point remains). I left a comment on 3ba76c2#r272783016, hopefully we can work to come up with a mutual solution.
If it helps ease your concerns, I could always have the Queue default to a size of 0. Then, the ExecutorService would act exactly like the Semaphore with the exception of it not being in HttpCall but a higher layer that is probably more appropriate. Outside of that, I definitely agree there are potentially many issues here. Hopefully collaborating on one of the above mentioned pull requests will result in something more palatable. Either way, I will try to help with the wiki page you put together for this and the workshop as best I can. But I'm afraid I don't have the capacity to assist with much outside of the spike load issue (meaning no push back to try and throttle things on the Reporter side) outside of assisting in integration with the other existing pull requests. |
thanks for all the notes. I wasn't sure about the timezone so just chose
arbitrarily. we can move it to later. when you get a wiki ID reply back and
I'll add you edit access. Thanks for taking the time to survey things and
respond back so much.
…On Sat, Apr 6, 2019 at 12:19 PM Logic-32 ***@***.***> wrote:
Keep in mind, the POST endpoint is async: the response is sent prior to
the storage consumer. So there would be some non-synchronous delay sending
that back (ex an atomic state variable checked before consuming the
request).
True, it is async. But the determination of whether we're over capacity
(at least according to the current Semaphore implementation) is made before
things actually go async. So sending a 429 is still possible.
... any chance you can help us spend some synchronous time (pun intended)
at our upcoming workshop ...
I'm UTC-6 and I see you have 10:30 picked out for this? 15:30 would be the
earliest I could do and hope that work doesn't get in my way. So basically,
I'd like to help but would need to be awake to do so ;)
for here we have a lot of unfinished work to complete and you can see
change related to what you discuss in our PR queue. for example the Netflix
rate limiter PR. many change like this become abandoned which is why I try
to make a wiki this time to properly inventory things because sometimes
people forget common problems and explaining across N issues takes our time
away.
Thank you for reminding me to go have a look at those! I didn't realize
work was already being done on this since nothing was called out in the
other issue. Both #2169 <#2169>
and #2166 <#2166> are definitely
"stepping on my toes" (or me theirs, but the point remains). I left a
comment on 3ba76c2#r272783016
<3ba76c2#r272783016>,
hopefully we can work to come up with a mutual solution.
the main concern is again exactly the same as the asynchronous reporter.
you have a bundling concern (which translates to how many spans per bulk
request). above that you have a concern of how many simultaneous bulk
requests (that the semaphore should control).
you also want some other things like error handling to not drop spans etc.
I would recommend splitting problems into parts this avoids a lot of
rehashing as you notice the issue is just like many others and some of this
takes longer to rehash than code.
If it helps ease your concerns, I could always have the Queue default to a
size of 0. Then, the ExecutorService would act exactly like the Semaphore
with the exception of it not being in HttpCall but a higher layer that is
probably more appropriate. Outside of that, I definitely agree there are
potentially many issues here. Hopefully collaborating on one of the above
mentioned pull requests will result in something more palatable. Either
way, I will try to help with the wiki page
<https://cwiki.apache.org/confluence/display/ZIPKIN/Collector+surge+and+error+handling>
you put together for this and the workshop as best I can. But I'm afraid I
don't have the capacity to assist with much outside of the spike load issue
(meaning no push back to try and throttle things on the Reporter side)
outside of assisting in integration with the other existing pull requests.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2481 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD6196dmkKBuoDgugJmACXlDf0jATXkks5veCBZgaJpZM4cZTlQ>
.
|
@adriancole, I just signed up as "logic32" :) |
thanks you now have access!
…On Wed, Apr 10, 2019 at 4:56 AM Logic-32 ***@***.***> wrote:
@adriancole, I just signed up as "logic32" :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Logic-32 PS your account is somewhat.. anonymous :) As this repository will shortly move to the Apache Software Foundation (ASF) and I suspect you will end up with non-trivial code to merge, you should probably prepare by sending a contributor agreement to the ASF. If you have already done this in the past, no work to do. This is a once in life thing. If you haven't already, easiest way is to download the ICLA template. After filling the icla.pdf with personal information correctly, print, sign, scan, and send it in mail as an attachment to the secretary ([email protected]) |
This is the internet, not a job interview :P No offense of course; if there is strong motivation to change I can :) I've seen the plans for moving to ASF but haven't been following it. Would it be acceptable to open/review a pull request before filling that out? Hate to go through the effort of signing up for that if my work doesn't make the cut ;) Also, while I'm here, one issue that will come up in the review is that of overriding other settings. Is there any precedent for, say, having a "global" max-concurrency setting which would override zipkin.storage.elasticsearch.max-requests for instance? I see there are other "max-active" and like settings. If we're going to have something that controls how many items get access to storage at a given instant then it makes sense that the settings for them remain in sync. Though, there is certainly no obligation to do so. Thoughts/opinions? |
This is the internet, not a job interview :P No offense of course; if there
is strong motivation to change I can :)
good point.. naming yourself on github will unleash the recruiting bots :)
I've seen the plans for moving to ASF but haven't been following it. Would
it be acceptable to open/review a pull request before filling that out?
Hate to go through the effort of signing up for that if my work doesn't
make the cut ;)
you can decide if the scan sign is too much effort regardless of this. it
is once in lifetime for whole foundation so might come in handy regardless.
The way you are working so far, I don't anticipate nothing will change in
the code here.
Also, while I'm here, one issue that will come up in the review is that of
overriding other settings. Is there any precedent for, say, having a
"global" max-concurrency setting which would override
zipkin.storage.elasticsearch.max-requests
<https://github.com/openzipkin/zipkin/blob/38d2148b914329c7ad0bbe45457e1969c6383797/zipkin-server/src/main/resources/zipkin-server-shared.yml#L105>
for instance? I see there are other "max-active" and like settings. If
we're going to have something that controls how many items get access to
storage at a given instant then it makes sense that the settings for them
remain in sync. Though, there is certainly no obligation to do so.
Thoughts/opinions?
some people use spring cloud config or other tools for global
configuration. once (you can look in openzipkin-attic) we had a zookeeper
based sampler which provided a coordinated rate setting based on a group
membership. some things are not great to stick in the box but could be
supplied externally. For example, one project is downsampling to match
storage by using a buffer like voltdb which can easily absorb all the
traffic that it can fit in memory, allowing slower export to more rigid
storage. there will be some settings there that need coordination and we
might utilize the internal ZK for that
https://github.com/adriancole/zipkin-voltdb
|
Attaching some very early results of running my changes in production (wanted to make sure these were up in time for the LINE workshop). Results are 24 hours apart. New metrics:
You can see the Drop Rate spikes are not only fewer farther apart but thinner with no appreciable impact on Response Times or Heap. |
awesome job
…On Wed, Apr 17, 2019, 7:09 AM Logic-32 ***@***.***> wrote:
Attaching some very early results of running my changes in production
(wanted to make sure these were up in time for the LINE workshop).
Before:
[image: before]
<https://user-images.githubusercontent.com/25107222/56247242-625f0780-6061-11e9-82ed-eff6afe1b8cb.png>
After:
[image: after]
<https://user-images.githubusercontent.com/25107222/56247247-64c16180-6061-11e9-9892-0d67109fe7ba.png>
Results are 24 hours apart. New metrics:
- Throttle concurrency/limit = how many Threads Netflix thinks we can
run
- Throttle in flight requests = concurrency + queue size; should be
the same most of the time but is mostly just a check to make sure we're
resolving all our LimitListeners and not leaking anything.
You can see the Drop Rate spikes are not only fewer farther apart but
thinner with no appreciable impact on Response Times or Heap.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2481 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD61y0re18LiaEkpmu-BGwGwkc5U235ks5vhkoagaJpZM4cZTlQ>
.
|
Adding storage-throttle module/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time. Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled. Making sure RejectedExecutionExceptions are "first class" citizens since they are used to reduce the throttle. Removing HttpCall's Semaphore in favor of the throttle (same purpose, different implementations). Inspired by work done on openzipkin#2169.
Just a head's up, I've got the ICLA signed but am waiting on feedback from my company's legal department to make sure I don't violate condition 4. |
thanks for the update and wish us luck they agree ;)
…On Sat, Apr 20, 2019, 7:04 AM Logic-32 ***@***.***> wrote:
Just a head's up, I've got the ICLA signed but am waiting on feedback from
my company's legal department to make sure I don't violate condition 4.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2481 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAPVV3VX4LHBYGWZOZOXL3PRI6XDANCNFSM4HDFHFIA>
.
|
ICLA signed and emailed :) |
Adding storage-throttle module/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time. Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled. Inspired by work done on openzipkin#2169.
Adding ThrottledStorageComponent/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time. Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled. Inspired by work done on openzipkin#2169.
Adding ThrottledStorageComponent/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time. Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled. Inspired by work done on openzipkin#2169.
#2502 now includes test instructions please give a try |
Adding ThrottledStorageComponent/etc. to contain logic for wrapping other storage implementations and limiting the number of requests that can go through to them at a given time. Elasticsearch storage's maxRequests can be override by throttle properties if the throttle is enabled. Inspired by work done on openzipkin#2169.
Feature:
Update feature from #1760 to allow for a bounded-queue implementation that can buffer some requests so that a spike in requests doesn't cause the majority to drop but also still doesn't cause OutOfMemory errors.
Rational
From #2023:
Example Scenario
200 different servers all decide to report a reasonable number of spans to Zipkin in the same instant. If ES_MAX_REQUESTS is left at it's default value, then only 64 of the requests will succeed. If that ES_MAX_REQUESTS is configured higher, then Elasticsearch could reject the bulk indexing request as a result of it's own queue filling up (especially if multiple zipkin-server instances are handling the load). This creates an awkward balancing act that can be solved by having a moderately sized, bounded queue in Zipkin to allow ES_MAX_REQUESTS to be set to something that won't kill ES, drop 70% of the requests, and not cause OOM errors.
The text was updated successfully, but these errors were encountered: