-
Notifications
You must be signed in to change notification settings - Fork 70
fix: AsyncAbstractResponse might loose part of send buffer #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
BTW is not that same problem #242 ? |
It looks like the same indeed! We can ask the use to test with this fix... |
2387abc to
226c4a5
Compare
src/WebResponses.cpp
Outdated
| --_in_flight_credit; // take a credit | ||
| #endif | ||
| request->client()->send(); | ||
| _send_buffer.erase(_send_buffer.begin(), _send_buffer.begin() + written); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vortigont : could this call be expensive ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, depends on implementation it could be compiler-optimized to something like memmove, but not sure how this is done in espressif's toolchain. But actually I do not expect this part to run frequently under normal conditions, buffer should be aligned with available space.
Other option could be to set additional member var and do index offset calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I do not like this buffer approach at all - it's too heavy in general to create buffer matching window size then copy data there, then copy from buffer to tcp's pcbs. Default Arduino's is 5.7k but with custom-builded lwip it could become a mem hog. Should think of something other - a small fixed-size circular buffer maybe or other type of handlers for objects that could avoid copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. That’s what I saw also - adding a vector field to handler this situation. Wondering if the same thing could be done without it. I was going to propose also a circular buffer because anyway it cannot be more than the pcb space, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's an alternative solution at the architectural level -- the interface of AsyncAbstractResponse requires that it consume bytes from the implementation only once; and we can't know for sure how many bytes the socket will accept until we send it some; so to be correct, AsyncAbstractResponse is going to have to cache the bytes it couldn't send. Since the API requires it to have a temporary buffer anyways, "just keep the buffer until we've sent it all" is the least bad solution.
Performance wise, std::vector<> does hurt a bit though - it both (a) insists on zeroing the memory, and (b) doesn't have easy to use release/reallocate semantics. I tried using a default_init_allocator<> to speed it up, but it didn't help much. Ultimately in the solution I put together for the fork I've been maintaining for WLED, I ended up making up a more explicit buffer data structure. I also wound up doing some gymnastics to avoid allocating so much memory that LwIP couldn't allocate a packet buffer.
See my attempt at a whole solution here: https://github.com/Aircoookie/ESPAsyncWebServer/blob/39b830c852054444ea12a8b5d7fcb4fa004a89d7/src/WebResponses.cpp#L317
Sorry I'm a bit backlogged on pulling this out and pushing it forward...
Some design notes:
- I opted to release the assembly buffer as soon as the data was sent, and reallocate every
_ack; this keeps the "static" memory usage down and lets it better multiplex between many connections when under memory pressure. - If I was doing it again, I'd give serious thought to capping the buffer at
TCP_MSSand looping over_fillBufferAndProcessTemplates->client()->write(). The up side is a guarantee that it'd never be buffering more than one packet; the down side is that it would make ArduinoJSON very sad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmmiles : I understand a bit more. I was wondering why we needed to add a buffer instead of just using indexes, but the fact is that this implementation being in in the abstract class like you say, it has to work with and without content buffer. Thanks!
|
@vortigont : FYI, I opened PR #317 to add an example in the project that we did not have about large responses. I was hopping to get the opportunity to reproduce these 2 issues but no, everything goes fine. Questions:
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have verified that the bug has been fixed.
I do not understand the code, but I pointed out some typos in the changes.
thanks! I'll fix typos :) |
92da8c7 to
0f6f725
Compare
|
@vortigont @willmmiles : I did some testing of this PR compared to main. I am using the new example merged yesterday: > for i in {1..20}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;main: => OK: everything works fine and I receive all 20x 16000 characters. this pr: => CRASH _send_buffer.resize(std::min(space, _contentLength - _sentLength));So I tried reduce the concurrency to 16 (lwip limit( connections: > for i in {1..16}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;And I am not able to reproduce anymore, except if I keep going and going But curl + bash like that are not going a great job like autocannon... So I am sapwning it: 32 requests: 16 threads and 16 concurrent connections (so lwip limit) autocannon -w 16 -c 16 -a 32 http://192.168.4.1/2=> CRASH Strangely the crash happens also when using a lower number of connections: autocannon -w 16 -c 5 -a 32 http://192.168.4.1/2So as long as the threads are correctly aligned and requests executed pretty much at the same time, the buffer allocations / resize are then done pretty much at the same time also I think. That explains why it is easier to reproduce with autocannon than curl. So that's not good because it kills the concurrency level of te library. @willmmiles : how did you solve that in your fork ? You might have the same issue also if you are buffering ? Is is what you are solving thanks to your |
|
@vortigont : follow-up from #317 (comment) I just pushed the MRE in the PR and tested it: 51f4472 => 16000 OK Console: In main branch, I receive only 15572 bytes indeed: ❯ curl http://192.168.4.1/3 | grep -o '.' | sort | uniq -c
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15572 0 15572 0 0 18329 0 --:--:-- --:--:-- --:--:-- 18320
5332 A
4308 B
5760 C
172 D=> 15572 Console So we have a MRE in the project showing that this is fixed 👍 The only issue now remaining is to fix the crash with concurrent requests... |
|
that is... unexpected 8-0, not that that it crashes on alloc but that it does not drop req's on main branch Here are my results and those are interesting. Yes, this PR ver crashes on high concurrency leveles, but somehow it is much faster when not crashing. == this PR == main So it is more that 2.5 times faster, do not ask me how :) that crappy I'll think about replacing this |
|
yeah, I'm using Autocannon does not give me any readable stat at all, it just runs requests then quits with zeroes in output. Works but quite useless for any analysis. |
@vortigont just to clarify: examples/LargeResponseI did my testing:
In main: /2 handler works (16000 bytes). I did not use the /3 handler for my testing: it just acts as a MRE to reproduce the is issue Using But in both cases, it is random. I have to restart ab several times to trigger it while I get to reproduce it more easily with autocannon. examples/PerfTests- For request serving: This is insanely fast! We were barely reaching 13 req/s before on average! This is more like 5-6 times faster! ❯ autocannon -c 16 -w 16 -d 20 --renderStatusCodes http://192.168.4.1
Running 20s test @ http://192.168.4.1
16 connections
16 workers
\
┌─────────┬────────┬─────────┬──────────┬──────────┬────────────┬────────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼─────────┼──────────┼──────────┼────────────┼────────────┼──────────┤
│ Latency │ 206 ms │ 4246 ms │ 11578 ms │ 12129 ms │ 4749.38 ms │ 3246.55 ms │ 14444 ms │
└─────────┴────────┴─────────┴──────────┴──────────┴────────────┴────────────┴──────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec │ 44 │ 44 │ 69 │ 80 │ 68.16 │ 7.75 │ 44 │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 193 kB │ 193 kB │ 302 kB │ 350 kB │ 298 kB │ 33.9 kB │ 193 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200 │ 1363 │
└──────┴───────┘
Req/Bytes counts sampled once per second.
# of samples: 320
3k requests in 20.03s, 5.97 MB read
200 errors (0 timeouts) |
Yes it is bad at interpreting response, I guess because of the way the response is crafted with this subclass. But anyway this is not important. What's important is a tool that triggers concurrent requests with workers/threads. |
it is important to understand how fast it works, otherwise I wound not have noticed :) |
If you run the PerfTest example and triggers 16 SSE clients:
|
|
@vortigont : serving also slow chunk (what was fixed) is now crashing: PerfTest example also:
crashes with: I suspect this is because of the while loop added in the new _ack implementation ? |
02db8d8 to
b70c8d5
Compare
|
@vortigont : I finally add to rebase and force-push this branch to fix a conflict with main: please do a |
nice catch. Those benchmark tools seems like they do not respect |
ugh... sorry, I might have messed it :( was wondering why it rejects my newer commits. |
yeah... with iterative refill we can't accept such delays each cycle. Well... what can I say - it's async lib, do not do this or watch dog will bite :)) |
I agree with you, but releasing that could break users doing for example slow sdcard serving or listing like we saw in the past.. The only way would be for them to switch to SSE or websocket, or set I am personally fine with that. @me-no-dev @willmmiles : what are you thoughts ? Ok with that too ?
@vortigont and were you able to have a look at the SSE failure ? I suspect websocket might have the same. I wonder if it is caused by the change around the _ack method in the super class but I find it weird... I do not see why the code would be wrong. |
AsyncAbstractResponse::_ack could allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually lossing the remainder on transfer due to not checking if the complete data was added to sock buff. Refactoring code in favor of having a dedicated std::vector object acting as accumulating buffer and more carefull control on amount of data actually copied to sockbuff Closes #315
b70c8d5 to
f7ec43d
Compare
|
@vortigont : |
no, not yet, allow me some time to check on this. |
|
@mathieucarbou |
src/ESPAsyncWebServer.h
Outdated
| * | ||
| * @return AsyncClient* pointer to released connection object | ||
| */ | ||
| AsyncClient* clientRelease(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be made private or protected and eventually use friend if needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking same but then I thought if any other implementations would need it (like my new websocket thing) if would be difficult to introduce friends each time. Added a note to understand what it does.
this will explicitly relese ownership of AsyncClient* object. Make it more clear on ownership change for SSE/WebSocket
|
Just noticed this place 8-0
in AsyncWebSocketClient *AsyncWebSocket::_newClient(AsyncWebServerRequest *request) {
_clients.emplace_back(request, this);
// we've just detached AsyncTCP client from AsyncWebServerRequest,
// *request was destructed along with response object in AsyncWebSocketClient's c-tor
_handleEvent(&_clients.back(), WS_EVT_CONNECT, nullptr /* request */, NULL, 0);
return &_clients.back();
}why is that request pointer passed to user's callback, any idea? I mean is there any use case for it? Otherwise would have to keep |
The request ptr is not passed to public api: On of the useful thing is:
I might not understand your question ? |
yeah, sorry, I mean here
3rd arg was |
I see! Use cases:
This makes me think that I forgot to add it in AsyncWebSocketMessageHandler. The onConnect callback should be:
|
|
yeah, you right, that makes sense to get extensions from headers. |
This is only for connect yes... After that, the arg is casted in a AwsFrameInfo for data events |
…t is executed user code might use HTTP headers information from the request
|
should good this way :) |
I will have a look, test and also update the callback to add the request: this class was added recently by me as a way to simplify WS usage so this is not used a lot and this is an acceptable api break I think providing we can update next version to 3.9.0 considering all the things that will be released. |
AsyncAbstractResponse::_ackcould allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually loosing the remainder on transfer due to not checking if the complete data was added to sock buff.Refactoring code in favor of having a dedicated
std::vectorobject acting as accumulating buffer and more careful control on amount of data actually copied to sockbuffCloses #315