fix: AsyncAbstractResponse might loose part of send buffer #316

vortigont · 2025-10-19T07:19:57Z

AsyncAbstractResponse::_ack could allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually loosing the remainder on transfer due to not checking if the complete data was added to sock buff.

Refactoring code in favor of having a dedicated std::vector object acting as accumulating buffer and more careful control on amount of data actually copied to sockbuff

Closes #315

vortigont · 2025-10-19T11:30:14Z

BTW is not that same problem #242 ?

mathieucarbou · 2025-10-19T11:40:54Z

BTW is not that same problem #242 ?

It looks like the same indeed!

We can ask the use to test with this fix...

mathieucarbou · 2025-10-19T13:09:36Z

src/WebResponses.cpp

+      --_in_flight_credit;  // take a credit
 #endif
+      request->client()->send();
+      _send_buffer.erase(_send_buffer.begin(), _send_buffer.begin() + written);


@vortigont : could this call be expensive ?

I'm not sure, depends on implementation it could be compiler-optimized to something like memmove, but not sure how this is done in espressif's toolchain. But actually I do not expect this part to run frequently under normal conditions, buffer should be aligned with available space.
Other option could be to set additional member var and do index offset calculations.

actually I do not like this buffer approach at all - it's too heavy in general to create buffer matching window size then copy data there, then copy from buffer to tcp's pcbs. Default Arduino's is 5.7k but with custom-builded lwip it could become a mem hog. Should think of something other - a small fixed-size circular buffer maybe or other type of handlers for objects that could avoid copying

I agree. That’s what I saw also - adding a vector field to handler this situation. Wondering if the same thing could be done without it. I was going to propose also a circular buffer because anyway it cannot be more than the pcb space, right ?

I don't think there's an alternative solution at the architectural level -- the interface of AsyncAbstractResponse requires that it consume bytes from the implementation only once; and we can't know for sure how many bytes the socket will accept until we send it some; so to be correct, AsyncAbstractResponse is going to have to cache the bytes it couldn't send. Since the API requires it to have a temporary buffer anyways, "just keep the buffer until we've sent it all" is the least bad solution.

Performance wise, std::vector<> does hurt a bit though - it both (a) insists on zeroing the memory, and (b) doesn't have easy to use release/reallocate semantics. I tried using a default_init_allocator<> to speed it up, but it didn't help much. Ultimately in the solution I put together for the fork I've been maintaining for WLED, I ended up making up a more explicit buffer data structure. I also wound up doing some gymnastics to avoid allocating so much memory that LwIP couldn't allocate a packet buffer.

See my attempt at a whole solution here: https://github.com/Aircoookie/ESPAsyncWebServer/blob/39b830c852054444ea12a8b5d7fcb4fa004a89d7/src/WebResponses.cpp#L317

Sorry I'm a bit backlogged on pulling this out and pushing it forward...

Some design notes:

I opted to release the assembly buffer as soon as the data was sent, and reallocate every _ack; this keeps the "static" memory usage down and lets it better multiplex between many connections when under memory pressure.

If I was doing it again, I'd give serious thought to capping the buffer at TCP_MSS and looping over _fillBufferAndProcessTemplates -> client()->write(). The up side is a guarantee that it'd never be buffering more than one packet; the down side is that it would make ArduinoJSON very sad...

@willmmiles : I understand a bit more. I was wondering why we needed to add a buffer instead of just using indexes, but the fact is that this implementation being in in the abstract class like you say, it has to work with and without content buffer. Thanks!

mathieucarbou · 2025-10-19T13:23:23Z

@vortigont : FYI, I opened PR #317 to add an example in the project that we did not have about large responses.

I was hopping to get the opportunity to reproduce these 2 issues but no, everything goes fine.

Questions:

Were you able to reproduce ?
If yes, would you be able to rebase your PR on top of Added LargeResponse example #317 and add a handler showing the issue is fixed ?

Thanks!

yoursunny

I have verified that the bug has been fixed.
I do not understand the code, but I pointed out some typos in the changes.

src/WebResponses.cpp

vortigont · 2025-10-19T16:15:05Z

I have verified that the bug has been fixed. I do not understand the code, but I pointed out some typos in the changes.

thanks! I'll fix typos :)

mathieucarbou · 2025-10-20T09:27:27Z

@vortigont @willmmiles : I did some testing of this PR compared to main.

I am using the new example merged yesterday: LargeResponse, with the second implementation (CustomResponse) which supports concurrent requests. I ask to send 20 requests concurrently (to specifically go over the lwip limit), and count the received bytes. Result should be 16000.

> for i in {1..20}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;

main:

=> OK: everything works fine and I receive all 20x 16000 characters.

this pr:

=> CRASH

_send_buffer.resize(std::min(space, _contentLength - _sentLength));

So I tried reduce the concurrency to 16 (lwip limit( connections:

> for i in {1..16}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;

And I am not able to reproduce anymore, except if I keep going and going

But curl + bash like that are not going a great job like autocannon... So I am sapwning it:

32 requests: 16 threads and 16 concurrent connections (so lwip limit)

autocannon -w 16 -c 16 -a 32 http://192.168.4.1/2

=> CRASH

Strangely the crash happens also when using a lower number of connections:

autocannon -w 16 -c 5 -a 32 http://192.168.4.1/2

So as long as the threads are correctly aligned and requests executed pretty much at the same time, the buffer allocations / resize are then done pretty much at the same time also I think. That explains why it is easier to reproduce with autocannon than curl.

So that's not good because it kills the concurrency level of te library.

@willmmiles : how did you solve that in your fork ? You might have the same issue also if you are buffering ? Is is what you are solving thanks to your _safe_allocate_buffer() function ?

abort() was called at PC 0x401590e7 on core 1
  #0  0x401590e7 in __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48



Backtrace: 0x4008b5cc:0x3ffd1350 0x4008b591:0x3ffd1370 0x400918e5:0x3ffd1390 0x401590e7:0x3ffd1410 0x4015911c:0x3ffd1430 0x401591f7:0x3ffd1450 0x40159236:0x3ffd1470 0x400dce6b:0x3ffd1490 0x400dcec1:0x3ffd14b0 0x400dd65a:0x3ffd14d0 0x400d9c7d:0x3ffd1520 0x400d9ca1:0x3ffd1540 0x400d600a:0x3ffd1560 0x400d6495:0x3ffd1590 0x400d6515:0x3ffd15d0 0x400d66b9:0x3ffd15f0 0x4008c3e1:0x3ffd1620
  #0  0x4008b5cc in panic_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/panic.c:477
  #1  0x4008b591 in esp_system_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/port/esp_system_chip.c:87
  #2  0x400918e5 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/src/abort.c:38
  #3  0x401590e7 in __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
  #4  0x4015911c in std::terminate() at /builds/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58 (discriminator 1)
  #5  0x401591f7 in __cxa_throw at /builds/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
  #6  0x40159236 in operator new(unsigned int) at /builds/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54 (discriminator 2)
  #7  0x400dce6b in std::__new_allocator<unsigned char>::allocate(unsigned int, void const*) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/new_allocator.h:151
      (inlined by) std::allocator<unsigned char>::allocate(unsigned int) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/allocator.h:196
      (inlined by) std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned int) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/alloc_traits.h:478
      (inlined by) std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned int) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/stl_vector.h:380
      (inlined by) std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned int) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/vector.tcc:834
  #8  0x400dcec1 in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned int) at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/stl_vector.h:1016 (discriminator 1)
  #9  0x400dd65a in AsyncAbstractResponse::_ack(AsyncWebServerRequest*, unsigned int, unsigned long) at src/WebResponses.cpp:435 (discriminator 1)
  #10 0x400d9c7d in AsyncWebServerRequest::_onPoll() at src/WebRequest.cpp:222
      (inlined by) AsyncWebServerRequest::_onPoll() at src/WebRequest.cpp:218
  #11 0x400d9ca1 in std::_Function_handler<void (void*, AsyncClient*), AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*)#2}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&) at src/WebRequest.cpp:82
      (inlined by) __invoke_impl<void, AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::<lambda(void*, AsyncClient*)>&, void*, AsyncClient*> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/invoke.h:61
      (inlined by) __invoke_r<void, AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::<lambda(void*, AsyncClient*)>&, void*, AsyncClient*> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/invoke.h:111
      (inlined by) _M_invoke at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/std_function.h:290
  #12 0x400d600a in std::function<void (void*, AsyncClient*)>::operator()(void*, AsyncClient*) const at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/std_function.h:591
  #13 0x400d6495 in AsyncClient::_poll(tcp_pcb*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:1117
  #14 0x400d6515 in AsyncTCP_detail::handle_async_event(lwip_tcp_event_packet_t*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:303
  #15 0x400d66b9 in _async_service_task(void*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:328
  #16 0x4008c3e1 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139

mathieucarbou · 2025-10-20T09:51:19Z

@vortigont : follow-up from #317 (comment)

I just pushed the MRE in the PR and tested it: 51f4472

> curl -s http://192.168.4.1/3 | grep -o '.' | sort | uniq -c

5760 A
4308 B
5760 C
 172 D

=> 16000 OK

Console:

Filling 'A' @ sent: 0, buflen: 5760
Filling 'B' @ sent: 5760, buflen: 4308
Filling 'C' @ sent: 10068, buflen: 5760
Filling 'D' @ sent: 15828, buflen: 172

In main branch, I receive only 15572 bytes indeed:

❯  curl http://192.168.4.1/3 | grep -o '.' | sort | uniq -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15572    0 15572    0     0  18329      0 --:--:-- --:--:-- --:--:-- 18320
5332 A
4308 B
5760 C
 172 D

=> 15572

Console

Filling 'A' @ sent: 0, buflen: 5760
Filling 'B' @ sent: 5760, buflen: 4308
Filling 'C' @ sent: 10068, buflen: 5760
Filling 'D' @ sent: 15828, buflen: 172

So we have a MRE in the project showing that this is fixed 👍

The only issue now remaining is to fix the crash with concurrent requests...

vortigont · 2025-10-20T12:07:22Z

that is... unexpected 8-0, not that that it crashes on alloc but that it does not drop req's on main branch

Here are my results and those are interesting. Yes, this PR ver crashes on high concurrency leveles, but somehow it is much faster when not crashing.
I use apache ab tool

== this PR

Concurrency Level:      10
Time taken for tests:   10.030 seconds
Complete requests:      346
Failed requests:        0
Total transferred:      5729792 bytes
HTML transferred:       5586528 bytes
Requests per second:    34.50 [#/sec] (mean)
Time per request:       289.895 [ms] (mean)
Time per request:       28.990 [ms] (mean, across all concurrent requests)
Transfer rate:          557.86 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        7   59 173.8     29    1100
Processing:    75  220 123.7    193    1151
Waiting:        7   37  10.5     38      68
Total:         82  279 215.7    229    1522

== main

Concurrency Level:      10
Time taken for tests:   10.002 seconds
Complete requests:      128
Failed requests:        0
Total transferred:      2111312 bytes
HTML transferred:       2057588 bytes
Requests per second:    12.80 [#/sec] (mean)
Time per request:       781.395 [ms] (mean)
Time per request:       78.140 [ms] (mean, across all concurrent requests)
Transfer rate:          206.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       11  110 275.1     29    1058
Processing:   389  562 280.2    489    2482
Waiting:      302  378  47.9    368     485
Total:        443  672 387.6    521    2513

So it is more that 2.5 times faster, do not ask me how :)

that crappy autocannon only gives zeroes to me, have no idea how to interpret it

❯ autocannon -w 10 -c 10 http://192.168.8.26/2
Running 10s test @ http://192.168.8.26/2
10 connections
10 workers

-
┌─────────┬──────┬──────┬───────┬──────┬──────┬───────┬──────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg  │ Stdev │ Max  │
├─────────┼──────┼──────┼───────┼──────┼──────┼───────┼──────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │
└─────────┴──────┴──────┴───────┴──────┴──────┴───────┴──────┘
┌───────────┬─────┬──────┬─────┬───────┬─────┬───────┬─────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 0     │ 0   │ 0     │ 0   │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 0 B   │ 0 B │ 0 B   │ 0 B │
└───────────┴─────┴──────┴─────┴───────┴─────┴───────┴─────┘

Req/Bytes counts sampled once per second.
# of samples: 100

156 requests in 10.02s, 0 B read

I'll think about replacing this std::vector buffer on something more static and controllable, a small circ buffer maybe, but it will take a bit more time for trial and error.

vortigont · 2025-10-20T13:40:55Z

yeah, I'm using /2 for testing now, it crashes with ab with connections above 10 too.
Memory pressure is much higher with this PR, but so as the performance. Scratching head...

Autocannon does not give me any readable stat at all, it just runs requests then quits with zeroes in output. Works but quite useless for any analysis.

➜ autocannon -w 16 -c 16 -a 32 http://192.168.8.26/2
Running 32 requests test @ http://192.168.8.26/2
16 connections
16 workers

/
┌─────────┬──────┬──────┬───────┬──────┬──────┬───────┬──────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg  │ Stdev │ Max  │
├─────────┼──────┼──────┼───────┼──────┼──────┼───────┼──────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │
└─────────┴──────┴──────┴───────┴──────┴──────┴───────┴──────┘
┌───────────┬─────┬──────┬─────┬───────┬─────┬───────┬─────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 0     │ 0   │ 0     │ 0   │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 0 B   │ 0 B │ 0 B   │ 0 B │
└───────────┴─────┴──────┴─────┴───────┴─────┴───────┴─────┘

Req/Bytes counts sampled once per second.
# of samples: 31

32 requests in 1.01s, 0 B read

mathieucarbou · 2025-10-20T13:44:34Z

that is... unexpected 8-0, not that that it crashes on alloc but that it does not drop req's on main branch

@vortigont just to clarify:

examples/LargeResponse

I did my testing:

with the handler at /2 of examples/LargeResponse
AP mode (using the example)

In main: /2 handler works (16000 bytes).
In this PR: /2 handler fails either constantly or randomly depending how requests arrive

I did not use the /3 handler for my testing: it just acts as a MRE to reproduce the is issue

Using ab -c 16 -t 10 http://192.168.4.1/2 I can reproduce the crash. Even with -c 10.

But in both cases, it is random. I have to restart ab several times to trigger it while I get to reproduce it more easily with autocannon.

examples/PerfTests

- For request serving:

This is insanely fast! We were barely reaching 13 req/s before on average! This is more like 5-6 times faster!

❯  autocannon -c 16 -w 16 -d 20 --renderStatusCodes http://192.168.4.1
Running 20s test @ http://192.168.4.1
16 connections
16 workers

\
┌─────────┬────────┬─────────┬──────────┬──────────┬────────────┬────────────┬──────────┐
│ Stat    │ 2.5%   │ 50%     │ 97.5%    │ 99%      │ Avg        │ Stdev      │ Max      │
├─────────┼────────┼─────────┼──────────┼──────────┼────────────┼────────────┼──────────┤
│ Latency │ 206 ms │ 4246 ms │ 11578 ms │ 12129 ms │ 4749.38 ms │ 3246.55 ms │ 14444 ms │
└─────────┴────────┴─────────┴──────────┴──────────┴────────────┴────────────┴──────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 44     │ 44     │ 69     │ 80     │ 68.16  │ 7.75    │ 44     │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 193 kB │ 193 kB │ 302 kB │ 350 kB │ 298 kB │ 33.9 kB │ 193 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200  │ 1363  │
└──────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 320

3k requests in 20.03s, 5.97 MB read
200 errors (0 timeouts)

mathieucarbou · 2025-10-20T13:49:05Z

Autocannon does not give me any readable stat at all, it just runs requests then quits with zeroes in output. Works but quite useless for any analysis.

Yes it is bad at interpreting response, I guess because of the way the response is crafted with this subclass. But anyway this is not important. What's important is a tool that triggers concurrent requests with workers/threads.

src/WebResponses.cpp

vortigont · 2025-10-20T13:57:35Z

But anyway this is not important

it is important to understand how fast it works, otherwise I wound not have noticed :)
Anyway will use both tools from now for all side testing

src/WebResponses.cpp

mathieucarbou · 2025-10-28T09:13:21Z

@vortigont :

I merged main into this branch to make it up to date
I added back the MRE in the LargeResponse example that I added and was probably removed following a force push
I fixed the ci checks
I found a bug also:

If you run the PerfTest example and triggers 16 SSE clients:

for i in {1..16}; do ( count=$(gtimeout 30 curl -s -N -H "Accept: text/event-stream" http://192.168.4.1/events 2>&1 | grep -c "^data:"); echo "Total: $count events, $(echo "$count / 4" | bc -l) events / second" ) & done;

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x400daaa6  PS      : 0x00060b30  A0      : 0x800daac4  A1      : 0x3ffd1690  
  #0  0x400daaa6 in AsyncWebServerRequest::_onAck(unsigned int, unsigned long) at src/WebRequest.cpp:234
      (inlined by) AsyncWebServerRequest::_onAck(unsigned int, unsigned long) at src/WebRequest.cpp:225

A2      : 0x3ffd28d8  A3      : 0x0000007a  A4      : 0x00000014  A5      : 0x0000e280  
A6      : 0x00000000  A7      : 0x0000abab  A8      : 0x800daaa4  A9      : 0x3ffd1670  
A10     : 0x00000000  A11     : 0x3ffd28d8  A12     : 0x0000007a  A13     : 0x00000014  
A14     : 0x00000003  A15     : 0x3ffd35f4  SAR     : 0x0000001c  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x40088a7c  LEND    : 0x40088a87  LCOUNT  : 0xffffffff  


Backtrace: 0x400daaa3:0x3ffd1690 0x400daac1:0x3ffd16b0 0x400d5c4f:0x3ffd16d0 0x400d65db:0x3ffd1700 0x400d678d:0x3ffd1720 0x4008c3f5:0x3ffd1750
  #0  0x400daaa3 in AsyncWebServerRequest::_onAck(unsigned int, unsigned long) at src/WebRequest.cpp:232
      (inlined by) AsyncWebServerRequest::_onAck(unsigned int, unsigned long) at src/WebRequest.cpp:225
  #1  0x400daac1 in std::_Function_handler<void (void*, AsyncClient*, unsigned int, unsigned long), AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*, unsigned int, unsigned long)#1}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&, unsigned int&&, unsigned long&&) at src/WebRequest.cpp:44
      (inlined by) __invoke_impl<void, AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::<lambda(void*, AsyncClient*, size_t, uint32_t)>&, void*, AsyncClient*, unsigned int, long unsigned int> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/invoke.h:61
      (inlined by) __invoke_r<void, AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::<lambda(void*, AsyncClient*, size_t, uint32_t)>&, void*, AsyncClient*, unsigned int, long unsigned int> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/invoke.h:111
      (inlined by) _M_invoke at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/std_function.h:290
  #2  0x400d5c4f in std::function<void (void*, AsyncClient*, unsigned int, unsigned long)>::operator()(void*, AsyncClient*, unsigned int, unsigned long) const at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/14.2.0/bits/std_function.h:591
      (inlined by) AsyncClient::_sent(tcp_pcb*, unsigned short) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:1055
  #3  0x400d65db in AsyncTCP_detail::handle_async_event(lwip_tcp_event_packet_t*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:300
  #4  0x400d678d in _async_service_task(void*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:328
  #5  0x4008c3f5 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139

mathieucarbou · 2025-10-28T09:28:38Z

@vortigont : serving also slow chunk (what was fixed) is now crashing:

PerfTest example also:

time curl -N -v -G -d 'd=2000' -d 'l=10000' http://192.168.4.1/slow.html --output -

crashes with:

E (338324) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (338324) task_wdt:  - async_tcp (CPU 1)
E (338324) task_wdt: Tasks currently running:
E (338324) task_wdt: CPU 0: IDLE0
E (338324) task_wdt: CPU 1: loopTask
E (338324) task_wdt: Aborting.
E (338324) task_wdt: Print CPU 1 backtrace




Backtrace: 0x400e34f8:0x3ffb2270 0x4008c3f5:0x3ffb2290
  #0  0x400e34f8 in loopTask(void*) at /Users/mat/.platformio/packages/framework-arduinoespressif32@src-8e73ca850246b0978988103fddbfce5e/cores/esp32/main.cpp:79
  #1  0x4008c3f5 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139

I suspect this is because of the while loop added in the new _ack implementation ?

mathieucarbou · 2025-10-28T09:31:35Z

@vortigont : I finally add to rebase and force-push this branch to fix a conflict with main: please do a git reset --hard when you refresh your local copy to get the added commits, thanks!

vortigont · 2025-10-28T09:58:27Z

this new one is about 6-7x times less and does not seem to support concurrent requests well ?

nice catch. Those benchmark tools seems like they do not respect connection: close header and it is missed for small responses that fits a single buffer.
This should fix it.
BTW, I've identified that this close calls does not provide graceful disconnects sometimes and server sends rst packets instead.
I do not like it, though it's not critical since all response data it being sent anyway. The proper way here would be to use half-closed tcp connections but AsyncTCP does not support it currently.
I'll work on it in a separate scope.

vortigont · 2025-10-28T10:01:00Z

@vortigont : I finally add to rebase and force-push this branch to fix a conflict with main: please do a git reset --hard when you refresh your local copy to get the added commits, thanks!

ugh... sorry, I might have messed it :( was wondering why it rejects my newer commits.
Will pull this back now.

vortigont · 2025-10-29T03:31:53Z

I suspect this is because of the while loop added in the new _ack implementation ?

yeah... with iterative refill we can't accept such delays each cycle. Well... what can I say - it's async lib, do not do this or watch dog will bite :))

mathieucarbou · 2025-10-29T08:23:12Z

I suspect this is because of the while loop added in the new _ack implementation ?

yeah... with iterative refill we can't accept such delays each cycle. Well... what can I say - it's async lib, do not do this or watch dog will bite :))

I agree with you, but releasing that could break users doing for example slow sdcard serving or listing like we saw in the past..
And there is no refactoring alternative for them. Even if they use request continuation or any sort of asynchronous task, the buffer filling would still be done within the context of async_tcp task.

The only way would be for them to switch to SSE or websocket, or set -D CONFIG_ASYNC_TCP_USE_WDT=0, or increase the global watchtod timeout.

I am personally fine with that.
This PR has tremendous avantages around request speed, plus it fixes a long standing bug. It has to go in.
So for corner use cases, I think this is up to people to correctly code with an asynchronous lib.

@me-no-dev @willmmiles : what are you thoughts ? Ok with that too ?

@vortigont : serving also slow chunk (what was fixed) is now crashing:

@vortigont and were you able to have a look at the SSE failure ? I suspect websocket might have the same. I wonder if it is caused by the change around the _ack method in the super class but I find it weird... I do not see why the code would be wrong.

AsyncAbstractResponse::_ack could allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually lossing the remainder on transfer due to not checking if the complete data was added to sock buff. Refactoring code in favor of having a dedicated std::vector object acting as accumulating buffer and more carefull control on amount of data actually copied to sockbuff Closes #315

mathieucarbou · 2025-10-29T08:28:28Z

@vortigont : ⚠️ I have rebased on main and fixed the conflict following the merge for the close(true) cleanup

vortigont · 2025-10-29T11:20:49Z

@vortigont and were you able to have a look at the SSE failure ?

no, not yet, allow me some time to check on this.

vortigont · 2025-10-31T02:52:57Z

@mathieucarbou
SSE/ws crashes due to that unclear ownership of AsyncClient object and deleting request/resp objects. Now it's not good and unclear and we also introduce a delay on connection establishing for SSE/websockets. I made it work with some changes here and there, but we should think on a more deep redesign in this area.

mathieucarbou · 2025-10-31T08:26:14Z

src/ESPAsyncWebServer.h

+   * 
+   * @return AsyncClient* pointer to released connection object
+   */
+  AsyncClient* clientRelease();


should this be made private or protected and eventually use friend if needed ?

I was thinking same but then I thought if any other implementations would need it (like my new websocket thing) if would be difficult to introduce friends each time. Added a note to understand what it does.

src/AsyncEventSource.cpp

this will explicitly relese ownership of AsyncClient* object. Make it more clear on ownership change for SSE/WebSocket

vortigont · 2025-10-31T08:58:35Z

Just noticed this place 8-0

_handleEvent(&_clients.back(), WS_EVT_CONNECT, nullptr /* request */, NULL, 0);

in

AsyncWebSocketClient *AsyncWebSocket::_newClient(AsyncWebServerRequest *request) {                                                                            
  _clients.emplace_back(request, this);                                                                                                                       
  // we've just detached AsyncTCP client from AsyncWebServerRequest,                                                                                          
  // *request was destructed along with response object in AsyncWebSocketClient's c-tor                                                                       
  _handleEvent(&_clients.back(), WS_EVT_CONNECT, nullptr /* request */, NULL, 0);                                                                             
  return &_clients.back();                                                                                                                                    
}

why is that request pointer passed to user's callback, any idea? I mean is there any use case for it? Otherwise would have to keep request ptr till callback compltes.

mathieucarbou · 2025-10-31T09:28:18Z

why is that request pointer passed to user's callback, any idea? I mean is there any use case for it? Otherwise would have to keep request ptr till callback compltes.

The request ptr is not passed to public api: _newClient is closed api. Only the ws client object is passed.

On of the useful thing is:

to identify the client: client->id() or using client ptr directly
to call client->ping()
to call client->setCloseClientOnQueueFull(false);

I might not understand your question ?

vortigont · 2025-10-31T09:32:08Z

I might not understand your question ?

yeah, sorry, I mean here

_handleEvent(&_clients.back(), WS_EVT_CONNECT, nullptr /* request */, NULL, 0);

3rd arg was request (here replaced with nullptr), which then passed as void* to user's callback to handle ws connect event. It is in addition to client*, I never used it in my code anywhere, do not know if there could be any use of getting request obj pointer in websocket-related callback.

mathieucarbou · 2025-10-31T09:46:16Z

I might not understand your question ?

yeah, sorry, I mean here

_handleEvent(&_clients.back(), WS_EVT_CONNECT, nullptr /* request */, NULL, 0);

3rd arg was request (here replaced with nullptr), which then passed as void* to user's callback to handle ws connect event. It is in addition to client*, I never used it in my code anywhere, do not know if there could be any use of getting request obj pointer in websocket-related callback.

I see!
This is only in the case of WS_EVT_CONNECT.

Use cases:

in order to cast arg to request to get the request query parameters (allowed in WebSockets spec)
In order to get request attributes that could have been set by a middleware, looking at the request headers like cookies to add a userID / session in the request attributes.
Basically any use case where the app would ned to maintain a link between an identified user and a websocket client

This makes me think that I forgot to add it in AsyncWebSocketMessageHandler. The onConnect callback should be:

void onConnect(std::function<void(AsyncWebSocket *server, AsyncWebSocketClient *client, AsyncWebServerRequest *request)> onConnect) {

vortigont · 2025-10-31T09:53:08Z

yeah, you right, that makes sense to get extensions from headers.
OK, not a big deal, I'll move it. It is for connect event only, right? We do not need to keep it further during whole ws connecton life-time, do we?

mathieucarbou · 2025-10-31T10:08:17Z

yeah, you right, that makes sense to get extensions from headers. OK, not a big deal, I'll move it. It is for connect event only, right? We do not need to keep it further during whole ws connecton life-time, do we?

This is only for connect yes... After that, the arg is casted in a AwsFrameInfo for data events

…t is executed user code might use HTTP headers information from the request

vortigont · 2025-10-31T10:54:58Z

should good this way :)

mathieucarbou · 2025-10-31T12:39:37Z

should good this way :)

I will have a look, test and also update the callback to add the request: this class was added recently by me as a way to simplify WS usage so this is not used a lot and this is an acceptable api break I think providing we can update next version to 3.9.0 considering all the things that will be released.

mathieucarbou assigned vortigont Oct 19, 2025

mathieucarbou requested review from mathieucarbou, me-no-dev and willmmiles October 19, 2025 07:29

mathieucarbou force-pushed the wresp_315 branch from 2387abc to 226c4a5 Compare October 19, 2025 11:41

mathieucarbou linked an issue Oct 19, 2025 that may be closed by this pull request

Lost writes using AsyncCallbackResponse during low memory #242

Open

5 tasks

mathieucarbou mentioned this pull request Oct 19, 2025

Lost writes using AsyncCallbackResponse during low memory #242

Open

5 tasks

mathieucarbou reviewed Oct 19, 2025

View reviewed changes

yoursunny reviewed Oct 19, 2025

View reviewed changes

src/WebResponses.cpp Outdated Show resolved Hide resolved

src/WebResponses.cpp Outdated Show resolved Hide resolved

src/WebResponses.cpp Outdated Show resolved Hide resolved

mathieucarbou force-pushed the wresp_315 branch 2 times, most recently from 92da8c7 to 0f6f725 Compare October 20, 2025 08:12