Skip to content

Conversation

@pschyska
Copy link

@pschyska pschyska commented Oct 2, 2025

Proposed changes

As mentioned in #110, my work on making Scheduler.schedule() thread-safe.

This would enable schedule() to be called from other threads, e.g. async-compat or other "sidecar-runtime" setups. It also makes sure epoll is interrupted when there are IO completion notifications coming in from outside of the event loop, leading to prompt continuation.

While this doesn't provide a native hyper/client as @bavshin-f5 wanted, it makes the default tokio implementation work via Compat. This would be a viable stopgap solution for us. I've added some examples, including hyper and reqwest. In the future, one could implement a "sidecar-runtime" approach as in async-compat natively that would use a separate epoll loop in a thread, or inject additional fds from the Rust side to nginx's epoll instance (if possible).

Some notes:

  • requires ngx_thread_tid to be present.
  • not compatible with no_std right now: OnceLock (might be replaceable by something from spin) and crossbeam-channel, and probably more. I've added std as a dependency for async to reflect that (this would be a breaking change, but async Rust probably implies std anyways).

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have written my commit messages in the Conventional Commits format.
  • I have read the CONTRIBUTING doc
  • I have added tests (when possible) that prove my fix is effective or that my feature works (don't think it's possible)
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@pschyska pschyska force-pushed the main branch 2 times, most recently from e327e07 to e1c9191 Compare October 6, 2025 13:06
@pschyska pschyska changed the title RFC: thread-safe spawn with ngx_notify thread-safe spawn with ngx_notify Oct 6, 2025
@pschyska pschyska force-pushed the main branch 5 times, most recently from 53f60c3 to 4a95650 Compare October 8, 2025 13:25
@bavshin-f5
Copy link
Member

  • ngx_notify is "thread-safe" under a very narrow set of conditions. One of those is that nobody outside of the nginx internal code is allowed to call it.
    Check ngx_epoll_module.c:769 and consider what would happen if multiple modules will start invoking ngx_notify() with different handler methods.
  • I don't want to allow mixing internal and external async runtimes or encourage use of threads. Both seem to be fragile and dangerous.
    I don't even believe you need to mix both runtimes: if you intend to use tokio, just run all the asynchronous code in the tokio task.
  • This change would break or make significantly slower any IO implementation that properly integrates with the nginx event loop (such as hyper client in nginx-acme).

@pschyska
Copy link
Author

pschyska commented Nov 7, 2025

  • ngx_notify is "thread-safe" under a very narrow set of conditions. One of those is that nobody outside of the nginx internal code is allowed to call it.
    Check ngx_epoll_module.c:769 and consider what would happen if multiple modules will start invoking ngx_notify() with different handler methods.

I see it now.

  • I don't want to allow mixing internal and external async runtimes or encourage use of threads. Both seem to be fragile and dangerous.

If it's guaranteed that all tasks run on the main thread, I don't think it's dangerous. This change only allows scheduling from other threads. It's not uncommon that libraries start their own helper threads, for instance. async-compat starts a transparent tokio runtime in a thread for IO completion handlers, while still using our executor for the tasks.

I also can image situations where you'd want to start non-IO compute in a thread pool to not block nginx - in our case, for example, crypto. You'd want to be able to notify the request handler async task of completion by writing to a channel or a similar mechanism. This, in turn, would call the waker from that thread (AFAIK), which calls schedule for the task from that thread, but the woken task would be scheduled to run on the main thread via ngx_notify.

I don't even believe you need to mix both runtimes: if you intend to use tokio, just run all the asynchronous code in the tokio task.

We need to work with the request heavily (mutate headers_in and headers_out, read client bodies, produce response bodies) in response to I/O (external requests, database queries, custom crypto/tunneling), which can only be done on the main thread safely. If all our code is running in a completely separate engine, it all becomes extremely hard. In addition, we need a way to interrupt nginx' epoll reacting I/O events, which aren't all bound to a request (OpenID shared signals, e.g.).
async-compat seemed like a good compromise to me: use the tokio "runtime" (I/O setup,...) , but with the ngx-rust scheduler/executor.

  • This change would break or make significantly slower any IO implementation that properly integrates with the nginx event loop (such as hyper client in nginx-acme).

I don't think it would do that. If the waker is invoked from the main thread, schedule in my branch would simply .run() the runnable, and everything stays on the main thread. ngx_notify would not be called (except once during the lifetime of a worker process because it's not known which tid is main). I have to admit I didn't test with nginx-acme yet though.

To recap, I'd still like the following:

  • A way to interrupt epoll
  • A way to move tasks to the main thread
  • Safe to call schedule from other threads

Given ngx_epoll_module.c:769, ngx_notify from other threads is indeed inherently unsafe.

However, what if we do this:

  • ngx_post_event a custom event, its handler being notify_handler
  • write(notify_fd, &inc, sizeof(uint64_t)) to interrupt epoll. The event loop would then find our custom event promptly.

Would this work for you?

schedule() can now be called from any thread, but will move tasks to the event loop
thread. pthread_kill(main_thread, SIGIO) is used to ensure prompt reponse if needed.

This enables receiving I/O notification from "sidecar runtimes" like async-compat, for
instance.

The async example has been rewritten to use async_::spawn, demonstrating usage of
reqwest and hyper clients wrapped in Compat to provide a tokio runtime environment while
using the async_ Scheduler as executor.
@pschyska
Copy link
Author

pschyska commented Nov 7, 2025

@bavshin-f5 I've rewritten the code to not rely on ngx_notify. Instead, I'm using ngx_post_event, followed by pthread_kill(main_thread, SIGIO) as I had a hard time getting the notify_fd from within ngx-rust. Does that address your concern?
schedule still can be called from other threads, e.g. from a waker, and moves the tasks to the main thread. The SIGIO ensures prompt reaction.

@bavshin-f5
Copy link
Member

If it's guaranteed that all tasks run on the main thread, I don't think it's dangerous. This change only allows scheduling from other threads. It's not uncommon that libraries start their own helper threads, for instance. async-compat starts a transparent tokio runtime in a thread for IO completion handlers, while still using our executor for the tasks.

Ah. I got why you assume that this is safe. I don't believe it is, and I expect that some of your code is quietly being scheduled on a tokio executor in another thread. async-compat is not the kind of magic that can override tokio scheduling, it merely allows creating and polling certain tokio types outside of the runtime-owned thread.
I also suspect that tokio is not quite prepared for deallocation of seemingly exclusively owned objects from a thread outside of the runtime.

The only approach I would consider safe is where nothing owned by a request or a cycle pool is allowed to move to another runtime, either accidentally or intentionally. Many things we do are lacking such protection because we assume single-threaded environment.

event.log = ngx_cycle_log().as_ptr();

unsafe {
ngx_post_event(&mut *event, ptr::addr_of_mut!(ngx_posted_events));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting to ngx_posted_events can easily lead to an infinite loop. If the current task is already running from a posted event handler, no IO could happen before the next wakeup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current task is running on the event thread, there is actually no need to post the event, as the handler is still currently reading from the channel here, necessarily. Therefore we can just skip it like the SIGIO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove the ngx_post_event when on event thread, but the example started deadlocking. I'm not sure why, but it seems we can't skip it. (but we should still skip the SIGIO). Can you elaborate on the deadlock that you suspect can happen now? Doesn't ngx_post_event just add it to the queue? When we are on the event thread, we know nginx is currently spinning, so the posted event should just get picked up the next turn.
Why is "no IO could happen before the next wakeup" relevant here?

/// Initialize async by storing MAIN_THREAD
pub fn initialize_async() {
MAIN_THREAD
.set(unsafe { pthread_self() })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call this from the master process, but POSIX does not specify if thread ID remains the same after fork().
It's better to initialize this in spawn, because spawn is the entry point of async runtime and it supposed to be called from a worker process.

Raw pthread use is also non-portable, we have that one platform without pthread.h that we pretend to support. ngx_thread_tid presence depends on the nginx build options, and Rust's std::thread::ThreadId is very expensive to obtain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm running async_initialize in init_process. The dev guide reads:

The master process creates one or more worker processes and the init_process handler is called in each of them.

(emphasis mine)

I read this as: "called once per worker", and I think I'm seeing that happening right now. Am I mistaken?

Fair point on pthreads, what would you recommend?

I'd have used nginx_thread_tid (potentially requiring the corresponding build options for the "async" feature), and had it working for just the "on event thread" detection, but then I don't have anything to pass to pthread_kill...

Would a normal kill(getpid(), SIGIO) be ok, too? It did seem to work fine when I tested it a few weeks back while working on the initial version, and it would actually enable me to remove the required init from init_process again.

On init during spawn - I considered it, but I wasn't sure I can rely on it happening on the event thread. Couldn't the user have set up a ngx_thread_task, and call the first spawn in its handler, shooting themselves into the foot?

@pschyska
Copy link
Author

pschyska commented Nov 7, 2025

If it's guaranteed that all tasks run on the main thread, I don't think it's dangerous. This change only allows scheduling from other threads. It's not uncommon that libraries start their own helper threads, for instance. async-compat starts a transparent tokio runtime in a thread for IO completion handlers, while still using our executor for the tasks.

Ah. I got why you assume that this is safe. I don't believe it is, and I expect that some of your code is quietly being scheduled on a tokio executor in another thread. async-compat is not the kind of magic that can override tokio scheduling, it merely allows creating and polling certain tokio types outside of the runtime-owned thread. I also suspect that tokio is not quite prepared for deallocation of seemingly exclusively owned objects from a thread outside of the runtime.
The only approach I would consider safe is where nothing owned by a request or a cycle pool is allowed to move to another runtime, either accidentally or intentionally. Many things we do are lacking such protection because we assume single-threaded environment.

I don't claim to fully understand it, but they state:

"Otherwise, a new single-threaded runtime will be created on demand. That does not mean the future is polled by the tokio runtime ."

The tokio runtime could spawn their own tasks into that runtime, sure. e.g some kind of helper task. But I don't see how my task could end up there. If my tasks Runnable.schedule() arrange it to be scheduled on the event thread, which is precisely what my PR does, it will run just there.

I'm not an expert, but I think what happens is this:

  • async_::spawn(my_task)
  • event handler starts running it (part1) until await:
    • reqwest.get(...).await
      • reqwest.get(...) is polled -> Pending, waker is set to my_task(part2).schedule()
        • tokio runtime thread things happen, ..., eventually waker is called (from that thread! which is why I want schedule() to work from other threads)
        • my_task(part2).schedule() is our Scheduler.schedule(), will post an event and push the Runnable to the queue
        • my_task(part2) runs on the main thread

This is what I see right now, using the code from the PR. This is also what I'd expect to happen with a "sidecar"-tokio-runtime that I started myself (no async-compat).

@pschyska pschyska changed the title thread-safe spawn with ngx_notify async: thread-safe schedule() Nov 7, 2025
@pschyska
Copy link
Author

pschyska commented Nov 7, 2025

I just pushed an experiment with a sidecar tokio runtime and added tid debug logging here: https://github.com/pschyska/ngx-rust/blob/a5ff1bb0cc3e6d5bb15f46e24348a1d2fa694f18/examples/async.rs#L115
What I see is this:

2025/11/07 23:27:02 [debug] 494044#494044: async: spawning new task
!!! schedule tid=494044
!!! run eager tid=494044
!!! async entry, tid=494044
!!! external task entry, tid=494047
!!! schedule tid=494047
!!! run handler tid=494044
!!! async resume, tid=494044, result=42
!!! schedule tid=494046
!!! run handler tid=494044
!!! after await tid=494044

This supports my theory: my task is never moved to the tokio runtime. It calls schedule from its own threads though - when using tokio::spawn from the thread of the runtime (494047), when awaiting tokio::time::sleep directly, from the sleep-thread, presumably. However, code in my task always runs in the event thread.

I've also pushed a change to main to switch to kill and nginx_thread_tid. It works fine also.

@pschyska
Copy link
Author

pschyska commented Nov 8, 2025

If it's guaranteed that all tasks run on the main thread, I don't think it's dangerous. This change only allows scheduling from other threads. It's not uncommon that libraries start their own helper threads, for instance. async-compat starts a transparent tokio runtime in a thread for IO completion handlers, while still using our executor for the tasks.

Ah. I got why you assume that this is safe. I don't believe it is, and I expect that some of your code is quietly being scheduled on a tokio executor in another thread. async-compat is not the kind of magic that can override tokio scheduling, it merely allows creating and polling certain tokio types outside of the runtime-owned thread. I also suspect that tokio is not quite prepared for deallocation of seemingly exclusively owned objects from a thread outside of the runtime.

The only approach I would consider safe is where nothing owned by a request or a cycle pool is allowed to move to another runtime, either accidentally or intentionally. Many things we do are lacking such protection because we assume single-threaded environment.

I just had another idea that helped me visualize this:

If Futures !Send could move executors at will, it would be able for them to end up in an executor that requires Send (and/or Sync).

E.g.: if the "part-2" future of my task, after awaiting a future from a tokio runtime, would magically run in a tokio executor using threads somehow, it would have to be Send. But If I used e.g. async_task::spawn_local, it could be just 'static. The compiler would not compile that code. (of course, crucial parts of an executor are unsafe, but this would still make this behaviour wildly illegal in Rust).

I don't know of any method of making a task move executors. If wanted to connect futures of different executors beyond their output for some reason (e.g. to be able to cancel the other task), I would use a remote_handle. But AFAIK this doesn't change the Context (which ties back to schedule() and task), but establishes an oneshot between the tasks.

We could use spawn_local instead of spawn_unchecked (which would store Rust's thread id and check that it is the same on .run()), but this is unnecessary overhead in this case, it simply can't happen. The example code I wrote which leads to waking from other threads all the time still runs fine with spawn_unchecked.

Another angle on this - the spawn_unchecked docs state:

Safety

  • If Fut is not 'static, borrowed non-metadata variables must outlive its Runnable. and: If schedule is not 'static, borrowed variables must outlive all instances of the Runnable's Waker.
    ✅ doesn't apply: we require 'static for the Future and Scheduler is 'static (current and my PR)
  • If Fut is not [Send], its [Runnable] must be used and dropped on the original thread
    ✅ run() is only called on the event thread (current and my PR), which is what "used and dropped" implies, I believe, according to the language used in the introduction.
  • If schedule is not Send and Sync, all instances of the Runnable's Waker must be used and dropped on the original thread.
    currently: ❗schedule is claimed to be Send + Sync, but it is not. It must not be called from another thread (and by extension Wakers, that call Runnable.schedule()). The fact that I'm even able to do it (e.g. accidentally by using async-compat, or manually by polling myself and calling Wakers, etc. ...) indicates an issue. Currently though, the Runnable will be .run() on an arbitrary thread. As there is no way to communicate that requirement in the type system, a runtime check would have been required (e.g.: spawn_local).
    PR: ✅ (IMHO :-) schedule is Send + Sync. The event is only mutated to update the log to the current ngx_cycle_log, and it is guarded with the RwLock. If not for that fact, the event could be `static actually (we only use it to communicate the static callback address) and there would be no need for the "unsafe impl"'s.

I think I have now fully convinced myself, let me know if this helps to convince you as well 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants