Difficulty: 🟣 Research / open-ended — this issue is to produce a scoping doc / RFC (Request for Comments — a written proposal circulated for feedback before implementation), not the port itself.
Scope: Very large as a program of work; this issue's deliverable is a written, reviewed plan with a recommended ordering and the key design decisions called out.
Subsystems (in likely migration order): graph/ · communication/ · conductor/ · api_server/ · worker/
Prerequisites: Rust; a Python↔Rust interop story (e.g., PyO3 built with maturin or equivalent); comfort designing a language-neutral wire format.
Goal
We want as much of the M* runtime as possible running in Rust — for raw
performance and, just as importantly, to get the scheduling and communication
hot paths off the Python GIL. As a hard constraint, model code stays Python.
Model authors will still write their submodules, model code, and forward passes in Python with torch. So every Rust boundary we introduce has to preserve the existing Python model-authoring API (with reasonable changes as needed).
This issue is to scope that migration: take the rough phasing below, validate
it against the code, and turn it into a concrete RFC with recommended ordering, the
wire-format decision, the in-process-vs-separate-process call per component, and
an assessment of what realistically can't move.
Rough phasing to react to (maintainer's initial thinking)
This is a starting point, not a spec — the point of the issue is to pressure-test
and refine it.
- Graph layer first. The graph logic, ready-queues, and edge/ingest
bookkeeping (mstar/graph/ — GraphNode, Sequential,
Parallel, Loop, GraphEdge) is self-contained dataflow with no GPU or
Python-model dependency. It's a strong first Rust candidate and can be
"plugged into" the existing worker behind the current Python API. Validate
against the existing graph tests
(test/modular/test_graph.py).
- ZMQ communication stack (mstar/communication/communicator.py,
the BaseCommunicator interface). Natural Rust boundary; Rust has solid ZMQ
bindings. The tensor communication stack
(mstar/communication/tensors.py) could
follow if it makes sense, but it's harder (see notes below).
- Conductor loop as a Rust process. The conductor's main loop
(Conductor.run() in conductor.py) is much
smaller and simpler than the worker's because it's essentially a poll-messages →
dispatch → sleep loop. Low risk, modest reward: a good vehicle for learning
how to stand up a Rust component in the live system before tackling anything
hot.
- API server. Also relatively self-contained
(mstar/api_server/). It's a Python process with two
threads, so it's potentially latency-critical, especially under load, i.e., when
there's a lot of data flowing in/out, or when requests are individually very
fast, per-request Python overhead can dominate. A Rust HTTP front-end could be
a real win there.
- Worker loop — scope only, don't commit.
Worker.run()
(worker.py) is the hard one: async scheduling, the
dedicated GPU thread, speculative scheduling, FlashInfer attention planning,
CUDA-graph capture/replay. It's unclear how much of this can run in Rust at
all (it's tangled with torch/flashinfer/CUDA), but individual components
almost certainly can. This needs very careful carve-out and is its own
sub-investigation.
Observations from the current code (fold these into the RFC)
- Pick the wire/message format early; it's a prerequisite for most things. Today,
messages are rich Python objects (e.g. WorkerMessage / ConductorMessage in
mstar/utils/ipc_format) sent over ZMQ. The moment one end is Rust, the
payload needs a language-neutral, schema'd encoding (msgpack / protobuf /
flatbuffers / bincode-with-shared-schema). This decision blocks points 2-4 above and
should be made first, ideally so Python-to-Python keeps working unchanged
during the transition.
- Decide in-process (PyO3) vs separate-process (IPC) per component. They're
different tools: the conductor and API server are already separate processes
talking over the bus, so rewriting them as standalone Rust processes avoids
PyO3 entirely. The graph layer, by contrast, lives inside the worker, so it
wants in-process PyO3 bindings. Name this fork explicitly for each phase.
- The GIL is the real prize. A big reason several Python threads exist in the
worker today (the plan-executor thread, the MSTAR_PY_SWITCH_INTERVAL_SEC
tuning) is GIL contention between scheduling/planning and GPU submission. Rust
components that release the GIL while they work are where the wins come from.
- Tensor comm is genuinely harder than ZMQ comm. The tensor transport
(tensors.py) is already partly native:
RDMA/TCP go through Mooncake's TransferEngine (C++), wrapped behind the
TensorTransferEngine ABC, and the orchestration touches torch tensors, CUDA
streams, and raw data_ptrs. Porting the orchestration to Rust means careful
torch/CUDA FFI, so this is reasonable to defer or leave partly/mostly Python.
Open questions for the RFC
- For each component: in-process PyO3 or separate Rust process?
- One wire format for everything, or different formats for control messages vs.
tensor metadata?
- How do we keep
main releasable throughout — i.e. land each phase behind the
existing Python interfaces, with a Python fallback, rather than a flag day?
- What's explicitly staying Python forever (model
forward, tokenization, media
decode, anything torch/CUDA-bound)?
Deliverable / acceptance criteria
- A reviewed RFC that: confirms or revises the phase ordering above; picks the
wire format; makes the in-process-vs-process call per component; lists what
stays Python; and covers build integration (maturin alongside the existing
pyproject.toml).
- Agreement that graph and the ZMQ communication layer are the first two
ports, each shippable behind its current Python interface.
New to M*? Skim How it works and the Contributing guide first.
Difficulty: 🟣 Research / open-ended — this issue is to produce a scoping doc / RFC (Request for Comments — a written proposal circulated for feedback before implementation), not the port itself.
Scope: Very large as a program of work; this issue's deliverable is a written, reviewed plan with a recommended ordering and the key design decisions called out.
Subsystems (in likely migration order): graph/ · communication/ · conductor/ · api_server/ · worker/
Prerequisites: Rust; a Python↔Rust interop story (e.g., PyO3 built with maturin or equivalent); comfort designing a language-neutral wire format.
Goal
We want as much of the M* runtime as possible running in Rust — for raw
performance and, just as importantly, to get the scheduling and communication
hot paths off the Python GIL. As a hard constraint, model code stays Python.
Model authors will still write their submodules, model code, and
forwardpasses in Python with torch. So every Rust boundary we introduce has to preserve the existing Python model-authoring API (with reasonable changes as needed).This issue is to scope that migration: take the rough phasing below, validate
it against the code, and turn it into a concrete RFC with recommended ordering, the
wire-format decision, the in-process-vs-separate-process call per component, and
an assessment of what realistically can't move.
Rough phasing to react to (maintainer's initial thinking)
This is a starting point, not a spec — the point of the issue is to pressure-test
and refine it.
bookkeeping (mstar/graph/ —
GraphNode,Sequential,Parallel,Loop,GraphEdge) is self-contained dataflow with no GPU orPython-model dependency. It's a strong first Rust candidate and can be
"plugged into" the existing worker behind the current Python API. Validate
against the existing graph tests
(test/modular/test_graph.py).
the
BaseCommunicatorinterface). Natural Rust boundary; Rust has solid ZMQbindings. The tensor communication stack
(mstar/communication/tensors.py) could
follow if it makes sense, but it's harder (see notes below).
(
Conductor.run()in conductor.py) is muchsmaller and simpler than the worker's because it's essentially a poll-messages →
dispatch → sleep loop. Low risk, modest reward: a good vehicle for learning
how to stand up a Rust component in the live system before tackling anything
hot.
(mstar/api_server/). It's a Python process with two
threads, so it's potentially latency-critical, especially under load, i.e., when
there's a lot of data flowing in/out, or when requests are individually very
fast, per-request Python overhead can dominate. A Rust HTTP front-end could be
a real win there.
Worker.run()(worker.py) is the hard one: async scheduling, the
dedicated GPU thread, speculative scheduling, FlashInfer attention planning,
CUDA-graph capture/replay. It's unclear how much of this can run in Rust at
all (it's tangled with torch/flashinfer/CUDA), but individual components
almost certainly can. This needs very careful carve-out and is its own
sub-investigation.
Observations from the current code (fold these into the RFC)
messages are rich Python objects (e.g.
WorkerMessage/ConductorMessageinmstar/utils/ipc_format) sent over ZMQ. The moment one end is Rust, thepayload needs a language-neutral, schema'd encoding (msgpack / protobuf /
flatbuffers / bincode-with-shared-schema). This decision blocks points 2-4 above and
should be made first, ideally so Python-to-Python keeps working unchanged
during the transition.
different tools: the conductor and API server are already separate processes
talking over the bus, so rewriting them as standalone Rust processes avoids
PyO3 entirely. The graph layer, by contrast, lives inside the worker, so it
wants in-process PyO3 bindings. Name this fork explicitly for each phase.
worker today (the plan-executor thread, the
MSTAR_PY_SWITCH_INTERVAL_SECtuning) is GIL contention between scheduling/planning and GPU submission. Rust
components that release the GIL while they work are where the wins come from.
(tensors.py) is already partly native:
RDMA/TCP go through Mooncake's
TransferEngine(C++), wrapped behind theTensorTransferEngineABC, and the orchestration touches torch tensors, CUDAstreams, and raw
data_ptrs. Porting the orchestration to Rust means carefultorch/CUDA FFI, so this is reasonable to defer or leave partly/mostly Python.
Open questions for the RFC
tensor metadata?
mainreleasable throughout — i.e. land each phase behind theexisting Python interfaces, with a Python fallback, rather than a flag day?
forward, tokenization, mediadecode, anything torch/CUDA-bound)?
Deliverable / acceptance criteria
wire format; makes the in-process-vs-process call per component; lists what
stays Python; and covers build integration (maturin alongside the existing
pyproject.toml).
ports, each shippable behind its current Python interface.