Skip to content

Scope a migration toward a Rust-native runtime (Python stays the model layer) #130

@NSagan271

Description

@NSagan271

Difficulty: 🟣 Research / open-ended — this issue is to produce a scoping doc / RFC (Request for Comments — a written proposal circulated for feedback before implementation), not the port itself.

Scope: Very large as a program of work; this issue's deliverable is a written, reviewed plan with a recommended ordering and the key design decisions called out.

Subsystems (in likely migration order): graph/ · communication/ · conductor/ · api_server/ · worker/

Prerequisites: Rust; a Python↔Rust interop story (e.g., PyO3 built with maturin or equivalent); comfort designing a language-neutral wire format.

Goal

We want as much of the M* runtime as possible running in Rust — for raw
performance and, just as importantly, to get the scheduling and communication
hot paths off the Python GIL. As a hard constraint, model code stays Python.
Model authors will still write their submodules, model code, and forward passes in Python with torch. So every Rust boundary we introduce has to preserve the existing Python model-authoring API (with reasonable changes as needed).

This issue is to scope that migration: take the rough phasing below, validate
it against the code, and turn it into a concrete RFC with recommended ordering, the
wire-format decision, the in-process-vs-separate-process call per component, and
an assessment of what realistically can't move.

Rough phasing to react to (maintainer's initial thinking)

This is a starting point, not a spec — the point of the issue is to pressure-test
and refine it.

  1. Graph layer first. The graph logic, ready-queues, and edge/ingest
    bookkeeping (mstar/graph/GraphNode, Sequential,
    Parallel, Loop, GraphEdge) is self-contained dataflow with no GPU or
    Python-model dependency. It's a strong first Rust candidate and can be
    "plugged into" the existing worker behind the current Python API. Validate
    against the existing graph tests
    (test/modular/test_graph.py).
  2. ZMQ communication stack (mstar/communication/communicator.py,
    the BaseCommunicator interface). Natural Rust boundary; Rust has solid ZMQ
    bindings. The tensor communication stack
    (mstar/communication/tensors.py) could
    follow if it makes sense, but it's harder (see notes below).
  3. Conductor loop as a Rust process. The conductor's main loop
    (Conductor.run() in conductor.py) is much
    smaller and simpler than the worker's because it's essentially a poll-messages →
    dispatch → sleep loop. Low risk, modest reward: a good vehicle for learning
    how to stand up a Rust component in the live system before tackling anything
    hot.
  4. API server. Also relatively self-contained
    (mstar/api_server/). It's a Python process with two
    threads, so it's potentially latency-critical, especially under load, i.e., when
    there's a lot of data flowing in/out, or when requests are individually very
    fast, per-request Python overhead can dominate. A Rust HTTP front-end could be
    a real win there.
  5. Worker loop — scope only, don't commit. Worker.run()
    (worker.py) is the hard one: async scheduling, the
    dedicated GPU thread, speculative scheduling, FlashInfer attention planning,
    CUDA-graph capture/replay. It's unclear how much of this can run in Rust at
    all (it's tangled with torch/flashinfer/CUDA), but individual components
    almost certainly can. This needs very careful carve-out and is its own
    sub-investigation.

Observations from the current code (fold these into the RFC)

  • Pick the wire/message format early; it's a prerequisite for most things. Today,
    messages are rich Python objects (e.g. WorkerMessage / ConductorMessage in
    mstar/utils/ipc_format) sent over ZMQ. The moment one end is Rust, the
    payload needs a language-neutral, schema'd encoding (msgpack / protobuf /
    flatbuffers / bincode-with-shared-schema). This decision blocks points 2-4 above and
    should be made first, ideally so Python-to-Python keeps working unchanged
    during the transition.
  • Decide in-process (PyO3) vs separate-process (IPC) per component. They're
    different tools: the conductor and API server are already separate processes
    talking over the bus, so rewriting them as standalone Rust processes avoids
    PyO3 entirely. The graph layer, by contrast, lives inside the worker, so it
    wants in-process PyO3 bindings. Name this fork explicitly for each phase.
  • The GIL is the real prize. A big reason several Python threads exist in the
    worker today (the plan-executor thread, the MSTAR_PY_SWITCH_INTERVAL_SEC
    tuning) is GIL contention between scheduling/planning and GPU submission. Rust
    components that release the GIL while they work are where the wins come from.
  • Tensor comm is genuinely harder than ZMQ comm. The tensor transport
    (tensors.py) is already partly native:
    RDMA/TCP go through Mooncake's TransferEngine (C++), wrapped behind the
    TensorTransferEngine ABC, and the orchestration touches torch tensors, CUDA
    streams, and raw data_ptrs. Porting the orchestration to Rust means careful
    torch/CUDA FFI, so this is reasonable to defer or leave partly/mostly Python.

Open questions for the RFC

  • For each component: in-process PyO3 or separate Rust process?
  • One wire format for everything, or different formats for control messages vs.
    tensor metadata?
  • How do we keep main releasable throughout — i.e. land each phase behind the
    existing Python interfaces, with a Python fallback, rather than a flag day?
  • What's explicitly staying Python forever (model forward, tokenization, media
    decode, anything torch/CUDA-bound)?

Deliverable / acceptance criteria

  • A reviewed RFC that: confirms or revises the phase ordering above; picks the
    wire format; makes the in-process-vs-process call per component; lists what
    stays Python; and covers build integration (maturin alongside the existing
    pyproject.toml).
  • Agreement that graph and the ZMQ communication layer are the first two
    ports, each shippable behind its current Python interface.

New to M*? Skim How it works and the Contributing guide first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    open endedThis issue requires careful scoping and design

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions