Skip to content

Commit c2df4e1

Browse files
authored
feat: add RPC support (#1629)
1 parent 9838264 commit c2df4e1

9 files changed

Lines changed: 282 additions & 1 deletion

File tree

CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,12 @@ if(SD_WEBM)
204204
endif()
205205
endif()
206206

207+
if (SD_RPC)
208+
message("-- Use RPC as backend stable-diffusion")
209+
set(GGML_RPC ON)
210+
add_definitions(-DSD_USE_RPC)
211+
endif ()
212+
207213
set(SD_LIB stable-diffusion)
208214

209215
file(GLOB SD_LIB_SOURCES CONFIGURE_DEPENDS

docs/rpc.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Building and Using the RPC Server with `stable-diffusion.cpp`
2+
3+
This guide covers how to build a version of [the RPC server from `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.
4+
5+
> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.
6+
7+
## 1. Building `stable-diffusion.cpp` with RPC client
8+
9+
First, you should build the client application from source. It requires `SD_RPC=ON` to include the RPC backend to your client.
10+
11+
```bash
12+
mkdir build
13+
cd build
14+
cmake .. \
15+
-DSD_RPC=ON \
16+
# Add other build flags here (e.g., -DSD_VULKAN=ON)
17+
cmake --build . --config Release -j $(nproc)
18+
```
19+
20+
> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the [build.md](build.md) documentation.
21+
22+
## 2. Ensure `llama.cpp` is at the correct commit
23+
24+
`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.
25+
26+
> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.
27+
28+
1. Read the target commit hash from the submodule tracker:
29+
30+
```bash
31+
# Linux / WSL / MacOS
32+
HASH=$(cat ggml/scripts/sync-llama.last)
33+
34+
# Windows (PowerShell)
35+
$HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
36+
```
37+
38+
2. Clone `llama.cpp` at the target commit .
39+
```bash
40+
git clone https://github.com/ggml-org/llama.cpp.git
41+
cd llama.cpp
42+
git checkout $HASH
43+
```
44+
To save on download time and storage, you can use a shallow clone to download only the target commit:
45+
```bash
46+
mkdir -p llama.cpp
47+
cd llama.cpp
48+
git init
49+
git remote add origin https://github.com/ggml-org/llama.cpp.git
50+
git fetch --depth 1 origin $HASH
51+
git checkout FETCH_HEAD
52+
```
53+
54+
## 3. Build `llama.cpp` (RPC Server)
55+
56+
The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.
57+
58+
To find the correct flags for your system, refer to the official documentation for the [`llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) repository.
59+
60+
> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
61+
>
62+
> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.
63+
64+
> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-server`), making the build significantly faster.
65+
66+
### Linux / WSL (Vulkan)
67+
68+
```bash
69+
mkdir build
70+
cd build
71+
cmake .. -DGGML_RPC=ON \
72+
-DGGML_VULKAN=ON \ # Ensure backend is enabled
73+
-DGGML_BUILD_SHARED_LIBS=OFF \
74+
-DLLAMA_CURL=OFF \
75+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
76+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
77+
cmake --build . --config Release --target rpc-server -j $(nproc)
78+
```
79+
80+
### macOS (Metal)
81+
82+
```bash
83+
mkdir build
84+
cd build
85+
cmake .. -DGGML_RPC=ON \
86+
-DGGML_METAL=ON \
87+
-DGGML_BUILD_SHARED_LIBS=OFF \
88+
-DLLAMA_CURL=OFF \
89+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
90+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
91+
cmake --build . --config Release --target rpc-server
92+
```
93+
94+
### Windows (Visual Studio 2022, Vulkan)
95+
96+
```powershell
97+
mkdir build
98+
cd build
99+
cmake .. -G "Visual Studio 17 2022" -A x64 `
100+
-DGGML_RPC=ON `
101+
-DGGML_VULKAN=ON `
102+
-DGGML_BUILD_SHARED_LIBS=OFF `
103+
-DLLAMA_CURL=OFF `
104+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
105+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
106+
cmake --build . --config Release --target rpc-server
107+
```
108+
109+
## 4. Usage
110+
111+
Once both applications are built, you can run the server and the client to manage your GPU allocation.
112+
113+
### Step A: Run the RPC Server
114+
115+
Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.
116+
117+
**On the Server :**
118+
If running on the same machine, you can use the default address:
119+
120+
```bash
121+
./rpc-server
122+
```
123+
124+
If you want to allow connections from other machines on the network:
125+
126+
```bash
127+
./rpc-server --host 0.0.0.0
128+
```
129+
130+
> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.
131+
132+
> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.
133+
134+
<!-- ### Step B: Check if the client is able to connect to the server and see the available devices
135+
136+
We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.
137+
138+
**On the Client:**
139+
140+
```bash
141+
./sd-cli --rpc-servers localhost:50052 --list-devices
142+
```
143+
144+
If the server is running and the client is able to connect, you should see `RPC0 localhost:50052` in the list of devices.
145+
146+
Example output:
147+
(Client built without GPU acceleration, two GPUs available on the server)
148+
149+
```
150+
List of available GGML devices:
151+
Name Description
152+
-------------------
153+
CPU AMD Ryzen 9 5900X 12-Core Processor
154+
RPC0 localhost:50052
155+
RPC1 localhost:50052
156+
``` -->
157+
158+
### Step B: Run with RPC device
159+
160+
If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.
161+
162+
Example: Setting the main backend to the RPC0 device for doing all the work on the server.
163+
164+
```bash
165+
./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc-servers localhost:50052 --backend RPC0
166+
```
167+
168+
---
169+
170+
## 5. Scaling: Multiple RPC Servers
171+
172+
You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.
173+
174+
Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.
175+
176+
**On the first machine (Running two server instances):**
177+
178+
**Terminal 1 (CUDA):**
179+
180+
```bash
181+
# Linux / WSL
182+
export CUDA_VISIBLE_DEVICES=0
183+
cd ./build_cuda/bin/Release
184+
./rpc-server --host 0.0.0.0
185+
186+
# Windows PowerShell
187+
$env:CUDA_VISIBLE_DEVICES="0"
188+
cd .\build_cuda\bin\Release
189+
./rpc-server --host 0.0.0.0
190+
```
191+
192+
**Terminal 2 (Vulkan):**
193+
194+
```bash
195+
cd ./build_vulkan/bin/Release
196+
# ignore the first GPU (used by CUDA server)
197+
./rpc-server --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
198+
```
199+
200+
**On the second machine:**
201+
202+
```bash
203+
cd ./build/bin/Release
204+
./rpc-server --host 0.0.0.0
205+
```
206+
207+
**On the Client:**
208+
Pass multiple server addresses separated by commas.
209+
210+
```bash
211+
./sd-cli --rpc-servers 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 [...]
212+
```
213+
214+
The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.
215+
216+
---
217+
218+
## 6. Performance Considerations
219+
220+
RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended). This shoumd not be a concern if you are running the server and client on the same machine, as the data transfer will happen over the loopback interface.

examples/common/common.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -427,6 +427,10 @@ ArgOptions SDContextParams::get_options() {
427427
"--params-backend",
428428
"parameter backend assignment, e.g. disk, cpu, or diffusion=disk,clip=cpu",
429429
&params_backend},
430+
{"",
431+
"--rpc-servers",
432+
"comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
433+
&rpc_servers},
430434
};
431435

432436
options.int_options = {
@@ -836,6 +840,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
836840
sd_ctx_params.stream_layers = stream_layers;
837841
sd_ctx_params.backend = effective_backend.c_str();
838842
sd_ctx_params.params_backend = effective_params_backend.c_str();
843+
sd_ctx_params.rpc_servers = rpc_servers.c_str();
839844
return sd_ctx_params;
840845
}
841846

examples/common/common.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ struct SDContextParams {
148148
bool stream_layers = false;
149149
std::string backend;
150150
std::string params_backend;
151+
std::string rpc_servers;
151152
std::string effective_backend;
152153
std::string effective_params_backend;
153154
bool enable_mmap = false;

include/stable-diffusion.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,7 @@ typedef struct {
220220
bool stream_layers; // Enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram)
221221
const char* backend;
222222
const char* params_backend;
223+
const char* rpc_servers;
223224
} sd_ctx_params_t;
224225

225226
typedef struct {

src/core/ggml_extend_backend.cpp

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,36 @@ void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value
204204
}
205205
}
206206

207+
bool add_rpc_devices(const std::string& servers) {
208+
const std::string in = trim_copy(servers);
209+
if (in.empty()) {
210+
return true;
211+
}
212+
auto rpc_servers = split_copy(in, ',');
213+
if (rpc_servers.empty()) {
214+
LOG_ERROR("invalid RPC servers specification: '%s'", servers.c_str());
215+
return false;
216+
}
217+
ggml_backend_reg_t rpc_reg = ggml_backend_reg_by_name("RPC");
218+
if (!rpc_reg) {
219+
LOG_ERROR("RPC backend not found, cannot add RPC servers");
220+
return false;
221+
}
222+
typedef ggml_backend_reg_t (*ggml_backend_rpc_add_server_t)(const char* endpoint);
223+
ggml_backend_rpc_add_server_t ggml_backend_rpc_add_server_fn = (ggml_backend_rpc_add_server_t)ggml_backend_reg_get_proc_address(rpc_reg, "ggml_backend_rpc_add_server");
224+
if (!ggml_backend_rpc_add_server_fn) {
225+
LOG_ERROR("RPC backend does not have ggml_backend_rpc_add_server function, cannot add RPC servers");
226+
return false;
227+
}
228+
for (const auto& server : rpc_servers) {
229+
LOG_INFO("Adding RPC server: %s", server.c_str());
230+
auto reg = ggml_backend_rpc_add_server_fn(server.c_str());
231+
// no return value to check for success but should print errors from the RPC backend if it fails to add the server
232+
ggml_backend_register(reg);
233+
}
234+
return true;
235+
}
236+
207237
static void ggml_backend_load_all_once() {
208238
// If the registry already has devices and the CPU backend is present,
209239
// assume either static registration or explicit host-side preloading has

src/core/ggml_extend_backend.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,5 @@ ggml_backend_t sd_backend_cpu_init();
7373
bool sd_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
7474
const char* sd_backend_module_name(SDBackendModule module);
7575
void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value);
76+
bool add_rpc_devices(const std::string& servers);
7677
#endif // __SD_CORE_GGML_EXTEND_BACKEND_H__

src/model_loader.cpp

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1002,6 +1002,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
10021002
std::atomic<size_t> tensor_idx(0);
10031003
std::atomic<bool> failed(false);
10041004
std::vector<std::thread> workers;
1005+
std::mutex rpc_backend_mutex;
10051006

10061007
for (int i = 0; i < n_threads; ++i) {
10071008
workers.emplace_back([&, file_path, is_zip]() {
@@ -1158,7 +1159,19 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
11581159

11591160
if (dst_tensor->buffer != nullptr && !ggml_backend_buffer_is_host(dst_tensor->buffer)) {
11601161
t0 = ggml_time_ms();
1161-
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
1162+
1163+
// RPC backends require serialized access to prevent concurrency issues
1164+
const char* buffer_type_name = ggml_backend_buft_name(ggml_backend_buffer_get_type(dst_tensor->buffer));
1165+
bool is_rpc_buffer = buffer_type_name != nullptr &&
1166+
std::string(buffer_type_name).find("RPC") != std::string::npos;
1167+
1168+
if (is_rpc_buffer) {
1169+
std::lock_guard<std::mutex> lock(rpc_backend_mutex);
1170+
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
1171+
} else {
1172+
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
1173+
}
1174+
11621175
t1 = ggml_time_ms();
11631176
copy_to_backend_time_ms.fetch_add(t1 - t0);
11641177
}

src/stable-diffusion.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,10 @@ class StableDiffusionGGML {
318318
stream_layers = sd_ctx_params->stream_layers;
319319
backend_spec = SAFE_STR(sd_ctx_params->backend);
320320
params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
321+
322+
std::string rpc_servers_spec = SAFE_STR(sd_ctx_params->rpc_servers);
323+
add_rpc_devices(rpc_servers_spec);
324+
321325
if (stream_layers && max_vram == 0.f) {
322326
LOG_WARN("--stream-layers has no effect without --max-vram set; ignoring");
323327
stream_layers = false;

0 commit comments

Comments
 (0)