perf!: use Cow to avoid OsString allocations#487
Conversation
|
About the |
|
And another thing: There is #390 which moves away from platform-dependent |
## 🤖 New release * `rustic_core`: 0.10.1 -> 0.11.0 (✓ API compatible changes) <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [0.11.0](rustic_core-v0.10.1...rustic_core-v0.11.0) - 2026-04-05 ### Added - Optimize hardlink creation in restore ([#495](#495)) - add exclude-if-xattr option ([#491](#491)) ### Fixed - make `ignore`'s `.git_exclude()` mirror `.git_ignore()` ([#494](#494)) ### Other - update dependencies ([#496](#496)) - preserve hardlinks on restore ([#492](#492)) - use general tree modifier in `repair snapshots` ([#463](#463)) - [**breaking**] Optimize file streaming ([#489](#489)) - [**breaking**] use Cow to avoid OsString allocations ([#487](#487)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). Co-authored-by: rustic-release-plz[bot] <182542030+rustic-release-plz[bot]@users.noreply.github.com>
Extends rustic-rs#12. PR rustic-rs#106 added `get_blob_cached` to amortize repeat fetches, but `commands::dump::dump` still walks its blob list serially: https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/commands/dump.rs#L39-L48 so on high-latency backends the throughput is bounded by per-blob round-trip time regardless of how many concurrent connections the backend itself permits. This routes the loop through `pariter::IteratorExt::parallel_map_scoped_custom`, the same family of ordered parallel-map primitives the archiver and packer already rely on, so fetches overlap N-wide while the writer stays single-threaded. Output ordering is preserved bit-for-bit; an integration test backs that up by chunking a 64 KiB payload into 16 fixed-size blobs and asserting that the parallel and sequential dumps both equal the source bytes exactly. The new `DumpOptions` carries a single `num_threads: u32` field (Setters, `non_exhaustive`, gated behind the existing `clap` feature), where `0` selects `available_parallelism` and `1` forces the sequential implementation. `Repository::dump_with_opts(node, w, &DumpOptions)` joins the existing `Repository::dump(node, w)`, which keeps its signature and now defaults to `available_parallelism`. Both methods pick up an `S: Sync` bound; the in-tree `IndexedFullStatus` already satisfies it, so all in-tree call sites compile unchanged. Files that decompose into a single blob take a sequential short-circuit and pay none of the worker setup cost. Out of scope: `dump --archive tar`, `--archive targz`, `mount`, and `webdav` go through `vfs::OpenFile::read_at` rather than `commands::dump::dump`. The serial blob-fetch shape is the same, but the consumer there is pull-based (the tar writer calls `Read::read` repeatedly with small buffers), so the right fix is read-ahead prefetching rather than the push-style overlap this PR uses. That wants its own change. ## Results Same caveat as rustic-rs#487: not a proper benchmark suite. An `InMemoryBackend` wrapper sleeps a configurable duration on every `read_full`/`read_partial` to simulate object-store round-trip latency. 4 MiB file, 64 KiB fixed-size chunks (64 blobs), fresh `Repository` per measurement so the in-process blob cache is cold for every run. ``` latency sequential parallel speedup 0ms 2124 MB/s 1497 MB/s 0.70x 1ms 54 MB/s 699 MB/s 12.89x 5ms 12 MB/s 180 MB/s 15.15x 20ms 3 MB/s 48 MB/s 15.65x 50ms 1 MB/s 19 MB/s 15.72x 100ms 0.6 MB/s 9 MB/s 15.81x ``` For any non-trivial backend latency the speedup converges to the worker count. A thread sweep at 20 ms latency confirms linear scaling: 1, 2, 4, 8, 16, 32 threads gives 1.0x, 2.0x, 4.0x, 7.9x, 15.8x, 30.8x. The 0 ms row is the cost. Against a free backend (warm page cache, in-memory backend, repeat dumps that already hit `get_blob_cached`) the parallel path runs at roughly 60-70% of sequential because thread setup and channel passing dominate when the per-blob fetch is itself near-instant. The single-blob short-circuit covers the smallest files, so the regression is bounded to multi-blob files on near-zero-latency backends, which is the case where dump throughput wasn't the bottleneck to begin with.
This PR eliminates many temporary allocations from unnecessary uses of
OsString(they remain necessary forTreeIteratorwithout lending iterators).Previously, filename unescaping accounted for 43% of all total allocations, even in cases where there was nothing to unescape. This is avoided by checking for the presence of
\before, and using the original name if none was found.comp_to_osstrwas only responsible for 0.1% of allocations.Results
The following measurements are not from a proper benchmarking setup, but merely repeated low-delta backups with heaptrack (without LTO and with raised codegen-units, on a laptop):
Total allocations reduced from 173M to 98M in my case.
These changes did not lower peak RSS significantly (4.7GiB to 4.6GiB, probably noise or slightly reduced fragmentation), which is expected for mostly temporary allocations.
Using the global allocator with glibc 2.40, there wasn't much (if any) walltime improvement (284s to 274s, measured without heaptrack), but it might be more significant on platforms with weaker allocators like musl.