Skip to content

perf!: use Cow to avoid OsString allocations#487

Merged
aawsome merged 2 commits intorustic-rs:mainfrom
tilpner:node-name-cow
Mar 6, 2026
Merged

perf!: use Cow to avoid OsString allocations#487
aawsome merged 2 commits intorustic-rs:mainfrom
tilpner:node-name-cow

Conversation

@tilpner
Copy link
Copy Markdown
Contributor

@tilpner tilpner commented Mar 5, 2026

This PR eliminates many temporary allocations from unnecessary uses of OsString (they remain necessary for TreeIterator without lending iterators).

Previously, filename unescaping accounted for 43% of all total allocations, even in cases where there was nothing to unescape. This is avoided by checking for the presence of \ before, and using the original name if none was found.

comp_to_osstr was only responsible for 0.1% of allocations.

Results

The following measurements are not from a proper benchmarking setup, but merely repeated low-delta backups with heaptrack (without LTO and with raised codegen-units, on a laptop):

Total allocations reduced from 173M to 98M in my case.

These changes did not lower peak RSS significantly (4.7GiB to 4.6GiB, probably noise or slightly reduced fragmentation), which is expected for mostly temporary allocations.

Using the global allocator with glibc 2.40, there wasn't much (if any) walltime improvement (284s to 274s, measured without heaptrack), but it might be more significant on platforms with weaker allocators like musl.

Comment thread crates/core/src/archiver/parent.rs
Comment thread crates/core/src/backend/node.rs
Comment thread crates/core/src/backend/node.rs
Comment thread crates/core/src/backend/node.rs
@aawsome aawsome changed the title perf: use Cow to avoid OsString allocations perf!: use Cow to avoid OsString allocations Mar 6, 2026
@aawsome
Copy link
Copy Markdown
Member

aawsome commented Mar 6, 2026

About the TreeIterator: I think this is one part which anyway needs some overwork: This is barely readable and too complex to understand and enhance. I wanted to implement features like allowing to use multiple different sources for a single snapshot (e.g. stdin and a local path) or to add additional sources (e.g. remote sources or using a tar as source) and almost always this complex "monster" was in the way and prevented a simple extension. (Ok, plus the parallelization used enforcing Send+Sync for many things). So IMO it's OK to keep the OsString here and tackle this when overworking the general tree handling for backups.

@aawsome
Copy link
Copy Markdown
Member

aawsome commented Mar 6, 2026

And another thing: There is #390 which moves away from platform-dependent OsString/OsStr. But this anyway needs rebasing and some extra work; so I think we can first merge this quite small performance optimization!

Comment thread crates/core/src/backend/node.rs Outdated
Copy link
Copy Markdown
Member

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot @tilpner

@aawsome aawsome enabled auto-merge March 6, 2026 13:50
@aawsome aawsome added this pull request to the merge queue Mar 6, 2026
Merged via the queue into rustic-rs:main with commit 00b93d9 Mar 6, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Apr 5, 2026
## 🤖 New release

* `rustic_core`: 0.10.1 -> 0.11.0 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

##
[0.11.0](rustic_core-v0.10.1...rustic_core-v0.11.0)
- 2026-04-05

### Added

- Optimize hardlink creation in restore
([#495](#495))
- add exclude-if-xattr option
([#491](#491))

### Fixed

- make `ignore`'s `.git_exclude()` mirror `.git_ignore()`
([#494](#494))

### Other

- update dependencies
([#496](#496))
- preserve hardlinks on restore
([#492](#492))
- use general tree modifier in `repair snapshots`
([#463](#463))
- [**breaking**] Optimize file streaming
([#489](#489))
- [**breaking**] use Cow to avoid OsString allocations
([#487](#487))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: rustic-release-plz[bot] <182542030+rustic-release-plz[bot]@users.noreply.github.com>
yonas pushed a commit to yonasBSD/rustic_core that referenced this pull request May 11, 2026
Extends rustic-rs#12. PR rustic-rs#106 added `get_blob_cached` to amortize repeat fetches,
but `commands::dump::dump` still walks its blob list serially:


https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/commands/dump.rs#L39-L48

so on high-latency backends the throughput is bounded by per-blob
round-trip time regardless of how many concurrent connections the
backend itself permits.

This routes the loop through
`pariter::IteratorExt::parallel_map_scoped_custom`, the same family of
ordered parallel-map primitives the archiver and packer already rely on,
so fetches overlap N-wide while the writer stays single-threaded. Output
ordering is preserved bit-for-bit; an integration test backs that up by
chunking a 64 KiB payload into 16 fixed-size blobs and asserting that
the parallel and sequential dumps both equal the source bytes exactly.

The new `DumpOptions` carries a single `num_threads: u32` field
(Setters, `non_exhaustive`, gated behind the existing `clap` feature),
where `0` selects `available_parallelism` and `1` forces the sequential
implementation. `Repository::dump_with_opts(node, w, &DumpOptions)`
joins the existing `Repository::dump(node, w)`, which keeps its
signature and now defaults to `available_parallelism`. Both methods pick
up an `S: Sync` bound; the in-tree `IndexedFullStatus` already satisfies
it, so all in-tree call sites compile unchanged. Files that decompose
into a single blob take a sequential short-circuit and pay none of the
worker setup cost.

Out of scope: `dump --archive tar`, `--archive targz`, `mount`, and
`webdav` go through `vfs::OpenFile::read_at` rather than
`commands::dump::dump`. The serial blob-fetch shape is the same, but the
consumer there is pull-based (the tar writer calls `Read::read`
repeatedly with small buffers), so the right fix is read-ahead
prefetching rather than the push-style overlap this PR uses. That wants
its own change.

## Results

Same caveat as rustic-rs#487: not a proper benchmark suite. An `InMemoryBackend`
wrapper sleeps a configurable duration on every
`read_full`/`read_partial` to simulate object-store round-trip latency.
4 MiB file, 64 KiB fixed-size chunks (64 blobs), fresh `Repository` per
measurement so the in-process blob cache is cold for every run.

```
    latency      sequential        parallel  speedup
        0ms     2124 MB/s        1497 MB/s    0.70x
        1ms       54 MB/s         699 MB/s   12.89x
        5ms       12 MB/s         180 MB/s   15.15x
       20ms        3 MB/s          48 MB/s   15.65x
       50ms        1 MB/s          19 MB/s   15.72x
      100ms      0.6 MB/s           9 MB/s   15.81x
```

For any non-trivial backend latency the speedup converges to the worker
count. A thread sweep at 20 ms latency confirms linear scaling: 1, 2, 4,
8, 16, 32 threads gives 1.0x, 2.0x, 4.0x, 7.9x, 15.8x, 30.8x.

The 0 ms row is the cost. Against a free backend (warm page cache,
in-memory backend, repeat dumps that already hit `get_blob_cached`) the
parallel path runs at roughly 60-70% of sequential because thread setup
and channel passing dominate when the per-blob fetch is itself
near-instant. The single-blob short-circuit covers the smallest files,
so the regression is bounded to multi-blob files on near-zero-latency
backends, which is the case where dump throughput wasn't the bottleneck
to begin with.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants