Skip to content

perf: concurrent write throughput ~28% behind memfs reference #16

@hooyao

Description

@hooyao

Summary

ATTO disk benchmark on the same machine shows RamDrive's concurrent write
throughput (Direct I/O + Bypass Write Cache, 4 MB blocks, QD=4) is about
28% behind WinFsp's reference `memfs-x64.exe` on the same hardware.
Single in-flight IRP throughput, read throughput, and cached throughput
are all on par with memfs. The gap is specific to concurrent writes to
the same file.

Measurements

ATTO 4.01, 4 MB I/O size, queue depth 4, single 256 MB test file.

Configuration RamDrive W RamDrive R memfs W memfs R
No checkboxes (Cc cached) 5.19 GB/s 8.42 GB/s 5.42 GB/s 8.94 GB/s
Bypass Write Cache only 2.67 GB/s 5.85 GB/s 2.47 GB/s 6.51 GB/s
Direct I/O + Bypass Write Cache 9.64 GB/s 6.30 GB/s 13.31 GB/s 6.53 GB/s

Read paths are within 5% across configurations. Single-IRP write (Bypass
only, no Direct I/O → NT serializes write-through) is also within 8%.
The gap shows up only when ATTO can issue 4 concurrent writes (Direct
I/O + Bypass): RamDrive scales 2.67 → 9.64 (3.6×), memfs scales 2.47 →
13.31 (5.4×).

Suspected cause

`PagedFileContent.Write` (src/RamDrive.Core/Memory/PagedFileContent.cs)
uses a per-file `ReaderWriterLockSlim` and acquires the write lock
in its phase 3 to publish page-table entries and `memcpy` data into
pages. With four concurrent writers to the same file, all four serialize
on `EnterWriteLock`. memfs writes to a single contiguous `byte[]` and
relies on no per-file lock at all (it serializes only via the WinFsp
dispatcher).

Secondary contributors likely:

  • 64-iteration page loop (4 MB / 64 KB) inside the write-lock critical
    section, with div + mod + bounds check + null check per iteration.
  • Page-table indirection on every chunk: `_pages[pageIndex]` lookup
    vs. memfs's flat `byte[]` offset arithmetic.
  • `Span.CopyTo` per page (64 calls per IRP) vs. one bulk
    `Buffer.MemoryCopy` of 4 MB.

Possible fixes (not yet evaluated)

In rough order of expected impact / risk:

  1. Move memcpy out of the write lock. Phase 3 currently holds the
    write lock while doing both page-table publish and memcpy. After
    page table is published, the memcpy could run lock-free as long as
    no concurrent truncation can free those pages (the existing CAS
    protocol against `SetLength` would need to extend to cover this).
  2. Per-page lock-free publish. Replace the per-file write lock
    with `Interlocked.CompareExchange` on individual `_pages[i]`
    slots. Requires careful TLA+ re-modelling because `SetLength` and
    `Dispose` currently rely on the write lock for atomicity over the
    whole page table.
  3. Inline single-page fast path for writes that fit in one page
    (skip the loop and indirection cost).
  4. Vectorized batch memcpy when consecutive page slots happen to
    be physically adjacent in the page pool (unlikely common case, but
    measurable).

Any change to the lock protocol must update `tla/RamDiskSystem.tla`
and re-pass at least the Minimal config before merging.

Out of scope for this issue

  • Read path: RamDrive 6.30 vs memfs 6.53 (within noise) — no fix needed.
  • Cached read/write: numbers are dominated by NT Cache Manager flush
    scheduling, not the FS implementation; not directly actionable.
  • Single-IRP write: NT serializes by design; nothing to optimize.

Acceptance criteria

A patch should bring the Direct I/O + Bypass Write Cache (4 MB, QD=4)
write throughput within 10% of memfs on the same hardware
(target ≥ 12.0 GB/s, currently 9.64 GB/s), without regressing the other
ATTO cells, without breaking integration tests under `RAMDRIVE_DIFF=1`,
and without invalidating the TLA+ model.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions