Summary
ATTO disk benchmark on the same machine shows RamDrive's concurrent write
throughput (Direct I/O + Bypass Write Cache, 4 MB blocks, QD=4) is about
28% behind WinFsp's reference `memfs-x64.exe` on the same hardware.
Single in-flight IRP throughput, read throughput, and cached throughput
are all on par with memfs. The gap is specific to concurrent writes to
the same file.
Measurements
ATTO 4.01, 4 MB I/O size, queue depth 4, single 256 MB test file.
| Configuration |
RamDrive W |
RamDrive R |
memfs W |
memfs R |
| No checkboxes (Cc cached) |
5.19 GB/s |
8.42 GB/s |
5.42 GB/s |
8.94 GB/s |
| Bypass Write Cache only |
2.67 GB/s |
5.85 GB/s |
2.47 GB/s |
6.51 GB/s |
| Direct I/O + Bypass Write Cache |
9.64 GB/s |
6.30 GB/s |
13.31 GB/s |
6.53 GB/s |
Read paths are within 5% across configurations. Single-IRP write (Bypass
only, no Direct I/O → NT serializes write-through) is also within 8%.
The gap shows up only when ATTO can issue 4 concurrent writes (Direct
I/O + Bypass): RamDrive scales 2.67 → 9.64 (3.6×), memfs scales 2.47 →
13.31 (5.4×).
Suspected cause
`PagedFileContent.Write` (src/RamDrive.Core/Memory/PagedFileContent.cs)
uses a per-file `ReaderWriterLockSlim` and acquires the write lock
in its phase 3 to publish page-table entries and `memcpy` data into
pages. With four concurrent writers to the same file, all four serialize
on `EnterWriteLock`. memfs writes to a single contiguous `byte[]` and
relies on no per-file lock at all (it serializes only via the WinFsp
dispatcher).
Secondary contributors likely:
- 64-iteration page loop (4 MB / 64 KB) inside the write-lock critical
section, with div + mod + bounds check + null check per iteration.
- Page-table indirection on every chunk: `_pages[pageIndex]` lookup
vs. memfs's flat `byte[]` offset arithmetic.
- `Span.CopyTo` per page (64 calls per IRP) vs. one bulk
`Buffer.MemoryCopy` of 4 MB.
Possible fixes (not yet evaluated)
In rough order of expected impact / risk:
- Move memcpy out of the write lock. Phase 3 currently holds the
write lock while doing both page-table publish and memcpy. After
page table is published, the memcpy could run lock-free as long as
no concurrent truncation can free those pages (the existing CAS
protocol against `SetLength` would need to extend to cover this).
- Per-page lock-free publish. Replace the per-file write lock
with `Interlocked.CompareExchange` on individual `_pages[i]`
slots. Requires careful TLA+ re-modelling because `SetLength` and
`Dispose` currently rely on the write lock for atomicity over the
whole page table.
- Inline single-page fast path for writes that fit in one page
(skip the loop and indirection cost).
- Vectorized batch memcpy when consecutive page slots happen to
be physically adjacent in the page pool (unlikely common case, but
measurable).
Any change to the lock protocol must update `tla/RamDiskSystem.tla`
and re-pass at least the Minimal config before merging.
Out of scope for this issue
- Read path: RamDrive 6.30 vs memfs 6.53 (within noise) — no fix needed.
- Cached read/write: numbers are dominated by NT Cache Manager flush
scheduling, not the FS implementation; not directly actionable.
- Single-IRP write: NT serializes by design; nothing to optimize.
Acceptance criteria
A patch should bring the Direct I/O + Bypass Write Cache (4 MB, QD=4)
write throughput within 10% of memfs on the same hardware
(target ≥ 12.0 GB/s, currently 9.64 GB/s), without regressing the other
ATTO cells, without breaking integration tests under `RAMDRIVE_DIFF=1`,
and without invalidating the TLA+ model.
🤖 Generated with Claude Code
Summary
ATTO disk benchmark on the same machine shows RamDrive's concurrent write
throughput (Direct I/O + Bypass Write Cache, 4 MB blocks, QD=4) is about
28% behind WinFsp's reference `memfs-x64.exe` on the same hardware.
Single in-flight IRP throughput, read throughput, and cached throughput
are all on par with memfs. The gap is specific to concurrent writes to
the same file.
Measurements
ATTO 4.01, 4 MB I/O size, queue depth 4, single 256 MB test file.
Read paths are within 5% across configurations. Single-IRP write (Bypass
only, no Direct I/O → NT serializes write-through) is also within 8%.
The gap shows up only when ATTO can issue 4 concurrent writes (Direct
I/O + Bypass): RamDrive scales 2.67 → 9.64 (3.6×), memfs scales 2.47 →
13.31 (5.4×).
Suspected cause
`PagedFileContent.Write` (src/RamDrive.Core/Memory/PagedFileContent.cs)
uses a per-file `ReaderWriterLockSlim` and acquires the write lock
in its phase 3 to publish page-table entries and `memcpy` data into
pages. With four concurrent writers to the same file, all four serialize
on `EnterWriteLock`. memfs writes to a single contiguous `byte[]` and
relies on no per-file lock at all (it serializes only via the WinFsp
dispatcher).
Secondary contributors likely:
section, with div + mod + bounds check + null check per iteration.
vs. memfs's flat `byte[]` offset arithmetic.
`Buffer.MemoryCopy` of 4 MB.
Possible fixes (not yet evaluated)
In rough order of expected impact / risk:
write lock while doing both page-table publish and memcpy. After
page table is published, the memcpy could run lock-free as long as
no concurrent truncation can free those pages (the existing CAS
protocol against `SetLength` would need to extend to cover this).
with `Interlocked.CompareExchange` on individual `_pages[i]`
slots. Requires careful TLA+ re-modelling because `SetLength` and
`Dispose` currently rely on the write lock for atomicity over the
whole page table.
(skip the loop and indirection cost).
be physically adjacent in the page pool (unlikely common case, but
measurable).
Any change to the lock protocol must update `tla/RamDiskSystem.tla`
and re-pass at least the Minimal config before merging.
Out of scope for this issue
scheduling, not the FS implementation; not directly actionable.
Acceptance criteria
A patch should bring the Direct I/O + Bypass Write Cache (4 MB, QD=4)
write throughput within 10% of memfs on the same hardware
(target ≥ 12.0 GB/s, currently 9.64 GB/s), without regressing the other
ATTO cells, without breaking integration tests under `RAMDRIVE_DIFF=1`,
and without invalidating the TLA+ model.
🤖 Generated with Claude Code