perf: concurrent write throughput ~28% behind memfs reference

## Summary

ATTO disk benchmark on the same machine shows RamDrive's concurrent write
throughput (Direct I/O + Bypass Write Cache, 4 MB blocks, QD=4) is about
28% behind WinFsp's reference \`memfs-x64.exe\` on the same hardware.
Single in-flight IRP throughput, read throughput, and cached throughput
are all on par with memfs. The gap is specific to concurrent writes to
the same file.

## Measurements

ATTO 4.01, 4 MB I/O size, queue depth 4, single 256 MB test file.

| Configuration | RamDrive W | RamDrive R | memfs W | memfs R |
|---|---|---|---|---|
| No checkboxes (Cc cached) | 5.19 GB/s | 8.42 GB/s | 5.42 GB/s | 8.94 GB/s |
| Bypass Write Cache only | 2.67 GB/s | 5.85 GB/s | 2.47 GB/s | 6.51 GB/s |
| Direct I/O + Bypass Write Cache | **9.64 GB/s** | 6.30 GB/s | **13.31 GB/s** | 6.53 GB/s |

Read paths are within 5% across configurations. Single-IRP write (Bypass
only, no Direct I/O → NT serializes write-through) is also within 8%.
The gap shows up only when ATTO can issue 4 concurrent writes (Direct
I/O + Bypass): RamDrive scales 2.67 → 9.64 (3.6×), memfs scales 2.47 →
13.31 (5.4×).

## Suspected cause

\`PagedFileContent.Write\` (src/RamDrive.Core/Memory/PagedFileContent.cs)
uses a per-file \`ReaderWriterLockSlim\` and acquires the **write lock**
in its phase 3 to publish page-table entries and \`memcpy\` data into
pages. With four concurrent writers to the same file, all four serialize
on \`EnterWriteLock\`. memfs writes to a single contiguous \`byte[]\` and
relies on no per-file lock at all (it serializes only via the WinFsp
dispatcher).

Secondary contributors likely:
- 64-iteration page loop (4 MB / 64 KB) inside the write-lock critical
  section, with div + mod + bounds check + null check per iteration.
- Page-table indirection on every chunk: \`_pages[pageIndex]\` lookup
  vs. memfs's flat \`byte[]\` offset arithmetic.
- \`Span<byte>.CopyTo\` per page (64 calls per IRP) vs. one bulk
  \`Buffer.MemoryCopy\` of 4 MB.

## Possible fixes (not yet evaluated)

In rough order of expected impact / risk:

1. **Move memcpy out of the write lock.** Phase 3 currently holds the
   write lock while doing both page-table publish *and* memcpy. After
   page table is published, the memcpy could run lock-free as long as
   no concurrent truncation can free those pages (the existing CAS
   protocol against \`SetLength\` would need to extend to cover this).
2. **Per-page lock-free publish.** Replace the per-file write lock
   with \`Interlocked.CompareExchange\` on individual \`_pages[i]\`
   slots. Requires careful TLA+ re-modelling because \`SetLength\` and
   \`Dispose\` currently rely on the write lock for atomicity over the
   whole page table.
3. **Inline single-page fast path** for writes that fit in one page
   (skip the loop and indirection cost).
4. **Vectorized batch memcpy** when consecutive page slots happen to
   be physically adjacent in the page pool (unlikely common case, but
   measurable).

Any change to the lock protocol must update \`tla/RamDiskSystem.tla\`
and re-pass at least the Minimal config before merging.

## Out of scope for this issue

- Read path: RamDrive 6.30 vs memfs 6.53 (within noise) — no fix needed.
- Cached read/write: numbers are dominated by NT Cache Manager flush
  scheduling, not the FS implementation; not directly actionable.
- Single-IRP write: NT serializes by design; nothing to optimize.

## Acceptance criteria

A patch should bring the Direct I/O + Bypass Write Cache (4 MB, QD=4)
write throughput within 10% of memfs on the same hardware
(target ≥ 12.0 GB/s, currently 9.64 GB/s), without regressing the other
ATTO cells, without breaking integration tests under \`RAMDRIVE_DIFF=1\`,
and without invalidating the TLA+ model.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: concurrent write throughput ~28% behind memfs reference #16

Summary

Measurements

Suspected cause

Possible fixes (not yet evaluated)

Out of scope for this issue

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Configuration	RamDrive W	RamDrive R	memfs W	memfs R
No checkboxes (Cc cached)	5.19 GB/s	8.42 GB/s	5.42 GB/s	8.94 GB/s
Bypass Write Cache only	2.67 GB/s	5.85 GB/s	2.47 GB/s	6.51 GB/s
Direct I/O + Bypass Write Cache	9.64 GB/s	6.30 GB/s	13.31 GB/s	6.53 GB/s

perf: concurrent write throughput ~28% behind memfs reference #16

Description

Summary

Measurements

Suspected cause

Possible fixes (not yet evaluated)

Out of scope for this issue

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions