Skip to content

audit: robust output-file write (atomic + OSError handling) #55

@Fieldnote-Echo

Description

Robust output-file write for nboot audit --output

Problem

audit_cmd writes the rendered report to --output via a single
output.write_text(rendered) call with no error handling and no
atomic-write semantics. Failure modes that surface as raw tracebacks
in CI:

  • PermissionError — read-only mount, restricted directory
  • OSError(ENOSPC) — disk full
  • OSError(EROFS) — read-only filesystem (containers)
  • Process killed mid-write — leaves a partial / truncated file that
    downstream consumers (upload-sarif action, log aggregators) try
    to parse and fail on

Combined with issue #1 (directory rejection), these are the most
common real-world --output failure modes.

Proposed fix

Wrap the write in:

  1. Atomic write: write to a sibling temp file in the same directory
    (so the rename is on the same filesystem), then os.replace.
  2. Try/except OSError: catch and re-raise as click.ClickException
    with a clean message naming the destination and underlying errno
    reason.

Sketch (note: tmp_path is initialised to None before the
try so the cleanup path is unconditional and never tries to
reference an unbound name if tempfile.NamedTemporaryFile itself
raises):

import os, tempfile
tmp_path: Path | None = None
try:
    out_dir = output.parent
    out_dir.mkdir(parents=True, exist_ok=True)
    with tempfile.NamedTemporaryFile(
        mode="w", dir=out_dir, delete=False, prefix=output.name + ".",
        suffix=".tmp",
    ) as tmp:
        tmp_path = Path(tmp.name)
        tmp.write(rendered)
    os.replace(tmp_path, output)
    tmp_path = None  # ownership transferred to `output`
except OSError as e:
    raise click.ClickException(
        f"Failed to write audit report to {output}: {e.strerror or e}"
    ) from e
finally:
    if tmp_path is not None and tmp_path.exists():
        tmp_path.unlink(missing_ok=True)

(os.fsync is intentionally NOT used — durability beyond rename
isn't part of the audit's contract; per-page fsync would only
matter if we promised "report survives a crash before next boot",
which we don't.)

Acceptance criteria

  • Permission denied on --output produces a one-line
    ClickException (not a traceback).
  • Disk-full / read-only filesystem produce the same.
  • --output <missing-dir>/report.sarif either creates the
    parent directory (via mkdir(parents=True, exist_ok=True))
    or, if creation itself fails, emits a clean ClickException
    naming the missing parent.
  • If os.replace() itself raises (rare; cross-device or
    destination-locked scenarios), the temp file is unlinked
    and a clean ClickException is raised — no .tmp debris
    and no partial overwrite.
  • Partial-write scenarios leave the destination file either
    fully-written or untouched (atomic via os.replace).
  • No leftover *.tmp files in the destination directory in
    any failure path (success path or any OSError path).
  • Tests in tests/test_audit.py cover: OSError on
    tempfile.NamedTemporaryFile, OSError on tmp.write,
    OSError on os.replace, and the success path; each asserts
    no *.tmp remains.

Effort

~30-45 minutes including tests.

Source

Bot review on PR #51 (Grippy cli.py:379 MEDIUM, cli.py:422 LOW).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions