Skip to content

perf+fix: diff-based attribution, unified rebase path, daemon guard fixes#817

Merged
svarlamov merged 14 commits intomainfrom
fix/devin-review-feedback
Mar 27, 2026
Merged

perf+fix: diff-based attribution, unified rebase path, daemon guard fixes#817
svarlamov merged 14 commits intomainfrom
fix/devin-review-feedback

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented Mar 27, 2026

Summary

Addresses review feedback from #813 and adds significant performance, correctness, and simplification improvements to the rebase authorship rewriting pipeline.

Performance: diff-based line attribution transfer

  • Replace the expensive char-level byte diffing (AttributionTracker::update_attributions) and VirtualAttributions wrapper with lightweight line-level diffing using the existing imara-diff Myers algorithm
  • Eliminates both VA construction overhead and O(n*m) char-level transform, replacing with O(n+m) line-level positional transfer
  • ~4x improvement for per-commit transform, ~3.4x for full pipeline
  • Fixes the duplicate-line collision bug in content-matching (positional tracking instead of HashMap<line_text, author>)

Performance: fast cached serialization

  • Restore per-file attestation text caching and metadata JSON template pre-splitting
  • Assembly is pure string concatenation instead of full serialize_to_string() per commit
  • Eliminates serialization as a bottleneck in the commit processing loop

Fix: prevent pipe deadlock in exec_git_stdin

  • Writing all stdin data before reading stdout causes deadlock when the child process's stdout pipe buffer fills (~64KB)
  • Fixed by spawning stdin writes in a separate thread, allowing concurrent read/write

Fix: diff-tree --stdin parsing for -z null-terminated output

  • With -z flag, commit SHA headers in git diff-tree --stdin output are null-terminated, not newline-terminated
  • Parser was silently returning 0 changed files, disabling the parallel blob loading optimization

Refactor: unified rebase path

  • Removed the use_line_lookup_fast_path branching that maintained two parallel transform and serialization paths
  • Both paths already used identical diff-based transfer; the only difference was serialization (cached string fragments vs structured AuthorshipLog)
  • Unified to the structured AuthorshipLog approach, eliminating a class of divergence bugs
  • Net -230 lines of code removed

Fix: per-commit prompt metrics

  • Prompt metrics (accepted_lines, overridden_lines) now update per commit instead of being frozen from initial state

Fix: file deletion/reappearance inheritance

  • Preserve attributions and file contents through file deletion so files recreated later in the rebase sequence inherit via positional diff transfer

Fix: daemon event handler guards

  • Add is_valid_oid/is_zero_oid guards to RefUpdated handler (matching Reset handler)
  • Add missing is_ancestor_commit guard to RefUpdated handler, preventing spurious rebase events on backward ref updates
  • Revert first_parent_oid onto_head hint — incorrect for multi-commit rebases
  • Swap reflog fallback order in parse_update_ref_heads (target ref before HEAD)
  • Skip rebase-like reset handling for backward resets

Other

  • Support update-ref and rebase-like reset detection in daemon mode for Graphite compatibility
  • Unignore 5 Graphite tests that now pass via update-ref hook
  • Keep pretty-printed JSON in authorship notes metadata
  • Clean up dead code (upsert_file_attestation, unused timing variables)

Test plan

  • 76 rebase integration tests pass (0 failures)
  • 21 Graphite integration tests pass
  • 24 rebase unit tests pass
  • 5 new unit tests for diff-based transfer (equal, insert, delete, replace, duplicate lines)
  • 2 new regression tests: per-commit prompt metrics, file delete-recreate attribution
  • 2 benchmark tests (ignored): full pipeline comparison, scaling 50-5000 lines
  • Lint & Format fully green
  • Ubuntu tests fully green (all 3 modes + SCM e2e)
  • Git Core Compatibility green
  • Performance Benchmarks green

🤖 Generated with Claude Code

- Add is_valid_oid/is_zero_oid guards to the RefUpdated handler in
  rewrite_events_from_semantic_events, preventing errors when the
  generic analyzer emits RefUpdated events with zero OIDs (e.g. git
  branch creation).

- Swap reflog fallback order in parse_update_ref_heads: consult the
  target ref's own reflog before HEAD's reflog, since update-ref does
  not modify HEAD and the HEAD reflog could contain an unrelated entry
  with the same OID from a concurrent operation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

The update-ref and reset-as-rebase handlers were passing
onto_head=None to build_rebase_commit_mappings, causing it to walk
all commits back to the merge_base between old_head and new_head.
For Graphite restacks with long shared history this meant scanning
hundreds of irrelevant commits.

Derive the onto hint from the first parent of new_head, which
constrains the commit walk to just the rebased commits — matching
the behavior of the regular RebaseComplete handler which gets
onto_head from stable_rebase_heads_from_worktree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Only attempt rebase commit mapping for non-ancestor resets (e.g.
Graphite restacking), not for backward resets like git reset --soft
HEAD~1 where new_head is an ancestor of old_head.  This matches the
wrapper's post_reset_hook which checks is_ancestor before calling
apply_wrapper_plumbing_rewrite_if_possible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

svarlamov and others added 2 commits March 27, 2026 04:02
…ransfer

Replace the expensive char-level byte diffing (via AttributionTracker::update_attributions)
and VirtualAttributions wrapper with a lightweight line-level diff approach using the
existing imara-diff Myers algorithm. This eliminates both the VA construction overhead
and the O(n*m) char-level transform, replacing them with O(n+m) line-level positional
transfer.

Key changes:
- Add diff_based_line_attribution_transfer() using capture_diff_slices for positional mapping
- Extract compute_line_attrs_for_changed_file() to deduplicate fast/slow path logic
- Remove VirtualAttributions wrapper construction from rebase v2 path
- Remove dead code: transform_changed_files_to_final_state, content_has_intersection_with_author_map
- Fix metrics leak bug: subtract prompt_line_metrics before deleted-file early return
- Fix clippy warnings in benchmark tests

Benchmarks show ~4x improvement for per-commit transform and ~3.4x for full pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dated handler

Address Devin review feedback:

1. Remove first_parent_oid onto_head hint from both Reset and RefUpdated handlers.
   For multi-commit rebases (D--B'--C'), first_parent_oid(C') returns B' (an
   intermediate commit), not D (the actual onto target). This causes
   build_rebase_commit_mappings to truncate the new commits list, creating
   mismatched commit mappings and losing authorship data.

2. Add missing is_ancestor_commit guard to RefUpdated handler. The Reset handler
   already had this guard to skip backward ref updates, but RefUpdated did not.
   Without it, backward ref moves (e.g. git update-ref to an older commit) would
   emit spurious rebase_complete events with incorrect commit mappings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…eritance

Address Devin review feedback: the old transform_changed_files_to_final_state
preserved attributions and file contents when a file was deleted mid-rebase,
so that if the file reappeared in a later commit, positional diff-based
transfer could inherit from the pre-deletion state.

Both fast and slow paths now keep current_attributions and current_file_contents
intact on deletion. The slow path re-adds the subtracted metrics to maintain
balance (the file is excluded from serialized notes via existing_files check).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…emplate

The fast path metadata template was pre-built once with initial prompt
metrics (accepted_lines, overridden_lines), so all rebased commits shared
identical metrics regardless of attribution changes. Now both fast and slow
paths track prompt_line_metrics per commit and serialize fresh metadata.

Also adds two regression tests:
- test_rebase_prompt_metrics_update_per_commit: verifies accepted_lines
  differs between commits when AI lines increase
- test_rebase_file_delete_recreate_preserves_attribution: verifies
  attributions survive a delete-recreate cycle within a rebase sequence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov svarlamov changed the title fix: validate OIDs in RefUpdated handler and fix reflog lookup order perf+fix: diff-based attribution transfer, daemon guard fixes, and per-commit prompt metrics Mar 27, 2026
Remove the use_line_lookup_fast_path branching that maintained two parallel
transform and serialization paths. Both paths already used identical
diff-based attribution transfer logic; the only difference was serialization
strategy (cached string fragments vs structured AuthorshipLog).

The unified path uses the structured AuthorshipLog approach for all cases,
eliminating a class of divergence bugs (frozen metrics, missing state
updates, deletion handling inconsistencies) that required separate fixes
for each path.

Removed:
- use_line_lookup_fast_path flag and all branching on it
- cached_file_attestation_text per-file string cache
- serialize_file_attestation (fast-path string fragment serializer)
- serialize_attestation_from_line_attrs (fast-path line attrs serializer)
- Pre-caching of attestation text from initial state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov svarlamov changed the title perf+fix: diff-based attribution transfer, daemon guard fixes, and per-commit prompt metrics perf+fix: diff-based attribution, unified rebase path, daemon guard fixes Mar 27, 2026
svarlamov and others added 6 commits March 27, 2026 15:17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the monolithic `collect_changed_file_contents_for_commits` into
three phases:

1. `run_diff_tree_for_commits` — fast metadata-only phase using
   `git diff-tree --stdin` to discover changed files and blob OIDs
2. `batch_read_blob_contents_parallel` — reads blob contents using up
   to 4 concurrent `git cat-file --batch` processes (chunks of 200
   OIDs each), using the established smol async pattern
3. `assemble_changed_contents` — pure data transformation

Also adds a large-scale benchmark test (`benchmark_large_scale_mixed`)
with 200 files (mixed 1k/5k lines), 100+ commits, and structured
timing output showing git time vs overhead percentage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ently

exec_git_stdin wrote all stdin data before reading stdout. When the
child process produces enough output to fill the OS pipe buffer (~64KB),
it blocks on write. But the parent is still blocked writing to stdin,
causing a deadlock. This manifests with large git cat-file --batch and
git diff-tree --stdin calls (e.g. 65+ files × 50+ commits).

Fix by spawning stdin writes in a separate thread so wait_with_output()
can drain stdout concurrently. Broken pipe errors are tolerated since
the child may exit before consuming all input.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With the -z flag, git diff-tree --stdin separates the commit SHA header
with a null byte (\0), not a newline (\n). The parser was looking for
newlines, finding none, and silently treating every commit as having
zero changed files. This meant the diff-tree optimization path never
actually detected file changes during rebase authorship rewrite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling serialize_to_string() per commit (which rebuilds the
entire JSON), pre-cache each file's attestation text at initialization
and only re-serialize changed files via serialize_attestation_from_line_attrs.
Note assembly becomes pure string concatenation of cached fragments +
a pre-split metadata template with commit SHA substitution.

This restores the fast serialization optimization that was removed in
the "unify fast/slow rebase paths" refactoring, while keeping the more
correct diff-based attribution transfer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused upsert_file_attestation (replaced by fast cached
serialization), remove dead loop_attestation_ms timing variable,
and collapse nested if-let chains per clippy suggestions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov svarlamov merged commit 89726e7 into main Mar 27, 2026
22 of 23 checks passed
@svarlamov svarlamov deleted the fix/devin-review-feedback branch March 27, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant