Skip to content

Conversation

jumski
Copy link
Contributor

@jumski jumski commented Sep 18, 2025

This PR fixes a critical issue where messages for pending tasks remain in the queue indefinitely when a run fails, causing performance degradation and resource waste.

Problem

  • Failed runs left queued messages orphaned, causing workers to poll them forever
  • Map steps with N tasks would leave N-1 messages orphaned when one task failed
  • Type constraint violations would retry unnecessarily despite being deterministic failures

Solution

  • Archive all queued messages when a run fails
  • Handle type violations gracefully (fail immediately, no retries)
  • Prevent any retries when the run is already failed
  • Add index for efficient message archiving

Testing

  • Added comprehensive tests for map task failures and type violations
  • All existing tests pass without regression

Copy link

changeset-bot bot commented Sep 18, 2025

🦋 Changeset detected

Latest commit: 237c69f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 7 packages
Name Type
@pgflow/core Patch
pgflow Patch
@pgflow/client Patch
@pgflow/edge-worker Patch
@pgflow/example-flows Patch
@pgflow/dsl Patch
@pgflow/website Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 09-18-fix-orphaned-messages-on-fail

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor Author

jumski commented Sep 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • merge:queue - adds this PR to the back of the merge queue
  • hotfix:queue - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jumski jumski changed the title docs: add instructions for fixing SQL tests and updating functions Fix Orphaned Messages on Run Failure Sep 18, 2025
@jumski jumski marked this pull request as ready for review September 18, 2025 15:38
Comment on lines +115 to +121
select is(
(select count(*)::integer from pgflow.step_tasks
where run_id = :'test_run_id'::uuid
and step_slug = 'parallel_single'),
0,
'Parallel single task should not exist after type constraint violation (transaction rolled back)'
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment indicates that parallel_single tasks won't exist due to transaction rollback, but the PR changes the behavior to handle type violations gracefully without rolling back transactions. With the new implementation, these tasks might actually exist in a failed state rather than not existing at all.

Consider updating this test to match the new behavior - either by checking for failed status instead of non-existence, or by updating the comment to reflect the current implementation's expected behavior. This will ensure the test accurately validates the intended behavior of the type violation handling.

Suggested change
select is(
(select count(*)::integer from pgflow.step_tasks
where run_id = :'test_run_id'::uuid
and step_slug = 'parallel_single'),
0,
'Parallel single task should not exist after type constraint violation (transaction rolled back)'
);
select is(
(select count(*)::integer from pgflow.step_tasks
where run_id = :'test_run_id'::uuid
and step_slug = 'parallel_single'
and status = 'failed'),
1,
'Parallel single task should exist but be in failed state after type constraint violation (graceful handling)'
);

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Copy link

nx-cloud bot commented Sep 18, 2025

🤖 Nx Cloud AI Fix Eligible

An automatically generated fix could have helped fix failing tasks for this run, but Self-healing CI is disabled for this workspace. Visit workspace settings to enable it and get automatic fixes in future runs.

To disable these notifications, a workspace admin can disable them in workspace settings.


View your CI Pipeline Execution ↗ for commit 237c69f

Command Status Duration Result
nx affected -t lint typecheck test --parallel -... ❌ Failed 6m 3s View ↗

☁️ Nx Cloud last updated this comment at 2025-10-06 15:42:04 UTC

@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch 2 times, most recently from c4b287c to 6602788 Compare September 18, 2025 16:04
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch 2 times, most recently from f303c04 to 79502f5 Compare September 18, 2025 20:52
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from 79502f5 to ae9a6ee Compare September 19, 2025 08:55
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from ae9a6ee to 104f337 Compare September 19, 2025 09:17
Comment on lines 172 to 182
PERFORM pgmq.archive(r.flow_slug, st.message_id)
FROM pgflow.step_tasks st
JOIN pgflow.runs r ON st.run_id = r.run_id
WHERE st.run_id = fail_task.run_id
AND st.status IN ('queued', 'started')
AND st.message_id IS NOT NULL;
END IF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance and correctness issue: The archive operation uses individual pgmq.archive() calls in a loop rather than batch archiving. This is inefficient for large numbers of messages and could cause partial archiving if one call fails. Should collect message IDs and use batch archiving like in complete_task, or use a single query with array_agg() to archive all messages atomically.

Suggested change
PERFORM pgmq.archive(r.flow_slug, st.message_id)
FROM pgflow.step_tasks st
JOIN pgflow.runs r ON st.run_id = r.run_id
WHERE st.run_id = fail_task.run_id
AND st.status IN ('queued', 'started')
AND st.message_id IS NOT NULL;
END IF;
WITH messages_to_archive AS (
SELECT r.flow_slug, array_agg(st.message_id) AS message_ids
FROM pgflow.step_tasks st
JOIN pgflow.runs r ON st.run_id = r.run_id
WHERE st.run_id = fail_task.run_id
AND st.status IN ('queued', 'started')
AND st.message_id IS NOT NULL
GROUP BY r.flow_slug
)
SELECT pgmq.archive_batch(flow_slug, message_ids)
FROM messages_to_archive
WHERE array_length(message_ids, 1) > 0;
END IF;

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from 104f337 to 842f929 Compare September 19, 2025 10:21
@jumski jumski force-pushed the 09-18-chore_improve_verify-schemas-synced_and_regenerate-temp-migration branch from acdca8d to fefb86b Compare October 5, 2025 18:59
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from 842f929 to bcbe933 Compare October 5, 2025 18:59
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from bcbe933 to f9e432c Compare October 5, 2025 19:20
@jumski jumski force-pushed the 09-18-chore_improve_verify-schemas-synced_and_regenerate-temp-migration branch from fefb86b to d04fa3f Compare October 5, 2025 19:20
Copy link
Contributor

github-actions bot commented Oct 6, 2025

🔍 Preview Deployment: Website

Deployment successful!

🔗 Preview URL: https://pr-220.pgflow.pages.dev

📝 Details:

  • Branch: 09-18-fix-orphaned-messages-on-fail
  • Commit: 4b4d34cbad504bdf484638803f1a601891fd5ed1
  • View Logs

_Last updated: _

Copy link
Contributor

github-actions bot commented Oct 6, 2025

🔍 Preview Deployment: Playground

Deployment successful!

🔗 Preview URL: https://pr-220--pgflow-demo.netlify.app

📝 Details:

  • Branch: 09-18-fix-orphaned-messages-on-fail
  • Commit: 4b4d34cbad504bdf484638803f1a601891fd5ed1
  • View Logs

_Last updated: _

Provides guidance on fixing invalid tests, updating SQL functions, and rerunning tests
without creating migrations or using nx, to streamline test maintenance and debugging.
@jumski jumski force-pushed the 09-18-chore_improve_verify-schemas-synced_and_regenerate-temp-migration branch from d04fa3f to a467f00 Compare October 6, 2025 15:25
@jumski jumski force-pushed the 09-18-fix-orphaned-messages-on-fail branch from f9e432c to 237c69f Compare October 6, 2025 15:25
Copy link
Contributor

graphite-app bot commented Oct 7, 2025

Merge activity

  • Oct 7, 8:28 AM UTC: jumski added this pull request to the Graphite merge queue.
  • Oct 7, 8:29 AM UTC: CI is running for this pull request on a draft pull request (#232) due to your merge queue CI optimization settings.
  • Oct 7, 8:29 AM UTC: Merged by the Graphite merge queue via draft PR: #232.

graphite-app bot pushed a commit that referenced this pull request Oct 7, 2025
This PR fixes a critical issue where messages for pending tasks remain in the queue indefinitely when a run fails, causing performance degradation and resource waste.

### Problem
- Failed runs left queued messages orphaned, causing workers to poll them forever
- Map steps with N tasks would leave N-1 messages orphaned when one task failed
- Type constraint violations would retry unnecessarily despite being deterministic failures

### Solution
- Archive all queued messages when a run fails
- Handle type violations gracefully (fail immediately, no retries)
- Prevent any retries when the run is already failed
- Add index for efficient message archiving

### Testing
- Added comprehensive tests for map task failures and type violations
- All existing tests pass without regression
@graphite-app graphite-app bot closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant