Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 0-Task Plans in Single-Shard Router When Updating a Local Table with Reference Table in Subquery #7897

Open
wants to merge 10 commits into
base: release-13.0
Choose a base branch
from

Conversation

m3hm3t
Copy link
Contributor

@m3hm3t m3hm3t commented Feb 12, 2025

This PR fixes an issue #7891 in the Citus planner where an UPDATE on a local table with a subquery referencing a reference table could produce a 0-task plan. Historically, the planner sometimes failed to detect that both the target and referenced tables were effectively “local,” assigning INVALID_SHARD_ID and yielding a no-op plan.

Root Cause

  • In the Citus router logic (PlanRouterQuery), we relied on shardId to determine whether a query should be routed to a single shard.
  • If shardId == INVALID_SHARD_ID, but we also had not marked the query as a “local table modification,” the code path would produce zero tasks.
  • Local + reference tables do not require multi-shard routing. Failing to detect this “purely local” scenario caused Citus to incorrectly route to zero tasks.

Changes

Enhanced Local Table Detection

  • Updated IsLocalTableModification and related checks to consider both local and reference tables as “local” for planning, preventing the 0-task scenario.
  • Expanded ContainsOnlyLocalOrReferenceTables to return true if there are no fully distributed tables in the query.

Added Regress Test

  • Introduced a new regress test (issue_7891.sql) which reproduces the scenario.
  • Verifies we get a valid single- or local-task plan rather than a 0-task plan.

@m3hm3t m3hm3t self-assigned this Feb 12, 2025
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.42%. Comparing base (c55bc8c) to head (ce9915d).
Report is 20 commits behind head on release-13.0.

Additional details and impacted files
@@               Coverage Diff                @@
##           release-13.0    #7897      +/-   ##
================================================
- Coverage         89.48%   89.42%   -0.06%     
================================================
  Files               276      276              
  Lines             60063    59895     -168     
  Branches           7524     7510      -14     
================================================
- Hits              53747    53561     -186     
- Misses             4166     4188      +22     
+ Partials           2150     2146       -4     

@m3hm3t m3hm3t changed the title Simplify ContainsOnlyLocalTables function to check only for distribut… Fix 0-Task Plans in Single-Shard Router When Updating a Local Table with Reference Table in Subquery Feb 12, 2025
@m3hm3t m3hm3t added the bug label Feb 12, 2025
@@ -1583,7 +1583,7 @@ IsLocalTableModification(Oid targetRelationId, Query *query, uint64 shardId,
return true;
}

if (shardId == INVALID_SHARD_ID && ContainsOnlyLocalTables(rteProperties))
if (shardId == INVALID_SHARD_ID && ContainsOnlyLocalOrReferenceTables(rteProperties))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I update the function name and description based on the adjustments here?

@m3hm3t m3hm3t marked this pull request as ready for review February 13, 2025 08:16
WHERE EXISTS (
SELECT (SELECT c15 FROM t2_ref)
FROM t4_pg
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the test also show the query plan ? EXPLAIN (VERBOSE, COSTS OFF) output for this query, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding it, but it behaves differently across different PostgreSQL versions. In this case, I don't think it's worth maintaining separate test outputs. Do you have any suggestions on how to handle this?

https://github.com/citusdata/citus/actions/runs/13288741578

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to handle this is to use a helper function explain_with_pg17_initplan_format() (see the test columnar_chunk_filtering.sql for an example of its usage), but even with that I think this difference in EXPLAIN format requires more than one test output, and I agree its not worth maintaining. Query correctness is the main concern here. So let's drop this suggestion if it means 2 output files.


-- Show final data
SELECT 't6_pg after' AS label, * FROM t6_pg;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a test that ensures updates on a reference table are handled correctly ? That is, all copies of the reference table are updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

{
return !rteProperties->hasDistributedTable && !rteProperties->hasReferenceTable;
/* If hasDistributedTable is false, then all tables are either local or reference. */
return !rteProperties->hasDistributedTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this enable a reference table to be updated ? e.g. update ref_table where <clause using local and ref tables> ?

Copy link
Contributor Author

@m3hm3t m3hm3t Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE t2_ref
   SET c15 = '2099-01-01 00:00:00'::timestamp
 WHERE EXISTS (
    SELECT 1
    FROM t6_pg
 );
ERROR:  relation t6_pg is not distributed

I executed the query above and encountered an error. Should we keep it? May they be duplicated or redundant?

Copy link
Contributor

@colm-mchugh colm-mchugh Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get the same error with main, so it is expected. I'm in favor of keeping, even if the error condition is tested elsewhere, as it serves to document the current behavior. Can we add a test query where t6_pg is replaced by a reference table ? Something like:

create table t3_ref(pkey int, c15 timestamp);
select create_reference_table('t3_ref');
insert into t3_ref values (99, now());

UPDATE t2_ref
   SET c15 = '2088-08-08 00:00:00'::timestamp
 WHERE EXISTS (
    SELECT 1
    FROM t3_ref
 );

and verify that t2_ref is updated on all nodes ? I think it should be, because the modify planner will not be looking at an invalid shard-id, but just to cover all bases... Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m3hm3t there was a typo in my previous comment - it should have read: verify that t2_ref is updated on all nodes.

I checked this case using your branch and it is okay; the reference table is updated correctly - the check I used is below, if you want to include it in the regress test.

This was the one remaining case I wanted to check, the PR now looks good to me overall.

create table t3_ref(pkey int, c15 text);
select create_reference_table('t3_ref');
insert into t3_ref values (99, 'Initial Data');

UPDATE t2_ref SET c15 = '2088-08-08 00:00:00'::timestamp WHERE EXISTS ( SELECT 1 FROM t3_ref);

SELECT 't2_ref after UPDATE' AS label, * FROM t2_ref;

SELECT citus_remove_node('localhost', :worker_2_port);

SELECT 't2_ref after UPDATE - without worker 2' AS label, * FROM t2_ref;

SELECT 1 FROM citus_add_node('localhost', :worker_2_port);
SELECT citus_remove_node('localhost', :worker_1_port);

SELECT 't2_ref after UPDATE - without worker 1' AS label, * FROM t2_ref;

SELECT 1 FROM citus_add_node('localhost', :worker_1_port);

-- - The outer subquery iterates over every row in t4_pg.
-- - The inner subquery selects c15 from t2_ref.
-- 5. The update should occur if the nested subquery returns any row, effectively updating t6_pg's vkey to 43.
-- 6. The final state of t6_pg is displayed to confirm that the update was applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue present for DELETE and MERGE ? Can we include tests for those ? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the restrictions in the MERGE section, do you think keeping these tests might make them duplicated?

MERGE INTO t6_pg AS tgt
USING t4_pg AS src
ON (tgt.pkey = 14000)    -- trivial condition to "match" row pkey=14000
WHEN MATCHED THEN
    UPDATE SET c26 = 'merged_' || (SELECT pkey FROM t2_ref WHERE pkey=24000 LIMIT 1)
WHEN NOT MATCHED THEN
    INSERT (vkey, pkey, c26)
    VALUES (99, src.pkey, 'inserted_via_merge');
ERROR:  non-IMMUTABLE functions are not yet supported in MERGE sql with distributed tables
MERGE INTO t6_pg AS tgt
USING t4_pg AS src
  ON (tgt.pkey = src.pkey)
WHEN MATCHED THEN
  UPDATE SET c26 = 'merged_value'
WHEN NOT MATCHED THEN
  INSERT (vkey, pkey, c26)
  VALUES (src.vkey, src.pkey, 'inserted_via_merge');
SELECT 't6_pg after MERGE' AS label, * FROM t6_pg;
       label       | vkey | pkey  |        c26
---------------------------------------------------------------------
 t6_pg after MERGE |   43 | 12000 | initial
 t6_pg after MERGE |   43 | 14000 | to_merge
 t6_pg after MERGE |    5 | 15000 | inserted_via_merge
(3 rows)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per previous comment, I'm in favor of keeping because it documents the current behavior. But I'll leave it to your call. I've a couple of test suggestions though, can we add:
MERGE into a reference table using a postgres table as source
MERGE into a reference table using another reference table as source

Copy link
Contributor

@colm-mchugh colm-mchugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks for accommodating the test requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants