Fix union breaking schema order #1398

ilongin · 2025-10-13T12:05:23Z

Union was breaking schema order mixing signals from multiple objects (e.g File). This PR fixes this issue.
Related Studio issue: https://github.com/iterative/studio/issues/12188

sourcery-ai · 2025-10-13T12:05:29Z

Reviewer's Guide

Refactors the internal column validation logic to preserve original column ordering during dataset union operations and adds a unit test to ensure the schema order remains consistent after union.

Class diagram for updated _validate_columns function

classDiagram
class _validate_columns {
  +left_columns: Iterable[ColumnElement]
  +right_columns: Iterable[ColumnElement]
  +return: list[str]
}
class ColumnElement {
  +name: str
}
_validate_columns --> ColumnElement: uses

Flow diagram for column validation and schema order preservation

flowchart TD
    A["left_columns (Iterable[ColumnElement])"] --> B["Extract left_names (list)"]
    C["right_columns (Iterable[ColumnElement])"] --> D["Extract right_names (list)"]
    B --> E["Sort left_names"]
    D --> F["Sort right_names"]
    E --> G["Compare sorted left_names and right_names"]
    F --> G
    G -- "If equal" --> H["Return left_names"]
    G -- "If not equal" --> I["Compute missing columns"]
    I --> J["Prepare error message"]

File-Level Changes

Change	Details	Files
Add test to verify schema order is preserved during union	Introduce test_union_does_not_break_schema_order in test_datachain.py Define a Meta model and helper functions add_file and add_meta for test setup Build two identical datasets, union them, save, and assert the final schema key order	`tests/unit/lib/test_datachain.py`
Refactor _validate_columns to maintain column order	Change return type from set[str] to list[str] for ordered output Collect left and right column names as lists instead of sets Compare sorted name lists for equality to detect matching schemas Use sets derived from the lists to compute missing columns when schemas differ	`src/datachain/query/dataset.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tests/unit/lib/test_datachain.py:4457-4466` </location>
<code_context>
+def test_union_does_not_break_schema_order(test_session):
</code_context>

<issue_to_address>
**suggestion (testing):** Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.

Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.

Suggested implementation:

```python
def test_union_does_not_break_schema_order(test_session):
    class Meta(BaseModel):
        name: str
        count: int

    def add_file(key) -> File:
        return File(path="")

    def add_meta(file) -> Meta:
        return Meta(name="meta", count=10)

def test_union_with_mismatched_schemas(test_session):
    class MetaA(BaseModel):
        name: str
        count: int

    class MetaB(BaseModel):
        name: str
        value: float

    meta_a = MetaA(name="metaA", count=5)
    meta_b = MetaB(name="metaB", value=3.14)

    # Simulate union operation
    try:
        result = [meta_a, meta_b]  # Replace with actual union logic if available
        # Check that mismatched schemas are handled (e.g., raise error or skip)
        assert not (hasattr(result[0], "value") and hasattr(result[1], "count"))
    except Exception as e:
        assert "schema" in str(e).lower()

def test_union_with_extra_columns(test_session):
    class MetaBase(BaseModel):
        name: str

    class MetaExtra(BaseModel):
        name: str
        extra: int

    meta_base = MetaBase(name="base")
    meta_extra = MetaExtra(name="extra", extra=42)

    # Simulate union operation
    result = [meta_base, meta_extra]  # Replace with actual union logic if available
    # Check that extra columns do not break the union
    assert hasattr(result[1], "extra")
    assert not hasattr(result[0], "extra")

def test_union_with_missing_columns(test_session):
    class MetaFull(BaseModel):
        name: str
        count: int

    class MetaMissing(BaseModel):
        name: str

    meta_full = MetaFull(name="full", count=10)
    meta_missing = MetaMissing(name="missing")

    # Simulate union operation
    result = [meta_full, meta_missing]  # Replace with actual union logic if available
    # Check that missing columns are handled gracefully
    assert hasattr(result[0], "count")
    assert not hasattr(result[1], "count")

```

If your codebase has a specific union operation or function, replace the list concatenation `[meta_a, meta_b]` etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-10-13T12:06:09Z

tests/unit/lib/test_datachain.py

+def test_union_does_not_break_schema_order(test_session):
+    class Meta(BaseModel):
+        name: str
+        count: int
+
+    def add_file(key) -> File:
+        return File(path="")
+
+    def add_meta(file) -> Meta:
+        return Meta(name="meta", count=10)


suggestion (testing): Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.

Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.

Suggested implementation:

def test_union_does_not_break_schema_order(test_session): class Meta(BaseModel): name: str count: int def add_file(key) -> File: return File(path="") def add_meta(file) -> Meta: return Meta(name="meta", count=10) def test_union_with_mismatched_schemas(test_session): class MetaA(BaseModel): name: str count: int class MetaB(BaseModel): name: str value: float meta_a = MetaA(name="metaA", count=5) meta_b = MetaB(name="metaB", value=3.14) # Simulate union operation try: result = [meta_a, meta_b] # Replace with actual union logic if available # Check that mismatched schemas are handled (e.g., raise error or skip) assert not (hasattr(result[0], "value") and hasattr(result[1], "count")) except Exception as e: assert "schema" in str(e).lower() def test_union_with_extra_columns(test_session): class MetaBase(BaseModel): name: str class MetaExtra(BaseModel): name: str extra: int meta_base = MetaBase(name="base") meta_extra = MetaExtra(name="extra", extra=42) # Simulate union operation result = [meta_base, meta_extra] # Replace with actual union logic if available # Check that extra columns do not break the union assert hasattr(result[1], "extra") assert not hasattr(result[0], "extra") def test_union_with_missing_columns(test_session): class MetaFull(BaseModel): name: str count: int class MetaMissing(BaseModel): name: str meta_full = MetaFull(name="full", count=10) meta_missing = MetaMissing(name="missing") # Simulate union operation result = [meta_full, meta_missing] # Replace with actual union logic if available # Check that missing columns are handled gracefully assert hasattr(result[0], "count") assert not hasattr(result[1], "count")

If your codebase has a specific union operation or function, replace the list concatenation [meta_a, meta_b] etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.

cloudflare-workers-and-pages · 2025-10-13T12:06:21Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`a8eaa6a`
Status:	✅ Deploy successful!
Preview URL:	https://fd7c22b9.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-12188-union-schema-c.datachain-documentation.pages.dev

View logs

codecov · 2025-10-13T12:14:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.78%. Comparing base (0b0419d) to head (a8eaa6a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1398   +/-   ##
=======================================
  Coverage   87.78%   87.78%           
=======================================
  Files         159      159           
  Lines       15001    15003    +2     
  Branches     2163     2163           
=======================================
+ Hits        13168    13170    +2     
  Misses       1334     1334           
  Partials      499      499

Flag	Coverage Δ
datachain	`87.73% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/query/dataset.py	`92.99% <100.00%> (+0.01%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shcheklein

Why does it fix the issue, could you explain it please?

Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?

dmpetrov

One comment is inline. Let's be more strict!

dmpetrov · 2025-10-13T19:01:18Z

src/datachain/query/dataset.py

+    right_names = [c.name for c in right_columns]

-    if left_names == right_names:
+    if sorted(left_names) == sorted(right_names):


It suppose to be much more strict - you cannot union if any mismatch.

If we try to be smart here, it opens a can of worms when someone will be always not happy with the results.

More details:

Number of columns. Must have.

Types of columns (some exception: it's ok to make it nullable or convert int to float).

Names of columns.

In SQL, they require only (1) and (2) while ignoring (3).

We might have issues with (2) - sqlalchemy is not good at types. So, for us it's better to use (3) in addition to (1) and ignore types (2). It should be == without any sorting.

So you are saying that we should not sort in order to not mix columns that are named the same but actually have different types? If it's only a matter of removing sorting then I will add it here, but if it's more complex I would add another PR / issue for this since it's not actually related to this PR (before my change we were comparing sets of column names which is pretty much the same)

A lot of tests failing after removing sorting ... let's do this in a separate issue as it's not trivial it seems ,and as mention above, it's not really related to this PR anyway.

ilongin · 2025-10-13T22:17:05Z

Why does it fix the issue, could you explain it please?

Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?

Schema is derived directly from selected columns from built SQLAlchemy query. Note that this is flatten schema (e.g it has keys like file__path, file__size etc.). We also have feature_schema which has higher level objects defined like file: File etc but that one is not important in this issue.

At some point in SQLUnion logic we were validating and constructing columns for union and in the process of validating we were using set which broke the original order.

shcheklein · 2025-10-13T22:19:09Z

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

ilongin · 2025-10-13T22:26:46Z

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

Columns created out of query and sent to create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/query/dataset.py#L1908-L1928
Columns used to create schema in create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/catalog/catalog.py#L834-L836

shcheklein · 2025-10-13T22:35:50Z

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

ilongin · 2025-10-14T10:45:38Z

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

I've added separate PR where I'm experimenting with using actual signals schema for calculating this, as it's not super simple it seems (need more time fixing that and making sure it's not breaking) #1404

fixing union breaking schema order

a8eaa6a

ilongin requested review from amritghimire, dreadatour and shcheklein October 13, 2025 12:05

sourcery-ai bot reviewed Oct 13, 2025

View reviewed changes

shcheklein reviewed Oct 13, 2025

View reviewed changes

dmpetrov reviewed Oct 13, 2025

View reviewed changes

ilongin requested review from dmpetrov and shcheklein October 13, 2025 22:29

Fix union breaking schema order #1398

Are you sure you want to change the base?

Fix union breaking schema order #1398

Uh oh!

Conversation

ilongin commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated _validate_columns function

Flow diagram for column validation and schema order preservation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Oct 13, 2025

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shcheklein left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

dmpetrov Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilongin Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ilongin Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ilongin commented Oct 13, 2025

Uh oh!

shcheklein commented Oct 13, 2025

Uh oh!

ilongin commented Oct 13, 2025

Uh oh!

shcheklein commented Oct 13, 2025

Uh oh!

ilongin commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ilongin commented Oct 13, 2025 •

edited

Loading

sourcery-ai bot commented Oct 13, 2025 •

edited

Loading

codecov bot commented Oct 13, 2025 •

edited

Loading

shcheklein left a comment •

edited

Loading

dmpetrov Oct 13, 2025 •

edited

Loading