Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Oct 13, 2025

Union was breaking schema order mixing signals from multiple objects (e.g File). This PR fixes this issue.
Related Studio issue: https://github.com/iterative/studio/issues/12188

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 13, 2025

Reviewer's Guide

Refactors the internal column validation logic to preserve original column ordering during dataset union operations and adds a unit test to ensure the schema order remains consistent after union.

Class diagram for updated _validate_columns function

classDiagram
class _validate_columns {
  +left_columns: Iterable[ColumnElement]
  +right_columns: Iterable[ColumnElement]
  +return: list[str]
}
class ColumnElement {
  +name: str
}
_validate_columns --> ColumnElement: uses
Loading

Flow diagram for column validation and schema order preservation

flowchart TD
    A["left_columns (Iterable[ColumnElement])"] --> B["Extract left_names (list)"]
    C["right_columns (Iterable[ColumnElement])"] --> D["Extract right_names (list)"]
    B --> E["Sort left_names"]
    D --> F["Sort right_names"]
    E --> G["Compare sorted left_names and right_names"]
    F --> G
    G -- "If equal" --> H["Return left_names"]
    G -- "If not equal" --> I["Compute missing columns"]
    I --> J["Prepare error message"]
Loading

File-Level Changes

Change Details Files
Add test to verify schema order is preserved during union
  • Introduce test_union_does_not_break_schema_order in test_datachain.py
  • Define a Meta model and helper functions add_file and add_meta for test setup
  • Build two identical datasets, union them, save, and assert the final schema key order
tests/unit/lib/test_datachain.py
Refactor _validate_columns to maintain column order
  • Change return type from set[str] to list[str] for ordered output
  • Collect left and right column names as lists instead of sets
  • Compare sorted name lists for equality to detect matching schemas
  • Use sets derived from the lists to compute missing columns when schemas differ
src/datachain/query/dataset.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tests/unit/lib/test_datachain.py:4457-4466` </location>
<code_context>
+def test_union_does_not_break_schema_order(test_session):
</code_context>

<issue_to_address>
**suggestion (testing):** Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.

Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.

Suggested implementation:

```python
def test_union_does_not_break_schema_order(test_session):
    class Meta(BaseModel):
        name: str
        count: int

    def add_file(key) -> File:
        return File(path="")

    def add_meta(file) -> Meta:
        return Meta(name="meta", count=10)

def test_union_with_mismatched_schemas(test_session):
    class MetaA(BaseModel):
        name: str
        count: int

    class MetaB(BaseModel):
        name: str
        value: float

    meta_a = MetaA(name="metaA", count=5)
    meta_b = MetaB(name="metaB", value=3.14)

    # Simulate union operation
    try:
        result = [meta_a, meta_b]  # Replace with actual union logic if available
        # Check that mismatched schemas are handled (e.g., raise error or skip)
        assert not (hasattr(result[0], "value") and hasattr(result[1], "count"))
    except Exception as e:
        assert "schema" in str(e).lower()

def test_union_with_extra_columns(test_session):
    class MetaBase(BaseModel):
        name: str

    class MetaExtra(BaseModel):
        name: str
        extra: int

    meta_base = MetaBase(name="base")
    meta_extra = MetaExtra(name="extra", extra=42)

    # Simulate union operation
    result = [meta_base, meta_extra]  # Replace with actual union logic if available
    # Check that extra columns do not break the union
    assert hasattr(result[1], "extra")
    assert not hasattr(result[0], "extra")

def test_union_with_missing_columns(test_session):
    class MetaFull(BaseModel):
        name: str
        count: int

    class MetaMissing(BaseModel):
        name: str

    meta_full = MetaFull(name="full", count=10)
    meta_missing = MetaMissing(name="missing")

    # Simulate union operation
    result = [meta_full, meta_missing]  # Replace with actual union logic if available
    # Check that missing columns are handled gracefully
    assert hasattr(result[0], "count")
    assert not hasattr(result[1], "count")

```

If your codebase has a specific union operation or function, replace the list concatenation `[meta_a, meta_b]` etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +4457 to +4466
def test_union_does_not_break_schema_order(test_session):
class Meta(BaseModel):
name: str
count: int

def add_file(key) -> File:
return File(path="")

def add_meta(file) -> Meta:
return Meta(name="meta", count=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.

Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.

Suggested implementation:

def test_union_does_not_break_schema_order(test_session):
    class Meta(BaseModel):
        name: str
        count: int

    def add_file(key) -> File:
        return File(path="")

    def add_meta(file) -> Meta:
        return Meta(name="meta", count=10)

def test_union_with_mismatched_schemas(test_session):
    class MetaA(BaseModel):
        name: str
        count: int

    class MetaB(BaseModel):
        name: str
        value: float

    meta_a = MetaA(name="metaA", count=5)
    meta_b = MetaB(name="metaB", value=3.14)

    # Simulate union operation
    try:
        result = [meta_a, meta_b]  # Replace with actual union logic if available
        # Check that mismatched schemas are handled (e.g., raise error or skip)
        assert not (hasattr(result[0], "value") and hasattr(result[1], "count"))
    except Exception as e:
        assert "schema" in str(e).lower()

def test_union_with_extra_columns(test_session):
    class MetaBase(BaseModel):
        name: str

    class MetaExtra(BaseModel):
        name: str
        extra: int

    meta_base = MetaBase(name="base")
    meta_extra = MetaExtra(name="extra", extra=42)

    # Simulate union operation
    result = [meta_base, meta_extra]  # Replace with actual union logic if available
    # Check that extra columns do not break the union
    assert hasattr(result[1], "extra")
    assert not hasattr(result[0], "extra")

def test_union_with_missing_columns(test_session):
    class MetaFull(BaseModel):
        name: str
        count: int

    class MetaMissing(BaseModel):
        name: str

    meta_full = MetaFull(name="full", count=10)
    meta_missing = MetaMissing(name="missing")

    # Simulate union operation
    result = [meta_full, meta_missing]  # Replace with actual union logic if available
    # Check that missing columns are handled gracefully
    assert hasattr(result[0], "count")
    assert not hasattr(result[1], "count")

If your codebase has a specific union operation or function, replace the list concatenation [meta_a, meta_b] etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.

@cloudflare-workers-and-pages
Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: a8eaa6a
Status: ✅  Deploy successful!
Preview URL: https://fd7c22b9.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-12188-union-schema-c.datachain-documentation.pages.dev

View logs

@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.78%. Comparing base (0b0419d) to head (a8eaa6a).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1398   +/-   ##
=======================================
  Coverage   87.78%   87.78%           
=======================================
  Files         159      159           
  Lines       15001    15003    +2     
  Branches     2163     2163           
=======================================
+ Hits        13168    13170    +2     
  Misses       1334     1334           
  Partials      499      499           
Flag Coverage Δ
datachain 87.73% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/query/dataset.py 92.99% <100.00%> (+0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it fix the issue, could you explain it please?

Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment is inline. Let's be more strict!

right_names = [c.name for c in right_columns]

if left_names == right_names:
if sorted(left_names) == sorted(right_names):
Copy link
Member

@dmpetrov dmpetrov Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It suppose to be much more strict - you cannot union if any mismatch.

If we try to be smart here, it opens a can of worms when someone will be always not happy with the results.

More details:

  1. Number of columns. Must have.
  2. Types of columns (some exception: it's ok to make it nullable or convert int to float).
  3. Names of columns.

In SQL, they require only (1) and (2) while ignoring (3).

We might have issues with (2) - sqlalchemy is not good at types. So, for us it's better to use (3) in addition to (1) and ignore types (2). It should be == without any sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying that we should not sort in order to not mix columns that are named the same but actually have different types? If it's only a matter of removing sorting then I will add it here, but if it's more complex I would add another PR / issue for this since it's not actually related to this PR (before my change we were comparing sets of column names which is pretty much the same)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of tests failing after removing sorting ... let's do this in a separate issue as it's not trivial it seems ,and as mention above, it's not really related to this PR anyway.

@ilongin
Copy link
Contributor Author

ilongin commented Oct 13, 2025

Why does it fix the issue, could you explain it please?

Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?

Schema is derived directly from selected columns from built SQLAlchemy query. Note that this is flatten schema (e.g it has keys like file__path, file__size etc.). We also have feature_schema which has higher level objects defined like file: File etc but that one is not important in this issue.

At some point in SQLUnion logic we were validating and constructing columns for union and in the process of validating we were using set which broke the original order.

@shcheklein
Copy link
Member

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

@ilongin
Copy link
Contributor Author

ilongin commented Oct 13, 2025

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

  1. Columns created out of query and sent to create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/query/dataset.py#L1908-L1928
  2. Columns used to create schema in create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/catalog/catalog.py#L834-L836

@shcheklein
Copy link
Member

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

@ilongin
Copy link
Contributor Author

ilongin commented Oct 14, 2025

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

I've added separate PR where I'm experimenting with using actual signals schema for calculating this, as it's not super simple it seems (need more time fixing that and making sure it's not breaking) #1404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants