-
Notifications
You must be signed in to change notification settings - Fork 132
Fix union breaking schema order #1398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideRefactors the internal column validation logic to preserve original column ordering during dataset union operations and adds a unit test to ensure the schema order remains consistent after union. Class diagram for updated _validate_columns functionclassDiagram
class _validate_columns {
+left_columns: Iterable[ColumnElement]
+right_columns: Iterable[ColumnElement]
+return: list[str]
}
class ColumnElement {
+name: str
}
_validate_columns --> ColumnElement: uses
Flow diagram for column validation and schema order preservationflowchart TD
A["left_columns (Iterable[ColumnElement])"] --> B["Extract left_names (list)"]
C["right_columns (Iterable[ColumnElement])"] --> D["Extract right_names (list)"]
B --> E["Sort left_names"]
D --> F["Sort right_names"]
E --> G["Compare sorted left_names and right_names"]
F --> G
G -- "If equal" --> H["Return left_names"]
G -- "If not equal" --> I["Compute missing columns"]
I --> J["Prepare error message"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tests/unit/lib/test_datachain.py:4457-4466` </location>
<code_context>
+def test_union_does_not_break_schema_order(test_session):
</code_context>
<issue_to_address>
**suggestion (testing):** Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.
Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.
Suggested implementation:
```python
def test_union_does_not_break_schema_order(test_session):
class Meta(BaseModel):
name: str
count: int
def add_file(key) -> File:
return File(path="")
def add_meta(file) -> Meta:
return Meta(name="meta", count=10)
def test_union_with_mismatched_schemas(test_session):
class MetaA(BaseModel):
name: str
count: int
class MetaB(BaseModel):
name: str
value: float
meta_a = MetaA(name="metaA", count=5)
meta_b = MetaB(name="metaB", value=3.14)
# Simulate union operation
try:
result = [meta_a, meta_b] # Replace with actual union logic if available
# Check that mismatched schemas are handled (e.g., raise error or skip)
assert not (hasattr(result[0], "value") and hasattr(result[1], "count"))
except Exception as e:
assert "schema" in str(e).lower()
def test_union_with_extra_columns(test_session):
class MetaBase(BaseModel):
name: str
class MetaExtra(BaseModel):
name: str
extra: int
meta_base = MetaBase(name="base")
meta_extra = MetaExtra(name="extra", extra=42)
# Simulate union operation
result = [meta_base, meta_extra] # Replace with actual union logic if available
# Check that extra columns do not break the union
assert hasattr(result[1], "extra")
assert not hasattr(result[0], "extra")
def test_union_with_missing_columns(test_session):
class MetaFull(BaseModel):
name: str
count: int
class MetaMissing(BaseModel):
name: str
meta_full = MetaFull(name="full", count=10)
meta_missing = MetaMissing(name="missing")
# Simulate union operation
result = [meta_full, meta_missing] # Replace with actual union logic if available
# Check that missing columns are handled gracefully
assert hasattr(result[0], "count")
assert not hasattr(result[1], "count")
```
If your codebase has a specific union operation or function, replace the list concatenation `[meta_a, meta_b]` etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_union_does_not_break_schema_order(test_session): | ||
| class Meta(BaseModel): | ||
| name: str | ||
| count: int | ||
|
|
||
| def add_file(key) -> File: | ||
| return File(path="") | ||
|
|
||
| def add_meta(file) -> Meta: | ||
| return Meta(name="meta", count=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test covers the main schema order issue but does not check for edge cases like mismatched schemas or extra/missing columns.
Add tests for mismatched schemas and extra or missing columns to ensure the union operation handles these edge cases correctly.
Suggested implementation:
def test_union_does_not_break_schema_order(test_session):
class Meta(BaseModel):
name: str
count: int
def add_file(key) -> File:
return File(path="")
def add_meta(file) -> Meta:
return Meta(name="meta", count=10)
def test_union_with_mismatched_schemas(test_session):
class MetaA(BaseModel):
name: str
count: int
class MetaB(BaseModel):
name: str
value: float
meta_a = MetaA(name="metaA", count=5)
meta_b = MetaB(name="metaB", value=3.14)
# Simulate union operation
try:
result = [meta_a, meta_b] # Replace with actual union logic if available
# Check that mismatched schemas are handled (e.g., raise error or skip)
assert not (hasattr(result[0], "value") and hasattr(result[1], "count"))
except Exception as e:
assert "schema" in str(e).lower()
def test_union_with_extra_columns(test_session):
class MetaBase(BaseModel):
name: str
class MetaExtra(BaseModel):
name: str
extra: int
meta_base = MetaBase(name="base")
meta_extra = MetaExtra(name="extra", extra=42)
# Simulate union operation
result = [meta_base, meta_extra] # Replace with actual union logic if available
# Check that extra columns do not break the union
assert hasattr(result[1], "extra")
assert not hasattr(result[0], "extra")
def test_union_with_missing_columns(test_session):
class MetaFull(BaseModel):
name: str
count: int
class MetaMissing(BaseModel):
name: str
meta_full = MetaFull(name="full", count=10)
meta_missing = MetaMissing(name="missing")
# Simulate union operation
result = [meta_full, meta_missing] # Replace with actual union logic if available
# Check that missing columns are handled gracefully
assert hasattr(result[0], "count")
assert not hasattr(result[1], "count")If your codebase has a specific union operation or function, replace the list concatenation [meta_a, meta_b] etc. with the actual union logic to ensure the tests are meaningful. You may also want to check for specific exceptions or error messages if your union implementation raises them for schema mismatches.
Deploying datachain-documentation with
|
| Latest commit: |
a8eaa6a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://fd7c22b9.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-12188-union-schema-c.datachain-documentation.pages.dev |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1398 +/- ##
=======================================
Coverage 87.78% 87.78%
=======================================
Files 159 159
Lines 15001 15003 +2
Branches 2163 2163
=======================================
+ Hits 13168 13170 +2
Misses 1334 1334
Partials 499 499
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it fix the issue, could you explain it please?
Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment is inline. Let's be more strict!
| right_names = [c.name for c in right_columns] | ||
|
|
||
| if left_names == right_names: | ||
| if sorted(left_names) == sorted(right_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It suppose to be much more strict - you cannot union if any mismatch.
If we try to be smart here, it opens a can of worms when someone will be always not happy with the results.
More details:
- Number of columns. Must have.
- Types of columns (some exception: it's ok to make it nullable or convert int to float).
- Names of columns.
In SQL, they require only (1) and (2) while ignoring (3).
We might have issues with (2) - sqlalchemy is not good at types. So, for us it's better to use (3) in addition to (1) and ignore types (2). It should be == without any sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are saying that we should not sort in order to not mix columns that are named the same but actually have different types? If it's only a matter of removing sorting then I will add it here, but if it's more complex I would add another PR / issue for this since it's not actually related to this PR (before my change we were comparing sets of column names which is pretty much the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of tests failing after removing sorting ... let's do this in a separate issue as it's not trivial it seems ,and as mention above, it's not really related to this PR anyway.
Schema is derived directly from selected columns from built SQLAlchemy query. Note that this is flatten schema (e.g it has keys like At some point in |
could you point me to it please? |
|
|
that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ... |
I've added separate PR where I'm experimenting with using actual signals schema for calculating this, as it's not super simple it seems (need more time fixing that and making sure it's not breaking) #1404 |
Union was breaking schema order mixing signals from multiple objects (e.g
File). This PR fixes this issue.Related Studio issue: https://github.com/iterative/studio/issues/12188