Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Oct 14, 2025

WIP

Summary by Sourcery

Use the signals schema to compute and pass both hierarchical and flat dataset schemas across save and dataset creation APIs, eliminating manual column-based schema logic

New Features:

  • Add clone_with_sys_signals method to SignalSchema for merging system and user signals

Enhancements:

  • Generate a flat_schema from signals_schema (including system signals) for dataset storage
  • Propagate flat_schema through dataset save and create_dataset calls to replace ad-hoc column-based schema building
  • Remove inline column-based schema computation in catalog.create_dataset and pass flat_schema to metastore.create_dataset

@ilongin ilongin marked this pull request as draft October 14, 2025 10:44
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 14, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR refactors how the dataset schema is computed by deriving a flat schema from the unified signals schema (including system signals) rather than reconstructing it from column definitions. It adds a clone_with_sys_signals helper to merge system columns, generates flat_schema in DataChain.save, and propagates flat_schema through query/dataset.save and catalog.create_dataset, replacing the previous inline column-based schema logic.

File-Level Changes

Change Details Files
Generate flat_schema from signals_schema in DataChain.save
  • Add dict comprehension to build flat_schema from clone_with_sys_signals().db_signals
  • Filter by SQLType and map signal names to type dictionaries
  • Pass flat_schema into the storage/save call alongside feature_schema
src/datachain/lib/dc/datachain.py
Introduce clone_with_sys_signals method on SignalSchema
  • Implement merging of system columns from DataTable.sys_columns
  • Return a new SignalSchema combining system and non-system signals
src/datachain/lib/signal_schema.py
Propagate flat_schema parameter in query Dataset.save
  • Add flat_schema argument to save signature
  • Forward flat_schema to the downstream catalog.create_dataset call
src/datachain/query/dataset.py
Replace inline column-based schema creation with flat_schema in Catalog.create_dataset
  • Add flat_schema parameter to create_dataset signature
  • Use flat_schema for metastore.create_dataset instead of locally built schema
  • Comment out legacy schema-building code based on columns
src/datachain/catalog/catalog.py

Possibly linked issues

  • remove docstring from DataModel.__pydantic__init_subclass__ #123: The PR replaces the schema calculation based on columns with a new flat_schema derived from SignalSchema.
  • #Compare() doesn't see file source: The PR updates schema calculation to correctly include system signals, resolving the 'source__file__source' SignalResolvingError.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ilongin ilongin linked an issue Oct 14, 2025 that may be closed by this pull request
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@cloudflare-workers-and-pages
Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 42f2b0b
Status: ✅  Deploy successful!
Preview URL: https://6dd3bbe1.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1403-use-signal-sche.datachain-documentation.pages.dev

View logs

@shcheklein
Copy link
Member

@ilongin ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use SignalSchema to calculate final dataset version schema

3 participants