Skip to content

control plane & inferred schema improvements omnibus #2059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Apr 9, 2025

Description:

This PR includes a number of structured commits which collectively:

  • Make JSON Schema reductions aware of x-collection-generation-id, including reset upon a ratchet upwards.
  • Fix inferred schema log sites to include x-collection-generation-id consistently, and to discard cached inferred doc::Shapes of collections which have been reset or backfilled.
  • Switch collections using schema inference to flow://relaxed-write-schema (now supported in the UI).
  • Deeply refactor how validations work:
    • Previously, validation only considered a proposed "draft" state and would either succeed or error.
    • Now, validation jointly walks live models, live built specifications, and the proposed draft change.
    • In many cases validations will continue to error, however:
    • In some cases, validations will apply automatic fixes to draft models to enforce a constraint instead of error-ing.
    • This now gives us a single code path to apply nuanced and contextual handling of most constraint violations in the control plane.

Model fixes are included in publication_specs details and visible in the History view (though formatting could be better...).

Validations now fully supports recently-introduced spec fields such as inactive_bindings and inactive_transforms, and also honors collection reset semantics by performing an effective drop-and-replace.

This PR completes the control-plane and runtime implementation of changes from the collection evolution proposal.

One notable departure is that resource path pointers are de-emphasized and may soon be deprecated entirely. Instead, validations inline a /_meta/path property into each binding resource config which holds the last-Validated resource_path returned by the connector. This is sufficient for the new joins done by validations, but places fewer constraints or assumptions on how connectors compute binding resource paths.

A wide variety of scenarios were manually tested on a local stack, including:

  • Behaviors around resets of type-A and type-B collections bound to both captures and materializations, including bindings which are active and also disabled at the time of reset.
  • Behaviors around deletions of collections which had been actively bound to a capture & materialization.
  • Updates of inferred schemas, including around collection reset.

Workflow steps:

  • Many conditions which previous resulted in a validation error will now succeed, with additional changes made by the control plane and recorded in publication_specs.
  • Inferred schemas respect collection deletion / re-creation and reset.
  • Capture and materialization bindings will properly backfill upon collection reset, even if disabled during the reset and later enabled.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

Generation ID ratchets (only) upwards during reduction. If the RHS has a
greater generation ID, then it replaces the LHS. Or, if the RHS has an
older one, it's understood to be stale and is dropped. If they're equal,
then reduce as per usual.

If only one side has a valid (string) generation ID, then the values are
reduced but the generation ID is explicitly attached to the reduced
output, which handles historical schemas without a generation ID, as
well as failing "safe" if there's a future whoopsie with regard to
x-collection-generation-ID population.
We preserve inferred shapes across capture task terms,
which is desireable to reduce the volumne of inferred schema update
logs, but in doing so we must properly handle cases where a collection
is reset (it's generation ID changes), or is backfilled.

Use the partition template name AND the state key to key the cache of
inferred shapes used across restarts.

Also fix derivation inferred schema logging, which had previously
omitted `x-collection-generation-id`.
This field has never been produced in actual logs.
It's harmless that it's been in the rollup (always coalesced
to the zero ID), but this commit cleans it up.
Validations have historically only looked at a proposed "draft" model,
and concerned themselves with either accepting the draft and building specs,
or producing errors.

This change deeply reworks the fundamental "job" of validation.

Going forward, it's job is to jointly examine live models, live specs,
and proposed draft models to validate all required constraints, and to
enforce those constraints by either a) producing an error, or b)
performing an automatic fix of the model, recording the fact that it
did so.

Much of the lift here is in the coordination required to jointly step
through live and drafted models and specs, which requires joining over
extracted resource paths.

Connector resource paths are somewhat complicated. After exhausting a
number of other unwork-able options, the solution implemented here
stashes a Validated response resource path into the resource config
model of its applicable bindings. Then, the next time we're validating a
proposed draft change of the now-live binding, we're able to extract
those stashed paths in order to properly join over them.

Having done this, new "fixes" are introduced for handling:
 - Unchanged bindings of collections which have been deleted.
 - Unchanged materialization exclusions of fields which have been
   dropped.
 - Unchanged projections of collections where the schema location has
   been removed.
 - Updating the inline inferred schema of a collection's read schema,
   respecting the generation ID of the collection and inferred schema.
 - Support for automatically back-filling the capture or materialization
   binding of a "reset" collection (having a changed generation ID).

New validations which restrict changes to keys or logical partitions are
also introduced, which may change only if the collection is also being
reset.

Finally, validations begin tracking inactive historical bindings /
transforms of tasks, which are threaded into connector RPCs and are also
used for low-level determination of whether binding backfill is required.
From the resource config schema. This approach doesn't work,
existing many existing connectors have more complicated behaviors which
cannot be expressed through resource path pointers.
@jgraettinger jgraettinger changed the title Johnny/validation tweaks 2 control plane & schema improvements omnibus Apr 9, 2025
@jgraettinger jgraettinger changed the title control plane & schema improvements omnibus control plane & inferred schema improvements omnibus Apr 9, 2025
@jgraettinger jgraettinger requested a review from psFried April 9, 2025 23:59
@jgraettinger jgraettinger marked this pull request as ready for review April 9, 2025 23:59
@jgraettinger
Copy link
Member Author

jgraettinger commented Apr 10, 2025

This PR includes both runtime and control-plane changes. I tested them together, but they can be released separately and in any order.

One other call-out is that the release of the inferred schema reducer update will churn a lot of inferred schemas in the platform, by adding x-collection-generation-id for the first time.

@jgraettinger jgraettinger added change:significant This is a significant change change:planned This is a planned change labels Apr 10, 2025
@jgraettinger
Copy link
Member Author

Issue #1969
Issue #1919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change change:significant This is a significant change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant