-
Notifications
You must be signed in to change notification settings - Fork 65
control plane & inferred schema improvements omnibus #2059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jgraettinger
wants to merge
10
commits into
master
Choose a base branch
from
johnny/validation-tweaks-2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Generation ID ratchets (only) upwards during reduction. If the RHS has a greater generation ID, then it replaces the LHS. Or, if the RHS has an older one, it's understood to be stale and is dropped. If they're equal, then reduce as per usual. If only one side has a valid (string) generation ID, then the values are reduced but the generation ID is explicitly attached to the reduced output, which handles historical schemas without a generation ID, as well as failing "safe" if there's a future whoopsie with regard to x-collection-generation-ID population.
We preserve inferred shapes across capture task terms, which is desireable to reduce the volumne of inferred schema update logs, but in doing so we must properly handle cases where a collection is reset (it's generation ID changes), or is backfilled. Use the partition template name AND the state key to key the cache of inferred shapes used across restarts. Also fix derivation inferred schema logging, which had previously omitted `x-collection-generation-id`.
This field has never been produced in actual logs. It's harmless that it's been in the rollup (always coalesced to the zero ID), but this commit cleans it up.
Rather than flow://write-schema
Validations have historically only looked at a proposed "draft" model, and concerned themselves with either accepting the draft and building specs, or producing errors. This change deeply reworks the fundamental "job" of validation. Going forward, it's job is to jointly examine live models, live specs, and proposed draft models to validate all required constraints, and to enforce those constraints by either a) producing an error, or b) performing an automatic fix of the model, recording the fact that it did so. Much of the lift here is in the coordination required to jointly step through live and drafted models and specs, which requires joining over extracted resource paths. Connector resource paths are somewhat complicated. After exhausting a number of other unwork-able options, the solution implemented here stashes a Validated response resource path into the resource config model of its applicable bindings. Then, the next time we're validating a proposed draft change of the now-live binding, we're able to extract those stashed paths in order to properly join over them. Having done this, new "fixes" are introduced for handling: - Unchanged bindings of collections which have been deleted. - Unchanged materialization exclusions of fields which have been dropped. - Unchanged projections of collections where the schema location has been removed. - Updating the inline inferred schema of a collection's read schema, respecting the generation ID of the collection and inferred schema. - Support for automatically back-filling the capture or materialization binding of a "reset" collection (having a changed generation ID). New validations which restrict changes to keys or logical partitions are also introduced, which may change only if the collection is also being reset. Finally, validations begin tracking inactive historical bindings / transforms of tasks, which are threaded into connector RPCs and are also used for low-level determination of whether binding backfill is required.
From the resource config schema. This approach doesn't work, existing many existing connectors have more complicated behaviors which cannot be expressed through resource path pointers.
This PR includes both runtime and control-plane changes. I tested them together, but they can be released separately and in any order. One other call-out is that the release of the inferred schema reducer update will churn a lot of inferred schemas in the platform, by adding |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR includes a number of structured commits which collectively:
x-collection-generation-id
, including reset upon a ratchet upwards.x-collection-generation-id
consistently, and to discard cached inferred doc::Shapes of collections which have been reset or backfilled.flow://relaxed-write-schema
(now supported in the UI).Model fixes are included in
publication_specs
details and visible in the History view (though formatting could be better...).Validations now fully supports recently-introduced spec fields such as
inactive_bindings
andinactive_transforms
, and also honors collectionreset
semantics by performing an effective drop-and-replace.This PR completes the control-plane and runtime implementation of changes from the collection evolution proposal.
One notable departure is that resource path pointers are de-emphasized and may soon be deprecated entirely. Instead, validations inline a
/_meta/path
property into each binding resource config which holds the last-Validated resource_path returned by the connector. This is sufficient for the new joins done by validations, but places fewer constraints or assumptions on how connectors compute binding resource paths.A wide variety of scenarios were manually tested on a local stack, including:
Workflow steps:
publication_specs
.reset
.Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is