Skip to content

Conversation

@teskje
Copy link
Contributor

@teskje teskje commented Nov 13, 2025

This PR adds a turmoil-based chaos test for builtin schema migrations. It spawns a bunch of processes at different versions, defines some random builtin schema changes, and then performs migrations through all versions. For added chaos, processes are randomly restarted all the time.

The first commit adjusts the builtin schema migration code by pulling the Transaction handling out of the core migration logic. Turns out that it isn't easy to construct a Transaction in tests, so pulling it to the edge makes the test code simpler.

Motivation

  • This PR adds a test.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@teskje teskje force-pushed the builtin-item-migration-turmoil branch 2 times, most recently from 81bd870 to 39a7377 Compare November 15, 2025 20:23
@teskje teskje force-pushed the builtin-item-migration-turmoil branch from 39a7377 to c98c108 Compare November 16, 2025 13:35
This commit reorganizes the builtin schema migration code by pulling all
`Transaction` handling into the top-level `run` function. `Migration`
becomes unaware of the existing of the `Transaction` and instead returns
a `MigrationRunResult` that informs the caller about the changes that
need to be applied.

The reason for doing this is that it makes it easier to test the core
migration code. In tests it's difficult to construct a `Transaction`
object, and with this change we don't have to do that anymore, as long
as we're invoking `Migration::run` directly.
@teskje teskje force-pushed the builtin-item-migration-turmoil branch 2 times, most recently from 77a17d0 to 76b5841 Compare November 20, 2025 10:03
This commit adds a turmoil-based chaos test for builtin schema
migrations. It spawns a bunch of processes at different versions,
defines some random builtin schema changes, and then performs migrations
through all versions. For added chaos, process are randomly restarted
all the time.
@teskje teskje force-pushed the builtin-item-migration-turmoil branch from 76b5841 to 953c265 Compare November 20, 2025 10:10
@teskje teskje marked this pull request as ready for review November 20, 2025 10:10
@teskje teskje requested a review from a team as a code owner November 20, 2025 10:10
@teskje teskje requested a review from SangJunBak November 20, 2025 10:10
Comment on lines +90 to +92
for generation in 1..=NUM_VERSIONS {
let version = Version::new(0, generation, 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth adding jumps to these versions instead of having them linear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would do anything! The migration logic doesn't look at the versions specifically, it only compares them, and for that it doesn't matter by how much they differ. Also note that the rng might decide to assign 0 migrations and 0 processes to a version, which effectively means skipping that version.


/// Fuzz test builtin schema migration using turmoil.
#[test] // allow(test-attribute)
#[ignore = "runs forever"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does CI know to move on when this runs forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't! That's why the test is ignored, so it doesn't run in CI. It's mostly useful for running locally, for fuzzing during development.

#[cfg_attr(miri, ignore)] // too slow
fn test_builtin_schema_migration() {
const NUM_VERSIONS: u64 = 10;
const NUM_BUILTINS: u64 = 5;
Copy link
Contributor

@SangJunBak SangJunBak Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to have a separate variable for the number of initial builtins to be 10x larger than the number of builtins we want to migrate. More representative of our actual system at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose a small number of builtins to make it more likely that there are multiple migrations on the same builtin, which seems more interesting than having a single migration on many. So the values are chosen based on what I believe will find the most bugs, not what's closest to the situation in production.

The idea with these constants is that it should be easy to change them during development, to tune how much certain aspects should be stressed. If you think it would be worthwhile to have another configuration running in CI as well, lmk. We can make some of these arguments and then define different #[test]s that invoke the function with different arguments.

Copy link
Contributor

@SangJunBak SangJunBak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some suggestions but otw lgtm!

@teskje
Copy link
Contributor Author

teskje commented Nov 21, 2025

TFTR!

@teskje teskje merged commit 94d696a into MaterializeInc:main Nov 21, 2025
132 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants