diff --git a/AGENTS.md b/AGENTS.md index 88789d69..06bd58fc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -132,7 +132,7 @@ m.Alter = "ENGINE=InnoDB" require.NoError(t, m.Run()) ``` -Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithStrict()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`. +Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`. **General test patterns:** - Integration tests connect to real MySQL — there are no mocked database tests for core logic diff --git a/docs/migrate.md b/docs/migrate.md index 3c46b0fd..70cbbed0 100644 --- a/docs/migrate.md +++ b/docs/migrate.md @@ -31,7 +31,6 @@ spirit migrate --host mydb:3306 --username root --password secret \ - [skip-drop-after-cutover](#skip-drop-after-cutover) - [skip-force-kill](#skip-force-kill) - [statement](#statement) -- [strict](#strict) - [table](#table) - [target-chunk-time](#target-chunk-time) - [threads](#threads) @@ -70,7 +69,7 @@ This protects against resuming from very stale checkpoints where replaying the a When Spirit reads a checkpoint, it relies on the columns of the checkpoint table matching the columns the current binary expects: -- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. This is true in both strict and non-strict modes — [`--strict`](#strict) does not currently hard-fail on a checkpoint-table schema mismatch, so progress from the previous binary version is silently discarded. +- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. Progress from the previous binary version is silently discarded. - **If the checkpoint table schema is unchanged but the *meaning* of stored values has changed** between versions (for example, a watermark format change, a routing-policy change, or a new applier behavior), Spirit cannot detect the mismatch. The resume will silently succeed and the new binary will reinterpret the old checkpoint, which can produce incorrect results. Operationally, this means: @@ -200,9 +199,8 @@ The main practical differences vs. the default path: A resume from checkpoint fails fast if `@@GLOBAL.gtid_purged` is no longer a subset of the checkpointed GTID set (i.e. the source has dropped binlogs Spirit would need to re-apply). In that case Spirit surfaces -`change.Source: cannot resume from position`, which under default (non-strict) -mode causes a restart from scratch, and under [`--strict`](#strict) causes an -exit with `status.ErrBinlogNotFound`. +`change.Source: cannot resume from position`, logs the reason, and restarts the +migration from scratch. ```bash spirit migrate --gtid \ @@ -345,26 +343,6 @@ There are some restrictions to `--statement`: - When sending multiple statements, the `INSTANT` and `INPLACE` optimizations will be skipped. This means that metadata-only changes that would execute instantly if submitted alone will require a full table copy. - When sending multiple statements, all statements must operate on tables in the same underlying database (aka schema). -### strict - -- Type: Boolean -- Default value: `false` - -> **⚠️ Not recommended.** In most cases, the default behavior (idempotent restart) is safer and more convenient. `--strict` was added for a specific internal use case and is generally the wrong choice for new deployments. If a previous migration was interrupted, the default behavior will safely clean up and restart, which is almost always what you want. - -By default, Spirit will automatically clean up old checkpoints before starting the schema change. This allows schema changes to always proceed forward, at the cost of potentially lost progress from a previous incomplete run. - -When set to `true`, if Spirit encounters a checkpoint belonging to a previous migration, it will validate that the statement being executed matches the statement from the previous run (whether provided via `--alter` or `--statement`). If the validation fails (e.g., the statement was changed between runs, or the binlog position is no longer available), Spirit will exit with an error rather than silently restarting from scratch. - -The scenarios where `--strict` causes Spirit to fail rather than restart are: -- The migration statement changed between runs (checkpoint has a different statement) -- The binlog file referenced by the checkpoint has been purged from the server -- The checkpoint is too old to safely resume (replaying binlogs would be slower than restarting) - -In all of these cases, the default (non-strict) behavior is to log a warning and start fresh, which is usually the correct action. - -Note: a checkpoint-table schema mismatch (typically caused by resuming with a different Spirit binary version — see [Resuming across Spirit binary versions](#resuming-across-spirit-binary-versions)) is **not** one of the strict-mode hard-fail cases. In both strict and non-strict mode Spirit logs a warning and starts a fresh migration. - ### table - Type: String diff --git a/pkg/migration/helpers_test.go b/pkg/migration/helpers_test.go index 72b42c54..ae38dad7 100644 --- a/pkg/migration/helpers_test.go +++ b/pkg/migration/helpers_test.go @@ -130,13 +130,6 @@ func WithDeferCutOver() RunnerOption { } } -// WithStrict enables strict mode (mismatched ALTER detection on resume). -func WithStrict() RunnerOption { - return func(m *Migration) { - m.Strict = true - } -} - // WithDBName overrides the database name (for tests using CreateUniqueTestDatabase). func WithDBName(name string) RunnerOption { return func(m *Migration) { diff --git a/pkg/migration/migration.go b/pkg/migration/migration.go index 757f0254..891a5242 100644 --- a/pkg/migration/migration.go +++ b/pkg/migration/migration.go @@ -42,7 +42,6 @@ type Migration struct { SkipDropAfterCutover bool `name:"skip-drop-after-cutover" help:"Keep old table after completing cutover" optional:"" default:"false"` DeferCutOver bool `name:"defer-cutover" help:"Defer cutover (and checksum) until sentinel table is dropped" optional:"" default:"false"` SkipForceKill bool `name:"skip-force-kill" help:"Disable killing long-running transactions in order to acquire metadata lock (MDL) at checksum and cutover time" optional:"" default:"false"` - Strict bool `name:"strict" help:"Exit on --alter mismatch when incomplete migration is detected. Not recommended for most users; the default idempotent restart behavior is safer." optional:"" default:"false"` Statement string `name:"statement" help:"The SQL statement to run (replaces --table and --alter)" optional:"" default:""` Lint bool `name:"lint" help:"Run lint checks before running migration" optional:""` LintOnly bool `name:"lint-only" help:"Run lint checks and exit without performing migration" optional:""` diff --git a/pkg/migration/resume-from-checkpoint.md b/pkg/migration/resume-from-checkpoint.md index 18d6f3c4..1e7a75cb 100644 --- a/pkg/migration/resume-from-checkpoint.md +++ b/pkg/migration/resume-from-checkpoint.md @@ -28,12 +28,12 @@ When a new Runner starts (`Runner.Run()` → `setup()`), it always attempts `res 1. **Check `__new` exists** — if the shadow table is gone, there's nothing to resume. 2. **Read checkpoint table** — fetch the saved watermarks, position, statement, and `original_table_name`. -3. **Validate DDL statement matches** — the checkpoint must be for the same alter. In `--strict` mode, a mismatch is a hard error. In non-strict mode, Spirit discards the checkpoint and starts fresh. +3. **Validate DDL statement matches** — the checkpoint must be for the same alter. On a mismatch, Spirit discards the checkpoint and starts fresh. 4. **Validate `original_table_name` matches** (single-table mode) — guards against the rare collision where two long table names truncate to the same checkpoint table name. A mismatch causes Spirit to discard the checkpoint and start fresh. 5. **Set up copier, checker, and change source** — create the change source and add subscriptions for each table. 6. **Resume streaming from the saved position** — `replClient.StartFromPosition(ctx, position)` primes the source's internal position and begins streaming. The source validates the position is still resumable; if it isn't (e.g. the MySQL binlog file has been purged), `change.ErrPositionNotFound` is returned and surfaces as `status.ErrBinlogNotFound`, causing Spirit to fall back to `newMigration()`. -If any step fails (and strict mode is not enabled), Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable. +If any step fails, Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable. ## Background: how MySQL binary logs work @@ -47,34 +47,13 @@ MySQL automatically deletes old binlog files based on `binlog_expire_logs_second One reason resume can fail is **binlog expiry**. If the checkpoint references a binlog file that has been purged, Spirit cannot resume because changes in the gap would be lost. -The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Strict mode then surfaces the error to the caller; non-strict mode logs and falls back to `newMigration()`. - -What happens next depends on whether strict mode is enabled: - -- **Without `--strict`:** Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost silently. -- **With `--strict`:** Spirit returns `status.ErrBinlogNotFound` to the caller. This lets automation detect the problem and alert an operator rather than silently discarding hours of copy work. +The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost. The tradeoff of falling back to `newMigration()` is that all copy progress is lost. For a large table this could mean hours of wasted work. To avoid this: - **Keep binlog retention longer than your longest expected migration pause.** If you expect to pause migrations for up to a week, make sure `binlog_expire_logs_seconds` is set to at least 7 days. The MySQL 8.0 default is 30 days (`2592000`), which is usually sufficient. -- **Consider `--strict` mode only if you have automation that handles the errors it produces.** In strict mode, Spirit surfaces both DDL mismatches (`status.ErrMismatchedAlter`) and binlog expiry (`status.ErrBinlogNotFound`) as errors instead of restarting. This is generally not recommended for most users — the default behavior of discarding stale checkpoints and restarting is safer and simpler. See [strict](../docs/migrate.md#strict) for more details. - **Be aware of your binlog retention window.** If Spirit is paused longer than the retention period, the checkpoint's binlog file will be purged and resume will fail. Some managed MySQL services disable retention by default. -## Strict mode - -> **Note:** `--strict` is not recommended for most users. The default idempotent restart behavior (discard stale checkpoint, restart from scratch) is safer and requires no special error handling. Only use `--strict` if you have automation that can programmatically handle the specific errors it produces. - -By default, Spirit treats checkpoint resume as best-effort. If the checkpoint is invalid for any reason — mismatched DDL statement, expired binlog, corrupt checkpoint data — Spirit discards it and starts a new migration. This is the recommended behavior. - -With `Strict: true`, Spirit returns a hard error for two specific resume failures: - -- **`status.ErrMismatchedAlter`** — the checkpoint's DDL statement doesn't match the current `--alter`. This prevents the scenario where an operator changes the alter between runs and unknowingly loses all progress. -- **`status.ErrBinlogNotFound`** — the checkpoint's binlog file has been purged from the server. This prevents silently restarting a multi-hour copy from scratch. - -Both errors work with `errors.Is()`, letting callers handle each case differently. See [strict](../docs/migrate.md#strict) for more details. - -Other resume failures (missing shadow table, corrupt checkpoint data) still fall through to `newMigration()` in both modes, since these typically indicate there was nothing valid to resume. - ## Cross-version compatibility Checkpoint tables are version-specific. Spirit deliberately uses `SELECT *` when reading the checkpoint, so any change to the checkpoint schema (e.g. columns added, removed, or reordered in a newer release) will cause the read to fail. This is by design — Spirit does not attempt to migrate or backfill checkpoint data across versions. @@ -83,8 +62,7 @@ Practical implications: - **Upgrading Spirit mid-migration** (older binary → newer binary). The newer binary's `Scan` expects a different number of columns than the on-disk checkpoint provides, so the read fails. - **Rolling back Spirit mid-migration** (newer binary → older binary). Same failure mode in reverse. -- **Effect in both `--strict` and non-strict mode:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. This error is not one of the typed `status.Err…` values that strict mode promotes (`ErrMismatchedAlter`, `ErrBinlogNotFound`, `ErrCheckpointTooOld`), so Spirit logs the error and falls through to `newMigration()` regardless of strict mode. The copy restarts from scratch and all checkpoint progress is lost silently. -- **Operator implication:** do **not** rely on `--strict` alerting to catch a Spirit upgrade or rollback. Strict mode does not currently distinguish a cross-version checkpoint schema mismatch from a healthy fresh start. +- **Effect:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. Spirit logs the error and falls through to `newMigration()`. The copy restarts from scratch and all checkpoint progress is lost. Operational guidance: diff --git a/pkg/migration/resume_test.go b/pkg/migration/resume_test.go index 55905c11..a8a090f3 100644 --- a/pkg/migration/resume_test.go +++ b/pkg/migration/resume_test.go @@ -561,63 +561,6 @@ FROM compositevarcharpk a WHERE version='1'`) require.NoError(t, m2.Close()) } -func TestResumeFromCheckpointStrict(t *testing.T) { - t.Parallel() - tt := testutils.NewTestTable(t, "resumestricttest", `CREATE TABLE resumestricttest ( - id int(11) NOT NULL AUTO_INCREMENT, - pad varbinary(1024) NOT NULL, - PRIMARY KEY (id) - )`) - tt.SeedRows(t, "INSERT INTO resumestricttest (pad) SELECT RANDOM_BYTES(1024)", 100000) - - alterSQL := "ADD INDEX(pad);" - - // Kick off a migration with --strict enabled and let it run until the first checkpoint is available - m := NewTestRunner(t, "resumestricttest", alterSQL, - WithThreads(1), - WithTargetChunkTime(100*time.Millisecond), - WithStrict(), - WithTestThrottler()) - - ctx, cancel := context.WithCancel(t.Context()) - done := make(chan struct{}) - go func() { - defer close(done) - err := m.Run(ctx) - require.Error(t, err) // it gets interrupted as soon as there is a checkpoint saved. - }() - - waitForCheckpoint(t, m) - - // Cancel + wait for Run to fully return before Close. See - // TestChangeIntToBigIntPKResumeFromChkPt for the rationale. - cancel() - <-done - require.NoError(t, m.Close()) - - // Insert some more dummy data - testutils.RunSQL(t, "INSERT INTO resumestricttest (pad) SELECT RANDOM_BYTES(1024) FROM resumestricttest LIMIT 1000") - - // Start a _different_ migration on the same table. We don't expect this to work when --strict is enabled - // since the --alter doesn't match what is recorded in the checkpoint table - runner2 := NewTestRunner(t, "resumestricttest", "ENGINE=INNODB", - WithThreads(1), - WithTargetChunkTime(100*time.Millisecond), - WithStrict()) - - err := runner2.Run(t.Context()) - require.Error(t, err) - require.ErrorIs(t, err, status.ErrMismatchedAlter) - require.NoError(t, runner2.Close()) - - // We should be able to force the migration to run even though there's a mismatched --alter - // by disabling --strict - runner3 := NewTestRunner(t, "resumestricttest", "ENGINE=INNODB", WithThreads(4)) - require.NoError(t, runner3.Run(t.Context())) - require.False(t, runner3.usedResumeFromCheckpoint) - require.NoError(t, runner3.Close()) -} - // TestResumeFromCheckpointPhantom tests that there is not a phantom row issue // when resuming from checkpoint. i.e. consider the following scenario: // 1) A new row is inserted at the end of the table, and the copier copies it.. but the low watermark never advances past this point @@ -919,55 +862,13 @@ func TestResumeFromCheckpointCleanupOnFailure(t *testing.T) { // This simulates binlog expiry between stop and start. testutils.RunSQL(t, `UPDATE _cleanup_test_chkpnt SET binlog_position = 'nonexistent-bin.999999:999999999'`) - // Without strict mode: falls back to newMigration and completes successfully. + // Resume falls back to newMigration and completes successfully. m2 := NewTestRunner(t, "cleanup_test", "ENGINE=InnoDB", WithThreads(2)) require.NoError(t, m2.Run(t.Context())) require.False(t, m2.usedResumeFromCheckpoint) // Should NOT have resumed because binlog was invalid require.NoError(t, m2.Close()) } -func TestResumeFromCheckpointStrictBinlogExpired(t *testing.T) { - t.Parallel() - tt := testutils.NewTestTable(t, "strictbinlogtest", `CREATE TABLE strictbinlogtest ( - id int(11) NOT NULL AUTO_INCREMENT, - name varchar(255) NOT NULL, - pad varbinary(1024) NOT NULL, - PRIMARY KEY (id) - )`) - tt.SeedRows(t, "INSERT INTO strictbinlogtest (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000) - - // First run: create a checkpoint - m := NewTestRunner(t, "strictbinlogtest", "ENGINE=InnoDB", - WithThreads(1), - WithTargetChunkTime(100*time.Millisecond), - WithTestThrottler()) - - ctx, cancel := context.WithCancel(t.Context()) - done := make(chan struct{}) - go func() { - defer close(done) - _ = m.Run(ctx) - }() - - waitForCheckpoint(t, m) - cancel() - <-done - require.NoError(t, m.Close()) - - // Corrupt binlog position to simulate expiry - testutils.RunSQL(t, `UPDATE _strictbinlogtest_chkpnt SET binlog_position = 'nonexistent-bin.999999:999999999'`) - - // With strict mode: should error with ErrBinlogNotFound instead of silently restarting - m2 := NewTestRunner(t, "strictbinlogtest", "ENGINE=InnoDB", - WithThreads(2), - WithStrict()) - - err := m2.Run(t.Context()) - require.Error(t, err) - require.ErrorIs(t, err, status.ErrBinlogNotFound) - require.NoError(t, m2.Close()) -} - // TestResumeFromCheckpointTooOld tests that when a checkpoint's created_at timestamp // exceeds CheckpointMaxAge, the migration falls back to a fresh start instead of // resuming from the stale checkpoint. This prevents the slow replay of many days @@ -1001,56 +902,13 @@ func TestResumeFromCheckpointTooOld(t *testing.T) { // Backdate the checkpoint's created_at to simulate an old checkpoint (8 days ago). testutils.RunSQL(t, `UPDATE _chkpttooold_chkpnt SET created_at = DATE_SUB(NOW(), INTERVAL 8 DAY)`) - // Without strict mode: falls back to newMigration and completes successfully. + // Resume falls back to newMigration and completes successfully. m2 := NewTestRunner(t, "chkpttooold", "ENGINE=InnoDB", WithThreads(2)) require.NoError(t, m2.Run(t.Context())) require.False(t, m2.usedResumeFromCheckpoint) // Should NOT have resumed because checkpoint was too old require.NoError(t, m2.Close()) } -// TestResumeFromCheckpointStrictTooOld tests that when strict mode is enabled -// and a checkpoint exceeds CheckpointMaxAge, the migration fails with -// ErrCheckpointTooOld rather than silently starting fresh. -func TestResumeFromCheckpointStrictTooOld(t *testing.T) { - t.Parallel() - tt := testutils.NewTestTable(t, "strictoldtest", `CREATE TABLE strictoldtest ( - id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, - name VARCHAR(255) NOT NULL, - pad VARCHAR(1000) NOT NULL default 'x')`) - tt.SeedRows(t, "INSERT INTO strictoldtest (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000) - - // First run: create a checkpoint - m := NewTestRunner(t, "strictoldtest", "ENGINE=InnoDB", - WithThreads(1), - WithTargetChunkTime(100*time.Millisecond), - WithTestThrottler()) - - ctx, cancel := context.WithCancel(t.Context()) - done := make(chan struct{}) - go func() { - defer close(done) - _ = m.Run(ctx) - }() - - waitForCheckpoint(t, m) - cancel() - <-done - require.NoError(t, m.Close()) - - // Backdate the checkpoint's created_at to simulate an old checkpoint (8 days ago). - testutils.RunSQL(t, `UPDATE _strictoldtest_chkpnt SET created_at = DATE_SUB(NOW(), INTERVAL 8 DAY)`) - - // With strict mode: should error with ErrCheckpointTooOld instead of silently restarting. - m2 := NewTestRunner(t, "strictoldtest", "ENGINE=InnoDB", - WithThreads(2), - WithStrict()) - - err := m2.Run(t.Context()) - require.Error(t, err) - require.ErrorIs(t, err, status.ErrCheckpointTooOld) - require.NoError(t, m2.Close()) -} - // TestResumeFromCheckpointNotTooOld tests that a recent checkpoint (within // CheckpointMaxAge) is still used for resume as expected. func TestResumeFromCheckpointNotTooOld(t *testing.T) { @@ -1127,41 +985,3 @@ func TestResumeRejectsCheckpointFromDifferentTable(t *testing.T) { "resume should be skipped when checkpoint records a different original table name") require.NoError(t, m2.Close()) } - -// TestResumeFromCheckpointStrictCollision asserts that under --strict, a -// truncation-collision (different original_table_name) surfaces as -// ErrCheckpointCollision rather than silently starting a fresh migration. -func TestResumeFromCheckpointStrictCollision(t *testing.T) { - t.Parallel() - tt := testutils.NewTestTable(t, "strictcollision", `CREATE TABLE strictcollision ( - id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, - name VARCHAR(255) NOT NULL, - pad VARCHAR(1000) NOT NULL default 'x')`) - tt.SeedRows(t, "INSERT INTO strictcollision (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000) - - m := NewTestRunner(t, "strictcollision", "ENGINE=InnoDB", - WithThreads(1), - WithTargetChunkTime(100*time.Millisecond), - WithTestThrottler()) - - ctx, cancel := context.WithCancel(t.Context()) - done := make(chan struct{}) - go func() { - defer close(done) - _ = m.Run(ctx) - }() - waitForCheckpoint(t, m) - cancel() - <-done - require.NoError(t, m.Close()) - - testutils.RunSQL(t, `UPDATE _strictcollision_chkpnt SET original_table_name = 'someothertable'`) - - m2 := NewTestRunner(t, "strictcollision", "ENGINE=InnoDB", - WithThreads(2), - WithStrict()) - err := m2.Run(t.Context()) - require.Error(t, err) - require.ErrorIs(t, err, status.ErrCheckpointCollision) - require.NoError(t, m2.Close()) -} diff --git a/pkg/migration/runner.go b/pkg/migration/runner.go index 80c22b68..fd22611b 100644 --- a/pkg/migration/runner.go +++ b/pkg/migration/runner.go @@ -813,26 +813,14 @@ func (r *Runner) setup(ctx context.Context) error { // We always attempt to resume from a checkpoint. if err = r.resumeFromCheckpoint(ctx); err != nil { - // Strict mode prevents silent loss of checkpoint progress. - // A mismatched alter means the user changed the DDL between runs. - // An expired binlog means the checkpoint can't be used because - // changes would be lost in the replication gap. - // A checkpoint that is too old means replaying binlogs would be - // slower than starting fresh. - // A truncation collision means the checkpoint belongs to a - // different long table that shares our truncated prefix. - // In all cases, strict mode surfaces the error rather than - // silently restarting from scratch. - if r.migration.Strict && (errors.Is(err, status.ErrMismatchedAlter) || errors.Is(err, status.ErrBinlogNotFound) || errors.Is(err, status.ErrCheckpointTooOld) || errors.Is(err, status.ErrCheckpointCollision)) { - return err - } - + // Resume is best-effort: a mismatched alter, expired binlog, too-old + // checkpoint, or truncation collision all mean the checkpoint can't be + // used. Spirit logs the reason and falls back to a fresh migration so + // it always makes forward progress. r.logger.Info("could not resume from checkpoint", "reason", err, ) // explain why it failed. - // Since we are not strict, we are allowed to - // start a new migration. if err := r.newMigration(ctx); err != nil { return err } @@ -1148,9 +1136,9 @@ func (r *Runner) resumeFromCheckpoint(ctx context.Context) error { // Open the change source at the checkpointed position. StartFromPosition // validates the position is still resumable (e.g. binlog file purged on // MySQL) and starts streaming. If the source can no longer reach the - // position, surface it as status.ErrBinlogNotFound so strict-mode and - // resume tests pick it up; otherwise propagate the error so the caller - // can abandon resume-from-checkpoint and start fresh. + // position, surface it as status.ErrBinlogNotFound so resume tests pick it + // up; otherwise propagate the error so the caller can abandon + // resume-from-checkpoint and start fresh. if err := r.replClient.StartFromPosition(ctx, binlogPosition); err != nil { r.logger.Warn("resuming from checkpoint failed because resuming from the previous source position failed", "position", binlogPosition,