diff --git a/AGENTS.md b/AGENTS.md
index 88789d69..06bd58fc 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -132,7 +132,7 @@ m.Alter = "ENGINE=InnoDB"
require.NoError(t, m.Run())
```
-Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithStrict()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`.
+Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`.
**General test patterns:**
- Integration tests connect to real MySQL — there are no mocked database tests for core logic
diff --git a/docs/migrate.md b/docs/migrate.md
index 3c46b0fd..70cbbed0 100644
--- a/docs/migrate.md
+++ b/docs/migrate.md
@@ -31,7 +31,6 @@ spirit migrate --host mydb:3306 --username root --password secret \
- [skip-drop-after-cutover](#skip-drop-after-cutover)
- [skip-force-kill](#skip-force-kill)
- [statement](#statement)
-- [strict](#strict)
- [table](#table)
- [target-chunk-time](#target-chunk-time)
- [threads](#threads)
@@ -70,7 +69,7 @@ This protects against resuming from very stale checkpoints where replaying the a
When Spirit reads a checkpoint, it relies on the columns of the checkpoint table matching the columns the current binary expects:
-- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. This is true in both strict and non-strict modes — [`--strict`](#strict) does not currently hard-fail on a checkpoint-table schema mismatch, so progress from the previous binary version is silently discarded.
+- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. Progress from the previous binary version is silently discarded.
- **If the checkpoint table schema is unchanged but the *meaning* of stored values has changed** between versions (for example, a watermark format change, a routing-policy change, or a new applier behavior), Spirit cannot detect the mismatch. The resume will silently succeed and the new binary will reinterpret the old checkpoint, which can produce incorrect results.
Operationally, this means:
@@ -200,9 +199,8 @@ The main practical differences vs. the default path:
A resume from checkpoint fails fast if `@@GLOBAL.gtid_purged` is no longer a
subset of the checkpointed GTID set (i.e. the source has dropped binlogs Spirit
would need to re-apply). In that case Spirit surfaces
-`change.Source: cannot resume from position`, which under default (non-strict)
-mode causes a restart from scratch, and under [`--strict`](#strict) causes an
-exit with `status.ErrBinlogNotFound`.
+`change.Source: cannot resume from position`, logs the reason, and restarts the
+migration from scratch.
```bash
spirit migrate --gtid \
@@ -345,26 +343,6 @@ There are some restrictions to `--statement`:
- When sending multiple statements, the `INSTANT` and `INPLACE` optimizations will be skipped. This means that metadata-only changes that would execute instantly if submitted alone will require a full table copy.
- When sending multiple statements, all statements must operate on tables in the same underlying database (aka schema).
-### strict
-
-- Type: Boolean
-- Default value: `false`
-
-> **⚠️ Not recommended.** In most cases, the default behavior (idempotent restart) is safer and more convenient. `--strict` was added for a specific internal use case and is generally the wrong choice for new deployments. If a previous migration was interrupted, the default behavior will safely clean up and restart, which is almost always what you want.
-
-By default, Spirit will automatically clean up old checkpoints before starting the schema change. This allows schema changes to always proceed forward, at the cost of potentially lost progress from a previous incomplete run.
-
-When set to `true`, if Spirit encounters a checkpoint belonging to a previous migration, it will validate that the statement being executed matches the statement from the previous run (whether provided via `--alter` or `--statement`). If the validation fails (e.g., the statement was changed between runs, or the binlog position is no longer available), Spirit will exit with an error rather than silently restarting from scratch.
-
-The scenarios where `--strict` causes Spirit to fail rather than restart are:
-- The migration statement changed between runs (checkpoint has a different statement)
-- The binlog file referenced by the checkpoint has been purged from the server
-- The checkpoint is too old to safely resume (replaying binlogs would be slower than restarting)
-
-In all of these cases, the default (non-strict) behavior is to log a warning and start fresh, which is usually the correct action.
-
-Note: a checkpoint-table schema mismatch (typically caused by resuming with a different Spirit binary version — see [Resuming across Spirit binary versions](#resuming-across-spirit-binary-versions)) is **not** one of the strict-mode hard-fail cases. In both strict and non-strict mode Spirit logs a warning and starts a fresh migration.
-
### table
- Type: String
diff --git a/pkg/migration/helpers_test.go b/pkg/migration/helpers_test.go
index 72b42c54..ae38dad7 100644
--- a/pkg/migration/helpers_test.go
+++ b/pkg/migration/helpers_test.go
@@ -130,13 +130,6 @@ func WithDeferCutOver() RunnerOption {
}
}
-// WithStrict enables strict mode (mismatched ALTER detection on resume).
-func WithStrict() RunnerOption {
- return func(m *Migration) {
- m.Strict = true
- }
-}
-
// WithDBName overrides the database name (for tests using CreateUniqueTestDatabase).
func WithDBName(name string) RunnerOption {
return func(m *Migration) {
diff --git a/pkg/migration/migration.go b/pkg/migration/migration.go
index 757f0254..891a5242 100644
--- a/pkg/migration/migration.go
+++ b/pkg/migration/migration.go
@@ -42,7 +42,6 @@ type Migration struct {
SkipDropAfterCutover bool `name:"skip-drop-after-cutover" help:"Keep old table after completing cutover" optional:"" default:"false"`
DeferCutOver bool `name:"defer-cutover" help:"Defer cutover (and checksum) until sentinel table is dropped" optional:"" default:"false"`
SkipForceKill bool `name:"skip-force-kill" help:"Disable killing long-running transactions in order to acquire metadata lock (MDL) at checksum and cutover time" optional:"" default:"false"`
- Strict bool `name:"strict" help:"Exit on --alter mismatch when incomplete migration is detected. Not recommended for most users; the default idempotent restart behavior is safer." optional:"" default:"false"`
Statement string `name:"statement" help:"The SQL statement to run (replaces --table and --alter)" optional:"" default:""`
Lint bool `name:"lint" help:"Run lint checks before running migration" optional:""`
LintOnly bool `name:"lint-only" help:"Run lint checks and exit without performing migration" optional:""`
diff --git a/pkg/migration/resume-from-checkpoint.md b/pkg/migration/resume-from-checkpoint.md
index 18d6f3c4..1e7a75cb 100644
--- a/pkg/migration/resume-from-checkpoint.md
+++ b/pkg/migration/resume-from-checkpoint.md
@@ -28,12 +28,12 @@ When a new Runner starts (`Runner.Run()` → `setup()`), it always attempts `res
1. **Check `_
_new` exists** — if the shadow table is gone, there's nothing to resume.
2. **Read checkpoint table** — fetch the saved watermarks, position, statement, and `original_table_name`.
-3. **Validate DDL statement matches** — the checkpoint must be for the same alter. In `--strict` mode, a mismatch is a hard error. In non-strict mode, Spirit discards the checkpoint and starts fresh.
+3. **Validate DDL statement matches** — the checkpoint must be for the same alter. On a mismatch, Spirit discards the checkpoint and starts fresh.
4. **Validate `original_table_name` matches** (single-table mode) — guards against the rare collision where two long table names truncate to the same checkpoint table name. A mismatch causes Spirit to discard the checkpoint and start fresh.
5. **Set up copier, checker, and change source** — create the change source and add subscriptions for each table.
6. **Resume streaming from the saved position** — `replClient.StartFromPosition(ctx, position)` primes the source's internal position and begins streaming. The source validates the position is still resumable; if it isn't (e.g. the MySQL binlog file has been purged), `change.ErrPositionNotFound` is returned and surfaces as `status.ErrBinlogNotFound`, causing Spirit to fall back to `newMigration()`.
-If any step fails (and strict mode is not enabled), Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable.
+If any step fails, Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable.
## Background: how MySQL binary logs work
@@ -47,34 +47,13 @@ MySQL automatically deletes old binlog files based on `binlog_expire_logs_second
One reason resume can fail is **binlog expiry**. If the checkpoint references a binlog file that has been purged, Spirit cannot resume because changes in the gap would be lost.
-The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Strict mode then surfaces the error to the caller; non-strict mode logs and falls back to `newMigration()`.
-
-What happens next depends on whether strict mode is enabled:
-
-- **Without `--strict`:** Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost silently.
-- **With `--strict`:** Spirit returns `status.ErrBinlogNotFound` to the caller. This lets automation detect the problem and alert an operator rather than silently discarding hours of copy work.
+The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost.
The tradeoff of falling back to `newMigration()` is that all copy progress is lost. For a large table this could mean hours of wasted work. To avoid this:
- **Keep binlog retention longer than your longest expected migration pause.** If you expect to pause migrations for up to a week, make sure `binlog_expire_logs_seconds` is set to at least 7 days. The MySQL 8.0 default is 30 days (`2592000`), which is usually sufficient.
-- **Consider `--strict` mode only if you have automation that handles the errors it produces.** In strict mode, Spirit surfaces both DDL mismatches (`status.ErrMismatchedAlter`) and binlog expiry (`status.ErrBinlogNotFound`) as errors instead of restarting. This is generally not recommended for most users — the default behavior of discarding stale checkpoints and restarting is safer and simpler. See [strict](../docs/migrate.md#strict) for more details.
- **Be aware of your binlog retention window.** If Spirit is paused longer than the retention period, the checkpoint's binlog file will be purged and resume will fail. Some managed MySQL services disable retention by default.
-## Strict mode
-
-> **Note:** `--strict` is not recommended for most users. The default idempotent restart behavior (discard stale checkpoint, restart from scratch) is safer and requires no special error handling. Only use `--strict` if you have automation that can programmatically handle the specific errors it produces.
-
-By default, Spirit treats checkpoint resume as best-effort. If the checkpoint is invalid for any reason — mismatched DDL statement, expired binlog, corrupt checkpoint data — Spirit discards it and starts a new migration. This is the recommended behavior.
-
-With `Strict: true`, Spirit returns a hard error for two specific resume failures:
-
-- **`status.ErrMismatchedAlter`** — the checkpoint's DDL statement doesn't match the current `--alter`. This prevents the scenario where an operator changes the alter between runs and unknowingly loses all progress.
-- **`status.ErrBinlogNotFound`** — the checkpoint's binlog file has been purged from the server. This prevents silently restarting a multi-hour copy from scratch.
-
-Both errors work with `errors.Is()`, letting callers handle each case differently. See [strict](../docs/migrate.md#strict) for more details.
-
-Other resume failures (missing shadow table, corrupt checkpoint data) still fall through to `newMigration()` in both modes, since these typically indicate there was nothing valid to resume.
-
## Cross-version compatibility
Checkpoint tables are version-specific. Spirit deliberately uses `SELECT *` when reading the checkpoint, so any change to the checkpoint schema (e.g. columns added, removed, or reordered in a newer release) will cause the read to fail. This is by design — Spirit does not attempt to migrate or backfill checkpoint data across versions.
@@ -83,8 +62,7 @@ Practical implications:
- **Upgrading Spirit mid-migration** (older binary → newer binary). The newer binary's `Scan` expects a different number of columns than the on-disk checkpoint provides, so the read fails.
- **Rolling back Spirit mid-migration** (newer binary → older binary). Same failure mode in reverse.
-- **Effect in both `--strict` and non-strict mode:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. This error is not one of the typed `status.Err…` values that strict mode promotes (`ErrMismatchedAlter`, `ErrBinlogNotFound`, `ErrCheckpointTooOld`), so Spirit logs the error and falls through to `newMigration()` regardless of strict mode. The copy restarts from scratch and all checkpoint progress is lost silently.
-- **Operator implication:** do **not** rely on `--strict` alerting to catch a Spirit upgrade or rollback. Strict mode does not currently distinguish a cross-version checkpoint schema mismatch from a healthy fresh start.
+- **Effect:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. Spirit logs the error and falls through to `newMigration()`. The copy restarts from scratch and all checkpoint progress is lost.
Operational guidance:
diff --git a/pkg/migration/resume_test.go b/pkg/migration/resume_test.go
index 55905c11..a8a090f3 100644
--- a/pkg/migration/resume_test.go
+++ b/pkg/migration/resume_test.go
@@ -561,63 +561,6 @@ FROM compositevarcharpk a WHERE version='1'`)
require.NoError(t, m2.Close())
}
-func TestResumeFromCheckpointStrict(t *testing.T) {
- t.Parallel()
- tt := testutils.NewTestTable(t, "resumestricttest", `CREATE TABLE resumestricttest (
- id int(11) NOT NULL AUTO_INCREMENT,
- pad varbinary(1024) NOT NULL,
- PRIMARY KEY (id)
- )`)
- tt.SeedRows(t, "INSERT INTO resumestricttest (pad) SELECT RANDOM_BYTES(1024)", 100000)
-
- alterSQL := "ADD INDEX(pad);"
-
- // Kick off a migration with --strict enabled and let it run until the first checkpoint is available
- m := NewTestRunner(t, "resumestricttest", alterSQL,
- WithThreads(1),
- WithTargetChunkTime(100*time.Millisecond),
- WithStrict(),
- WithTestThrottler())
-
- ctx, cancel := context.WithCancel(t.Context())
- done := make(chan struct{})
- go func() {
- defer close(done)
- err := m.Run(ctx)
- require.Error(t, err) // it gets interrupted as soon as there is a checkpoint saved.
- }()
-
- waitForCheckpoint(t, m)
-
- // Cancel + wait for Run to fully return before Close. See
- // TestChangeIntToBigIntPKResumeFromChkPt for the rationale.
- cancel()
- <-done
- require.NoError(t, m.Close())
-
- // Insert some more dummy data
- testutils.RunSQL(t, "INSERT INTO resumestricttest (pad) SELECT RANDOM_BYTES(1024) FROM resumestricttest LIMIT 1000")
-
- // Start a _different_ migration on the same table. We don't expect this to work when --strict is enabled
- // since the --alter doesn't match what is recorded in the checkpoint table
- runner2 := NewTestRunner(t, "resumestricttest", "ENGINE=INNODB",
- WithThreads(1),
- WithTargetChunkTime(100*time.Millisecond),
- WithStrict())
-
- err := runner2.Run(t.Context())
- require.Error(t, err)
- require.ErrorIs(t, err, status.ErrMismatchedAlter)
- require.NoError(t, runner2.Close())
-
- // We should be able to force the migration to run even though there's a mismatched --alter
- // by disabling --strict
- runner3 := NewTestRunner(t, "resumestricttest", "ENGINE=INNODB", WithThreads(4))
- require.NoError(t, runner3.Run(t.Context()))
- require.False(t, runner3.usedResumeFromCheckpoint)
- require.NoError(t, runner3.Close())
-}
-
// TestResumeFromCheckpointPhantom tests that there is not a phantom row issue
// when resuming from checkpoint. i.e. consider the following scenario:
// 1) A new row is inserted at the end of the table, and the copier copies it.. but the low watermark never advances past this point
@@ -919,55 +862,13 @@ func TestResumeFromCheckpointCleanupOnFailure(t *testing.T) {
// This simulates binlog expiry between stop and start.
testutils.RunSQL(t, `UPDATE _cleanup_test_chkpnt SET binlog_position = 'nonexistent-bin.999999:999999999'`)
- // Without strict mode: falls back to newMigration and completes successfully.
+ // Resume falls back to newMigration and completes successfully.
m2 := NewTestRunner(t, "cleanup_test", "ENGINE=InnoDB", WithThreads(2))
require.NoError(t, m2.Run(t.Context()))
require.False(t, m2.usedResumeFromCheckpoint) // Should NOT have resumed because binlog was invalid
require.NoError(t, m2.Close())
}
-func TestResumeFromCheckpointStrictBinlogExpired(t *testing.T) {
- t.Parallel()
- tt := testutils.NewTestTable(t, "strictbinlogtest", `CREATE TABLE strictbinlogtest (
- id int(11) NOT NULL AUTO_INCREMENT,
- name varchar(255) NOT NULL,
- pad varbinary(1024) NOT NULL,
- PRIMARY KEY (id)
- )`)
- tt.SeedRows(t, "INSERT INTO strictbinlogtest (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000)
-
- // First run: create a checkpoint
- m := NewTestRunner(t, "strictbinlogtest", "ENGINE=InnoDB",
- WithThreads(1),
- WithTargetChunkTime(100*time.Millisecond),
- WithTestThrottler())
-
- ctx, cancel := context.WithCancel(t.Context())
- done := make(chan struct{})
- go func() {
- defer close(done)
- _ = m.Run(ctx)
- }()
-
- waitForCheckpoint(t, m)
- cancel()
- <-done
- require.NoError(t, m.Close())
-
- // Corrupt binlog position to simulate expiry
- testutils.RunSQL(t, `UPDATE _strictbinlogtest_chkpnt SET binlog_position = 'nonexistent-bin.999999:999999999'`)
-
- // With strict mode: should error with ErrBinlogNotFound instead of silently restarting
- m2 := NewTestRunner(t, "strictbinlogtest", "ENGINE=InnoDB",
- WithThreads(2),
- WithStrict())
-
- err := m2.Run(t.Context())
- require.Error(t, err)
- require.ErrorIs(t, err, status.ErrBinlogNotFound)
- require.NoError(t, m2.Close())
-}
-
// TestResumeFromCheckpointTooOld tests that when a checkpoint's created_at timestamp
// exceeds CheckpointMaxAge, the migration falls back to a fresh start instead of
// resuming from the stale checkpoint. This prevents the slow replay of many days
@@ -1001,56 +902,13 @@ func TestResumeFromCheckpointTooOld(t *testing.T) {
// Backdate the checkpoint's created_at to simulate an old checkpoint (8 days ago).
testutils.RunSQL(t, `UPDATE _chkpttooold_chkpnt SET created_at = DATE_SUB(NOW(), INTERVAL 8 DAY)`)
- // Without strict mode: falls back to newMigration and completes successfully.
+ // Resume falls back to newMigration and completes successfully.
m2 := NewTestRunner(t, "chkpttooold", "ENGINE=InnoDB", WithThreads(2))
require.NoError(t, m2.Run(t.Context()))
require.False(t, m2.usedResumeFromCheckpoint) // Should NOT have resumed because checkpoint was too old
require.NoError(t, m2.Close())
}
-// TestResumeFromCheckpointStrictTooOld tests that when strict mode is enabled
-// and a checkpoint exceeds CheckpointMaxAge, the migration fails with
-// ErrCheckpointTooOld rather than silently starting fresh.
-func TestResumeFromCheckpointStrictTooOld(t *testing.T) {
- t.Parallel()
- tt := testutils.NewTestTable(t, "strictoldtest", `CREATE TABLE strictoldtest (
- id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
- name VARCHAR(255) NOT NULL,
- pad VARCHAR(1000) NOT NULL default 'x')`)
- tt.SeedRows(t, "INSERT INTO strictoldtest (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000)
-
- // First run: create a checkpoint
- m := NewTestRunner(t, "strictoldtest", "ENGINE=InnoDB",
- WithThreads(1),
- WithTargetChunkTime(100*time.Millisecond),
- WithTestThrottler())
-
- ctx, cancel := context.WithCancel(t.Context())
- done := make(chan struct{})
- go func() {
- defer close(done)
- _ = m.Run(ctx)
- }()
-
- waitForCheckpoint(t, m)
- cancel()
- <-done
- require.NoError(t, m.Close())
-
- // Backdate the checkpoint's created_at to simulate an old checkpoint (8 days ago).
- testutils.RunSQL(t, `UPDATE _strictoldtest_chkpnt SET created_at = DATE_SUB(NOW(), INTERVAL 8 DAY)`)
-
- // With strict mode: should error with ErrCheckpointTooOld instead of silently restarting.
- m2 := NewTestRunner(t, "strictoldtest", "ENGINE=InnoDB",
- WithThreads(2),
- WithStrict())
-
- err := m2.Run(t.Context())
- require.Error(t, err)
- require.ErrorIs(t, err, status.ErrCheckpointTooOld)
- require.NoError(t, m2.Close())
-}
-
// TestResumeFromCheckpointNotTooOld tests that a recent checkpoint (within
// CheckpointMaxAge) is still used for resume as expected.
func TestResumeFromCheckpointNotTooOld(t *testing.T) {
@@ -1127,41 +985,3 @@ func TestResumeRejectsCheckpointFromDifferentTable(t *testing.T) {
"resume should be skipped when checkpoint records a different original table name")
require.NoError(t, m2.Close())
}
-
-// TestResumeFromCheckpointStrictCollision asserts that under --strict, a
-// truncation-collision (different original_table_name) surfaces as
-// ErrCheckpointCollision rather than silently starting a fresh migration.
-func TestResumeFromCheckpointStrictCollision(t *testing.T) {
- t.Parallel()
- tt := testutils.NewTestTable(t, "strictcollision", `CREATE TABLE strictcollision (
- id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
- name VARCHAR(255) NOT NULL,
- pad VARCHAR(1000) NOT NULL default 'x')`)
- tt.SeedRows(t, "INSERT INTO strictcollision (name, pad) SELECT 'a', REPEAT('x', 1000)", 1000)
-
- m := NewTestRunner(t, "strictcollision", "ENGINE=InnoDB",
- WithThreads(1),
- WithTargetChunkTime(100*time.Millisecond),
- WithTestThrottler())
-
- ctx, cancel := context.WithCancel(t.Context())
- done := make(chan struct{})
- go func() {
- defer close(done)
- _ = m.Run(ctx)
- }()
- waitForCheckpoint(t, m)
- cancel()
- <-done
- require.NoError(t, m.Close())
-
- testutils.RunSQL(t, `UPDATE _strictcollision_chkpnt SET original_table_name = 'someothertable'`)
-
- m2 := NewTestRunner(t, "strictcollision", "ENGINE=InnoDB",
- WithThreads(2),
- WithStrict())
- err := m2.Run(t.Context())
- require.Error(t, err)
- require.ErrorIs(t, err, status.ErrCheckpointCollision)
- require.NoError(t, m2.Close())
-}
diff --git a/pkg/migration/runner.go b/pkg/migration/runner.go
index 80c22b68..fd22611b 100644
--- a/pkg/migration/runner.go
+++ b/pkg/migration/runner.go
@@ -813,26 +813,14 @@ func (r *Runner) setup(ctx context.Context) error {
// We always attempt to resume from a checkpoint.
if err = r.resumeFromCheckpoint(ctx); err != nil {
- // Strict mode prevents silent loss of checkpoint progress.
- // A mismatched alter means the user changed the DDL between runs.
- // An expired binlog means the checkpoint can't be used because
- // changes would be lost in the replication gap.
- // A checkpoint that is too old means replaying binlogs would be
- // slower than starting fresh.
- // A truncation collision means the checkpoint belongs to a
- // different long table that shares our truncated prefix.
- // In all cases, strict mode surfaces the error rather than
- // silently restarting from scratch.
- if r.migration.Strict && (errors.Is(err, status.ErrMismatchedAlter) || errors.Is(err, status.ErrBinlogNotFound) || errors.Is(err, status.ErrCheckpointTooOld) || errors.Is(err, status.ErrCheckpointCollision)) {
- return err
- }
-
+ // Resume is best-effort: a mismatched alter, expired binlog, too-old
+ // checkpoint, or truncation collision all mean the checkpoint can't be
+ // used. Spirit logs the reason and falls back to a fresh migration so
+ // it always makes forward progress.
r.logger.Info("could not resume from checkpoint",
"reason", err,
) // explain why it failed.
- // Since we are not strict, we are allowed to
- // start a new migration.
if err := r.newMigration(ctx); err != nil {
return err
}
@@ -1148,9 +1136,9 @@ func (r *Runner) resumeFromCheckpoint(ctx context.Context) error {
// Open the change source at the checkpointed position. StartFromPosition
// validates the position is still resumable (e.g. binlog file purged on
// MySQL) and starts streaming. If the source can no longer reach the
- // position, surface it as status.ErrBinlogNotFound so strict-mode and
- // resume tests pick it up; otherwise propagate the error so the caller
- // can abandon resume-from-checkpoint and start fresh.
+ // position, surface it as status.ErrBinlogNotFound so resume tests pick it
+ // up; otherwise propagate the error so the caller can abandon
+ // resume-from-checkpoint and start fresh.
if err := r.replClient.StartFromPosition(ctx, binlogPosition); err != nil {
r.logger.Warn("resuming from checkpoint failed because resuming from the previous source position failed",
"position", binlogPosition,