Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ m.Alter = "ENGINE=InnoDB"
require.NoError(t, m.Run())
```

Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithStrict()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`.
Available options: `WithThreads(n)`, `WithTargetChunkTime(d)`, `WithBuffered(b)`, `WithTable(name)`, `WithAlter(stmt)`, `WithStatement(sql)`, `WithTestThrottler()`, `WithDeferCutOver()`, `WithSkipDropAfterCutover()`, `WithDBName(name)`, `WithRespectSentinel()`, `WithLint()`, `WithLintOnly()`, `WithHost(host)`, `WithReplicaDSN(dsn)`, `WithReplicaMaxLag(d)`, `WithConfFile(t, content)`.

**General test patterns:**
- Integration tests connect to real MySQL — there are no mocked database tests for core logic
Expand Down
28 changes: 3 additions & 25 deletions docs/migrate.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ spirit migrate --host mydb:3306 --username root --password secret \
- [skip-drop-after-cutover](#skip-drop-after-cutover)
- [skip-force-kill](#skip-force-kill)
- [statement](#statement)
- [strict](#strict)
- [table](#table)
- [target-chunk-time](#target-chunk-time)
- [threads](#threads)
Expand Down Expand Up @@ -70,7 +69,7 @@ This protects against resuming from very stale checkpoints where replaying the a

When Spirit reads a checkpoint, it relies on the columns of the checkpoint table matching the columns the current binary expects:

- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. This is true in both strict and non-strict modes — [`--strict`](#strict) does not currently hard-fail on a checkpoint-table schema mismatch, so progress from the previous binary version is silently discarded.
- **If the checkpoint table schema differs** between versions (columns added, removed, or reordered), the resume read will fail and Spirit logs a warning and starts a fresh migration. Progress from the previous binary version is silently discarded.
- **If the checkpoint table schema is unchanged but the *meaning* of stored values has changed** between versions (for example, a watermark format change, a routing-policy change, or a new applier behavior), Spirit cannot detect the mismatch. The resume will silently succeed and the new binary will reinterpret the old checkpoint, which can produce incorrect results.

Operationally, this means:
Expand Down Expand Up @@ -200,9 +199,8 @@ The main practical differences vs. the default path:
A resume from checkpoint fails fast if `@@GLOBAL.gtid_purged` is no longer a
subset of the checkpointed GTID set (i.e. the source has dropped binlogs Spirit
would need to re-apply). In that case Spirit surfaces
`change.Source: cannot resume from position`, which under default (non-strict)
mode causes a restart from scratch, and under [`--strict`](#strict) causes an
exit with `status.ErrBinlogNotFound`.
`change.Source: cannot resume from position`, logs the reason, and restarts the
migration from scratch.

```bash
spirit migrate --gtid \
Expand Down Expand Up @@ -345,26 +343,6 @@ There are some restrictions to `--statement`:
- When sending multiple statements, the `INSTANT` and `INPLACE` optimizations will be skipped. This means that metadata-only changes that would execute instantly if submitted alone will require a full table copy.
- When sending multiple statements, all statements must operate on tables in the same underlying database (aka schema).

### strict

- Type: Boolean
- Default value: `false`

> **⚠️ Not recommended.** In most cases, the default behavior (idempotent restart) is safer and more convenient. `--strict` was added for a specific internal use case and is generally the wrong choice for new deployments. If a previous migration was interrupted, the default behavior will safely clean up and restart, which is almost always what you want.

By default, Spirit will automatically clean up old checkpoints before starting the schema change. This allows schema changes to always proceed forward, at the cost of potentially lost progress from a previous incomplete run.

When set to `true`, if Spirit encounters a checkpoint belonging to a previous migration, it will validate that the statement being executed matches the statement from the previous run (whether provided via `--alter` or `--statement`). If the validation fails (e.g., the statement was changed between runs, or the binlog position is no longer available), Spirit will exit with an error rather than silently restarting from scratch.

The scenarios where `--strict` causes Spirit to fail rather than restart are:
- The migration statement changed between runs (checkpoint has a different statement)
- The binlog file referenced by the checkpoint has been purged from the server
- The checkpoint is too old to safely resume (replaying binlogs would be slower than restarting)

In all of these cases, the default (non-strict) behavior is to log a warning and start fresh, which is usually the correct action.

Note: a checkpoint-table schema mismatch (typically caused by resuming with a different Spirit binary version — see [Resuming across Spirit binary versions](#resuming-across-spirit-binary-versions)) is **not** one of the strict-mode hard-fail cases. In both strict and non-strict mode Spirit logs a warning and starts a fresh migration.

### table

- Type: String
Expand Down
7 changes: 0 additions & 7 deletions pkg/migration/helpers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -130,13 +130,6 @@ func WithDeferCutOver() RunnerOption {
}
}

// WithStrict enables strict mode (mismatched ALTER detection on resume).
func WithStrict() RunnerOption {
return func(m *Migration) {
m.Strict = true
}
}

// WithDBName overrides the database name (for tests using CreateUniqueTestDatabase).
func WithDBName(name string) RunnerOption {
return func(m *Migration) {
Expand Down
1 change: 0 additions & 1 deletion pkg/migration/migration.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ type Migration struct {
SkipDropAfterCutover bool `name:"skip-drop-after-cutover" help:"Keep old table after completing cutover" optional:"" default:"false"`
DeferCutOver bool `name:"defer-cutover" help:"Defer cutover (and checksum) until sentinel table is dropped" optional:"" default:"false"`
SkipForceKill bool `name:"skip-force-kill" help:"Disable killing long-running transactions in order to acquire metadata lock (MDL) at checksum and cutover time" optional:"" default:"false"`
Strict bool `name:"strict" help:"Exit on --alter mismatch when incomplete migration is detected. Not recommended for most users; the default idempotent restart behavior is safer." optional:"" default:"false"`
Statement string `name:"statement" help:"The SQL statement to run (replaces --table and --alter)" optional:"" default:""`
Lint bool `name:"lint" help:"Run lint checks before running migration" optional:""`
LintOnly bool `name:"lint-only" help:"Run lint checks and exit without performing migration" optional:""`
Expand Down
30 changes: 4 additions & 26 deletions pkg/migration/resume-from-checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ When a new Runner starts (`Runner.Run()` → `setup()`), it always attempts `res

1. **Check `_<table>_new` exists** — if the shadow table is gone, there's nothing to resume.
2. **Read checkpoint table** — fetch the saved watermarks, position, statement, and `original_table_name`.
3. **Validate DDL statement matches** — the checkpoint must be for the same alter. In `--strict` mode, a mismatch is a hard error. In non-strict mode, Spirit discards the checkpoint and starts fresh.
3. **Validate DDL statement matches** — the checkpoint must be for the same alter. On a mismatch, Spirit discards the checkpoint and starts fresh.
4. **Validate `original_table_name` matches** (single-table mode) — guards against the rare collision where two long table names truncate to the same checkpoint table name. A mismatch causes Spirit to discard the checkpoint and start fresh.
5. **Set up copier, checker, and change source** — create the change source and add subscriptions for each table.
6. **Resume streaming from the saved position** — `replClient.StartFromPosition(ctx, position)` primes the source's internal position and begins streaming. The source validates the position is still resumable; if it isn't (e.g. the MySQL binlog file has been purged), `change.ErrPositionNotFound` is returned and surfaces as `status.ErrBinlogNotFound`, causing Spirit to fall back to `newMigration()`.

If any step fails (and strict mode is not enabled), Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable.
If any step fails, Spirit logs the reason and falls back to `newMigration()`, which starts the migration from scratch. This means resume is best-effort — Spirit will always make forward progress even if the checkpoint is unusable.

## Background: how MySQL binary logs work

Expand All @@ -47,34 +47,13 @@ MySQL automatically deletes old binlog files based on `binlog_expire_logs_second

One reason resume can fail is **binlog expiry**. If the checkpoint references a binlog file that has been purged, Spirit cannot resume because changes in the gap would be lost.

The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Strict mode then surfaces the error to the caller; non-strict mode logs and falls back to `newMigration()`.

What happens next depends on whether strict mode is enabled:

- **Without `--strict`:** Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost silently.
- **With `--strict`:** Spirit returns `status.ErrBinlogNotFound` to the caller. This lets automation detect the problem and alert an operator rather than silently discarding hours of copy work.
The change source is responsible for detecting this when resume begins. The binlog implementation validates the position inside `StartFromPosition` (against `SHOW BINARY LOGS`) and returns `change.ErrPositionNotFound` if the file is gone; the migration runner translates that to `status.ErrBinlogNotFound`. Spirit logs the reason and falls back to `newMigration()`, restarting the copy from scratch. All checkpoint progress is lost.

The tradeoff of falling back to `newMigration()` is that all copy progress is lost. For a large table this could mean hours of wasted work. To avoid this:

- **Keep binlog retention longer than your longest expected migration pause.** If you expect to pause migrations for up to a week, make sure `binlog_expire_logs_seconds` is set to at least 7 days. The MySQL 8.0 default is 30 days (`2592000`), which is usually sufficient.
- **Consider `--strict` mode only if you have automation that handles the errors it produces.** In strict mode, Spirit surfaces both DDL mismatches (`status.ErrMismatchedAlter`) and binlog expiry (`status.ErrBinlogNotFound`) as errors instead of restarting. This is generally not recommended for most users — the default behavior of discarding stale checkpoints and restarting is safer and simpler. See [strict](../docs/migrate.md#strict) for more details.
- **Be aware of your binlog retention window.** If Spirit is paused longer than the retention period, the checkpoint's binlog file will be purged and resume will fail. Some managed MySQL services disable retention by default.

## Strict mode

> **Note:** `--strict` is not recommended for most users. The default idempotent restart behavior (discard stale checkpoint, restart from scratch) is safer and requires no special error handling. Only use `--strict` if you have automation that can programmatically handle the specific errors it produces.

By default, Spirit treats checkpoint resume as best-effort. If the checkpoint is invalid for any reason — mismatched DDL statement, expired binlog, corrupt checkpoint data — Spirit discards it and starts a new migration. This is the recommended behavior.

With `Strict: true`, Spirit returns a hard error for two specific resume failures:

- **`status.ErrMismatchedAlter`** — the checkpoint's DDL statement doesn't match the current `--alter`. This prevents the scenario where an operator changes the alter between runs and unknowingly loses all progress.
- **`status.ErrBinlogNotFound`** — the checkpoint's binlog file has been purged from the server. This prevents silently restarting a multi-hour copy from scratch.

Both errors work with `errors.Is()`, letting callers handle each case differently. See [strict](../docs/migrate.md#strict) for more details.

Other resume failures (missing shadow table, corrupt checkpoint data) still fall through to `newMigration()` in both modes, since these typically indicate there was nothing valid to resume.

## Cross-version compatibility

Checkpoint tables are version-specific. Spirit deliberately uses `SELECT *` when reading the checkpoint, so any change to the checkpoint schema (e.g. columns added, removed, or reordered in a newer release) will cause the read to fail. This is by design — Spirit does not attempt to migrate or backfill checkpoint data across versions.
Expand All @@ -83,8 +62,7 @@ Practical implications:

- **Upgrading Spirit mid-migration** (older binary → newer binary). The newer binary's `Scan` expects a different number of columns than the on-disk checkpoint provides, so the read fails.
- **Rolling back Spirit mid-migration** (newer binary → older binary). Same failure mode in reverse.
- **Effect in both `--strict` and non-strict mode:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. This error is not one of the typed `status.Err…` values that strict mode promotes (`ErrMismatchedAlter`, `ErrBinlogNotFound`, `ErrCheckpointTooOld`), so Spirit logs the error and falls through to `newMigration()` regardless of strict mode. The copy restarts from scratch and all checkpoint progress is lost silently.
- **Operator implication:** do **not** rely on `--strict` alerting to catch a Spirit upgrade or rollback. Strict mode does not currently distinguish a cross-version checkpoint schema mismatch from a healthy fresh start.
- **Effect:** the read returns a generic `"could not read from table"` error wrapping the underlying `database/sql` scan error. Spirit logs the error and falls through to `newMigration()`. The copy restarts from scratch and all checkpoint progress is lost.

Operational guidance:

Expand Down
Loading
Loading