Skip to content

[🐞] Inconsistent checkpoint filename padding when resuming with different num_epochs #315

@LenardJSchnakenbeck

Description

@LenardJSchnakenbeck

Is there an existing issue / discussion for this? | 是否已有关于该错误的issue或讨论?

  • I have searched the existing issues / discussions | 我已经搜索过已有的issues和讨论

Is there an existing answer for this in tutorial? | 该问题是否在教程中有解答?

  • I have searched tutorial | 我已经搜索过tutorial

Current Behavior | 当前行为

Checkpoint filenames depend on num_epochs (e.g., ..._1.pt vs ..._01.pt).
When saving a checkpoint and resuming training with a different number of digits in num_epochs, basicts_runner._get_ckpt_path() generates a different filename pattern.
This causes an error, when renaming the old checkpoint after the first epoch of the new run in checkpoint.backup_last_ckpt().

Expected Behavior | 期望行为

Checkpoint paths should remain compatible across runs, regardless of the number of digits in num_epochs.

Proposed solutions:

  • Use fixed-width padding (e.g. {epoch:04d}), or
  • Use pattern matching (like in checkpoint.get_last_ckpt_path() , or
  • Avoid padding entirely

I’d be happy to open a PR for this if one approach sounds reasonable.

Environment | 运行环境

- OS:
- DEVICE:
- NVIDIA Driver:
- CUDA:
- NVIDIA GPU Memory:
- PyTorch:

BasicTS logs | BasicTS日志

No response

Steps To Reproduce | 复现方法

Train with num_epochs=5
Resume with num_epochs=50
(wait until first epoch of the new run is finished)

Anything else? | 备注

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-triagedfor issues raised to be triaged

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions