Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br/stream: save more information to pause (pause v2) #59660

Open
YuJuncen opened this issue Feb 20, 2025 · 0 comments · May be fixed by #59663
Open

br/stream: save more information to pause (pause v2) #59660

YuJuncen opened this issue Feb 20, 2025 · 0 comments · May be fixed by #59663
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@YuJuncen
Copy link
Contributor

Enhancement

Background

Now, when a task encounters a fatal error, the following keys are written:

$prefix/last_error/$task_name → kvproto::brpb::StreamBackupError
$prefix/paused/$task_name     → <nothing>

So far, we have suffered some pains due to this like:

  • When someone pauses the task, it is impossible to track the reason for pausing.
    • This happens when a task is paused by a new feature, like auto pausing tasks with huge lags.
  • When a user mistakenly pauses a task, we cannot figure out when and how he paused it.
  • Putting pause and error isn't atomic. This may cause a fatal error to be treated as retryable and the user may recover an unrecoveriable task.

Design

Here I propose a new format of pausing:

$prefix/paused/$task_name → PauseReason
Where `PauseReason` looks like:
{
    "severity": "ERROR" | "REGULAR_OPERATION",
    "operation_hostname": string,
    "operation_pid": int,
    "operation_time": datetime,
    "payload_type": "application/x-protobuf;messageType=brpb.StreamBackupError" | "text/plain;charset=UTF-8",
    "payload": bytes,
}

Notes:

  • Though payload_type uses the same format as MIME types, for safety and simplicity, only the message brpb.StreamBackupError and UTF-8 strings are accepted by log status. The BR CLI will refuse to parse the payload if the type isn't one of them.
  • I think PauseReason can be JSON instead of Protocol Buffer Message, because in almost all cases, the information it provides is for humans. JSON is more flexible and more friendly to humans.

Compatibility / Feasibility

For now, the pause key was used by both TiKV and the Coordinator. But neither of them reads the value of this key. So modifications to values are safe.

The error key was written by TiKV and only reads by the BR CLI (log status).

A possible update route:

  1. Update TiDB codebase, allowing BR CLI to read information from Pause V2.
    • We just read error messages from both V1 and V2.
    • In this step, the base library for Pause V2 can be added.2. Allowing the coordinator and BR CLI to pause a task by V2.
  2. Maybe allow the user to attach a custom message to this pause.
  3. Allowing the TiKV to upload errors by V2.
@YuJuncen YuJuncen added the type/enhancement The issue or PR belongs to an enhancement. label Feb 20, 2025
@YuJuncen YuJuncen changed the title br/stream: save more information to pause br/stream: save more information to pause (pause v2) Feb 20, 2025
@YuJuncen YuJuncen linked a pull request Feb 20, 2025 that will close this issue
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant