Skip to content

Conversation

@moio
Copy link
Contributor

@moio moio commented Mar 12, 2025

Description

Relying on etcd snapshots for the upstream ("local") cluster is known to cause issues.

The Backup and Restore Operator was created to avoid those.

We should clarify that to customers.

Other context: https://www.suse.com/support/kb/doc/?id=000021770

Comments

I opened a PR instead of an issue in hopes that will help fixing the problem sooner - and also because it was the easiest way for me to communicate clearly what the problem is. I hope this helps

@moio
Copy link
Contributor Author

moio commented Mar 12, 2025

cc @mallardduck please take a look if this makes sense from your perspective
cc @jakefhyde please take a look if this makes sense from your perspective
cc @kourosh7 for initially bringing this topic up

@mallardduck
Copy link
Member

mallardduck commented Mar 12, 2025

From my understanding the limitations of etcd as an option for local is more focused on RKE1 compared to other runtimes. For instances I know some PSEs (@inichols might be able to give some input) that have customers that only use Etcd for local backups and do not use BRO at all.

So I think we should make sure we capture what that nuance is to properly communicate that. As from my understanding ETCD is a safe option for full local (rancher) cluster backups when using k3s/rke2 for instance.

From the BRO docs, we specifically call out that BRO/Rancher Backups is not an etcd replacement. So we should be sure to not create a conflicting message between these docs and our BRO readme.

What the Backup Restore Operator is not:

  • A downstream cluster snapshot tool,
  • A replacement for Etcd cluster backups,
  • Configured to back up user-created resources on the Rancher cluster.

@jakefhyde
Copy link
Contributor

From my understanding the limitations of etcd as an option for local is more focused on RKE1 compared to other runtimes. For instances I know some PSEs (@inichols might be able to give some input) that have customers that only use Etcd for local backups and do not use BRO at all.

So I think we should make sure we capture what that nuance is to properly communicate that. As from my understanding ETCD is a safe option for full local (rancher) cluster backups when using k3s/rke2 for instance.

From the BRO docs, we specifically call out that BRO/Rancher Backups is not an etcd replacement. So we should be sure to not create a conflicting message between these docs and our BRO readme.

What the Backup Restore Operator is not:

  • A downstream cluster snapshot tool,
  • A replacement for Etcd cluster backups,
  • Configured to back up user-created resources on the Rancher cluster.

@mallardduck Want to clear up some things:

  • ETCD Snapshot Pros
    • Easy, often automatic
    • Useful if nothing was changed since the last snapshot
  • ETCD Snapshot Cons
    • If anything was changed in the local cluster, those changes are lost (i.e. CAPI resources)
      • If all of the machines were rotated for a cluster after the snapshot, those machines are "orphaned" and the cluster will also need to be restored with manual intervention
      • If clusters were created or deleted after the snapshot was taken, they will be orphaned or recreated respectively

Though simpler, the cons are very bad here. Since the state of downstreams is stored within the local cluster, I wouldn't recommend users to rely only on snapshots if they are performing massive infrastructure changes. I think this PR is correct to say that etcd restores should be treated as a last resort, because BRO has the advantage of being able to restore a local cluster with a partial state. We have users that rotate the nodes in their clusters as part of maintenance, so having a selective restoration tool that allows you to rollback to a certain state without losing everything created since makes resolving issues much simpler. I don't think any of that contradicts what is in https://github.com/rancher/backup-restore-operator?tab=readme-ov-file#use-cases, since it's filling a use-case which etcd snapshots are not suitable for in the first place.

@mallardduck
Copy link
Member

mallardduck commented Mar 12, 2025

I think my concern lies in how vague sounding the wording of this part is:

can sometimes cause issues

What issues, when? Why should it even be a last resort if it can sometimes cause issues? Will knowing that it can sometimes cause issues - but having uncertainty of what those issues are and when to expect them - be reassuring if I am a customer? Those are some of the things I'm considering as to why something doesn't sit quite right currently.

(Obviously @jakefhyde answered some of this with his clarification. 😆 )

Beyond that, I have concerns around the fact that a "Rancher Backup In-place Restore" and a "Rancher Backup Migration" are drastically different actions that customers already confuse quite often. So while not directly related to this change, the choice customers need to make isn't just binary...but rather 3 pronged.

"Can I do a Rancher Backup restore even, or do I have to do a Rancher Backup migration?" (first 2 options) And finally, when should I even consider etcd if I've been warned it's a last resort?

I think that if we can clarify the above areas then maybe it will help make it more clear overall. So if this change could incorporate some of these clarifications you provided that may be more valuable for customers (and community users - since this is community docs).

@jakefhyde
Copy link
Contributor

What issues, when?

@mallardduck I think that's a reasonable criticism. I think we would do well to inform the user about when one is recommended over the other, and what the potential tradeoffs are. IMO etcd restore would even be usable for rancher version rollbacks, as long as the upgrade didn't cause any changes to downstream clusters, but we never really guarantee that (hard to anyway, given the cluster-agent will immediately be rolled out on upgrade).

@mallardduck
Copy link
Member

@jakefhyde - I wonder between your feedback and mine, if a good way to communicate all this would be a new page specific to comparing and contrasting Etcd (specifically in context to local cluster only) vs Rancher backup? Then the snippet we add to callout this aspect on this page could read more like:

For a reliable Rancher backup, we recommend using the Rancher Backup functionality described in this guide. Read our "(insert title)" guide to understand our best practices on when to use Rancher Backups or etcd backups.

Then we can include some of these details that better explain the risk we called out in the paragraph above this line? @moio What do you think of these additions to add more clarity around the two tools - and providing a guide to compare/contrast so users can pick the best strategy (a mix of both hopefully) for their use-cases?

@moio
Copy link
Contributor Author

moio commented Mar 13, 2025

I think adding a page with a more detailed analysis, and link it from the warning, is the best idea.

My problem is I do not feel like being competent enough to be authoritative about the details. Best I can do is to copypaste what you mentioned above.

Can any of you draft such a page? You have push permissions to this branch, just push a new markdown file in and let's iterate on the content.

Then I can take care of the fine editing, rancher-docs specifics, and improving on the Doc Team feedback until the PR is merged.

@jakefhyde
Copy link
Contributor

@moio I can draft something up tomorrow

@moio
Copy link
Contributor Author

moio commented Mar 14, 2025

Thank you.

BTW I consider this an "important not urgent" topic, so it's perfectly fine waiting a week or two as well, as long as it isn't forgotten 😇

@jakefhyde
Copy link
Contributor

jakefhyde commented Mar 28, 2025

@moio This is what I have come up with, feel free to tweak (also @mallardduck feel free to critique as well).

In the event of a failure in the local cluster, it may be necessary to restore Rancher to a previous version. There exist two currently supported methods for restoring Rancher: ETCD Restores, and restores performed by Rancher's Backup Restore Operator. ETCD Restores are suitable if no user changes have been made since the snapshot was taken which need to be persisted. For example, if a user takes a snapshot and then makes a change to their local cluster which causes an outage, an ETCD Restore would be the expected restoration mechanism. However, if a user takes a snapshot of the local cluster and then upgrades the kubernetesVersion of the downstream cluster, an ETCD restore would reset the kubernetesVersion and the nodes in this cluster would be running a version of kubernetes that is higher than their spec. Rancher will atempt to reconcile the version, which could have disastrous results. Likewise, if a user were to take a snapshot of the local cluster, rotate all of the nodes in a Rancher provisioned downstream cluster, and then restore from the snapshot, the information on how these nodes were provisioned would be lost, these nodes would be orphaned and in some cases the cluster could become inoperable. As a general rule, when restoring Rancher if any meaningful changes have been made to the local cluster (installing charts/apps in the local/downstream clusters, performing day2 ops of downstream clusters, creating or deleting clusters or nodes within clusters, rotating Rancher certificates in the local cluster), it is recommended to use the Backup Restore Operator to prevent data loss.

@moio moio force-pushed the snapshot_warning branch from 1c6b094 to fe7a3d6 Compare April 1, 2025 08:34
@moio moio force-pushed the snapshot_warning branch from fe7a3d6 to 1f74b96 Compare April 1, 2025 08:35
@moio
Copy link
Contributor Author

moio commented Apr 1, 2025

Thanks @jakefhyde - I reworded your piece for consistency with jargon on the k3s and RKE2 manuals and other parts of the document, then copy edited again with some Gemini help for clarity and conciseness as I am not a native English speaker.

Can you please confirm this is now OK so that we pass it to the Docs team for their part of the review?

@kourosh7
Copy link
Contributor

kourosh7 commented Apr 1, 2025

@moio just curious why you changed it to Datastore Snapshot and Datastore Restore? Everyone seems to use and be familiar with etcd snapshot and etcd snapshot restore and I think we should stick to that terminology to not introduce any confusion.

@moio
Copy link
Contributor Author

moio commented Apr 1, 2025

Terminology comes from official docs:

https://docs.k3s.io/datastore/backup-restore
https://docs.rke2.io/datastore/backup_restore

I did not want to introduce user confusion (although I understand the internal names are clearer for us who work with them every day...)

Philosophically, it might not be an etcd database - sqlite or something else entirely via kine.

@moio moio added the port/community-product Triggers a GitHub action to file a community sync issue for rancher-product-docs. label Apr 4, 2025
@mallardduck
Copy link
Member

Hey @moio / @jakefhyde - I just realized that there are settings in BRO that we use for in-place restores that might make some of this info less accurate. My good pal @inichols is looking into this for us and will follow up for us here once he has confirmation.

@inichols
Copy link
Contributor

@moio This is what I have come up with, feel free to tweak (also @mallardduck feel free to critique as well).

In the event of a failure in the local cluster, it may be necessary to restore Rancher to a previous version. There exist two currently supported methods for restoring Rancher: ETCD Restores, and restores performed by Rancher's Backup Restore Operator. ETCD Restores are suitable if no user changes have been made since the snapshot was taken which need to be persisted. For example, if a user takes a snapshot and then makes a change to their local cluster which causes an outage, an ETCD Restore would be the expected restoration mechanism. However, if a user takes a snapshot of the local cluster and then upgrades the kubernetesVersion of the downstream cluster, an ETCD restore would reset the kubernetesVersion and the nodes in this cluster would be running a version of kubernetes that is higher than their spec. Rancher will atempt to reconcile the version, which could have disastrous results. Likewise, if a user were to take a snapshot of the local cluster, rotate all of the nodes in a Rancher provisioned downstream cluster, and then restore from the snapshot, the information on how these nodes were provisioned would be lost, these nodes would be orphaned and in some cases the cluster could become inoperable. As a general rule, when restoring Rancher if any meaningful changes have been made to the local cluster (installing charts/apps in the local/downstream clusters, performing day2 ops of downstream clusters, creating or deleting clusters or nodes within clusters, rotating Rancher certificates in the local cluster), it is recommended to use the Backup Restore Operator to prevent data loss.

Hey, want to provide some information on BRO. It acts similar to an ETCD Restore for the downstream clusters state. If you have a cluster on 1.31.7 and a BRO backup from 1.30.11, a restore will still cause a difference between the running cluster and the Rancher Manager state. This would cause the downstream cluster to go back to 1.30.11. BRO does not take into account for downstream clusters when restoring in place as Prune(the default for in place migrations) removes everything that is not present in the current backup that matches the resourceset that was used for the backup.

I think this would be a really cool feature though and really set BRO apart from ETCD backups/restores if we could make it downstream cluster aware. However, I think this would need to be a choice during the restore process. Sometimes you would want to completely set a cluster back to how it was during a certain state, but other times it might be okay to keep certain clusters around that were not in the backup.

@moio
Copy link
Contributor Author

moio commented May 5, 2025

Hey, want to provide some information on BRO. It acts similar to an ETCD Restore for the downstream clusters state. If you have a cluster on 1.31.7 and a BRO backup from 1.30.11, a restore will still cause a difference between the running cluster and the Rancher Manager state. This would cause the downstream cluster to go back to 1.30.11. BRO does not take into account for downstream clusters when restoring in place as Prune(the default for in place migrations) removes everything that is not present in the current backup that matches the resourceset that was used for the backup.

I hope I captured that correctly https://github.com/rancher/rancher-docs/pull/1705/files#r2073153167, please let me know

I think this would be a really cool feature though and really set BRO apart from ETCD backups/restores if we could make it downstream cluster aware. However, I think this would need to be a choice during the restore process. Sometimes you would want to completely set a cluster back to how it was during a certain state, but other times it might be okay to keep certain clusters around that were not in the backup.

I agree, please bring this up to management/PM for further prioritization.

Signed-off-by: Silvio Moioli <[email protected]>
@moio moio requested a review from pmkovar as a code owner June 12, 2025 13:35
@moio
Copy link
Contributor Author

moio commented Jun 12, 2025

@btat this is now aligned with feedback from all participants. FMPOV, it's ready to go, and it should as well be ported to the product docs with no specific change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

port/community-product Triggers a GitHub action to file a community sync issue for rancher-product-docs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants