-
Notifications
You must be signed in to change notification settings - Fork 261
backup and restore: add warning not to use snapshots #1705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cc @mallardduck please take a look if this makes sense from your perspective |
|
From my understanding the limitations of etcd as an option for local is more focused on RKE1 compared to other runtimes. For instances I know some PSEs (@inichols might be able to give some input) that have customers that only use Etcd for local backups and do not use BRO at all. So I think we should make sure we capture what that nuance is to properly communicate that. As from my understanding ETCD is a safe option for full local (rancher) cluster backups when using k3s/rke2 for instance. From the BRO docs, we specifically call out that BRO/Rancher Backups is not an etcd replacement. So we should be sure to not create a conflicting message between these docs and our BRO readme.
|
@mallardduck Want to clear up some things:
Though simpler, the cons are very bad here. Since the state of downstreams is stored within the local cluster, I wouldn't recommend users to rely only on snapshots if they are performing massive infrastructure changes. I think this PR is correct to say that etcd restores should be treated as a last resort, because BRO has the advantage of being able to restore a local cluster with a partial state. We have users that rotate the nodes in their clusters as part of maintenance, so having a selective restoration tool that allows you to rollback to a certain state without losing everything created since makes resolving issues much simpler. I don't think any of that contradicts what is in https://github.com/rancher/backup-restore-operator?tab=readme-ov-file#use-cases, since it's filling a use-case which etcd snapshots are not suitable for in the first place. |
|
I think my concern lies in how vague sounding the wording of this part is:
What issues, when? Why should it even be a last resort if it can sometimes cause issues? Will knowing that it can sometimes cause issues - but having uncertainty of what those issues are and when to expect them - be reassuring if I am a customer? Those are some of the things I'm considering as to why something doesn't sit quite right currently. (Obviously @jakefhyde answered some of this with his clarification. 😆 ) Beyond that, I have concerns around the fact that a "Rancher Backup In-place Restore" and a "Rancher Backup Migration" are drastically different actions that customers already confuse quite often. So while not directly related to this change, the choice customers need to make isn't just binary...but rather 3 pronged. "Can I do a Rancher Backup restore even, or do I have to do a Rancher Backup migration?" (first 2 options) And finally, when should I even consider I think that if we can clarify the above areas then maybe it will help make it more clear overall. So if this change could incorporate some of these clarifications you provided that may be more valuable for customers (and community users - since this is community docs). |
@mallardduck I think that's a reasonable criticism. I think we would do well to inform the user about when one is recommended over the other, and what the potential tradeoffs are. IMO etcd restore would even be usable for rancher version rollbacks, as long as the upgrade didn't cause any changes to downstream clusters, but we never really guarantee that (hard to anyway, given the cluster-agent will immediately be rolled out on upgrade). |
|
@jakefhyde - I wonder between your feedback and mine, if a good way to communicate all this would be a new page specific to comparing and contrasting Etcd (specifically in context to local cluster only) vs Rancher backup? Then the snippet we add to callout this aspect on this page could read more like:
Then we can include some of these details that better explain the risk we called out in the paragraph above this line? @moio What do you think of these additions to add more clarity around the two tools - and providing a guide to compare/contrast so users can pick the best strategy (a mix of both hopefully) for their use-cases? |
|
I think adding a page with a more detailed analysis, and link it from the warning, is the best idea. My problem is I do not feel like being competent enough to be authoritative about the details. Best I can do is to copypaste what you mentioned above. Can any of you draft such a page? You have push permissions to this branch, just push a new markdown file in and let's iterate on the content. Then I can take care of the fine editing, rancher-docs specifics, and improving on the Doc Team feedback until the PR is merged. |
|
@moio I can draft something up tomorrow |
|
Thank you. BTW I consider this an "important not urgent" topic, so it's perfectly fine waiting a week or two as well, as long as it isn't forgotten 😇 |
|
@moio This is what I have come up with, feel free to tweak (also @mallardduck feel free to critique as well).
|
Signed-off-by: Silvio Moioli <[email protected]>
Signed-off-by: Silvio Moioli <[email protected]>
|
Thanks @jakefhyde - I reworded your piece for consistency with jargon on the k3s and RKE2 manuals and other parts of the document, then copy edited again with some Gemini help for clarity and conciseness as I am not a native English speaker. Can you please confirm this is now OK so that we pass it to the Docs team for their part of the review? |
|
@moio just curious why you changed it to |
|
Terminology comes from official docs: https://docs.k3s.io/datastore/backup-restore I did not want to introduce user confusion (although I understand the internal names are clearer for us who work with them every day...) Philosophically, it might not be an etcd database - sqlite or something else entirely via kine. |
|
Hey @moio / @jakefhyde - I just realized that there are settings in BRO that we use for in-place restores that might make some of this info less accurate. My good pal @inichols is looking into this for us and will follow up for us here once he has confirmation. |
Hey, want to provide some information on BRO. It acts similar to an ETCD Restore for the downstream clusters state. If you have a cluster on 1.31.7 and a BRO backup from 1.30.11, a restore will still cause a difference between the running cluster and the Rancher Manager state. This would cause the downstream cluster to go back to 1.30.11. BRO does not take into account for downstream clusters when restoring in place as Prune(the default for in place migrations) removes everything that is not present in the current backup that matches the resourceset that was used for the backup. I think this would be a really cool feature though and really set BRO apart from ETCD backups/restores if we could make it downstream cluster aware. However, I think this would need to be a choice during the restore process. Sometimes you would want to completely set a cluster back to how it was during a certain state, but other times it might be okay to keep certain clusters around that were not in the backup. |
...o-guides/new-user-guides/backup-restore-and-disaster-recovery/back-up-restore-usage-guide.md
Outdated
Show resolved
Hide resolved
I hope I captured that correctly https://github.com/rancher/rancher-docs/pull/1705/files#r2073153167, please let me know
I agree, please bring this up to management/PM for further prioritization. |
Signed-off-by: Silvio Moioli <[email protected]>
|
@btat this is now aligned with feedback from all participants. FMPOV, it's ready to go, and it should as well be ported to the product docs with no specific change. |
Description
Relying on etcd snapshots for the upstream ("local") cluster is known to cause issues.
The Backup and Restore Operator was created to avoid those.
We should clarify that to customers.
Other context: https://www.suse.com/support/kb/doc/?id=000021770
Comments
I opened a PR instead of an issue in hopes that will help fixing the problem sooner - and also because it was the easiest way for me to communicate clearly what the problem is. I hope this helps