-
Notifications
You must be signed in to change notification settings - Fork 0
Create routine for backup of DB #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
So far have tested this on a VM with postgres installed and created the DB with a tiny bit of fake data (see comment below). Will check that the crons run, then would consider that we want to run the ansible on the cluster and see that it starts taking backups. Then we should potentially use task #67 to check it actually works on larger amounts of data / the real cluster? |
Create database
Made some fake data
This then looks like:lard=# select * from data; lard=# select * from timeseries; Create a backup to the S3 the same way the cron should
Currently a bit unclear if we need the globals, and for this testing we probably don't. However, they could be useful if the postgres installation was more thoroughly wiped? Delete the databaseThen should be able to test deleting the database and recreating it Recreate the database
Get the data backIf doing this for real, find the most recent file!
Go in and have a look around... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I think there are a few more things we should address before closing the issue (I don't mind if you want to address them in a follow-on PR instead of this one):
- Playbook for restoring backups - Ideally we should be able to just do
ansible-playbook ... restore_backup.yml
. I know the filename changes, but it should be straightforward to figure out which is the latest programatically. lard_restricted
backups - This seems to just backup/restore thelard
db, but it should coverlard_restricted
too.- Streaming restores - I think you just need to add a
-
like you do in the put command - S3 space management - I think this is the thorniest issue. If we're doing daily multi-terabyte backups, we're quickly going to run out of space on our S3 cluster; we need to find a way to not do that.
- Testing on the real and fully populated database - I imagine @Lun4m is done remigrating by now, so you should be a go to try this. Could also be interesting to see how having ingestion turned on affects it.
Some suggestions for managing space on s3: CompressionThis isn't a solution on its own, but it might be helpful anyway. Block deduplicationI don't know if s3 has any capability for this, but if it does that would be great. Deleting old backupsThe most obvious solution, but then we need to be really good about automatically integrity-checking our backups so we're sure we always have one that works (we should probably do that anyway though). Incremental backupsDoing full backups only on larger intervals, and only having incremental backups between could let us have a longer backup history and/or more fine-grained backups |
ansible/roles/backups/tasks/main.yml
Outdated
ansible.builtin.cron: | ||
name: "backup lard" | ||
minute: "0" | ||
hour: "3,15" # backup daily at 3am and 3pm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3pm is maybe not a good time. It's in peak usage hours, and a pg_dump will probably affect performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have one that was more in the night, and maybe one at the end of the work day (but that they are 12 hours apart). The problem is that its UTC as far as I can tell based on the VM clock, so currently 2 hours off (so actually 5 and 17) but then that changes with winter time. Peak ingestion occurs just after the hour and then usage probably peaks around 10-15 after the hour... So maybe better to do it at like 45 over something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is there a particular reason we want a 2x daily schedule? I'd have thought 1x daily was good enough since we have a comfortably larger buffer than that on the kafka queue / Obsinn. Only doing it at night would make it easier to find a good schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then just sometime during the night. At least for a full backup (that we then thin out)... We could consider incremental at other times (that would only need to exist since the last full backup)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I imagine an incremental backup is much lighter on the system, so we can probably get away with doing those whenever
Agree on all points @intarga. I will test if I can do a streaming restore (although worried if its large and gets a network glitch while it's happening it could cause issues). I will also look into incremental backups and some sort of thinning cron job for full backups. Additionally I think I need the name of the VM in the backup... Since we backup both a and b (and one could be corrupt or something)! |
We are currently using postgres 16 right? We might have to go to 17 to get incremental backups... Seems it is a new feature. |
That's a good point. I guess we have to decide what we want to prioritise. Streaming has the advantage of being faster, which if we're trying to recover from a fault means less downtime. Non-streaming also has the issue that we need to make sure we have enough disk space on the VM for the backup
I forgot this will run on both 😱 Seems kinda wasteful? The primary is the authoritative version, so if we only going to backup one it should be that. I guess it's a bit tricky for the cron job to know whether it's on the primary or not though, and that adds an extra point of failure... |
Nope, we are already on 17 😁 |
Good point about wasteful, maybe I can try to wrap the Cron in a check to repmgr if its primary? |
That was my thought, I did something similar to idempotentise the configure.yml playbook. I'm nervous about the extra point of failure it adds though. Perhaps we should add a metric for latest backup time, so we can have an alert if it's ever more than 26 hours ago? |
If we switch the cron job ON/OFF on switchover/failover. it should be okay, no? Anyway incremental backups use |
On the new kubernetes cluster the postgresql-operator has support for doing database backups |
Added a metric that is pushed to the prometheus gateway for knowing if the backup has been called (and below have thought about what I need to add to the rules file for monitoring). However I still need a different job that will check the integrity of the backup somehow... this will probably also need a metric.
|
closes #89