Skip to content

Create routine for backup of DB #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: trunk
Choose a base branch
from
Open

Create routine for backup of DB #96

wants to merge 8 commits into from

Conversation

lcoram
Copy link
Collaborator

@lcoram lcoram commented Apr 7, 2025

closes #89

@lcoram lcoram self-assigned this Apr 7, 2025
@lcoram
Copy link
Collaborator Author

lcoram commented Apr 7, 2025

So far have tested this on a VM with postgres installed and created the DB with a tiny bit of fake data (see comment below). Will check that the crons run, then would consider that we want to run the ansible on the cluster and see that it starts taking backups. Then we should potentially use task #67 to check it actually works on larger amounts of data / the real cluster?

@lcoram lcoram requested a review from intarga April 7, 2025 13:53
@lcoram lcoram marked this pull request as ready for review April 7, 2025 13:54
@lcoram
Copy link
Collaborator Author

lcoram commented Apr 7, 2025

Create database

sudo -u postgres psql
create database lard
enter the lard database
sudo -u postgres psql -d lard
apply all the .sql files from the \db folder

Made some fake data

insert into timeseries(id) values (123),(456);
insert into data(timeseries, obstime, obsvalue) values (123, '2025-01-01 10:20:30', 0), (456, '2025-01-01 10:20:30', 1);

This then looks like:

lard=# select * from data;
timeseries | obstime | obsvalue | qc_usable
------------+------------------------+----------+-----------
123 | 2025-01-01 10:20:30+00 | 0 | t
456 | 2025-01-01 10:20:30+00 | 1 | t
(2 rows)

lard=# select * from timeseries;
id | fromtime | totime | loc | permit | deactivated
-----+----------+--------+-----+--------+-------------
123 | | | | |
456 | | | | |
(2 rows)

Create a backup to the S3 the same way the cron should

sudo -u postgres pg_dump lard | s3cmd put - s3://lard/backups/lard__$(date +%Y%m%d%H%M%S)
sudo -u postgres pg_dumpall --globals-only | s3cmd put - s3://lard/backups/globals__$(date +%Y%m%d%H%M%S)

Currently a bit unclear if we need the globals, and for this testing we probably don't. However, they could be useful if the postgres installation was more thoroughly wiped?

Delete the database

Then should be able to test deleting the database and recreating it
sudo -u postgres psql
drop database lard
Check its really not there...
sudo -u postgres psql -d lard

Recreate the database

sudo -u postgres psql
create database lard
Try to get the backup back from the s3 and pipe it into the database.

Get the data back

If doing this for real, find the most recent file!
s3cmd get s3://lard/backups/lard__20250407131431 lard__latest

sudo -u postgres psql -U postgres -d lard < lard__latest

Go in and have a look around...
sudo -u postgres psql -d lard

@lcoram lcoram requested review from jo-asplin-met-no and Lun4m April 7, 2025 14:02
Copy link
Member

@intarga intarga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I think there are a few more things we should address before closing the issue (I don't mind if you want to address them in a follow-on PR instead of this one):

  • Playbook for restoring backups - Ideally we should be able to just do ansible-playbook ... restore_backup.yml. I know the filename changes, but it should be straightforward to figure out which is the latest programatically.
  • lard_restricted backups - This seems to just backup/restore the lard db, but it should cover lard_restricted too.
  • Streaming restores - I think you just need to add a - like you do in the put command
  • S3 space management - I think this is the thorniest issue. If we're doing daily multi-terabyte backups, we're quickly going to run out of space on our S3 cluster; we need to find a way to not do that.
  • Testing on the real and fully populated database - I imagine @Lun4m is done remigrating by now, so you should be a go to try this. Could also be interesting to see how having ingestion turned on affects it.

@intarga
Copy link
Member

intarga commented Apr 7, 2025

Some suggestions for managing space on s3:

Compression

This isn't a solution on its own, but it might be helpful anyway.

Block deduplication

I don't know if s3 has any capability for this, but if it does that would be great.

Deleting old backups

The most obvious solution, but then we need to be really good about automatically integrity-checking our backups so we're sure we always have one that works (we should probably do that anyway though).

Incremental backups

Doing full backups only on larger intervals, and only having incremental backups between could let us have a longer backup history and/or more fine-grained backups

ansible.builtin.cron:
name: "backup lard"
minute: "0"
hour: "3,15" # backup daily at 3am and 3pm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3pm is maybe not a good time. It's in peak usage hours, and a pg_dump will probably affect performance

Copy link
Collaborator Author

@lcoram lcoram Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have one that was more in the night, and maybe one at the end of the work day (but that they are 12 hours apart). The problem is that its UTC as far as I can tell based on the VM clock, so currently 2 hours off (so actually 5 and 17) but then that changes with winter time. Peak ingestion occurs just after the hour and then usage probably peaks around 10-15 after the hour... So maybe better to do it at like 45 over something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is there a particular reason we want a 2x daily schedule? I'd have thought 1x daily was good enough since we have a comfortably larger buffer than that on the kafka queue / Obsinn. Only doing it at night would make it easier to find a good schedule

Copy link
Collaborator Author

@lcoram lcoram Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then just sometime during the night. At least for a full backup (that we then thin out)... We could consider incremental at other times (that would only need to exist since the last full backup)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I imagine an incremental backup is much lighter on the system, so we can probably get away with doing those whenever

@lcoram
Copy link
Collaborator Author

lcoram commented Apr 7, 2025

Agree on all points @intarga. I will test if I can do a streaming restore (although worried if its large and gets a network glitch while it's happening it could cause issues). I will also look into incremental backups and some sort of thinning cron job for full backups. Additionally I think I need the name of the VM in the backup... Since we backup both a and b (and one could be corrupt or something)!
Will try to think what makes sense to get into this PR and what maybe goes in a follow up.

@lcoram
Copy link
Collaborator Author

lcoram commented Apr 7, 2025

We are currently using postgres 16 right? We might have to go to 17 to get incremental backups... Seems it is a new feature.
https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-INCREMENTAL-BACKUP

@intarga
Copy link
Member

intarga commented Apr 7, 2025

(although worried if its large and gets a network glitch while it's happening it could cause issues)

That's a good point. I guess we have to decide what we want to prioritise. Streaming has the advantage of being faster, which if we're trying to recover from a fault means less downtime. Non-streaming also has the issue that we need to make sure we have enough disk space on the VM for the backup

Additionally I think I need the name of the VM in the backup... Since we backup both a and b

I forgot this will run on both 😱 Seems kinda wasteful? The primary is the authoritative version, so if we only going to backup one it should be that. I guess it's a bit tricky for the cron job to know whether it's on the primary or not though, and that adds an extra point of failure...

@intarga
Copy link
Member

intarga commented Apr 7, 2025

We are currently using postgres 16 right? We might have to go to 17 to get incremental backups... Seems it is a new feature.

Nope, we are already on 17 😁

@lcoram
Copy link
Collaborator Author

lcoram commented Apr 7, 2025

Good point about wasteful, maybe I can try to wrap the Cron in a check to repmgr if its primary?

@intarga
Copy link
Member

intarga commented Apr 7, 2025

Good point about wasteful, maybe I can try to wrap the Cron in a check to repmgr if its primary?

That was my thought, I did something similar to idempotentise the configure.yml playbook. I'm nervous about the extra point of failure it adds though. Perhaps we should add a metric for latest backup time, so we can have an alert if it's ever more than 26 hours ago?

@lcoram lcoram added this to the Beta Release milestone Apr 7, 2025
@Lun4m
Copy link
Collaborator

Lun4m commented Apr 8, 2025

If we switch the cron job ON/OFF on switchover/failover. it should be okay, no? Anyway incremental backups use pg_basebackup (dumps the whole cluster? Version upgrade incompatible?) which is different from pg_dump

@cskarby
Copy link
Member

cskarby commented Apr 8, 2025

On the new kubernetes cluster the postgresql-operator has support for doing database backups

@lcoram
Copy link
Collaborator Author

lcoram commented Apr 10, 2025

Added a metric that is pushed to the prometheus gateway for knowing if the backup has been called (and below have thought about what I need to add to the rules file for monitoring). However I still need a different job that will check the integrity of the backup somehow... this will probably also need a metric.

 - alert: lard last_lard_backup_stale
    expr: time() - last_lard_backup{job="lard_backup"} > 604800
    for: 30m
    labels:
      severity: critical
    annotations:
      description: last backup is older than 1 week 
  - alert: lard last_lard_restricted_backup_stale
    expr: time() - last_lard_restricted_backup{job="lard_backup"} > 604800
    for: 30m
    labels:
      severity: critical
    annotations:
      description: last retricted backup is older than 1 week 

@intarga intarga removed this from the Beta Release milestone Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic backups
4 participants