Create routine for backup of DB #96

lcoram · 2025-04-07T09:44:30Z

closes #89

lcoram · 2025-04-07T13:53:27Z

So far have tested this on a VM with postgres installed and created the DB with a tiny bit of fake data (see comment below). Will check that the crons run, then would consider that we want to run the ansible on the cluster and see that it starts taking backups. Then we should potentially use task #67 to check it actually works on larger amounts of data / the real cluster?

lcoram · 2025-04-07T13:55:13Z

Create database

sudo -u postgres psql
create database lard
enter the lard database
sudo -u postgres psql -d lard
apply all the .sql files from the \db folder

Made some fake data

insert into timeseries(id) values (123),(456);
insert into data(timeseries, obstime, obsvalue) values (123, '2025-01-01 10:20:30', 0), (456, '2025-01-01 10:20:30', 1);

This then looks like:

lard=# select * from data;
timeseries | obstime | obsvalue | qc_usable
------------+------------------------+----------+-----------
123 | 2025-01-01 10:20:30+00 | 0 | t
456 | 2025-01-01 10:20:30+00 | 1 | t
(2 rows)

Create a backup to the S3 the same way the cron should

sudo -u postgres pg_dump lard | s3cmd put - s3://lard/backups/lard__$(date +%Y%m%d%H%M%S)
sudo -u postgres pg_dumpall --globals-only | s3cmd put - s3://lard/backups/globals__$(date +%Y%m%d%H%M%S)

Currently a bit unclear if we need the globals, and for this testing we probably don't. However, they could be useful if the postgres installation was more thoroughly wiped?

Delete the database

Then should be able to test deleting the database and recreating it
sudo -u postgres psql
drop database lard
Check its really not there...
sudo -u postgres psql -d lard

Recreate the database

sudo -u postgres psql
create database lard
Try to get the backup back from the s3 and pipe it into the database.

Get the data back

If doing this for real, find the most recent file!
s3cmd get s3://lard/backups/lard__20250407131431 lard__latest

sudo -u postgres psql -U postgres -d lard < lard__latest

Go in and have a look around...
sudo -u postgres psql -d lard

intarga

Nice work!

I think there are a few more things we should address before closing the issue (I don't mind if you want to address them in a follow-on PR instead of this one):

Playbook for restoring backups - Ideally we should be able to just do ansible-playbook ... restore_backup.yml. I know the filename changes, but it should be straightforward to figure out which is the latest programatically.
lard_restricted backups - This seems to just backup/restore the lard db, but it should cover lard_restricted too.
Streaming restores - I think you just need to add a - like you do in the put command
S3 space management - I think this is the thorniest issue. If we're doing daily multi-terabyte backups, we're quickly going to run out of space on our S3 cluster; we need to find a way to not do that.
Testing on the real and fully populated database - I imagine @Lun4m is done remigrating by now, so you should be a go to try this. Could also be interesting to see how having ingestion turned on affects it.

intarga · 2025-04-07T15:24:21Z

Some suggestions for managing space on s3:

Compression

This isn't a solution on its own, but it might be helpful anyway.

Block deduplication

I don't know if s3 has any capability for this, but if it does that would be great.

Deleting old backups

The most obvious solution, but then we need to be really good about automatically integrity-checking our backups so we're sure we always have one that works (we should probably do that anyway though).

Incremental backups

Doing full backups only on larger intervals, and only having incremental backups between could let us have a longer backup history and/or more fine-grained backups

intarga · 2025-04-07T15:25:40Z

ansible/roles/backups/tasks/main.yml

+  ansible.builtin.cron:
+    name: "backup lard"
+    minute: "0"
+    hour: "3,15" # backup daily at 3am and 3pm


3pm is maybe not a good time. It's in peak usage hours, and a pg_dump will probably affect performance

I wanted to have one that was more in the night, and maybe one at the end of the work day (but that they are 12 hours apart). The problem is that its UTC as far as I can tell based on the VM clock, so currently 2 hours off (so actually 5 and 17) but then that changes with winter time. Peak ingestion occurs just after the hour and then usage probably peaks around 10-15 after the hour... So maybe better to do it at like 45 over something?

Hmm is there a particular reason we want a 2x daily schedule? I'd have thought 1x daily was good enough since we have a comfortably larger buffer than that on the kafka queue / Obsinn. Only doing it at night would make it easier to find a good schedule

Ok, then just sometime during the night. At least for a full backup (that we then thin out)... We could consider incremental at other times (that would only need to exist since the last full backup)?

Yeah, I imagine an incremental backup is much lighter on the system, so we can probably get away with doing those whenever

lcoram · 2025-04-07T18:00:47Z

Agree on all points @intarga. I will test if I can do a streaming restore (although worried if its large and gets a network glitch while it's happening it could cause issues). I will also look into incremental backups and some sort of thinning cron job for full backups. Additionally I think I need the name of the VM in the backup... Since we backup both a and b (and one could be corrupt or something)!
Will try to think what makes sense to get into this PR and what maybe goes in a follow up.

lcoram · 2025-04-07T18:07:17Z

We are currently using postgres 16 right? We might have to go to 17 to get incremental backups... Seems it is a new feature.
https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-INCREMENTAL-BACKUP

intarga · 2025-04-07T18:10:13Z

(although worried if its large and gets a network glitch while it's happening it could cause issues)

That's a good point. I guess we have to decide what we want to prioritise. Streaming has the advantage of being faster, which if we're trying to recover from a fault means less downtime. Non-streaming also has the issue that we need to make sure we have enough disk space on the VM for the backup

Additionally I think I need the name of the VM in the backup... Since we backup both a and b

I forgot this will run on both 😱 Seems kinda wasteful? The primary is the authoritative version, so if we only going to backup one it should be that. I guess it's a bit tricky for the cron job to know whether it's on the primary or not though, and that adds an extra point of failure...

intarga · 2025-04-07T18:12:42Z

We are currently using postgres 16 right? We might have to go to 17 to get incremental backups... Seems it is a new feature.

Nope, we are already on 17 😁

lcoram · 2025-04-07T18:12:57Z

Good point about wasteful, maybe I can try to wrap the Cron in a check to repmgr if its primary?

intarga · 2025-04-07T18:16:04Z

Good point about wasteful, maybe I can try to wrap the Cron in a check to repmgr if its primary?

That was my thought, I did something similar to idempotentise the configure.yml playbook. I'm nervous about the extra point of failure it adds though. Perhaps we should add a metric for latest backup time, so we can have an alert if it's ever more than 26 hours ago?

Lun4m · 2025-04-08T07:51:17Z

If we switch the cron job ON/OFF on switchover/failover. it should be okay, no? Anyway incremental backups use pg_basebackup (dumps the whole cluster? Version upgrade incompatible?) which is different from pg_dump

cskarby · 2025-04-08T08:29:34Z

On the new kubernetes cluster the postgresql-operator has support for doing database backups

…primary

lcoram · 2025-04-10T10:03:40Z

Added a metric that is pushed to the prometheus gateway for knowing if the backup has been called (and below have thought about what I need to add to the rules file for monitoring). However I still need a different job that will check the integrity of the backup somehow... this will probably also need a metric.

 - alert: lard last_lard_backup_stale
    expr: time() - last_lard_backup{job="lard_backup"} > 604800
    for: 30m
    labels:
      severity: critical
    annotations:
      description: last backup is older than 1 week 
  - alert: lard last_lard_restricted_backup_stale
    expr: time() - last_lard_restricted_backup{job="lard_backup"} > 604800
    for: 30m
    labels:
      severity: critical
    annotations:
      description: last retricted backup is older than 1 week

lcoram self-assigned this Apr 7, 2025

lcoram force-pushed the backups branch from fe33e85 to 8b6c46e Compare April 7, 2025 10:16

role to install s3cmd and create crons

5a72178

lcoram force-pushed the backups branch from 8b6c46e to 5a72178 Compare April 7, 2025 10:18

lcoram added 2 commits April 7, 2025 13:29

add the backups task to the configure step

ed6a99a

readme with basic information about recovering the database

5f9e28c

lcoram requested a review from intarga April 7, 2025 13:53

lcoram marked this pull request as ready for review April 7, 2025 13:54

lcoram requested review from jo-asplin-met-no and Lun4m April 7, 2025 14:02

intarga reviewed Apr 7, 2025

View reviewed changes

lcoram added this to the Beta Release milestone Apr 7, 2025

lcoram added 2 commits April 8, 2025 13:34

cron shell syntax, add cron for restricted

c5cb0db

add shell scripts used for backup crons, to execute a backup only on …

b7803ed

…primary

lcoram force-pushed the backups branch from e59e0df to b7803ed Compare April 9, 2025 12:08

push last backup metric to prometheus in shell script

0035b7e

intarga removed this from the Beta Release milestone Apr 11, 2025

lcoram force-pushed the backups branch from 35bcbba to 39cb512 Compare April 11, 2025 19:40

create basic staging with ansible

4e9d5be

lcoram force-pushed the backups branch from 39cb512 to 4e9d5be Compare April 11, 2025 19:57

simplify ansible for staging

1bd3e16

lcoram force-pushed the backups branch from dae0bd9 to 1bd3e16 Compare April 14, 2025 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create routine for backup of DB #96

Create routine for backup of DB #96

lcoram commented Apr 7, 2025 •

edited

Loading

lcoram commented Apr 7, 2025 •

edited

Loading

lcoram commented Apr 7, 2025

intarga left a comment •

edited

Loading

intarga commented Apr 7, 2025

intarga Apr 7, 2025

lcoram Apr 7, 2025 •

edited

Loading

intarga Apr 7, 2025

lcoram Apr 7, 2025 •

edited

Loading

intarga Apr 7, 2025

lcoram commented Apr 7, 2025 •

edited

Loading

lcoram commented Apr 7, 2025

intarga commented Apr 7, 2025 •

edited

Loading

intarga commented Apr 7, 2025

lcoram commented Apr 7, 2025

intarga commented Apr 7, 2025

Lun4m commented Apr 8, 2025 •

edited

Loading

cskarby commented Apr 8, 2025

lcoram commented Apr 10, 2025 •

edited

Loading

Create routine for backup of DB #96

Are you sure you want to change the base?

Create routine for backup of DB #96

Conversation

lcoram commented Apr 7, 2025 • edited Loading

lcoram commented Apr 7, 2025 • edited Loading

lcoram commented Apr 7, 2025

Create database

Made some fake data

This then looks like:

Create a backup to the S3 the same way the cron should

Delete the database

Recreate the database

Get the data back

intarga left a comment • edited Loading

Choose a reason for hiding this comment

intarga commented Apr 7, 2025

Compression

Block deduplication

Deleting old backups

Incremental backups

intarga Apr 7, 2025

Choose a reason for hiding this comment

lcoram Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

intarga Apr 7, 2025

Choose a reason for hiding this comment

lcoram Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

intarga Apr 7, 2025

Choose a reason for hiding this comment

lcoram commented Apr 7, 2025 • edited Loading

lcoram commented Apr 7, 2025

intarga commented Apr 7, 2025 • edited Loading

intarga commented Apr 7, 2025

lcoram commented Apr 7, 2025

intarga commented Apr 7, 2025

Lun4m commented Apr 8, 2025 • edited Loading

cskarby commented Apr 8, 2025

lcoram commented Apr 10, 2025 • edited Loading

lcoram commented Apr 7, 2025 •

edited

Loading

lcoram commented Apr 7, 2025 •

edited

Loading

intarga left a comment •

edited

Loading

lcoram Apr 7, 2025 •

edited

Loading

lcoram Apr 7, 2025 •

edited

Loading

lcoram commented Apr 7, 2025 •

edited

Loading

intarga commented Apr 7, 2025 •

edited

Loading

Lun4m commented Apr 8, 2025 •

edited

Loading

lcoram commented Apr 10, 2025 •

edited

Loading