Database v1 #134

JamesWrigley · 2023-11-15T11:10:10Z

These are all the changes that make up what I'm obnoxiously going to call v1 of the DAMNIT format 😛

TL;DR:

Create image thumbnails during processing (73d75a6 and 9d37867)
Allow saving Figures (02e8540). This isn't really related to v1 but it was too complicated to untangle from the other commits I cherry-picked...
New long-narrow database schema and netcdf in the HDF5 files for xr.Dataset's and xr.DataArray's (fae05a2, 77e33c1, 66f067b, 94c9ea5)
An elementary read-only API, which I hope will be moved out to a separate package one day (77e33c1). I attempted to preserve backwards compatibility here, but I don't think that's feasible any more so a lot of that code is probably unused and could be deleted. I also wrote it quite feverishly one late evening so the design is probably rubbish.

As usual all the commits are atomic so I would recommend reviewing each individually. Here's the commit message from fae05a2 for some of the gnarly details:

The goals of v1 are to:
- Get the backend to generate thumbnails and store them as summaries, instead of
  relying on the GUI to do it
- Add support for storing variable metadata (like the type and whether it's
  changing)
- Replacing the HDF5 data format for DataArray's to netcdf
- Adding support for Dataset's with netcdf

This commit only updates the database schema, supporting netcdf comes in the
next commit.

Major changes:
- Replace the old schema with a long-narrow schema. Now we store additional
  metadata about variables like the stored type.
- Add ctxrunner.DataType to represent the various types we support
- Add DamnitDB.ReducedData to represent summary data and its metadata
- Add helper functions to DamnitDB for things like setting a variable and
  getting a list of variables
- Forbid None values being passed to add_to_db()
- Add a tests/helpers submodule to hold utility functions for tests
- Use `amore-proto reprocess --mock` to generate mock data in tests

Things that should be done before this is merged:

Get the GUI to check if the database format is old, and refuse to open it and prompt the user to migrate it if so (probably tell them to ask us for now).

Things that should be done at some point but aren't strictly necessary before merging:

Combine amore-proto migrate images and amore-proto migrate v0-to-v1
Add tests for amore-proto migrate v0-to-v1
Reconsider the code for the migrations. I tried to make it as standalone as possible so it doesn't depend on anything in DamnitDB, but that does make the code messy. It might be better to soft-deprecate old DamnitDB methods by renaming them and calling them from the migrations.
Add a summary column to store the summary used for each variable. That was a major oversight on my part...

I won't be able to work on this for the next two weeks, so if someone else wants to take over feel free.

takluyver · 2023-11-16T18:27:31Z

Thumbnail storage: I believe we only use the 'summary' data in the DB for drawing the thumbnail in the table? I.e. double clicking retrieves the un-summarised image from the HDF5 file to plot. This being the case, I think it would make sense to scale down the image before storing it, for faster loading. This would mean we'd have to regenerate thumbnails if we wanted to display them larger, but that seems like an acceptable compromise, as we can just display the existing thumbnail until that's done.

Perhaps this is aesthetic, but I'd also like to get away from storing pickled data, and thumbnails seem like an easy place to start. I believe that no valid pickle can start with the magic numbers of a PNG file, for instance, so we could store PNGs in the database with no ambiguity, and QPixmap can parse PNGs directly. Or Numpy's .npy file format would also work.

I'll continue looking through this tomorrow. 🙂

JamesWrigley · 2023-11-17T09:22:24Z

Thumbnail storage: I believe we only use the 'summary' data in the DB for drawing the thumbnail in the table? I.e. double clicking retrieves the un-summarised image from the HDF5 file to plot. This being the case, I think it would make sense to scale down the image before storing it, for faster loading. This would mean we'd have to regenerate thumbnails if we wanted to display them larger, but that seems like an acceptable compromise, as we can just display the existing thumbnail until that's done.

Yep, we already do that:
https://github.com/European-XFEL/DAMNIT/blob/master/damnit/ctxsupport/ctxrunner.py#L329-L337

takluyver · 2023-11-17T10:12:54Z

I hadn't seen that, but even so: we scale it down to 150 pixels (max dimension) to store, but then we scale it down again to 35 pixels to display:

DAMNIT/damnit/gui/table.py

Lines 323 to 324 in 4319e98

    
           return QtGui.QPixmap(image).scaled(QtCore.QSize(THUMBNAIL_SIZE, THUMBNAIL_SIZE), 
        
                                              Qt.KeepAspectRatio)

takluyver · 2023-11-17T19:00:40Z

damnit/backend/db.py

+CREATE TABLE IF NOT EXISTS run_info(proposal, run, start_time, added_at);
+CREATE UNIQUE INDEX IF NOT EXISTS proposal_run ON run_info (proposal, run);
+
+CREATE TABLE IF NOT EXISTS run_variables(proposal, run, name, version, value, timestamp, stored_type, max_diff, provenance);


The stored_type column seems somewhat confusing to me, because we're storing it with the reduced data, but it describes the un-reduced data (where those are different, e.g. arrays).

Not sure what to do about this, but I want to remember to think about it.

takluyver · 2023-11-17T19:16:32Z

damnit/gui/plot.py

-                except Exception:
-                    pass
+                except Exception as e:
+                    log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}")


Suggested change

log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}")

log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}'", exc_info=True)

This should give us a traceback in the logging.

JamesWrigley · 2023-12-21T16:43:55Z

Things (for me 😇) to do before merging:

Test the migrations
Forbid amore-proto from opening unmigrated databases
Add a summary column to store which summary method was used
Extend the variables table to store context file variables too, which should partially solve Support a 'database refresh' from the context file #74

@takluyver, @CammilleCC, does that cover everything? Is there anything else we need from the database?

takluyver · 2023-12-21T17:13:05Z

Extend the variables table to store context file variables too

I'd say we should do that change separately after we've landed this PR. This is already a pretty massive set of changes, and I think adding computed variables into the database is going to be another substantial piece of work. Maybe this is just my personal preference, but I don't find it easy working in this model of giant feature branches, so I'd really like to get the changes we've already done into master.

Add a summary column to store which summary method was used

Does this belong in the run_variables table, or in variables? More concretely, do you have an idea how we'll use that information?

JamesWrigley · 2023-12-21T17:51:47Z

I'd say we should do that change separately after we've landed this PR. This is already a pretty massive set of changes, and I think adding computed variables into the database is going to be another substantial piece of work. Maybe this is just my personal preference, but I don't find it easy working in this model of giant feature branches, so I'd really like to get the changes we've already done into master.

Ok, I'll leave it for a later PR. But I think it should be part of v1, so I'll hold off on writing the intermediate migrations until then (boy, that's going to be fun 😨).

Does this belong in the run_variables table, or in variables? More concretely, do you have an idea how we'll use that information?

This should go in run_variables since it may change, and we should be displaying it in the GUI (maybe a tooltip) so that the users know how the summary was calculated.

takluyver · 2023-12-21T18:18:24Z

But I think it should be part of v1, so I'll hold off on writing the intermediate migrations until then (boy, that's going to be fun 😨).

If more databases are created before we merge that extra change, which seems quite possible to me, I'd say we should call the version after that v2 (or v1.1, if we can make it compatible enough). I think the migrations will be easier to deal with if we've got an ID for each version 'in the wild', rather than having to special case the databases which are inbetween v0 and v1.

I dislike that we've already got some databases labelled v1 that don't match what we've ended up calling v1, but 🤷 adding a version number for the first time is inevitably a bit messy. But now that we've got a version number, let's use it! 🙂

CammilleCC · 2023-12-22T14:00:27Z

Thank you very much for the improvements, both! Looking great so far, I would test further in the following weeks.

A wishlist from the frontend: could you please add another view for latest variable timestamps? We ideally would to check the when a variable is last updated and while one could query the run_variables for the latest reprocess timestamp, It might be worthwhile to add another view instead.

JamesWrigley · 2024-01-15T11:18:06Z

RIP my unrealistic deadlines 🪦

New checklist für mich:

* Test the migrations
* Forbid `amore-proto` from opening unmigrated databases
* Add a `summary` column to store which summary method was used

A view for the variable timestamps

Previously only the scaled down 2D arrays were saved as summaries and the RGBA thumbnails were created on-the-fly by the GUI. But that's really slow. With this change, opening a database with a single image variable and 400 runs went down to ~6.5s from ~12s.

Now that the GUI expects full RGBA thumbnails as image summaries, we have to provide a way for existing databases to upgrade. I considered supporting both kinds of summaries in the GUI, but decided that in the long run it'd be simpler if we just upgraded databases instead of going for full backwards-compatibility in the GUI.

It will be treated automatically as an RGBA image by the backend and GUI.

Also made `DamnitDB.ensure_run()` allow a default timestamp of the current time.

The goals of v1 are to: - Get the backend to generate thumbnails and store them as summaries, instead of relying on the GUI to do it - Add support for storing variable metadata (like the type and whether it's changing) - Replacing the HDF5 data format for DataArray's to netcdf - Adding support for Dataset's with netcdf This commit only updates the database schema, supporting netcdf comes in the next commit. Major changes: - Replace the old schema with a long-narrow schema. Now we store additional metadata about variables like the stored type. - Add ctxrunner.DataType to represent the various types we support - Add DamnitDB.ReducedData to represent summary data and its metadata - Add helper functions to DamnitDB for things like setting a variable and getting a list of variables - Forbid None values being passed to add_to_db() - Add a tests/helpers submodule to hold utility functions for tests - Use `amore-proto reprocess --mock` to generate mock data in tests

Also ended up writing a simple API.

…sting one

…fix on name

JamesWrigley requested review from takluyver, tmichela and matheuscteo November 15, 2023 11:10

takluyver reviewed Nov 17, 2023

View reviewed changes

takluyver mentioned this pull request Dec 11, 2023

Database v1 - further changes #154

Merged

JamesWrigley and others added 15 commits January 15, 2024 13:24

Add support for returning Figure's from Variable's

5cec6f0

It will be treated automatically as an RGBA image by the backend and GUI.

Use DamnitDB.ensure_run() everywhere instead of raw SQL

f8166e0

Also made `DamnitDB.ensure_run()` allow a default timestamp of the current time.

Add support for saving DataArray's in the NetCDF format

c4bf681

Also ended up writing a simple API.

Add support for saving xarray.Datasets

f40cdd2

Add migration from data format v0 to v1

c1b7af0

Store object type hints in HDF5 files, not in database

40fcc03

Scale thumbnails down to 35 * 35 before storing

39087cf

Save thumbnails as PNGs rather than pickled arrays

ca161a5

Separate out DataArray NetCDF4 migration code from database changes

612b1e0

Migrate database by creating a new file rather than modifying the exi…

0c9892c

…sting one

Add column for summary object types, for future use

94d5bdd

Allow for dry-run mode in images migration

bac735a

takluyver added 9 commits January 15, 2024 13:24

Fix <image> marker in exported spreadsheets

dfc31bc

Fix up some tests for recent changes

c9a5266

Don't expose internal attribute on returned xarray objects, use _ pre…

f8418e4

…fix on name

Remove unused code to store pickled data in database

3acca82

Remove damnit_png attribute

70e726b

Fix setting up old-style database in migration test

0c1c537

Don't try to put arrays in database in tests

3aaf131

Use BlobTypes.identify() to check for PNG data in .reduced

651cf55

Clarify use of main_dataset()

6648e12

JamesWrigley force-pushed the database-v1 branch from c0b1d0b to 6648e12 Compare January 15, 2024 12:28

JamesWrigley merged commit 76af052 into master Jan 15, 2024
3 checks passed

JamesWrigley deleted the database-v1 branch January 15, 2024 12:31

JamesWrigley mentioned this pull request Jan 15, 2024

Database schema update v1 #151

Open

JamesWrigley mentioned this pull request Apr 16, 2024

Allow storing RGBA images (including matplotlib figures) instead of just 2D images. #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database v1 #134

Database v1 #134

JamesWrigley commented Nov 15, 2023

takluyver commented Nov 16, 2023

JamesWrigley commented Nov 17, 2023

takluyver commented Nov 17, 2023

takluyver Nov 17, 2023 •

edited

Loading

takluyver Nov 17, 2023

JamesWrigley commented Dec 21, 2023

takluyver commented Dec 21, 2023

JamesWrigley commented Dec 21, 2023

takluyver commented Dec 21, 2023

CammilleCC commented Dec 22, 2023

JamesWrigley commented Jan 15, 2024

	log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}")
	log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}'", exc_info=True)

Database v1 #134

Database v1 #134

Conversation

JamesWrigley commented Nov 15, 2023

takluyver commented Nov 16, 2023

JamesWrigley commented Nov 17, 2023

takluyver commented Nov 17, 2023

takluyver Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

takluyver Nov 17, 2023

Choose a reason for hiding this comment

JamesWrigley commented Dec 21, 2023

takluyver commented Dec 21, 2023

JamesWrigley commented Dec 21, 2023

takluyver commented Dec 21, 2023

CammilleCC commented Dec 22, 2023

JamesWrigley commented Jan 15, 2024

takluyver Nov 17, 2023 •

edited

Loading