Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database v1 #134

Merged
merged 24 commits into from
Jan 15, 2024
Merged

Database v1 #134

merged 24 commits into from
Jan 15, 2024

Conversation

JamesWrigley
Copy link
Member

These are all the changes that make up what I'm obnoxiously going to call v1 of the DAMNIT format 😛

TL;DR:

  • Create image thumbnails during processing (73d75a6 and 9d37867)
  • Allow saving Figures (02e8540). This isn't really related to v1 but it was too complicated to untangle from the other commits I cherry-picked...
  • New long-narrow database schema and netcdf in the HDF5 files for xr.Dataset's and xr.DataArray's (fae05a2, 77e33c1, 66f067b, 94c9ea5)
  • An elementary read-only API, which I hope will be moved out to a separate package one day (77e33c1). I attempted to preserve backwards compatibility here, but I don't think that's feasible any more so a lot of that code is probably unused and could be deleted. I also wrote it quite feverishly one late evening so the design is probably rubbish.

As usual all the commits are atomic so I would recommend reviewing each individually. Here's the commit message from fae05a2 for some of the gnarly details:

The goals of v1 are to:
- Get the backend to generate thumbnails and store them as summaries, instead of
  relying on the GUI to do it
- Add support for storing variable metadata (like the type and whether it's
  changing)
- Replacing the HDF5 data format for DataArray's to netcdf
- Adding support for Dataset's with netcdf

This commit only updates the database schema, supporting netcdf comes in the
next commit.

Major changes:
- Replace the old schema with a long-narrow schema. Now we store additional
  metadata about variables like the stored type.
- Add ctxrunner.DataType to represent the various types we support
- Add DamnitDB.ReducedData to represent summary data and its metadata
- Add helper functions to DamnitDB for things like setting a variable and
  getting a list of variables
- Forbid None values being passed to add_to_db()
- Add a tests/helpers submodule to hold utility functions for tests
- Use `amore-proto reprocess --mock` to generate mock data in tests

Things that should be done before this is merged:

  • Get the GUI to check if the database format is old, and refuse to open it and prompt the user to migrate it if so (probably tell them to ask us for now).

Things that should be done at some point but aren't strictly necessary before merging:

  • Combine amore-proto migrate images and amore-proto migrate v0-to-v1
  • Add tests for amore-proto migrate v0-to-v1
  • Reconsider the code for the migrations. I tried to make it as standalone as possible so it doesn't depend on anything in DamnitDB, but that does make the code messy. It might be better to soft-deprecate old DamnitDB methods by renaming them and calling them from the migrations.
  • Add a summary column to store the summary used for each variable. That was a major oversight on my part...

I won't be able to work on this for the next two weeks, so if someone else wants to take over feel free.

@takluyver
Copy link
Member

Thumbnail storage: I believe we only use the 'summary' data in the DB for drawing the thumbnail in the table? I.e. double clicking retrieves the un-summarised image from the HDF5 file to plot. This being the case, I think it would make sense to scale down the image before storing it, for faster loading. This would mean we'd have to regenerate thumbnails if we wanted to display them larger, but that seems like an acceptable compromise, as we can just display the existing thumbnail until that's done.

Perhaps this is aesthetic, but I'd also like to get away from storing pickled data, and thumbnails seem like an easy place to start. I believe that no valid pickle can start with the magic numbers of a PNG file, for instance, so we could store PNGs in the database with no ambiguity, and QPixmap can parse PNGs directly. Or Numpy's .npy file format would also work.

I'll continue looking through this tomorrow. 🙂

@JamesWrigley
Copy link
Member Author

Thumbnail storage: I believe we only use the 'summary' data in the DB for drawing the thumbnail in the table? I.e. double clicking retrieves the un-summarised image from the HDF5 file to plot. This being the case, I think it would make sense to scale down the image before storing it, for faster loading. This would mean we'd have to regenerate thumbnails if we wanted to display them larger, but that seems like an acceptable compromise, as we can just display the existing thumbnail until that's done.

Yep, we already do that:
https://github.com/European-XFEL/DAMNIT/blob/master/damnit/ctxsupport/ctxrunner.py#L329-L337

@takluyver
Copy link
Member

I hadn't seen that, but even so: we scale it down to 150 pixels (max dimension) to store, but then we scale it down again to 35 pixels to display:

DAMNIT/damnit/gui/table.py

Lines 323 to 324 in 4319e98

return QtGui.QPixmap(image).scaled(QtCore.QSize(THUMBNAIL_SIZE, THUMBNAIL_SIZE),
Qt.KeepAspectRatio)

CREATE TABLE IF NOT EXISTS run_info(proposal, run, start_time, added_at);
CREATE UNIQUE INDEX IF NOT EXISTS proposal_run ON run_info (proposal, run);

CREATE TABLE IF NOT EXISTS run_variables(proposal, run, name, version, value, timestamp, stored_type, max_diff, provenance);
Copy link
Member

@takluyver takluyver Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stored_type column seems somewhat confusing to me, because we're storing it with the reduced data, but it describes the un-reduced data (where those are different, e.g. arrays).

Not sure what to do about this, but I want to remember to think about it.

except Exception:
pass
except Exception as e:
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}")
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}'", exc_info=True)

This should give us a traceback in the logging.

@JamesWrigley
Copy link
Member Author

Things (for me 😇) to do before merging:

  • Test the migrations
  • Forbid amore-proto from opening unmigrated databases
  • Add a summary column to store which summary method was used
  • Extend the variables table to store context file variables too, which should partially solve Support a 'database refresh' from the context file #74

@takluyver, @CammilleCC, does that cover everything? Is there anything else we need from the database?

@takluyver
Copy link
Member

Extend the variables table to store context file variables too

I'd say we should do that change separately after we've landed this PR. This is already a pretty massive set of changes, and I think adding computed variables into the database is going to be another substantial piece of work. Maybe this is just my personal preference, but I don't find it easy working in this model of giant feature branches, so I'd really like to get the changes we've already done into master.

Add a summary column to store which summary method was used

Does this belong in the run_variables table, or in variables? More concretely, do you have an idea how we'll use that information?

@JamesWrigley
Copy link
Member Author

I'd say we should do that change separately after we've landed this PR. This is already a pretty massive set of changes, and I think adding computed variables into the database is going to be another substantial piece of work. Maybe this is just my personal preference, but I don't find it easy working in this model of giant feature branches, so I'd really like to get the changes we've already done into master.

Ok, I'll leave it for a later PR. But I think it should be part of v1, so I'll hold off on writing the intermediate migrations until then (boy, that's going to be fun 😨).

Does this belong in the run_variables table, or in variables? More concretely, do you have an idea how we'll use that information?

This should go in run_variables since it may change, and we should be displaying it in the GUI (maybe a tooltip) so that the users know how the summary was calculated.

@takluyver
Copy link
Member

But I think it should be part of v1, so I'll hold off on writing the intermediate migrations until then (boy, that's going to be fun 😨).

If more databases are created before we merge that extra change, which seems quite possible to me, I'd say we should call the version after that v2 (or v1.1, if we can make it compatible enough). I think the migrations will be easier to deal with if we've got an ID for each version 'in the wild', rather than having to special case the databases which are inbetween v0 and v1.

I dislike that we've already got some databases labelled v1 that don't match what we've ended up calling v1, but 🤷 adding a version number for the first time is inevitably a bit messy. But now that we've got a version number, let's use it! 🙂

@CammilleCC
Copy link
Member

Thank you very much for the improvements, both! Looking great so far, I would test further in the following weeks.

A wishlist from the frontend: could you please add another view for latest variable timestamps? We ideally would to check the when a variable is last updated and while one could query the run_variables for the latest reprocess timestamp, It might be worthwhile to add another view instead.

@JamesWrigley
Copy link
Member Author

RIP my unrealistic deadlines 🪦

New checklist für mich:

* Test the migrations
* Forbid `amore-proto` from opening unmigrated databases
* Add a `summary` column to store which summary method was used
  • A view for the variable timestamps

JamesWrigley and others added 15 commits January 15, 2024 13:24
Previously only the scaled down 2D arrays were saved as summaries and the RGBA
thumbnails were created on-the-fly by the GUI. But that's really slow. With this
change, opening a database with a single image variable and 400 runs went down
to ~6.5s from ~12s.
Now that the GUI expects full RGBA thumbnails as image summaries, we have to
provide a way for existing databases to upgrade. I considered supporting both
kinds of summaries in the GUI, but decided that in the long run it'd be simpler
if we just upgraded databases instead of going for full backwards-compatibility
in the GUI.
It will be treated automatically as an RGBA image by the backend and GUI.
Also made `DamnitDB.ensure_run()` allow a default timestamp of the current
time.
The goals of v1 are to:
- Get the backend to generate thumbnails and store them as summaries, instead of
  relying on the GUI to do it
- Add support for storing variable metadata (like the type and whether it's
  changing)
- Replacing the HDF5 data format for DataArray's to netcdf
- Adding support for Dataset's with netcdf

This commit only updates the database schema, supporting netcdf comes in the
next commit.

Major changes:
- Replace the old schema with a long-narrow schema. Now we store additional
  metadata about variables like the stored type.
- Add ctxrunner.DataType to represent the various types we support
- Add DamnitDB.ReducedData to represent summary data and its metadata
- Add helper functions to DamnitDB for things like setting a variable and
  getting a list of variables
- Forbid None values being passed to add_to_db()
- Add a tests/helpers submodule to hold utility functions for tests
- Use `amore-proto reprocess --mock` to generate mock data in tests
Also ended up writing a simple API.
@JamesWrigley JamesWrigley merged commit 76af052 into master Jan 15, 2024
3 checks passed
@JamesWrigley JamesWrigley deleted the database-v1 branch January 15, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants