-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database v1 #134
Database v1 #134
Conversation
Thumbnail storage: I believe we only use the 'summary' data in the DB for drawing the thumbnail in the table? I.e. double clicking retrieves the un-summarised image from the HDF5 file to plot. This being the case, I think it would make sense to scale down the image before storing it, for faster loading. This would mean we'd have to regenerate thumbnails if we wanted to display them larger, but that seems like an acceptable compromise, as we can just display the existing thumbnail until that's done. Perhaps this is aesthetic, but I'd also like to get away from storing pickled data, and thumbnails seem like an easy place to start. I believe that no valid pickle can start with the magic numbers of a PNG file, for instance, so we could store PNGs in the database with no ambiguity, and QPixmap can parse PNGs directly. Or Numpy's I'll continue looking through this tomorrow. 🙂 |
Yep, we already do that: |
I hadn't seen that, but even so: we scale it down to 150 pixels (max dimension) to store, but then we scale it down again to 35 pixels to display: Lines 323 to 324 in 4319e98
|
damnit/backend/db.py
Outdated
CREATE TABLE IF NOT EXISTS run_info(proposal, run, start_time, added_at); | ||
CREATE UNIQUE INDEX IF NOT EXISTS proposal_run ON run_info (proposal, run); | ||
|
||
CREATE TABLE IF NOT EXISTS run_variables(proposal, run, name, version, value, timestamp, stored_type, max_diff, provenance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stored_type
column seems somewhat confusing to me, because we're storing it with the reduced data, but it describes the un-reduced data (where those are different, e.g. arrays).
Not sure what to do about this, but I want to remember to think about it.
except Exception: | ||
pass | ||
except Exception as e: | ||
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}': {e}") | |
log.warning(f"Couldn't retrieve data for '{xlabel}', '{ylabel}'", exc_info=True) |
This should give us a traceback in the logging.
Things (for me 😇) to do before merging:
@takluyver, @CammilleCC, does that cover everything? Is there anything else we need from the database? |
I'd say we should do that change separately after we've landed this PR. This is already a pretty massive set of changes, and I think adding computed variables into the database is going to be another substantial piece of work. Maybe this is just my personal preference, but I don't find it easy working in this model of giant feature branches, so I'd really like to get the changes we've already done into master.
Does this belong in the |
Ok, I'll leave it for a later PR. But I think it should be part of v1, so I'll hold off on writing the intermediate migrations until then (boy, that's going to be fun 😨).
This should go in |
If more databases are created before we merge that extra change, which seems quite possible to me, I'd say we should call the version after that v2 (or v1.1, if we can make it compatible enough). I think the migrations will be easier to deal with if we've got an ID for each version 'in the wild', rather than having to special case the databases which are inbetween v0 and v1. I dislike that we've already got some databases labelled v1 that don't match what we've ended up calling v1, but 🤷 adding a version number for the first time is inevitably a bit messy. But now that we've got a version number, let's use it! 🙂 |
Thank you very much for the improvements, both! Looking great so far, I would test further in the following weeks. A wishlist from the frontend: could you please add another view for latest variable timestamps? We ideally would to check the when a variable is last updated and while one could query the |
RIP my unrealistic deadlines 🪦 New checklist für mich:
|
Previously only the scaled down 2D arrays were saved as summaries and the RGBA thumbnails were created on-the-fly by the GUI. But that's really slow. With this change, opening a database with a single image variable and 400 runs went down to ~6.5s from ~12s.
Now that the GUI expects full RGBA thumbnails as image summaries, we have to provide a way for existing databases to upgrade. I considered supporting both kinds of summaries in the GUI, but decided that in the long run it'd be simpler if we just upgraded databases instead of going for full backwards-compatibility in the GUI.
It will be treated automatically as an RGBA image by the backend and GUI.
Also made `DamnitDB.ensure_run()` allow a default timestamp of the current time.
The goals of v1 are to: - Get the backend to generate thumbnails and store them as summaries, instead of relying on the GUI to do it - Add support for storing variable metadata (like the type and whether it's changing) - Replacing the HDF5 data format for DataArray's to netcdf - Adding support for Dataset's with netcdf This commit only updates the database schema, supporting netcdf comes in the next commit. Major changes: - Replace the old schema with a long-narrow schema. Now we store additional metadata about variables like the stored type. - Add ctxrunner.DataType to represent the various types we support - Add DamnitDB.ReducedData to represent summary data and its metadata - Add helper functions to DamnitDB for things like setting a variable and getting a list of variables - Forbid None values being passed to add_to_db() - Add a tests/helpers submodule to hold utility functions for tests - Use `amore-proto reprocess --mock` to generate mock data in tests
Also ended up writing a simple API.
c0b1d0b
to
6648e12
Compare
These are all the changes that make up what I'm obnoxiously going to call v1 of the DAMNIT format 😛
TL;DR:
xr.Dataset
's andxr.DataArray
's (fae05a2, 77e33c1, 66f067b, 94c9ea5)As usual all the commits are atomic so I would recommend reviewing each individually. Here's the commit message from fae05a2 for some of the gnarly details:
Things that should be done before this is merged:
Things that should be done at some point but aren't strictly necessary before merging:
amore-proto migrate images
andamore-proto migrate v0-to-v1
amore-proto migrate v0-to-v1
DamnitDB
, but that does make the code messy. It might be better to soft-deprecate oldDamnitDB
methods by renaming them and calling them from the migrations.summary
column to store the summary used for each variable. That was a major oversight on my part...I won't be able to work on this for the next two weeks, so if someone else wants to take over feel free.