Reprocess from gui two #66

matheuscteo · 2023-07-12T07:17:54Z

A copy of my old branch with a few alterations. Enables reprocessing from the GUI by right-clicking a given row (or selection of) on the run table. A color scheme indicates the reprocessing status (a row green background means the run is being reprocessed and blue means that the row/run is in the queue to be reprocessed).

The runs to be reprocessed are added to a queue and reprocessed in a separated thread, one by one, in order. If one or more reprocessing fails, a message will appear at the status bar advising the user to read the log files.

Support for SLURM jobs it's still not available, thus the color scheme might be misleading if SLURM jobs are required for some variables. Yet, for the HED use-case SLURM jobs are not so relevant (an alternative would be to merge this to HED branch, but I'm not sure which one would that be).

Upon closing the GUI, if a run is in the middle of being reprocessed, the reprocessing will continue. I'm not sure how to overcome this nor if there's a use case in which the user would actually close the GUI during the experiment. I'm open to suggestions on this regard.

A copy of my old branch with a few alterations. Enables reprocessing from the GUI by right-clicking a given row (or selection of) on the run table. A color scheme indicates the reprocessing status. Support for SLURM jobs it's still not availiable, thus the color scheme might be missleading if SLURM jobs are required for some variables.

JamesWrigley

I'm not sure this is quite the right design for reprocessing. A lot of the monitoring/Extractor() logic in the Reprocessor class is already implemented in the listener, and it would be good if we could reuse that code. I also would prefer if the GUI didn't launch jobs independently because you run the risk of file corruption if two GUIs reprocess the same runs and write to the same files simultaneously.

Slurm support is also really important; instruments like MID, SCS, and SPB do a lot of heavy processing in slurm jobs and those are usually the ones that take a long time and really do need to be monitored.

So concretely, I would suggest:

If it's really important for HED, then merge it to the HED branch.
Extend the monitoring code in the listener to also track slurm jobs (nerd-sniping @takluyver because I think he's done this before for calibration).
Make the GUI get the reprocessing status from the listener.

For that last point, my first idea was to add a REST API to the listener with the expectation that the web frontend can use it later, but the downside is that if the listener is restarted then you lose the status of all the jobs. So now I think the best way might be to write that information to the database, that way the GUI has easy access to it and the jobs (particularly slurm jobs) can be tracked across listener restarts.

takluyver · 2023-07-12T14:45:17Z

Extend the monitoring code in the listener to also track slurm jobs (nerd-sniping @takluyver because I think he's done this before for calibration).

Consider me sniped 🤓 .

We do do this for calibration, but it's kind of a pain, and we're still trying to figure out how best to do it. The main challenge is requeuing. In theory, jobs from any state can be requeued, but one big thing we care about is pre-emption: jobs being cancelled to make space for higher priority jobs, after which the cancelled job is automatically requeued. So something is running, then it's not running, but it's not yet done.

There are two approaches:

Periodically use squeue (and maybe also sacct) to check on the status of your jobs and see what's changed since last run. This is what we do for offline calibration, with a hardcoded list of states which we take to mean a job is probably finished:

STATES_FINISHED = {  # https://slurm.schedmd.com/squeue.html#lbAG
    'BOOT_FAIL',  'CANCELLED', 'COMPLETED',  'DEADLINE', 'FAILED',
    'OUT_OF_MEMORY', 'SPECIAL_EXIT', 'TIMEOUT',
}

By default squeue only shows jobs in some of these states, so we're now using --states=all to hopefully see all the recent ones, but we're still seeing exactly how this shakes out. Doing this too frequently is discouraged, so we call squeue once per minute. Jobs are removed from squeue a couple of minutes after Slurm thinks they're finished; sacct retains some information for longer.

Send updates from inside the job when it's finishing. This gets you the updated status quicker. Cancellation shows up as SIGTERM, then you have 30 seconds to do any clean up and send out a message before the job is killed. But AFAIK the cancelled job doesn't know if it's going to be automatically requeued or not, so you still need to do something with squeue to figure out if it's failed or if you just need to wait longer.

JamesWrigley · 2023-07-12T16:00:34Z

Ufff, I just realized I missed a critical point from my earlier comment: how to trigger reprocessing from the GUI in the first place 🙃

This is something that I think could be a REST endpoint in the listener, then the PyQt GUI can make an HTTP request to trigger a job as well as the web frontend in the future. @CammilleCC, would this work with your design?
For a little bit of security we can protect the endpoint by choosing a random port and token to be sent with each request, like we (kinda) already do with supervisord. It won't be truly secure unless we also use TLS certificates, but I don't think that's necessary while we're running everything inside the DESY network.

JamesWrigley · 2023-07-12T16:17:24Z

We do do this for calibration, but it's kind of a pain, and we're still trying to figure out how best to do it. The main challenge is requeuing. In theory, jobs from any state can be requeued, but one big thing we care about is pre-emption: jobs being cancelled to make space for higher priority jobs, after which the cancelled job is automatically requeued. So something is running, then it's not running, but it's not yet done.

Ah man, that is way more complicated than I thought 🥲 Have you ever tried using the REST API? https://slurm.schedmd.com/rest_api.html#slurmdbV0039GetJob
Though unfortunately the docs say that the server is completely stateless and recommends setting up a caching proxy, so if Maxwell haven't done that then we'd probably have the same rate-limiting issue. Anyway, I'd say let's stick with polling for now. Updates every minute isn't so bad since most slurm jobs take many minutes, and if people get confused we can implement updating from the job later.

JamesWrigley · 2023-07-13T10:08:20Z

This is something that I think could be a REST endpoint in the listener, then the PyQt GUI can make an HTTP request to trigger a job as well as the web frontend in the future. @CammilleCC, would this work with your design?

Discussed with Cammille, and it turns out that since the webserver is currently separate from the listener we can stick to Kafka. That should simplify things a lot, the listener can listen on a Kafka topic for reprocessing (which should include the database ID) as well as to the migration/calibration topics and the GUI can send Kafka messages when a run is requested to be reprocessed.

takluyver · 2023-10-25T11:36:04Z

Ah man, that is way more complicated than I thought 🥲 Have you ever tried using the REST API?

As far as I've found, the Slurm REST API contains the same info that you can get from commands like squeue, just sent over HTTP (and presumably requiring authentication). If that's true, it doesn't really make any difference to the challenges I described.

A lot of the time, monitoring that updates once a minute or so is 'good enough', so we can live with polling. After all, what the jobs do usually takes at least a few minutes. I think the way we're doing this for offline calibration is working OK for now - at least, I don't remember more issues since we started using --states=all.

takluyver · 2024-07-31T13:19:09Z

I'm going to close this PR now, as it was superseded by #285.

matheuscteo added 2 commits July 11, 2023 17:14

Adds a test for the reprocessing from the GUI feature

0566580

matheuscteo requested review from JamesWrigley and tmichela July 12, 2023 07:17

JamesWrigley requested changes Jul 12, 2023

View reviewed changes

JamesWrigley added enhancement New feature or request GUI For GUI-related things labels Jul 12, 2023

takluyver closed this Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reprocess from gui two #66

Reprocess from gui two #66

matheuscteo commented Jul 12, 2023 •

edited

Loading

JamesWrigley left a comment •

edited

Loading

takluyver commented Jul 12, 2023

JamesWrigley commented Jul 12, 2023

JamesWrigley commented Jul 12, 2023

JamesWrigley commented Jul 13, 2023

takluyver commented Oct 25, 2023

takluyver commented Jul 31, 2024

Reprocess from gui two #66

Reprocess from gui two #66

Conversation

matheuscteo commented Jul 12, 2023 • edited Loading

JamesWrigley left a comment • edited Loading

Choose a reason for hiding this comment

takluyver commented Jul 12, 2023

JamesWrigley commented Jul 12, 2023

JamesWrigley commented Jul 12, 2023

JamesWrigley commented Jul 13, 2023

takluyver commented Oct 25, 2023

takluyver commented Jul 31, 2024

matheuscteo commented Jul 12, 2023 •

edited

Loading

JamesWrigley left a comment •

edited

Loading