Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reprocess from gui two #66

Closed
wants to merge 2 commits into from
Closed

Reprocess from gui two #66

wants to merge 2 commits into from

Conversation

matheuscteo
Copy link
Member

@matheuscteo matheuscteo commented Jul 12, 2023

A copy of my old branch with a few alterations. Enables reprocessing from the GUI by right-clicking a given row (or selection of) on the run table. A color scheme indicates the reprocessing status (a row green background means the run is being reprocessed and blue means that the row/run is in the queue to be reprocessed).

The runs to be reprocessed are added to a queue and reprocessed in a separated thread, one by one, in order. If one or more reprocessing fails, a message will appear at the status bar advising the user to read the log files.

Support for SLURM jobs it's still not available, thus the color scheme might be misleading if SLURM jobs are required for some variables. Yet, for the HED use-case SLURM jobs are not so relevant (an alternative would be to merge this to HED branch, but I'm not sure which one would that be).

Upon closing the GUI, if a run is in the middle of being reprocessed, the reprocessing will continue. I'm not sure how to overcome this nor if there's a use case in which the user would actually close the GUI during the experiment. I'm open to suggestions on this regard.

A copy of my old branch with a few alterations. Enables reprocessing
from the GUI by right-clicking a given row (or selection of) on the run
table. A color scheme indicates the reprocessing status. Support for
SLURM jobs it's still not availiable, thus the color scheme might be
missleading if SLURM jobs are required for some variables.
Copy link
Member

@JamesWrigley JamesWrigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is quite the right design for reprocessing. A lot of the monitoring/Extractor() logic in the Reprocessor class is already implemented in the listener, and it would be good if we could reuse that code. I also would prefer if the GUI didn't launch jobs independently because you run the risk of file corruption if two GUIs reprocess the same runs and write to the same files simultaneously.

Slurm support is also really important; instruments like MID, SCS, and SPB do a lot of heavy processing in slurm jobs and those are usually the ones that take a long time and really do need to be monitored.

So concretely, I would suggest:

  • If it's really important for HED, then merge it to the HED branch.
  • Extend the monitoring code in the listener to also track slurm jobs (nerd-sniping @takluyver because I think he's done this before for calibration).
  • Make the GUI get the reprocessing status from the listener.

For that last point, my first idea was to add a REST API to the listener with the expectation that the web frontend can use it later, but the downside is that if the listener is restarted then you lose the status of all the jobs. So now I think the best way might be to write that information to the database, that way the GUI has easy access to it and the jobs (particularly slurm jobs) can be tracked across listener restarts.

@JamesWrigley JamesWrigley added enhancement New feature or request GUI For GUI-related things labels Jul 12, 2023
@takluyver
Copy link
Member

Extend the monitoring code in the listener to also track slurm jobs (nerd-sniping @takluyver because I think he's done this before for calibration).

Consider me sniped 🤓 .

We do do this for calibration, but it's kind of a pain, and we're still trying to figure out how best to do it. The main challenge is requeuing. In theory, jobs from any state can be requeued, but one big thing we care about is pre-emption: jobs being cancelled to make space for higher priority jobs, after which the cancelled job is automatically requeued. So something is running, then it's not running, but it's not yet done.

There are two approaches:

  1. Periodically use squeue (and maybe also sacct) to check on the status of your jobs and see what's changed since last run. This is what we do for offline calibration, with a hardcoded list of states which we take to mean a job is probably finished:
STATES_FINISHED = {  # https://slurm.schedmd.com/squeue.html#lbAG
    'BOOT_FAIL',  'CANCELLED', 'COMPLETED',  'DEADLINE', 'FAILED',
    'OUT_OF_MEMORY', 'SPECIAL_EXIT', 'TIMEOUT',
}

By default squeue only shows jobs in some of these states, so we're now using --states=all to hopefully see all the recent ones, but we're still seeing exactly how this shakes out. Doing this too frequently is discouraged, so we call squeue once per minute. Jobs are removed from squeue a couple of minutes after Slurm thinks they're finished; sacct retains some information for longer.

  1. Send updates from inside the job when it's finishing. This gets you the updated status quicker. Cancellation shows up as SIGTERM, then you have 30 seconds to do any clean up and send out a message before the job is killed. But AFAIK the cancelled job doesn't know if it's going to be automatically requeued or not, so you still need to do something with squeue to figure out if it's failed or if you just need to wait longer.

@JamesWrigley
Copy link
Member

Ufff, I just realized I missed a critical point from my earlier comment: how to trigger reprocessing from the GUI in the first place 🙃

This is something that I think could be a REST endpoint in the listener, then the PyQt GUI can make an HTTP request to trigger a job as well as the web frontend in the future. @CammilleCC, would this work with your design?
For a little bit of security we can protect the endpoint by choosing a random port and token to be sent with each request, like we (kinda) already do with supervisord. It won't be truly secure unless we also use TLS certificates, but I don't think that's necessary while we're running everything inside the DESY network.

@JamesWrigley
Copy link
Member

We do do this for calibration, but it's kind of a pain, and we're still trying to figure out how best to do it. The main challenge is requeuing. In theory, jobs from any state can be requeued, but one big thing we care about is pre-emption: jobs being cancelled to make space for higher priority jobs, after which the cancelled job is automatically requeued. So something is running, then it's not running, but it's not yet done.

Ah man, that is way more complicated than I thought 🥲 Have you ever tried using the REST API? https://slurm.schedmd.com/rest_api.html#slurmdbV0039GetJob
Though unfortunately the docs say that the server is completely stateless and recommends setting up a caching proxy, so if Maxwell haven't done that then we'd probably have the same rate-limiting issue. Anyway, I'd say let's stick with polling for now. Updates every minute isn't so bad since most slurm jobs take many minutes, and if people get confused we can implement updating from the job later.

@JamesWrigley
Copy link
Member

This is something that I think could be a REST endpoint in the listener, then the PyQt GUI can make an HTTP request to trigger a job as well as the web frontend in the future. @CammilleCC, would this work with your design?

Discussed with Cammille, and it turns out that since the webserver is currently separate from the listener we can stick to Kafka. That should simplify things a lot, the listener can listen on a Kafka topic for reprocessing (which should include the database ID) as well as to the migration/calibration topics and the GUI can send Kafka messages when a run is requested to be reprocessed.

@takluyver
Copy link
Member

Ah man, that is way more complicated than I thought 🥲 Have you ever tried using the REST API?

As far as I've found, the Slurm REST API contains the same info that you can get from commands like squeue, just sent over HTTP (and presumably requiring authentication). If that's true, it doesn't really make any difference to the challenges I described.

A lot of the time, monitoring that updates once a minute or so is 'good enough', so we can live with polling. After all, what the jobs do usually takes at least a few minutes. I think the way we're doing this for offline calibration is working OK for now - at least, I don't remember more issues since we started using --states=all.

@takluyver
Copy link
Member

I'm going to close this PR now, as it was superseded by #285.

@takluyver takluyver closed this Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GUI For GUI-related things
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants