-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reprocess from gui two #66
Conversation
A copy of my old branch with a few alterations. Enables reprocessing from the GUI by right-clicking a given row (or selection of) on the run table. A color scheme indicates the reprocessing status. Support for SLURM jobs it's still not availiable, thus the color scheme might be missleading if SLURM jobs are required for some variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is quite the right design for reprocessing. A lot of the monitoring/Extractor()
logic in the Reprocessor
class is already implemented in the listener, and it would be good if we could reuse that code. I also would prefer if the GUI didn't launch jobs independently because you run the risk of file corruption if two GUIs reprocess the same runs and write to the same files simultaneously.
Slurm support is also really important; instruments like MID, SCS, and SPB do a lot of heavy processing in slurm jobs and those are usually the ones that take a long time and really do need to be monitored.
So concretely, I would suggest:
- If it's really important for HED, then merge it to the HED branch.
- Extend the monitoring code in the listener to also track slurm jobs (nerd-sniping @takluyver because I think he's done this before for calibration).
- Make the GUI get the reprocessing status from the listener.
For that last point, my first idea was to add a REST API to the listener with the expectation that the web frontend can use it later, but the downside is that if the listener is restarted then you lose the status of all the jobs. So now I think the best way might be to write that information to the database, that way the GUI has easy access to it and the jobs (particularly slurm jobs) can be tracked across listener restarts.
Consider me sniped 🤓 . We do do this for calibration, but it's kind of a pain, and we're still trying to figure out how best to do it. The main challenge is requeuing. In theory, jobs from any state can be requeued, but one big thing we care about is pre-emption: jobs being cancelled to make space for higher priority jobs, after which the cancelled job is automatically requeued. So something is running, then it's not running, but it's not yet done. There are two approaches:
STATES_FINISHED = { # https://slurm.schedmd.com/squeue.html#lbAG
'BOOT_FAIL', 'CANCELLED', 'COMPLETED', 'DEADLINE', 'FAILED',
'OUT_OF_MEMORY', 'SPECIAL_EXIT', 'TIMEOUT',
} By default
|
Ufff, I just realized I missed a critical point from my earlier comment: how to trigger reprocessing from the GUI in the first place 🙃 This is something that I think could be a REST endpoint in the listener, then the PyQt GUI can make an HTTP request to trigger a job as well as the web frontend in the future. @CammilleCC, would this work with your design? |
Ah man, that is way more complicated than I thought 🥲 Have you ever tried using the REST API? https://slurm.schedmd.com/rest_api.html#slurmdbV0039GetJob |
Discussed with Cammille, and it turns out that since the webserver is currently separate from the listener we can stick to Kafka. That should simplify things a lot, the listener can listen on a Kafka topic for reprocessing (which should include the database ID) as well as to the migration/calibration topics and the GUI can send Kafka messages when a run is requested to be reprocessed. |
As far as I've found, the Slurm REST API contains the same info that you can get from commands like A lot of the time, monitoring that updates once a minute or so is 'good enough', so we can live with polling. After all, what the jobs do usually takes at least a few minutes. I think the way we're doing this for offline calibration is working OK for now - at least, I don't remember more issues since we started using |
I'm going to close this PR now, as it was superseded by #285. |
A copy of my old branch with a few alterations. Enables reprocessing from the GUI by right-clicking a given row (or selection of) on the run table. A color scheme indicates the reprocessing status (a row green background means the run is being reprocessed and blue means that the row/run is in the queue to be reprocessed).
The runs to be reprocessed are added to a queue and reprocessed in a separated thread, one by one, in order. If one or more reprocessing fails, a message will appear at the status bar advising the user to read the log files.
Support for SLURM jobs it's still not available, thus the color scheme might be misleading if SLURM jobs are required for some variables. Yet, for the HED use-case SLURM jobs are not so relevant (an alternative would be to merge this to HED branch, but I'm not sure which one would that be).
Upon closing the GUI, if a run is in the middle of being reprocessed, the reprocessing will continue. I'm not sure how to overcome this nor if there's a use case in which the user would actually close the GUI during the experiment. I'm open to suggestions on this regard.