Proposal for implementing hold, release, and info methods #521

tat-ohmura · 2025-05-28T07:15:39Z

This is part of our effort to integrate PSI/J into Open OnDemand (URL: https://openondemand.org/), a web portal for HPC systems.
Currently, Open OnDemand maintains an adapter (backend) for each scheduler, leading to increased maintenance costs.
We are planning to utilize PSI/J to abstract different schedulers and create a single adapter for all schedulers supported by PSI/J.
Open OnDemand requires APIs for hold, release, and info operations, in addition to job submission and deletion already supported by PSI/J. We have implemented these methods as follows:

hold: Suspend a pending job
release: Resume a suspended job
info: Query job information
Once this PR is approved, we will update the documentation accordingly. We would appreciate any feedback.

hategan · 2025-05-28T17:33:56Z

Hi and thank you for the PR. I am adding @andre-merzky to the discussion.

This is indeed something that makes good sense and I believe it to be within the scope of PSI/J. Holding/release were not part of the initial design because we mostly focused on what we perceived automation to be and that is workflows. As far as I can tell without doing a full review, but the code looks clean and nicely follows the existing codebase in style and organization.

I think it is likely that info/JobInfo would need a bit of discussion due to some overlapping functionality with the existing code. Specifically, the walltime, various state transition times, the node list, etc. are already available in other places, although not as nicely aggregated. We would also want to ensure that a potential info() can be somewhat uniformly implemented over all schedulers.

So I'll add a few research tasks here that we should probably work out. Some likely have obvious answers, but it's probably a good idea to have them listed out anyway.

Can we reasonably implement hold/release over the entire range of batch schedulers?
For executors where hold/release don't apply (e.g., local), do we have a graceful way of saying "Not implemented/does not apply"?
Should hold/release be added to the specification (https://github.com/ExaWorks/job-api-spec)?
Can we aggregate JobInfo-like information from existing PSI/J sources and is the available information sufficient to satisfy OOD's requirements?
- If anything is missing, can it be reasonably added?
Would JobInfo be a better option than some of the more ad-hoc mechanisms in PSI/J for reporting job status?

Let's try to answer these and go from there.

tat-ohmura · 2025-05-30T05:54:34Z

Thank you for your prompt response. I'm happy to be able to discuss this with you.

I believe we need to consider the use cases for how hold and release can be utilized within workflow automation. At the very least, hold and release operations are necessary for OOD, but I would also like to investigate whether other workflow tools have similar use cases for these functions.

I would also like to discuss info/JobInfo. Monitoring not only job statuses but also real-time job information, such as CPU usage, can be useful for verifying job health. Since some of this information overlaps, I think we need to organize it properly. Regarding info, in cases where monitoring needs to be done separately from the job submission process, the current system does not seem to provide all the necessary data. Since job schedulers retain submission-time information, I was considering them as a way to query and update job details. At least for OOD, this was essential.

I will look into the research tasks raised.
I appreciate your support and look forward to continued collaboration.

hategan · 2025-05-30T22:14:21Z

[...]

I believe we need to consider the use cases for how hold and release can be utilized within workflow automation. At the very least, hold and release operations are necessary for OOD, but I would also like to investigate whether other workflow tools have similar use cases for these functions.

@andre-merzky can weigh in on this, but I think that hold/release are a reasonable part of interacting with a scheduler and should be included. We'll need to do some work on our end beyond this PR, but that can be done separately.

I would also like to discuss info/JobInfo. Monitoring not only job statuses but also real-time job information, such as CPU usage, can be useful for verifying job health. Since some of this information overlaps, I think we need to organize it properly. Regarding info, in cases where monitoring needs to be done separately from the job submission process, the current system does not seem to provide all the necessary data. Since job schedulers retain submission-time information, I was considering them as a way to query and update job details. At least for OOD, this was essential.

These are indeed useful. However, real-time CPU usage would be beyond the scope of PSI/J. There are two reasons for this:

There is no simple way to get this information on arbitrary machines
Adding a mechanism to deal with live information coming from running jobs involves complexity that would make PSI/J difficult to maintain. This is quite important given that getting financial support for infrastructure projects like PSI/J is difficult. It is also important because PSI/J depends on contributions from users with access to machines that we do not have access to (NQSV is a perfect example), and we want to ensure that we do not make these contributions too complex.

Perhaps we could start by listing exactly what information is needed by OOD, and then we can see if there is a way to implement a solution that can be layered on top of PSI/J rather than within.

I will look into the research tasks raised. I appreciate your support and look forward to continued collaboration.

And thank you for your input and contributions.

tat-ohmura added 2 commits April 25, 2025 15:13

Add hold,release,info function for open ondemand support

361ec47

fix slurm.py

f0ebbf3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal for implementing hold, release, and info methods #521

Proposal for implementing hold, release, and info methods #521

Uh oh!

tat-ohmura commented May 28, 2025

Uh oh!

hategan commented May 28, 2025

Uh oh!

tat-ohmura commented May 30, 2025

Uh oh!

hategan commented May 30, 2025

Uh oh!

Uh oh!

Proposal for implementing hold, release, and info methods #521

Are you sure you want to change the base?

Proposal for implementing hold, release, and info methods #521

Uh oh!

Conversation

tat-ohmura commented May 28, 2025

Uh oh!

hategan commented May 28, 2025

Uh oh!

tat-ohmura commented May 30, 2025

Uh oh!

hategan commented May 30, 2025

Uh oh!

Uh oh!