Skip to content

Proposal for implementing hold, release, and info methods #521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tat-ohmura
Copy link
Contributor

This is part of our effort to integrate PSI/J into Open OnDemand (URL: https://openondemand.org/), a web portal for HPC systems.
Currently, Open OnDemand maintains an adapter (backend) for each scheduler, leading to increased maintenance costs.
We are planning to utilize PSI/J to abstract different schedulers and create a single adapter for all schedulers supported by PSI/J.
Open OnDemand requires APIs for hold, release, and info operations, in addition to job submission and deletion already supported by PSI/J. We have implemented these methods as follows:

  • hold: Suspend a pending job
  • release: Resume a suspended job
  • info: Query job information
    Once this PR is approved, we will update the documentation accordingly. We would appreciate any feedback.

@hategan
Copy link
Collaborator

hategan commented May 28, 2025

Hi and thank you for the PR. I am adding @andre-merzky to the discussion.

This is indeed something that makes good sense and I believe it to be within the scope of PSI/J. Holding/release were not part of the initial design because we mostly focused on what we perceived automation to be and that is workflows. As far as I can tell without doing a full review, but the code looks clean and nicely follows the existing codebase in style and organization.

I think it is likely that info/JobInfo would need a bit of discussion due to some overlapping functionality with the existing code. Specifically, the walltime, various state transition times, the node list, etc. are already available in other places, although not as nicely aggregated. We would also want to ensure that a potential info() can be somewhat uniformly implemented over all schedulers.

So I'll add a few research tasks here that we should probably work out. Some likely have obvious answers, but it's probably a good idea to have them listed out anyway.

  • Can we reasonably implement hold/release over the entire range of batch schedulers?
  • For executors where hold/release don't apply (e.g., local), do we have a graceful way of saying "Not implemented/does not apply"?
  • Should hold/release be added to the specification (https://github.com/ExaWorks/job-api-spec)?
  • Can we aggregate JobInfo-like information from existing PSI/J sources and is the available information sufficient to satisfy OOD's requirements?
    • If anything is missing, can it be reasonably added?
  • Would JobInfo be a better option than some of the more ad-hoc mechanisms in PSI/J for reporting job status?

Let's try to answer these and go from there.

@tat-ohmura
Copy link
Contributor Author

Thank you for your prompt response. I'm happy to be able to discuss this with you.

I believe we need to consider the use cases for how hold and release can be utilized within workflow automation. At the very least, hold and release operations are necessary for OOD, but I would also like to investigate whether other workflow tools have similar use cases for these functions.

I would also like to discuss info/JobInfo. Monitoring not only job statuses but also real-time job information, such as CPU usage, can be useful for verifying job health. Since some of this information overlaps, I think we need to organize it properly. Regarding info, in cases where monitoring needs to be done separately from the job submission process, the current system does not seem to provide all the necessary data. Since job schedulers retain submission-time information, I was considering them as a way to query and update job details. At least for OOD, this was essential.

I will look into the research tasks raised.
I appreciate your support and look forward to continued collaboration.

@hategan
Copy link
Collaborator

hategan commented May 30, 2025

[...]

I believe we need to consider the use cases for how hold and release can be utilized within workflow automation. At the very least, hold and release operations are necessary for OOD, but I would also like to investigate whether other workflow tools have similar use cases for these functions.

@andre-merzky can weigh in on this, but I think that hold/release are a reasonable part of interacting with a scheduler and should be included. We'll need to do some work on our end beyond this PR, but that can be done separately.

I would also like to discuss info/JobInfo. Monitoring not only job statuses but also real-time job information, such as CPU usage, can be useful for verifying job health. Since some of this information overlaps, I think we need to organize it properly. Regarding info, in cases where monitoring needs to be done separately from the job submission process, the current system does not seem to provide all the necessary data. Since job schedulers retain submission-time information, I was considering them as a way to query and update job details. At least for OOD, this was essential.

These are indeed useful. However, real-time CPU usage would be beyond the scope of PSI/J. There are two reasons for this:

  1. There is no simple way to get this information on arbitrary machines
  2. Adding a mechanism to deal with live information coming from running jobs involves complexity that would make PSI/J difficult to maintain. This is quite important given that getting financial support for infrastructure projects like PSI/J is difficult. It is also important because PSI/J depends on contributions from users with access to machines that we do not have access to (NQSV is a perfect example), and we want to ensure that we do not make these contributions too complex.

Perhaps we could start by listing exactly what information is needed by OOD, and then we can see if there is a way to implement a solution that can be layered on top of PSI/J rather than within.

I will look into the research tasks raised. I appreciate your support and look forward to continued collaboration.

And thank you for your input and contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants