Skip to content

Conversation

trz42
Copy link
Contributor

@trz42 trz42 commented Jun 29, 2025

This PR is based on the proof-of-concept #213. It aims at keeping only necessary code from the proof-of-concept, move the main loop into a separate script, and thus leave the existing ingestion code unchanged.

Summary of the ideas/changes/additions:

  • Adds CI for ensuring code quality (flake8, existing code is not validated) and running pytest.
  • We aim at using type hints for all function arguments and return values.
  • Improved logging combining levels (as before), scopes (to limit logging to parts of the code) and a decorator for logging function entry & exit. We aim at using the decorator for all functions to provide detailed debugging means.
  • Model the client to fetch files and ETags from a remote storage service.
  • Model an S3 bucket (e.g., hosted on AWS or Minio).
  • Model a file and its signature including functions to download them, use ETags to only download them if they have changed on the remote storage, verify the signature. A file can be the payload (tarball), a metadata (or task) file, or any other file of interest.
  • Model a task description (essentially the read in metadata or task file and some associated convenience functions such as obtaining the architecture from the name of the metadata/task file).
  • Model a task payload (could be a list of directories/files to be removed from CVMFS repo, a tarball containing software installations, or anything that should be applied to a CVMFS repo)
  • Model a task (combines the task description and the task payload, provides most of the logic to process a task, ensures that a task for a single payload is bundled in a single staging PR, updates its information in the staging repo, ...)

States, repository directory structure ... in a picture

ingest_bundles_infographics

High-level overview of state handler functions

_handle_add_undetermined

  1. Determine sequence number (corresponds to open or yet-to-be-opened pull request)
  2. Create files and directories with a single commit in default branch (see picture above)

_handle_add_new_task

  1. Init payload object (EESSITaskPayload) by downloading payload
  2. Update TaskState file

_handle_add_payload_staged

  1. Determines feature branch name
  2. Creates feature branch if it doesn't exist (TaskState is still PAYLOAD_STAGED in default and feature branch after it was created)
  3. Search for PR for feature branch
  4. none found: update states (default branch: PULL_REQUEST, feature branch: APPROVED) and create pull request
  5. found and closed: open issue (TO BE IMPLEMENTED)
  6. found and open: update states (default branch: PULL_REQUEST, feature branch: APPROVED) and update pull request
    Creating/updating a pull request will create and update a TaskSummary.html file and create/update the description of the pull request.

_handle_add_pull_request

  1. Determines state of PR
  2. If PR was closed, it changes state to REJECTED

@trz42 trz42 added enhancement New feature or request help wanted Extra attention is needed labels Jun 29, 2025
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", "commit created: '%s'", commit)
except Exception as err:
log_message(LoggingScope.TASK_OPS, "ERROR", "Error creating commit: '%s'", err)
# TODO: rollback previous changes (task description file, task state file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that actually important? I mean, it doesn't hurt, but at least with the current code, nothing will break: you return EESSITaskState.UNDETERMINED, which means the task hasn't progressed after calling handle(), which means the while loop will exist and move on to the next task.

# TODO: rollback previous changes (task description file, task state file)
return EESSITaskState.UNDETERMINED

# TODO: verify that the sequence number is still valid (PR corresponding to the sequence number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean: in theory the PR could have been closed/merged while this code was running? What would actually happen then? I think:

  • Commit above may have been done to a closed or merged feature branch (not a big deal).
  • Commit does NOT make it into the main branch because of this
  • Next iteration that the code is run, it will consider this a task that hasn't been handled yet and thus call _handle_add_undetermined again

So, doesn't the next time the ingest_bundles.py is run just solve this scenario? I don't think extra handling is really necessary. Unless you want to guarantee that everything gets handled in one iteration. In that case you could consider checking if the commit made it to main, and if not, call _handle_add_undetermined again - before returning. In this case, you may want to give it some internal counter on the maximum recursiveness of these calls, so as to avoid infinite loops, but it'd be rather unlikely that this happens several times in a row anyway.

Anyway, I think it's a "nice to have", I'd also be fine with just putting a clear comment in the code describing the current behaviour (i.e. next call to ingest_bundles.py will fix it - if I understand correctly), and the potential optmization that can be done here. But I wouldn't prioritize it now.

# NOTE, for EESSITaskState.PULL_REQUEST, EESSITaskState.APPROVED must be the first element or
# _next_state() will not work correctly
self.valid_transitions = {
EESSITaskState.UNDETERMINED: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. Here, you say essentially every state is a valid next state for EESSITaskState.UNDETERMINED. But then when I look at the handler for the undetermined task state, the only potential return values are UNDETERMINED (not part of this list!) or NEW_TASK?

log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", "next_state: '%s'", next_state)

# initialize payload object
self._init_payload_object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify a bit what this does (and maybe update the docstring for _init_payload_object() with a similar description. As far as I understand, it sets the paths to the remote and local file and signature locations, and then also downloads the payload. That's relevant to know, because it makes it clear that this is a potentially time-consuming step if the payload is big.

In fact, maybe this should be in the description of this handler: in the "new task" state, the handler will download the payload, and progress to the 'payload_staged' state.

# initialize payload object
self._init_payload_object()

# update TaskState file content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should probably reflect that this updates the TaskState files both in the default and feature branches. That's also relevant for the TODO listed below, since if something gets updated in the default branch, it means we cant simply rely on the next iteration to fix things for us - since the state has permanently changed.

Comment on lines +955 to +956
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists
# ASSUME: it has not existed before --> create it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with keeping this a TODO, but I would like it if you can add a comment to the code what the implication of your assumption below is. I'd guess the following

Suggested change
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists
# ASSUME: it has not existed before --> create it
# ASSUMPTION: for now, we assume the feature branch never existed, and just create it
# This assumption is false if the feature branch did exist, and was merged while in between the when
# `_handle_undetermined` checked whether a PR existed and this point in the code.
# For now, it means we should be careful not to merge feature branches while tarballs may still be coming in.
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists

if pull_request.state == "closed":
log_message(LoggingScope.TASK_OPS, "INFO",
"PR '%s' is closed, creating issue", pull_request)
# TODO: create issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: create issue
# For now, we return EESSITaskState.PAYLOAD_STAGED, to indicate that the state hasn't progressed.
# Since the task is returned in the same state as in which it entered the handler, the main `while` loop
# in ingest_bundles.py is quit for this task and no further handling is done (until the next iteration).
# TODO: make sure an issue is created, so that we can act on it, e.g. by cleaning up the feature branch

else:
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"found existing PR for branch '%s': '%s'", feature_branch_name, pull_request)
# TODO: check if PR is open or closed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a TODO, right? That's what you are doing?

log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"no PR found for branch '%s'", feature_branch_name)

# TODO: add failure handling (capture result and act on it)
Copy link
Contributor

@casparvl casparvl Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: add failure handling (capture result and act on it)
# The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file
# In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged
# into the default branch, the default branch reflects the current state (namely: approval) of the PR.
# TODO: add failure handling (capture result and act on it)

else:
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"PR '%s' is open, updating task states", pull_request)
# TODO: add failure handling (capture result and act on it)
Copy link
Contributor

@casparvl casparvl Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: add failure handling (capture result and act on it)
# The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file
# In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged
# into the default branch, the default branch reflects the current state (namely: approval) of the PR.
# TODO: add failure handling (capture result and act on it)

"""
Update task states in default and feature branches

States have to be updated in a specific order and in particular the default branch has to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
States have to be updated in a specific order and in particular the default branch has to be
The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file.
In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged
into the default branch, the default branch reflects the current state (namely: approval) of the PR.
States have to be updated in a specific order and in particular the default branch has to be

Comment on lines +701 to +704
if len(prs):
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref)
return prs[0] if prs else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be more careful here? I imagine there should only be a single PR for a given branch (can you make more? I don't know), but it's good to check

Suggested change
if len(prs):
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref)
return prs[0] if prs else None
if len(prs) == 1:
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO",
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref)
elif len(prs) > 1:
raise RuntimeError("More than one PR was found for for branch name %s" % branch_name)
return prs[0] if prs else None

PullRequest object if found, None otherwise
"""
try:
prs = [pr for pr in list(self.git_repo.get_pulls(state="all"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this somehow be limited to PRs that also target a certain branch (the default branch)?

# PR is closed & merged --> deployment is approved
# PR is closed & not merged --> deployment is rejected
feature_branch_name = self._determine_feature_branch_name()
# TODO: check if feature branch exists, for now ASSUME it does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: check if feature branch exists, for now ASSUME it does

I think this comment can be removed. _find_pr_for_branch returns none if the PR doesn't exist, which is handled gracefully by the rest of _handle_add_pull_request. So I don't think there is anything 'TODO' here?

# PR is merged, hence it should have been in the APPROVED state
# ==> for now, just return EESSITaskState.PULL_REQUEST
#
# there is the possibility that the PR was updated just before the
Copy link
Contributor

@casparvl casparvl Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if I get this right, but I think you mean: the task was in the EESSITaskState.PULL_REQUEST when the while loop was running. Then, it entered this handler. And while the handler was processing it, it was merged. That means the merge happened after this handler was called, but before self._find_pr_for_branch(feature_branch_name) was called. That window is very tight, there's only one function in between. Or, is the reality that the window is much bigger and that the merge could have happened in any state between UNDETERMINED and now?

Sure, we could open an issue. But that's not going to allow anyone to act fast enough to make sure that this task is not ingested (or to at least re-check it before it is), because that will happen on the next call to ingest_bunldes.py.

Should we revert the merge here and immediately re-open the PR and put a comment in it? This way, the reviewer has another chance to look at it. Either that, or we make some mechanism to revert only the changes by this task in the PR and then open a new PR (with new sequence number) for this task (as if it entirely came in after the merge).

Anyway, I think for now, returning EESSITaskState.PULL_REQUEST, and leaving it at that, is a reasonable option.

# NOPE, we have to do some handling here, at least for the tasks where their
# state file did
# --> check if we could have ended up here? If so, create an issue.
# Do we need a state ISSUE_OPENED to avoid processing the task again?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think here: why would the task be processed again? The next time ingest_bundles.py runs, the task.determine_state() will return the last state we wrote to the state file in the default branch, which will be PULL_REQUEST if the _handle_add_payload_staged handler made it to the _update_task_states call that sets the next_state). Then, indeed, this will be evaluated every time.

I'm not sure if that's a problem. Technically, even the ingested and rejected tasks are handled every time that ingest_bundles.py is run, no? As long as that handling is 'light enough', it's ok I guess. We should just clean up the bucket every so often to make sure this doesn't explode, I guess?

# issue is closed
# WE could also defer all handling of this to the handler for the
# REJECTED state
# FOR NOW, we assume that the task was rejected on purpose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a decent assumption. If a PR contained 5 tasks, and the reviewer reject them, and a 6th task comes in, it's very unlikely the reviewer would have approved it if the 6th task would have been there already. Unless maybe he/she rejected because there should have been 6, and the fact that there wasn't was the sole reason for rejection.

In any case: this is much more easily picked up by a reviewer. They can reopen the PR, change the task states, and, as far as I can tell, this code will then happily process it again. If they then merge, they can change the decision after all.

In summary: I agree with updating the state file.

@log_function_entry_exit()
def _perform_task_add(self) -> bool:
"""Perform the ADD task action"""
# TODO: verify checksum here or before?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a checksum should be checked as early as possible, i.e. when it tries to download the payload. If the checksum isn't ok, the payload download failed, and one cannot reasonably move on to the next state. Reasonable thing to do would be to try and redownload x-times, then just forgot about it and try again the next time ingest_bundles.py is run.

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR adds some really nice functionality, and tests by Bob show that it works.

I do have some minor comments, mostly regarding comments in the code in order to clarify what's happening. I think this will be important for the maintainability - even our future selves will probably struggle to figure out what happens where otherwise. I've tried to capture all changes in suggestions, so that they are easy to apply.

Some general comments:

I would expand the docstrings for each of the handlers. They are now very compact. Explain in words what this handler will do. I could probably write down what is happening overall. My view of this (but you may know better) is:

_handle_state_undetermined:

  • Called when there is no task state file yet in the default branch for this task
  • Figures out the sequence number to which this task should be added. This mechanism allows for bundling multiple tarballs in a single PR (until that one is closed, then a new sequence will be started).
  • Commit initial TaskState and TaskDescription files to the main branch
  • If successful, return the next state (EESSITaskState.NEW_TASK)

_handle_add_new_task:

  • Use payload information from the TaskDescription to initialize an EESSIDataAndSignatureObject, a class representing the data file and it's signature both in remote and local storage.
  • Use the EESSIDataAndSignatureObject to initialize a EESSITaskPayload. This downloads the actual tarball and verifies the signature
  • Update the state file in the default branch to reflect the progression to the next stage (payload_staged)
  • If successful, return the next state (EESSITaskState.PAYLOAD_STAGED)

_handle_add_payload_staged:

  • Create a feature branch (if it doesn't exist) for this task.
  • Updates the task state in the default branch (as to reflect creation of a PR in this stage)
  • Merges the default branch into the feature branch, to avoid conflicts
  • Updates the task state the feature branch (as to reflect the state that this task will have after it has been merged - typically 'approved').
  • Commits those changes
  • Creates a PR (if one doesn't exist already) or updates an existing PR to add information on the current task (e.g. tarball name, metadata) to the PR description
  • If succesful, return the next state (EESSITaskState.PULL_REQUEST)

_handle_add_pull_requests

  • Gets the feature branch name for the current task
  • Tries to get the PR for the current feature branch
  • If not successful, return EESSITaskState.PULL_REQUEST, so that all further handling is stopped (for now).
  • If successful: check the PR status
  • If closed & merged: stay in the current state (return EESSITAskState.PULL_REQUEST). This should not typically happen, unless someone merge the PR while this task was being added to it. This may mean a tarball is marked as approved unintentionally. TODO: implement code to deal with this.
  • If closed & unmerged: update the state file in the default branch to reflect that this task is rejected (and return EESSITaskState.REJECTED).
  • If PR is open, return EESSITaskState.PULL_REQUEST. This means further handling of this task is stopped - it is not needed until someone decides to merge or reject the PR.

_handle_add_approved

  • Runs the ingestion_script configured in the config (with sudo, if "ingest_as_root" is set in the config)
  • If successful, updates the task state file in the default branch to the next state (EESSITaskState.INGESTED)
  • If successful, returns the EESSITaskState.INGESTED

_handle_add_ingested

  • Essentially a no-op that just logs that this task was already ingested
  • Returns EESSITaskState.DONE, which will stop further handling by ingest_bundles.py

_handle_add_rejected

  • Essentially a no-op that just logs that this task was already rejected
  • Returns EESSITaskState.DONE, which will stop further handling by ingest_bundles.py

I think this will help us a lot to know what happens where, and thus also where to look if something does not go according to plan. E.g. Suppose we end up with a feature branch but no PR, having a clear docstring description of which handler was responsible for creating the PR from the feature branch immediately tells us where to look. It also means it'll be much easier to insert extra states, or merge states, etc, if we ever feel that is useful.

Finally, I think some user-facing documentation (for EESSI maintainers), like we have for the bot https://www.eessi.io/docs/bot/ , is also useful. It should definitely include a description of the files to expect in the default branch, the feature branch, and what they contain / what the meaning of their contents is. Possibly also with instructions on how to recover from a buggy state. The prime example for now being: a PR is merged after a tarball (i.e. new task) has been added, but before it reached the handle_add_pull_request stage. We might want users to know how to intervene in this case (e.g. revert the merge, reopen the PR, set the state for each task to XYZ on the feature branch and ABC on the main branch). Similarly, the code allows for scenarios where e.g. the state is PULL_REQUEST, but no pull request was found. To recover from this, I guess we should just overwrite the state with a commit to go back one State (though I'd have to double check if that doesn't have unintended side effects).

I consider the dosctrings to be something to fix before merging this PR (as well as my suggestions in the various review comments of course). The user-facing documentation can imho be tackled afterwards: the feature in this PR is valuable enough that we want it now, even if that means you, @bedroge and I may need to step in and help out of buggy states arise (because others don't know what they mean / what to do about them).

@Neves-P
Copy link
Member

Neves-P commented Aug 18, 2025

Tested this in production via EESSI/software-layer-scripts#59, which opened staging PR https://github.com/EESSI/staging_bundles/pull/7.

Note: I ran the ingestion/staging script myself on the S0 to be able to time it, which yielded the following:

Staging

real	123m15.124s
user	63m51.337s
sys	7m33.614s

Ingestion

real	110m8.823s
user	358m0.438s
sys	13m13.256s

Everything was successful, the only small change that had to happen for this to work was a small change to the tarball naming done in EESSI/software-layer-scripts#63. Other than that, it was just a matter of setting things up by manually removing old tarballs generated in EESSI/software-layer-scripts#59 after the naming change and also to ensure we started with a clean slate wrt timing.

One small feedback I have from reviewing the PR, the handling of long GitHub comments works well by not printing the tarball contents at all once a certain character limit is reached. It would be nicer if just the name of the tarballs without the contents could be added. Otherwise it's harder to tell that everything has been staged properly. But this is definitely a minor comment that shouldn't keep this PR from being merged.

self.task_object.download(mode=DownloadMode.CHECK_REMOTE)

# verify signature and set initial state
self.signature_verified = self.task_object.verify_signature()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trz42 this will always verify the signatures regardless of the state (i.e. even for already ingested tarballs), and that's currently causing delays and a lot of I/O on the Stratum 0. See #238..

I think we should only do this for certain states, e.g. only for new tarballs and tarballs that are about to be ingested (and maybe also for ones for which it's going to open a PR)?

@boegel
Copy link
Contributor

boegel commented Aug 26, 2025

Since we're actively using this, let's merge this as is, and divert fixes/improvements to additional follow-up PRs that are easier to review.

I think @bedroge is best suited for this...

@bedroge
Copy link
Collaborator

bedroge commented Aug 27, 2025

I've created issues for most/all of the things mentioned in this PR and on Slack: https://github.com/EESSI/filesystem-layer/issues?q=state%3Aopen%20label%3A%22Automated%20ingestion%22.

Let's merge this and address the issues in follow-up PRs.

@bedroge bedroge marked this pull request as ready for review August 27, 2025 08:49
@bedroge bedroge dismissed casparvl’s stale review August 27, 2025 08:50

Issues have been created, and these will be addressed in follow-up PRs.

@bedroge bedroge merged commit c2e4c95 into EESSI:main Aug 27, 2025
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants