-
Notifications
You must be signed in to change notification settings - Fork 17
new method for ingesting tarballs via a single staging PR #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", "commit created: '%s'", commit) | ||
except Exception as err: | ||
log_message(LoggingScope.TASK_OPS, "ERROR", "Error creating commit: '%s'", err) | ||
# TODO: rollback previous changes (task description file, task state file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that actually important? I mean, it doesn't hurt, but at least with the current code, nothing will break: you return EESSITaskState.UNDETERMINED
, which means the task hasn't progressed after calling handle()
, which means the while loop will exist and move on to the next task.
# TODO: rollback previous changes (task description file, task state file) | ||
return EESSITaskState.UNDETERMINED | ||
|
||
# TODO: verify that the sequence number is still valid (PR corresponding to the sequence number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean: in theory the PR could have been closed/merged while this code was running? What would actually happen then? I think:
- Commit above may have been done to a closed or merged feature branch (not a big deal).
- Commit does NOT make it into the main branch because of this
- Next iteration that the code is run, it will consider this a task that hasn't been handled yet and thus call
_handle_add_undetermined
again
So, doesn't the next time the ingest_bundles.py
is run just solve this scenario? I don't think extra handling is really necessary. Unless you want to guarantee that everything gets handled in one iteration. In that case you could consider checking if the commit made it to main, and if not, call _handle_add_undetermined
again - before returning. In this case, you may want to give it some internal counter on the maximum recursiveness of these calls, so as to avoid infinite loops, but it'd be rather unlikely that this happens several times in a row anyway.
Anyway, I think it's a "nice to have", I'd also be fine with just putting a clear comment in the code describing the current behaviour (i.e. next call to ingest_bundles.py
will fix it - if I understand correctly), and the potential optmization that can be done here. But I wouldn't prioritize it now.
# NOTE, for EESSITaskState.PULL_REQUEST, EESSITaskState.APPROVED must be the first element or | ||
# _next_state() will not work correctly | ||
self.valid_transitions = { | ||
EESSITaskState.UNDETERMINED: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. Here, you say essentially every state is a valid next state for EESSITaskState.UNDETERMINED
. But then when I look at the handler for the undetermined task state, the only potential return values are UNDETERMINED (not part of this list!) or NEW_TASK?
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", "next_state: '%s'", next_state) | ||
|
||
# initialize payload object | ||
self._init_payload_object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify a bit what this does (and maybe update the docstring for _init_payload_object()
with a similar description. As far as I understand, it sets the paths to the remote and local file and signature locations, and then also downloads the payload. That's relevant to know, because it makes it clear that this is a potentially time-consuming step if the payload is big.
In fact, maybe this should be in the description of this handler: in the "new task" state, the handler will download the payload, and progress to the 'payload_staged' state.
# initialize payload object | ||
self._init_payload_object() | ||
|
||
# update TaskState file content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment should probably reflect that this updates the TaskState files both in the default and feature branches. That's also relevant for the TODO listed below, since if something gets updated in the default branch, it means we cant simply rely on the next iteration to fix things for us - since the state has permanently changed.
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists | ||
# ASSUME: it has not existed before --> create it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with keeping this a TODO, but I would like it if you can add a comment to the code what the implication of your assumption below is. I'd guess the following
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists | |
# ASSUME: it has not existed before --> create it | |
# ASSUMPTION: for now, we assume the feature branch never existed, and just create it | |
# This assumption is false if the feature branch did exist, and was merged while in between the when | |
# `_handle_undetermined` checked whether a PR existed and this point in the code. | |
# For now, it means we should be careful not to merge feature branches while tarballs may still be coming in. | |
# TODO: could have been merged already --> check if PR corresponding to the feature branch exists | |
if pull_request.state == "closed": | ||
log_message(LoggingScope.TASK_OPS, "INFO", | ||
"PR '%s' is closed, creating issue", pull_request) | ||
# TODO: create issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: create issue | |
# For now, we return EESSITaskState.PAYLOAD_STAGED, to indicate that the state hasn't progressed. | |
# Since the task is returned in the same state as in which it entered the handler, the main `while` loop | |
# in ingest_bundles.py is quit for this task and no further handling is done (until the next iteration). | |
# TODO: make sure an issue is created, so that we can act on it, e.g. by cleaning up the feature branch |
else: | ||
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | ||
"found existing PR for branch '%s': '%s'", feature_branch_name, pull_request) | ||
# TODO: check if PR is open or closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a TODO, right? That's what you are doing?
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | ||
"no PR found for branch '%s'", feature_branch_name) | ||
|
||
# TODO: add failure handling (capture result and act on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: add failure handling (capture result and act on it) | |
# The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file | |
# In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged | |
# into the default branch, the default branch reflects the current state (namely: approval) of the PR. | |
# TODO: add failure handling (capture result and act on it) |
else: | ||
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | ||
"PR '%s' is open, updating task states", pull_request) | ||
# TODO: add failure handling (capture result and act on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: add failure handling (capture result and act on it) | |
# The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file | |
# In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged | |
# into the default branch, the default branch reflects the current state (namely: approval) of the PR. | |
# TODO: add failure handling (capture result and act on it) |
""" | ||
Update task states in default and feature branches | ||
|
||
States have to be updated in a specific order and in particular the default branch has to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
States have to be updated in a specific order and in particular the default branch has to be | |
The default branch should reflect the 'current' state. I.e. there, we write next_state into the state file. | |
In the feature branch, we write approved_state into the state file so that if-and-when the PR is merged | |
into the default branch, the default branch reflects the current state (namely: approval) of the PR. | |
States have to be updated in a specific order and in particular the default branch has to be |
if len(prs): | ||
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | ||
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref) | ||
return prs[0] if prs else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be more careful here? I imagine there should only be a single PR for a given branch (can you make more? I don't know), but it's good to check
if len(prs): | |
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | |
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref) | |
return prs[0] if prs else None | |
if len(prs) == 1: | |
log_message(LoggingScope.TASK_OPS_DETAILS, "INFO", | |
"1st PR found: %d, '%s'", prs[0].number, prs[0].head.ref) | |
elif len(prs) > 1: | |
raise RuntimeError("More than one PR was found for for branch name %s" % branch_name) | |
return prs[0] if prs else None |
PullRequest object if found, None otherwise | ||
""" | ||
try: | ||
prs = [pr for pr in list(self.git_repo.get_pulls(state="all")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this somehow be limited to PRs that also target a certain branch (the default branch)?
# PR is closed & merged --> deployment is approved | ||
# PR is closed & not merged --> deployment is rejected | ||
feature_branch_name = self._determine_feature_branch_name() | ||
# TODO: check if feature branch exists, for now ASSUME it does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: check if feature branch exists, for now ASSUME it does |
I think this comment can be removed. _find_pr_for_branch returns none if the PR doesn't exist, which is handled gracefully by the rest of _handle_add_pull_request. So I don't think there is anything 'TODO' here?
# PR is merged, hence it should have been in the APPROVED state | ||
# ==> for now, just return EESSITaskState.PULL_REQUEST | ||
# | ||
# there is the possibility that the PR was updated just before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if I get this right, but I think you mean: the task was in the EESSITaskState.PULL_REQUEST
when the while
loop was running. Then, it entered this handler. And while the handler was processing it, it was merged. That means the merge happened after this handler was called, but before self._find_pr_for_branch(feature_branch_name)
was called. That window is very tight, there's only one function in between. Or, is the reality that the window is much bigger and that the merge could have happened in any state between UNDETERMINED and now?
Sure, we could open an issue. But that's not going to allow anyone to act fast enough to make sure that this task is not ingested (or to at least re-check it before it is), because that will happen on the next call to ingest_bunldes.py
.
Should we revert the merge here and immediately re-open the PR and put a comment in it? This way, the reviewer has another chance to look at it. Either that, or we make some mechanism to revert only the changes by this task in the PR and then open a new PR (with new sequence number) for this task (as if it entirely came in after the merge).
Anyway, I think for now, returning EESSITaskState.PULL_REQUEST
, and leaving it at that, is a reasonable option.
# NOPE, we have to do some handling here, at least for the tasks where their | ||
# state file did | ||
# --> check if we could have ended up here? If so, create an issue. | ||
# Do we need a state ISSUE_OPENED to avoid processing the task again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think here: why would the task be processed again? The next time ingest_bundles.py
runs, the task.determine_state()
will return the last state we wrote to the state file in the default branch, which will be PULL_REQUEST
if the _handle_add_payload_staged
handler made it to the _update_task_states
call that sets the next_state
). Then, indeed, this will be evaluated every time.
I'm not sure if that's a problem. Technically, even the ingested and rejected tasks are handled every time that ingest_bundles.py
is run, no? As long as that handling is 'light enough', it's ok I guess. We should just clean up the bucket every so often to make sure this doesn't explode, I guess?
# issue is closed | ||
# WE could also defer all handling of this to the handler for the | ||
# REJECTED state | ||
# FOR NOW, we assume that the task was rejected on purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a decent assumption. If a PR contained 5 tasks, and the reviewer reject them, and a 6th task comes in, it's very unlikely the reviewer would have approved it if the 6th task would have been there already. Unless maybe he/she rejected because there should have been 6, and the fact that there wasn't was the sole reason for rejection.
In any case: this is much more easily picked up by a reviewer. They can reopen the PR, change the task states, and, as far as I can tell, this code will then happily process it again. If they then merge, they can change the decision after all.
In summary: I agree with updating the state file.
@log_function_entry_exit() | ||
def _perform_task_add(self) -> bool: | ||
"""Perform the ADD task action""" | ||
# TODO: verify checksum here or before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a checksum should be checked as early as possible, i.e. when it tries to download the payload. If the checksum isn't ok, the payload download failed, and one cannot reasonably move on to the next state. Reasonable thing to do would be to try and redownload x-times, then just forgot about it and try again the next time ingest_bundles.py
is run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR adds some really nice functionality, and tests by Bob show that it works.
I do have some minor comments, mostly regarding comments in the code in order to clarify what's happening. I think this will be important for the maintainability - even our future selves will probably struggle to figure out what happens where otherwise. I've tried to capture all changes in suggestions, so that they are easy to apply.
Some general comments:
I would expand the docstrings for each of the handlers. They are now very compact. Explain in words what this handler will do. I could probably write down what is happening overall. My view of this (but you may know better) is:
_handle_state_undetermined
:
- Called when there is no task state file yet in the default branch for this task
- Figures out the sequence number to which this task should be added. This mechanism allows for bundling multiple tarballs in a single PR (until that one is closed, then a new sequence will be started).
- Commit initial
TaskState
andTaskDescription
files to the main branch - If successful, return the next state (
EESSITaskState.NEW_TASK
)
_handle_add_new_task
:
- Use payload information from the TaskDescription to initialize an
EESSIDataAndSignatureObject
, a class representing the data file and it's signature both in remote and local storage. - Use the
EESSIDataAndSignatureObject
to initialize a EESSITaskPayload. This downloads the actual tarball and verifies the signature - Update the state file in the default branch to reflect the progression to the next stage (payload_staged)
- If successful, return the next state (EESSITaskState.PAYLOAD_STAGED)
_handle_add_payload_staged
:
- Create a feature branch (if it doesn't exist) for this task.
- Updates the task state in the default branch (as to reflect creation of a PR in this stage)
- Merges the default branch into the feature branch, to avoid conflicts
- Updates the task state the feature branch (as to reflect the state that this task will have after it has been merged - typically 'approved').
- Commits those changes
- Creates a PR (if one doesn't exist already) or updates an existing PR to add information on the current task (e.g. tarball name, metadata) to the PR description
- If succesful, return the next state (EESSITaskState.PULL_REQUEST)
_handle_add_pull_requests
- Gets the feature branch name for the current task
- Tries to get the PR for the current feature branch
- If not successful, return EESSITaskState.PULL_REQUEST, so that all further handling is stopped (for now).
- If successful: check the PR status
- If closed & merged: stay in the current state (return EESSITAskState.PULL_REQUEST). This should not typically happen, unless someone merge the PR while this task was being added to it. This may mean a tarball is marked as approved unintentionally. TODO: implement code to deal with this.
- If closed & unmerged: update the state file in the default branch to reflect that this task is rejected (and return EESSITaskState.REJECTED).
- If PR is open, return EESSITaskState.PULL_REQUEST. This means further handling of this task is stopped - it is not needed until someone decides to merge or reject the PR.
_handle_add_approved
- Runs the
ingestion_script
configured in the config (withsudo
, if "ingest_as_root" is set in the config) - If successful, updates the task state file in the default branch to the next state (EESSITaskState.INGESTED)
- If successful, returns the EESSITaskState.INGESTED
_handle_add_ingested
- Essentially a no-op that just logs that this task was already ingested
- Returns EESSITaskState.DONE, which will stop further handling by ingest_bundles.py
_handle_add_rejected
- Essentially a no-op that just logs that this task was already rejected
- Returns EESSITaskState.DONE, which will stop further handling by ingest_bundles.py
I think this will help us a lot to know what happens where, and thus also where to look if something does not go according to plan. E.g. Suppose we end up with a feature branch but no PR, having a clear docstring description of which handler was responsible for creating the PR from the feature branch immediately tells us where to look. It also means it'll be much easier to insert extra states, or merge states, etc, if we ever feel that is useful.
Finally, I think some user-facing documentation (for EESSI maintainers), like we have for the bot https://www.eessi.io/docs/bot/ , is also useful. It should definitely include a description of the files to expect in the default branch, the feature branch, and what they contain / what the meaning of their contents is. Possibly also with instructions on how to recover from a buggy state. The prime example for now being: a PR is merged after a tarball (i.e. new task) has been added, but before it reached the handle_add_pull_request
stage. We might want users to know how to intervene in this case (e.g. revert the merge, reopen the PR, set the state for each task to XYZ on the feature branch and ABC on the main branch). Similarly, the code allows for scenarios where e.g. the state is PULL_REQUEST
, but no pull request was found. To recover from this, I guess we should just overwrite the state with a commit to go back one State (though I'd have to double check if that doesn't have unintended side effects).
I consider the dosctrings to be something to fix before merging this PR (as well as my suggestions in the various review comments of course). The user-facing documentation can imho be tackled afterwards: the feature in this PR is valuable enough that we want it now, even if that means you, @bedroge and I may need to step in and help out of buggy states arise (because others don't know what they mean / what to do about them).
Tested this in production via EESSI/software-layer-scripts#59, which opened staging PR https://github.com/EESSI/staging_bundles/pull/7. Note: I ran the ingestion/staging script myself on the S0 to be able to time it, which yielded the following: Staging
Ingestion
Everything was successful, the only small change that had to happen for this to work was a small change to the tarball naming done in EESSI/software-layer-scripts#63. Other than that, it was just a matter of setting things up by manually removing old tarballs generated in EESSI/software-layer-scripts#59 after the naming change and also to ensure we started with a clean slate wrt timing. One small feedback I have from reviewing the PR, the handling of long GitHub comments works well by not printing the tarball contents at all once a certain character limit is reached. It would be nicer if just the name of the tarballs without the contents could be added. Otherwise it's harder to tell that everything has been staged properly. But this is definitely a minor comment that shouldn't keep this PR from being merged. |
self.task_object.download(mode=DownloadMode.CHECK_REMOTE) | ||
|
||
# verify signature and set initial state | ||
self.signature_verified = self.task_object.verify_signature() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trz42 this will always verify the signatures regardless of the state (i.e. even for already ingested tarballs), and that's currently causing delays and a lot of I/O on the Stratum 0. See #238..
I think we should only do this for certain states, e.g. only for new tarballs and tarballs that are about to be ingested (and maybe also for ones for which it's going to open a PR)?
Since we're actively using this, let's merge this as is, and divert fixes/improvements to additional follow-up PRs that are easier to review. I think @bedroge is best suited for this... |
I've created issues for most/all of the things mentioned in this PR and on Slack: https://github.com/EESSI/filesystem-layer/issues?q=state%3Aopen%20label%3A%22Automated%20ingestion%22. Let's merge this and address the issues in follow-up PRs. |
Issues have been created, and these will be addressed in follow-up PRs.
This PR is based on the proof-of-concept #213. It aims at keeping only necessary code from the proof-of-concept, move the main loop into a separate script, and thus leave the existing ingestion code unchanged.
Summary of the ideas/changes/additions:
States, repository directory structure ... in a picture
High-level overview of state handler functions
_handle_add_undetermined
_handle_add_new_task
TaskState
file_handle_add_payload_staged
PAYLOAD_STAGED
in default and feature branch after it was created)PULL_REQUEST
, feature branch:APPROVED
) and create pull requestPULL_REQUEST
, feature branch:APPROVED
) and update pull requestCreating/updating a pull request will create and update a
TaskSummary.html
file and create/update the description of the pull request._handle_add_pull_request
REJECTED