- 
                Notifications
    You must be signed in to change notification settings 
- Fork 738
Add staging events to the TraceObserverV2 #6443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| ✅ Deploy Preview for nextflow-docs-staging canceled.
 | 
788f51a    to
    a280d0e      
    Compare
  
    Signed-off-by: Robrecht Cannoodt <[email protected]>
a280d0e    to
    b6e17d2      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems going into the right direction. My only connect is about the notification events, here
Likely those events should be emitter *async* in this context for two reasons:
- Make sure the target path is accessible when the event is received
- Avoid that core runtime performance could be impacted by compute/data intensive that could be implemented by third-parties plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to enable this through the task events rather than adding a new event. I just haven't had time to prepare a new PR. I will try to have something ready this week or next week. I would at least like to try that before we add a whole new event
| Trying my idea again #6460 | 
| 
 Thanks for the review, and I totally agree with you. However, this PR mimics the event handling system that is in place for other events (e.g.  Edit: I created #6471 
 Alright, thank you for taking the time to take a look at this! Despite this new PR, would you agree that it might still make sense to merge this PR as well, since  | 
| I'm still considering the merits of having an event for file staging. I don't really see it as a mirror of file publishing. File staging is just an implementation detail, whereas file publishing is "essential" to understand the outputs of a run and their provenance. | 
| Thanks for taking this PR into consideration! From my perspective, the file staging is also essential to understand the inputs of a run and their provenance 😅 This PR is the only thing that is currently blocking me from implementing decent data provenance for Nextflow workflows with LaminDB using nf-lamin. For now, all of the Nextflow runs result in Lamin Runs being created with only outputs and no inputs, which goes a bit against the whole LaminDB philosophy. Based on the proposed implementation, would you expect any significant slowdowns or breaking changes with existing functionality? | 
| 
 Of course, but file staging is tied to the task lifecycle in a way that file publishing is not. Output files can be published separately from the originating task. But remote input files are only staged in the context of a task getting ready to run, so there isn't as much of a need for a separate "file staged" event. 
 I suppose there is a small cost to adding new workflow events, but it should be pretty minimal. I'm more concerned about adding an event for something that is more of an implementation detail For example, you wouldn't really be able to use the "file staged" event on its own. You would always need to use it in concert with the task events, because not all input files need to be staged. That suggests that it shouldn't be a standalone event at all | 
This PR implements file staging event notifications for
TraceObserverV2, allowing plugins to track when remote files are staged to the local work directory. This resolves #5905 by providing observers access to both the original remote file paths and their staged local paths.Background
Issue #5905 identified a need for provenance tracking plugins (specifically nf-lamin) to access the original remote paths of input files, not just their staged local paths. Two previous attempts were made to solve this:
onFileStagetoTraceObserver- rejected in favor of Include remote path in FileHolder #5911FileHolderto include remote paths - merged but then reverted due to concerns about cache invalidation and risk to core functionalityI met up with @bentsherman at the beginning of September. He said he wanted to have another shot at resolving this issue, and that I could reach out in a few weeks time if no progress was made.
Solution
This implementation is a simpler solution in comparison to the two previous PRs. I added this to the TraceObserverV2:
onFileStaged(FileStagingEvent event)- Called after a file has been successfully staged from a remote locationThe
FileStagingEventprovides:source: The original remote path (e.g.,s3://bucket/file.txt,https://example.com/data.csv)target: The staged local path in the work directory (e.g.,work/.../file.txt)Usage Example
For plugins that need to track file provenance (like nf-lamin):
Related Issues