-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move warm-up from session to runner #4262
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@@ -112,6 +112,10 @@ def run( | |||
self._logger.info( | |||
"Asynchronous mode is enabled for loading and saving data" | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment about this? I also notice this line is duplicated:
registered_ds = [ds for ds in pipeline.datasets() if ds in catalog]
Maybe we can refactor this into one loop so we only loop pipeline.datasets()
once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Kept one loop and added a comment about warm-up
- The line is duplicated on purpose, there's a comment explaining the difference but I also renamed the variable to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registered_ds_no_runtime_patterns
is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds
(not necessarily the best option, but just an example)?
Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description on the PR! This makes all sense to me 👍
Can you add a comment in the release notes for future reference?
Signed-off-by: Elena Khaustova <[email protected]>
@@ -112,6 +112,10 @@ def run( | |||
self._logger.info( | |||
"Asynchronous mode is enabled for loading and saving data" | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registered_ds_no_runtime_patterns
is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds
(not necessarily the best option, but just an example)?
Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.
Signed-off-by: Elena Khaustova <[email protected]>
Description
Solves #4250
Relates to #3935 - step 5 in the proposed solution
Development notes
Moved warm-up from the session to the
AbstractRunner
before we call_run()
and made this logic common for all runners. This simplified the logic as now it's common for all runners and all the patterns are resolved before the pipeline run.We added a unit test with dataset patterns to check that
ThreadRunner
is not failing with aDataset 'name' has already been registered
error. In the test, we check that the dataset was registered at the warm-up, and we successfully passed to loading it, though we do not do the actual loading.We tried to make the full test with actual data loading — 208a24b. But we faced a problem where the first thread was trying to load data before it was created — https://github.com/kedro-org/kedro/actions/runs/11562442829/job/32183629244. This happened mostly at the CI for the latest Python versions only and was hard to reproduce locally. We tried some checks if the file exists before calling
run()
but it didn't help. So, we changed the test to the current one to exclude data creation.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file