Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move warm-up from session to runner #4262

Merged
merged 22 commits into from
Nov 5, 2024

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Oct 28, 2024

Description

Solves #4250

Relates to #3935 - step 5 in the proposed solution

Development notes

Moved warm-up from the session to the AbstractRunner before we call _run() and made this logic common for all runners. This simplified the logic as now it's common for all runners and all the patterns are resolved before the pipeline run.

We added a unit test with dataset patterns to check that ThreadRunner is not failing with a Dataset 'name' has already been registered error. In the test, we check that the dataset was registered at the warm-up, and we successfully passed to loading it, though we do not do the actual loading.

We tried to make the full test with actual data loading — 208a24b. But we faced a problem where the first thread was trying to load data before it was created — https://github.com/kedro-org/kedro/actions/runs/11562442829/job/32183629244. This happened mostly at the CI for the latest Python versions only and was hard to reproduce locally. We tried some checks if the file exists before calling run() but it didn't help. So, we changed the test to the current one to exclude data creation.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review October 29, 2024 12:28
@@ -112,6 +112,10 @@ def run(
self._logger.info(
"Asynchronous mode is enabled for loading and saving data"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about this? I also notice this line is duplicated:
registered_ds = [ds for ds in pipeline.datasets() if ds in catalog]

Maybe we can refactor this into one loop so we only loop pipeline.datasets() once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Kept one loop and added a comment about warm-up
  2. The line is duplicated on purpose, there's a comment explaining the difference but I also renamed the variable to avoid confusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registered_ds_no_runtime_patterns is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds (not necessarily the best option, but just an example)?

Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@noklam noklam self-requested a review October 29, 2024 13:53
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description on the PR! This makes all sense to me 👍
Can you add a comment in the release notes for future reference?

Signed-off-by: Elena Khaustova <[email protected]>
@@ -112,6 +112,10 @@ def run(
self._logger.info(
"Asynchronous mode is enabled for loading and saving data"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registered_ds_no_runtime_patterns is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds (not necessarily the best option, but just an example)?

Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.

@ElenaKhaustova ElenaKhaustova enabled auto-merge (squash) November 5, 2024 22:36
@ElenaKhaustova ElenaKhaustova merged commit 84b71b1 into main Nov 5, 2024
34 checks passed
@ElenaKhaustova ElenaKhaustova deleted the fix/4250-move-warm-up-to-runner branch November 5, 2024 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants