Move warm-up from session to runner #4262

ElenaKhaustova · 2024-10-28T17:29:13Z

Description

Relates to #3935 - step 5 in the proposed solution

Development notes

Moved warm-up from the session to the AbstractRunner before we call _run() and made this logic common for all runners. This simplified the logic as now it's common for all runners and all the patterns are resolved before the pipeline run.

We added a unit test with dataset patterns to check that ThreadRunner is not failing with a Dataset 'name' has already been registered error. In the test, we check that the dataset was registered at the warm-up, and we successfully passed to loading it, though we do not do the actual loading.

We tried to make the full test with actual data loading — 208a24b. But we faced a problem where the first thread was trying to load data before it was created — https://github.com/kedro-org/kedro/actions/runs/11562442829/job/32183629244. This happened mostly at the CI for the latest Python versions only and was hard to reproduce locally. We tried some checks if the file exists before calling run() but it didn't help. So, we changed the test to the current one to exclude data creation.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

noklam · 2024-10-29T12:54:09Z

kedro/runner/runner.py

@@ -112,6 +112,10 @@ def run(
            self._logger.info(
                "Asynchronous mode is enabled for loading and saving data"
            )
+


Can we add a comment about this? I also notice this line is duplicated:
registered_ds = [ds for ds in pipeline.datasets() if ds in catalog]

Maybe we can refactor this into one loop so we only loop pipeline.datasets() once?

Kept one loop and added a comment about warm-up

The line is duplicated on purpose, there's a comment explaining the difference but I also renamed the variable to avoid confusion

registered_ds_no_runtime_patterns is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds (not necessarily the best option, but just an example)?

Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

Thanks for the detailed description on the PR! This makes all sense to me 👍
Can you add a comment in the release notes for future reference?

Signed-off-by: Elena Khaustova <[email protected]>

idanov · 2024-11-05T17:23:27Z

kedro/runner/runner.py

@@ -112,6 +112,10 @@ def run(
            self._logger.info(
                "Asynchronous mode is enabled for loading and saving data"
            )
+


registered_ds_no_runtime_patterns is a bit of a mouthful, could we come up with a shorter name? E.g. warm_concrete_ds (not necessarily the best option, but just an example)?

Tbh, I don't mind the two loops - they show the purpose better without too much of a slowdown, but the current refactor is also ok.

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova added 13 commits October 28, 2024 16:04

Move warm-up to runner

7c794f4

Signed-off-by: Elena Khaustova <[email protected]>

Implemented test for running thread runner with patterns

208a24b

Signed-off-by: Elena Khaustova <[email protected]>

Added test for new catalog

7c6729a

Signed-off-by: Elena Khaustova <[email protected]>

Add line separator to file

9d5b37d

Signed-off-by: Elena Khaustova <[email protected]>

Replaced writing csv manually to writing with pandas

c3229c0

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/4250-move-warm-up-to-runner

6c509d9

Fixed fixture

bd878c9

Signed-off-by: Elena Khaustova <[email protected]>

Removed new catalog from test

68010aa

Signed-off-by: Elena Khaustova <[email protected]>

Made catalog type a parameter

29d373f

Signed-off-by: Elena Khaustova <[email protected]>

Removed old catalog from test

e90cfd7

Signed-off-by: Elena Khaustova <[email protected]>

Removed new catalog from test

3f1dbe0

Signed-off-by: Elena Khaustova <[email protected]>

Removed data creation/loading

892cda4

Signed-off-by: Elena Khaustova <[email protected]>

Fixed test docstring

e7f2632

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova marked this pull request as ready for review October 29, 2024 12:28

ElenaKhaustova requested a review from merelcht as a code owner October 29, 2024 12:28

ElenaKhaustova requested review from noklam, lrcouto, idanov and DimedS October 29, 2024 12:28

noklam reviewed Oct 29, 2024

View reviewed changes

ElenaKhaustova added 2 commits October 29, 2024 13:45

Removed extra loop

429ca13

Signed-off-by: Elena Khaustova <[email protected]>

Renamed variable for clarifty

3ffd538

Signed-off-by: Elena Khaustova <[email protected]>

noklam self-requested a review October 29, 2024 13:53

noklam approved these changes Oct 29, 2024

View reviewed changes

ElenaKhaustova added 3 commits October 29, 2024 13:56

Merge branch 'main' into fix/4250-move-warm-up-to-runner

681d3f1

Moved warm-up to the top

01f9b62

Signed-off-by: Elena Khaustova <[email protected]>

Moved warm-up to the top

069dff4

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova mentioned this pull request Nov 1, 2024

[DataCatalog]: Lazy dataset loading #4270

Merged

7 tasks

Merge branch 'main' into fix/4250-move-warm-up-to-runner

9d0f579

merelcht approved these changes Nov 1, 2024

View reviewed changes

Updated release notes

5f6ef85

Signed-off-by: Elena Khaustova <[email protected]>

idanov approved these changes Nov 5, 2024

View reviewed changes

ElenaKhaustova added 2 commits November 5, 2024 22:30

Merge branch 'main' into fix/4250-move-warm-up-to-runner

b0f9b0f

Remaned variable

e481c72

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova enabled auto-merge (squash) November 5, 2024 22:36

ElenaKhaustova merged commit 84b71b1 into main Nov 5, 2024
34 checks passed

ElenaKhaustova deleted the fix/4250-move-warm-up-to-runner branch November 5, 2024 22:51

ElenaKhaustova mentioned this pull request Nov 7, 2024

ThreadRunner Dataset DatasetAlreadyExistsError: Dataset has already been registered #4250

Closed

merelcht mentioned this pull request Nov 7, 2024

[BUG] DataCatalog problem with ThreadRunner on kedro >=0.19.7 #4191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move warm-up from session to runner #4262

Move warm-up from session to runner #4262

ElenaKhaustova commented Oct 28, 2024 •

edited

Loading

noklam Oct 29, 2024

ElenaKhaustova Oct 29, 2024

idanov Nov 5, 2024

merelcht left a comment

idanov Nov 5, 2024

Move warm-up from session to runner #4262

Move warm-up from session to runner #4262

Conversation

ElenaKhaustova commented Oct 28, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

noklam Oct 29, 2024

Choose a reason for hiding this comment

ElenaKhaustova Oct 29, 2024

Choose a reason for hiding this comment

idanov Nov 5, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

idanov Nov 5, 2024

Choose a reason for hiding this comment

ElenaKhaustova commented Oct 28, 2024 •

edited

Loading