[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova · 2024-09-09T15:29:53Z

Description

In this PR we add a new catalog KedroDataCatalog, which uses the DataCatalogConfigResolver and addresses:

Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)

This PR is done on top of #4160 and relies on CatalogProtocol.

For the reviewers: this PR does not include unit-tests for KedroDataCatalog, they'll be added after the initial feedback.

Development notes

We kept some old DataCatalog API to avoid multiple if/else branches depending on catalog type used in context, runner and session
Removed _FrozenDatasets and access datasets as properties
Added get dataset by name feature: dedicated function and access by key
Added iterate over the datasets feature
add_feed_dict() simplified and renamed to add_raw_data()
Datasets' init was moved out from from_config() method

To test KedroDataCatalog modify your settings.py and run commands as usual:

# settings.py

from kedro.io import KedroDataCatalog
DATA_CATALOG_CLASS = KedroDataCatalog

kedro run
kedro catalog list/rank/resolve/create

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

noklam

Based on the research findings, we decided to get rid of the FrozenDataset API. It was confusing for the users, and they almost never used it, preferring _get_dataset() instead. For now, I made getitem to return a deep copy for consistency with the datasets property. However, object assignment is not supported anyway, so users cannot override the specific dataset.

I guess it will be hard to prevent user assigned something that they are not gonna access later. We have similar issues previously where user try to assign something to catalog but find out it doesn't change anything because Kedro run always creates its own `catalog.

For deep-copy, would it works for datasets that are not "picklable" or dataset that comes with connection?

I left a comment in the other PR, I still found CatalogProtocol a bit weird but I am not sure what's convention did the Python community adopted, something like CatalogLike?

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova · 2024-09-18T10:11:57Z

I guess it will be hard to prevent user assigned something that they are not gonna access later. We have similar issues previously where user try to assign something to catalog but find out it doesn't change anything because Kedro run always creates its own `catalog.

For deep-copy, would it works for datasets that are not "picklable" or dataset that comes with connection?

These two points totally make sense to me. However, _FrozenDataset didn't solve any of those problems as people were using _get_dataset(), which returns the reference to the dataset. Based on your second point, I returned to reference as well, so we do not require "pickable" datasets, though as far as I understand, ParallelRunner does.

ElenaKhaustova · 2024-09-18T10:18:00Z

I left a comment in the other PR, I still found CatalogProtocol a bit weird but I am not sure what's convention did the Python community adopted, something like CatalogLike?

As for naming convention, I see different examples: https://typing.readthedocs.io/en/latest/spec/protocol.html
But if looking at the existing ones they named like SupportsDataCatalog: https://docs.python.org/3/library/typing.html#protocols

So maybe the last is the better naming

Signed-off-by: Elena Khaustova <[email protected]>

kedro/io/kedro_data_catalog.py

ElenaKhaustova · 2024-09-18T13:36:28Z

Update for the reviewers

Since there is a proposal on the new interface (#4175), we removed some methods (__getitem__, __setitem__, __iter__) from the current implementation since they will most probably change during the implementation of the proposal.

If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one.

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.

tests/io/conftest.py

kedro/io/kedro_data_catalog.py

idanov · 2024-09-17T15:57:52Z

kedro/io/kedro_data_catalog.py

+from kedro.utils import _format_rich, _has_rich_handler
+
+
+class KedroDataCatalog:


I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.

idanov · 2024-09-19T19:46:53Z

docs/source/conf.py

@@ -134,6 +134,7 @@
        "kedro.io.core.DatasetError",
        "kedro.io.core.Version",
        "kedro.io.data_catalog.DataCatalog",
+        "kedro.io.kedro_data_catalog.KedroDataCatalog",


I'd put it directly into the data_catalog module.

Do you mean moving KedroDataCatalog class into data_catalog instead of creating a new file?

kedro/io/kedro_data_catalog.py

idanov · 2024-09-19T20:25:25Z

kedro/io/kedro_data_catalog.py

+            DatasetNotFoundError: When a dataset with the given name
+                is not in the collection and do not match patterns.
+        """
+        ds_config = self._config_resolver.resolve_dataset_pattern(ds_name)


Shouldn't we resolve only if ds_name not in self._datasets? Or it's just to make the code a bit simpler?

Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.

idanov · 2024-09-19T20:32:28Z

kedro/io/kedro_data_catalog.py

+
+        dataset = self._datasets.get(ds_name, None)
+
+        if dataset is None:


Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:

resolve the dataset pattern

if not part of the materialised datasets, add from config

get the dataset

if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario

otherwise continue with non-error scenario

I think we can make the flow a bit less zig-zagy.

Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if as the logic needs to go inside the if fail [] else [] scenario.

Now the logic is like this:

if not part of the materialised datasets, resolve the dataset pattern

if resolved, add from config

get the dataset

if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario

otherwise, continue with a non-error scenario

kedro/io/kedro_data_catalog.py

merelcht

Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟

kedro/io/kedro_data_catalog.py

tests/io/test_kedro_data_catalog.py

Signed-off-by: Elena Khaustova <[email protected]>

This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova added 30 commits July 31, 2024 18:16

Added a skeleton for AbstractDataCatalog and KedroDataCatalog

a8f4fb3

Signed-off-by: Elena Khaustova <[email protected]>

Removed from_config method

7d56818

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

787e121

Implemented _init_datasets method

0b80f23

Signed-off-by: Elena Khaustova <[email protected]>

Implemented get dataset

5c727df

Signed-off-by: Elena Khaustova <[email protected]>

Started resolve_patterns implementation

05c9171

Signed-off-by: Elena Khaustova <[email protected]>

Implemented resolve_patterns

5c804d6

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

e9ba5c4

Fixed credentials resolving

530f7d6

Signed-off-by: Elena Khaustova <[email protected]>

Updated match pattern

64be83c

Signed-off-by: Elena Khaustova <[email protected]>

Implemented add from dict method

c29828a

Signed-off-by: Elena Khaustova <[email protected]>

Updated io __init__

957403a

Signed-off-by: Elena Khaustova <[email protected]>

Added list method

14908ff

Signed-off-by: Elena Khaustova <[email protected]>

Implemented _validate_missing_keys

c5e925b

Signed-off-by: Elena Khaustova <[email protected]>

Added datasets access logic

b9a92b0

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

2cb794f

Added __contains__ and comments on lazy loading

2f32593

Signed-off-by: Elena Khaustova <[email protected]>

Renamed dataset_name to ds_name

d1ea64e

Signed-off-by: Elena Khaustova <[email protected]>

Updated some docstrings

fb89fca

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

4486939

Fixed _update_ds_configs

c667645

Signed-off-by: Elena Khaustova <[email protected]>

Fixed _init_datasets

be8e929

Signed-off-by: Elena Khaustova <[email protected]>

Implemented add_runtime_patterns

ec7ac39

Signed-off-by: Elena Khaustova <[email protected]>

Fixed runtime patterns usage

8e23450

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

529e61a

Merge branch 'main' into refactor-pattern-logic

e4cb21c

Moved pattern logic out of data catalog, implemented KedroDataCatalog

50bc816

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into 4110-move-pattern-resolution-logic

6dfbcb0

KedroDataCatalog updates

9346f08

Signed-off-by: Elena Khaustova <[email protected]>

Added property to return config

9568e29

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into 3995-data-catalog-2.0

70dc177

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova requested review from noklam and merelcht September 17, 2024 15:42

Updated RELEASE.md

ae7a271

Signed-off-by: Elena Khaustova <[email protected]>

noklam reviewed Sep 17, 2024

View reviewed changes

Removed deep copies

135cb0e

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova mentioned this pull request Sep 18, 2024

[DataCatalog2.0]: Extend KedroDataCatalog with dict interface #4175

Open

ElenaKhaustova added 2 commits September 18, 2024 14:21

Removed some interface that will be changed in the next version

ca4867c

Signed-off-by: Elena Khaustova <[email protected]>

Removed key completions

4745f71

Signed-off-by: Elena Khaustova <[email protected]>

yury-fedotov reviewed Sep 18, 2024

View reviewed changes

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved

kedro/io/kedro_data_catalog.py Show resolved Hide resolved

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved

ElenaKhaustova added 2 commits September 18, 2024 14:42

Fixinf typos

033a0b7

Signed-off-by: Elena Khaustova <[email protected]>

Removed key completions test

e74ffda

Signed-off-by: Elena Khaustova <[email protected]>

merelcht reviewed Sep 18, 2024

View reviewed changes

idanov reviewed Sep 19, 2024

View reviewed changes

merelcht approved these changes Sep 20, 2024

View reviewed changes

ElenaKhaustova added 10 commits September 20, 2024 15:10

Replaced data set with dataset

00af3ec

Signed-off-by: Elena Khaustova <[email protected]>

Added docstring for get_dataset() method

2de7ccb

Signed-off-by: Elena Khaustova <[email protected]>

Renamed pytest fixture

8affed6

Signed-off-by: Elena Khaustova <[email protected]>

Addressed review comments

a52672e

Signed-off-by: Elena Khaustova <[email protected]>

Updated _assert_requirements_ok starters test

84f249c

Signed-off-by: Elena Khaustova <[email protected]>

Revert "Updated _assert_requirements_ok starters test"

2548119

This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>

Updated error message

ac124e3

Signed-off-by: Elena Khaustova <[email protected]>

Replaced typo

f62ed03

Signed-off-by: Elena Khaustova <[email protected]>

Replaced data set with dataset in docstrings

b65609f

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests

17199ad

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova force-pushed the 3995-data-catalog-2.0 branch from 9c4701e to 17199ad Compare September 20, 2024 14:10

Merge branch 'main' into 3995-data-catalog-2.0

44c576e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog2.0]: `KedroDataCatalog` #4151

[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova commented Sep 9, 2024 •

edited

Loading

noklam left a comment

ElenaKhaustova commented Sep 18, 2024

ElenaKhaustova commented Sep 18, 2024

ElenaKhaustova commented Sep 18, 2024

merelcht left a comment

idanov Sep 17, 2024

idanov Sep 19, 2024

ElenaKhaustova Sep 19, 2024

idanov Sep 19, 2024

ElenaKhaustova Sep 19, 2024

idanov Sep 19, 2024

ElenaKhaustova Sep 19, 2024

merelcht left a comment

		from kedro.utils import _format_rich, _has_rich_handler


		class KedroDataCatalog:


		dataset = self._datasets.get(ds_name, None)

		if dataset is None:

[DataCatalog2.0]: KedroDataCatalog #4151

Are you sure you want to change the base?

[DataCatalog2.0]: KedroDataCatalog #4151

Conversation

ElenaKhaustova commented Sep 9, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

noklam left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented Sep 18, 2024

ElenaKhaustova commented Sep 18, 2024

ElenaKhaustova commented Sep 18, 2024

merelcht left a comment

Choose a reason for hiding this comment

idanov Sep 17, 2024

Choose a reason for hiding this comment

idanov Sep 19, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 19, 2024

Choose a reason for hiding this comment

idanov Sep 19, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 19, 2024

Choose a reason for hiding this comment

idanov Sep 19, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 19, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

[DataCatalog2.0]: `KedroDataCatalog` #4151

[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova commented Sep 9, 2024 •

edited

Loading