Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: KedroDataCatalog #4151

Open
wants to merge 171 commits into
base: main
Choose a base branch
from
Open

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Sep 9, 2024

Description

In this PR we add a new catalog KedroDataCatalog, which uses the DataCatalogConfigResolver and addresses:

Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)

This PR is done on top of #4160 and relies on CatalogProtocol.

For the reviewers: this PR does not include unit-tests for KedroDataCatalog, they'll be added after the initial feedback.

Development notes

  • We kept some old DataCatalog API to avoid multiple if/else branches depending on catalog type used in context, runner and session
  • Removed _FrozenDatasets and access datasets as properties
  • Added get dataset by name feature: dedicated function and access by key
  • Added iterate over the datasets feature
  • add_feed_dict() simplified and renamed to add_raw_data()
  • Datasets' init was moved out from from_config() method

To test KedroDataCatalog modify your settings.py and run commands as usual:

# settings.py

from kedro.io import KedroDataCatalog
DATA_CATALOG_CLASS = KedroDataCatalog

kedro run
kedro catalog list/rank/resolve/create

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the research findings, we decided to get rid of the FrozenDataset API. It was confusing for the users, and they almost never used it, preferring _get_dataset() instead. For now, I made getitem to return a deep copy for consistency with the datasets property. However, object assignment is not supported anyway, so users cannot override the specific dataset.

I guess it will be hard to prevent user assigned something that they are not gonna access later. We have similar issues previously where user try to assign something to catalog but find out it doesn't change anything because Kedro run always creates its own `catalog.

For deep-copy, would it works for datasets that are not "picklable" or dataset that comes with connection?

I left a comment in the other PR, I still found CatalogProtocol a bit weird but I am not sure what's convention did the Python community adopted, something like CatalogLike?

Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova
Copy link
Contributor Author

I guess it will be hard to prevent user assigned something that they are not gonna access later. We have similar issues previously where user try to assign something to catalog but find out it doesn't change anything because Kedro run always creates its own `catalog.

For deep-copy, would it works for datasets that are not "picklable" or dataset that comes with connection?

These two points totally make sense to me. However, _FrozenDataset didn't solve any of those problems as people were using _get_dataset(), which returns the reference to the dataset. Based on your second point, I returned to reference as well, so we do not require "pickable" datasets, though as far as I understand, ParallelRunner does.

@ElenaKhaustova
Copy link
Contributor Author

I left a comment in the other PR, I still found CatalogProtocol a bit weird but I am not sure what's convention did the Python community adopted, something like CatalogLike?

As for naming convention, I see different examples: https://typing.readthedocs.io/en/latest/spec/protocol.html
But if looking at the existing ones they named like SupportsDataCatalog: https://docs.python.org/3/library/typing.html#protocols

So maybe the last is the better naming

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
@ElenaKhaustova
Copy link
Contributor Author

Update for the reviewers

Since there is a proposal on the new interface (#4175), we removed some methods (__getitem__, __setitem__, __iter__) from the current implementation since they will most probably change during the implementation of the proposal.

If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one.

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.

tests/io/conftest.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
from kedro.utils import _format_rich, _has_rich_handler


class KedroDataCatalog:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.

@@ -134,6 +134,7 @@
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
"kedro.io.kedro_data_catalog.KedroDataCatalog",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put it directly into the data_catalog module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean moving KedroDataCatalog class into data_catalog instead of creating a new file?

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
DatasetNotFoundError: When a dataset with the given name
is not in the collection and do not match patterns.
"""
ds_config = self._config_resolver.resolve_dataset_pattern(ds_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we resolve only if ds_name not in self._datasets? Or it's just to make the code a bit simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.


dataset = self._datasets.get(ds_name, None)

if dataset is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:

  • resolve the dataset pattern
  • if not part of the materialised datasets, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario
  • otherwise continue with non-error scenario

I think we can make the flow a bit less zig-zagy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if as the logic needs to go inside the if fail [] else [] scenario.

Now the logic is like this:

  • if not part of the materialised datasets, resolve the dataset pattern
  • if resolved, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario
  • otherwise, continue with a non-error scenario

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
This reverts commit 5208321.

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants