-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog2.0]: KedroDataCatalog
#4151
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the research findings, we decided to get rid of the FrozenDataset API. It was confusing for the users, and they almost never used it, preferring _get_dataset() instead. For now, I made getitem to return a deep copy for consistency with the datasets property. However, object assignment is not supported anyway, so users cannot override the specific dataset.
I guess it will be hard to prevent user assigned something that they are not gonna access later. We have similar issues previously where user try to assign something to catalog
but find out it doesn't change anything because Kedro run always creates its own `catalog.
For deep-copy, would it works for datasets that are not "picklable" or dataset that comes with connection?
I left a comment in the other PR, I still found CatalogProtocol
a bit weird but I am not sure what's convention did the Python community adopted, something like CatalogLike
?
Signed-off-by: Elena Khaustova <[email protected]>
These two points totally make sense to me. However, |
As for naming convention, I see different examples: https://typing.readthedocs.io/en/latest/spec/protocol.html So maybe the last is the better naming |
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Update for the reviewers Since there is a proposal on the new interface (#4175), we removed some methods ( If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one. |
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.
from kedro.utils import _format_rich, _has_rich_handler | ||
|
||
|
||
class KedroDataCatalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.
@@ -134,6 +134,7 @@ | |||
"kedro.io.core.DatasetError", | |||
"kedro.io.core.Version", | |||
"kedro.io.data_catalog.DataCatalog", | |||
"kedro.io.kedro_data_catalog.KedroDataCatalog", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put it directly into the data_catalog
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean moving KedroDataCatalog
class into data_catalog
instead of creating a new file?
kedro/io/kedro_data_catalog.py
Outdated
DatasetNotFoundError: When a dataset with the given name | ||
is not in the collection and do not match patterns. | ||
""" | ||
ds_config = self._config_resolver.resolve_dataset_pattern(ds_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we resolve only if ds_name not in self._datasets
? Or it's just to make the code a bit simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.
|
||
dataset = self._datasets.get(ds_name, None) | ||
|
||
if dataset is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:
- resolve the dataset pattern
- if not part of the materialised datasets, add from config
- get the dataset
- if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario
- otherwise continue with non-error scenario
I think we can make the flow a bit less zig-zagy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if
as the logic needs to go inside the if fail [] else []
scenario.
Now the logic is like this:
- if not part of the materialised datasets, resolve the dataset pattern
- if resolved, add from config
- get the dataset
- if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario
- otherwise, continue with a non-error scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
9c4701e
to
17199ad
Compare
Description
In this PR we add a new catalog
KedroDataCatalog
, which uses theDataCatalogConfigResolver
and addresses:_FrozenDatasets
public API #3926DataCatalog
#3931Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)
This PR is done on top of #4160 and relies on
CatalogProtocol
.For the reviewers: this PR does not include unit-tests for
KedroDataCatalog
, they'll be added after the initial feedback.Development notes
DataCatalog
API to avoid multiple if/else branches depending on catalog type used incontext
,runner
andsession
_FrozenDatasets
and access datasets as propertiesadd_feed_dict()
simplified and renamed toadd_raw_data()
from_config()
methodTo test
KedroDataCatalog
modify yoursettings.py
and run commands as usual:kedro run
kedro catalog list/rank/resolve/create
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file