-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732
Conversation
Signed-off-by: Ivan Danov <[email protected]>
Quick thought, this will break %load_node immediately. Having a syntax that is identical in "pipelines.py" and DataCatalog is convenient for this case. Also, what's the size of the parameters here? Can you share a link for the testing project that you used. |
I tested with a YAML file ~2MB (100k lines), the nested logic takes less than a second. |
@@ -12,6 +12,7 @@ | |||
import re | |||
from typing import Any, Dict | |||
|
|||
from omegaconf import OmegaConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I feel about coupling the DataCatalog
class with OmegaConf here, let alone special-casing parameters by using the :
separator.
Shouldn't we redesign the DataCatalog
API instead so that parameters are first class citizens, and not fake datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to that 👆🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the
OmegaConfigLoader
itself. I would have called it coupling if it uses the actualOmegaConfigLoader
class, but this just imports the library.
Originally posted by @merelcht in #3893 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been 3 months so I had to re-read the PR again to give a sensible answer to that.
Maybe "coupling" is not the right word but I still feel weird about using OmegaConf
inside DataCatalog
in this way. Yes, we do have a dependency on OmegaConf
already in other parts of the framework, but the way DataCatalog.load
introspects the result of dataset.load()
and applies extra logic if it's a dictionary feels wrong. What if the user has a custom dataset that returns a dict
that is not meant to be configuration?
In any case, I'm also okay with separating the discussion about this solution from the idea of making parameters first-class citizens rather than fake datasets (before or after tackling #2240).
It's unclear whether this solution is appropriate and to what extent this PR is ready to be polished by some other team member, could we maybe transform this into an issue and properly prioritize it? I think we need a broad performance analysis of Kedro to understand where are the main bottlenecks for big projects, see #3790, kedro-org/kedro-viz#1726 |
Looks like there was not enough consensus to merge this PR. Left a comment about the underlying user problem at #3893 (comment) |
Since there's no consensus on this POC and it's a breaking change, I suggest to close this for now. We have several tickets open regarding performance issues and the Performance tickets
Catalog redesign |
Description
Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.
This Draft PR is a POC in dropping the hack we have built into Kedro to support the convenience syntax of
params:x.y.z
and replacing it with a proper query instead, powered byOmegaConf.select
. This way all datasets which load into a Python dictionary can provide the same functionality out of the box.A side effect is the removal of all those
params:xxxx
datasets from the output ofcatalog.list()
, which is something people have been annoyed by anyways. Nevertheless, it still presents a breaking change, so we need to decide whether it will need a new breaking Kedro version or it can go in as a bugfix/performance fix.This profiling revealed another potential source of slowness and that's the loading of the config files, which is something we should investigate further in the future.
Development notes
Tested with an autogenerated 2.2MB parameters file. As you can see that nearly 2/3 of the time is shaved off.
Before (~120s and 1.17GB memory):
After (~45s and peaked at ~1.10GB memory usage):
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file