Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

idanov · 2024-03-22T15:18:38Z

Description

Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.

This Draft PR is a POC in dropping the hack we have built into Kedro to support the convenience syntax of params:x.y.z and replacing it with a proper query instead, powered by OmegaConf.select. This way all datasets which load into a Python dictionary can provide the same functionality out of the box.

A side effect is the removal of all those params:xxxx datasets from the output of catalog.list(), which is something people have been annoyed by anyways. Nevertheless, it still presents a breaking change, so we need to decide whether it will need a new breaking Kedro version or it can go in as a bugfix/performance fix.

This profiling revealed another potential source of slowness and that's the loading of the config files, which is something we should investigate further in the future.

Development notes

Tested with an autogenerated 2.2MB parameters file. As you can see that nearly 2/3 of the time is shaved off.

Before (~120s and 1.17GB memory):

╰─❯ pyinstrument --show '*/kedro/*' -m kedro catalog list
............................
  _     ._   __/__   _ _  _  _ _/_   Recorded: 15:01:57  Samples:  80971
 /_//_/// /_\ / //_// / //_'/ //     Duration: 120.480   CPU time: 123.195
/   _/                      v4.6.2

Program: pyinstrument --show */kedro/* -m kedro catalog list

120.476 <module>  kedro/__main__.py:1
└─ 119.979 main  kedro/framework/cli/cli.py:192
   ├─ 118.307 KedroCLI.__call__  click/core.py:1155
   │  └─ 118.307 KedroCLI.main  kedro/framework/cli/cli.py:107
   │     └─ 118.306 KedroCLI.main  click/core.py:1010
   │           [6 frames hidden]  click
   │              118.306 new_func  click/decorators.py:44
   │              └─ 118.302 list_datasets  kedro/framework/cli/catalog.py:37
   │                 └─ 117.349 KedroContext.catalog  kedro/framework/context/context.py:177
   │                    └─ 117.338 KedroContext._get_catalog  kedro/framework/context/context.py:209
   │                       ├─ 74.947 DataCatalog.add_feed_dict  kedro/io/data_catalog.py:638
   │                       │  ├─ 72.751 DataCatalog.add  kedro/io/data_catalog.py:565
   │                       │  │  ├─ 69.485 _FrozenDatasets.__init__  kedro/io/data_catalog.py:101
   │                       │  │  │  ├─ 67.073 dict.update  <built-in>
   │                       │  │  │  └─ 2.186 [self]  kedro/io/data_catalog.py
   │                       │  │  └─ 3.266 [self]  kedro/io/data_catalog.py
   │                       │  └─ 2.053 MemoryDataset.__init__  kedro/io/memory_dataset.py:38
   │                       │     └─ 2.038 MemoryDataset._save  kedro/io/memory_dataset.py:70
   │                       │        └─ 1.984 _copy_with_mode  kedro/io/memory_dataset.py:115
   │                       │           └─ 1.953 deepcopy  copy.py:128
   │                       │                 [3 frames hidden]  copy
   │                       └─ 42.202 KedroContext._get_feed_dict  kedro/framework/context/context.py:251
   │                          └─ 42.167 KedroContext.params  kedro/framework/context/context.py:189
   │                             └─ 42.167 OmegaConfigLoader.__getitem__  kedro/config/omegaconf_config.py:154
   │                                └─ 42.167 OmegaConfigLoader.load_and_merge_dir_config  kedro/config/omegaconf_config.py:255
   │                                   ├─ 27.912 load  omegaconf/omegaconf.py:184
   │                                   │     [87 frames hidden]  omegaconf, yaml
   │                                   ├─ 11.244 merge  omegaconf/omegaconf.py:250
   │                                   │     [30 frames hidden]  omegaconf, copy
   │                                   └─ 2.993 to_container  omegaconf/omegaconf.py:542
   │                                         [12 frames hidden]  omegaconf
   └─ 1.663 KedroCLI.__init__  kedro/framework/cli/cli.py:96
      └─ 1.450 KedroCLI.global_groups  kedro/framework/cli/cli.py:142
         └─ 1.450 load_entry_points  kedro/framework/cli/utils.py:385
            └─ 1.438 _safe_load_entry_point  kedro/framework/cli/utils.py:369
               └─ 1.438 EntryPoint.load  importlib_metadata/__init__.py:178
                     [3 frames hidden]  importlib_metadata, importlib, kedro_viz

To view this report with different options, run:
    pyinstrument --load-prev 2024-03-22T15-01-57 [options]

After (~45s and peaked at ~1.10GB memory usage):

╰─❯ pyinstrument --show '*/kedro/*' -m kedro catalog list
..........................................
  _     ._   __/__   _ _  _  _ _/_   Recorded: 15:07:06  Samples:  37290
 /_//_/// /_\ / //_// / //_'/ //     Duration: 45.816    CPU time: 49.156
/   _/                      v4.6.2

Program: pyinstrument --show */kedro/* -m kedro catalog list

45.815 <module>  kedro/__main__.py:1
├─ 45.335 main  kedro/framework/cli/cli.py:225
│  ├─ 43.610 KedroCLI.__call__  click/core.py:1155
│  │  └─ 43.610 KedroCLI.main  kedro/framework/cli/cli.py:110
│  │     └─ 43.608 KedroCLI.main  click/core.py:1010
│  │           [6 frames hidden]  click
│  │              43.608 new_func  click/decorators.py:44
│  │              └─ 43.608 list_datasets  kedro/framework/cli/catalog.py:37
│  │                 ├─ 42.749 KedroContext.catalog  kedro/framework/context/context.py:177
│  │                 │  └─ 42.742 KedroContext._get_catalog  kedro/framework/context/context.py:209
│  │                 │     ├─ 42.079 KedroContext.params  kedro/framework/context/context.py:189
│  │                 │     │  └─ 42.079 OmegaConfigLoader.__getitem__  kedro/config/omegaconf_config.py:154
│  │                 │     │     └─ 42.079 OmegaConfigLoader.load_and_merge_dir_config  kedro/config/omegaconf_config.py:255
│  │                 │     │        ├─ 27.833 load  omegaconf/omegaconf.py:184
│  │                 │     │        │     [129 frames hidden]  omegaconf, contextlib, yaml
│  │                 │     │        ├─ 11.229 merge  omegaconf/omegaconf.py:250
│  │                 │     │        │     [50 frames hidden]  omegaconf, copy, <built-in>
│  │                 │     │        └─ 3.002 to_container  omegaconf/omegaconf.py:542
│  │                 │     │              [12 frames hidden]  omegaconf
│  │                 │     └─ 0.596 DataCatalog.add_feed_dict  kedro/io/data_catalog.py:652
│  │                 │        └─ 0.596 MemoryDataset.__init__  kedro/io/memory_dataset.py:38
│  │                 │           └─ 0.596 MemoryDataset._save  kedro/io/memory_dataset.py:70
│  │                 │              └─ 0.596 _copy_with_mode  kedro/io/memory_dataset.py:115
│  │                 │                 └─ 0.587 deepcopy  copy.py:128
│  │                 │                       [9 frames hidden]  copy
│  │                 └─ 0.810 _ProjectPipelines.inner  kedro/framework/project/__init__.py:141
│  │                    └─ 0.810 _ProjectPipelines._load_data  kedro/framework/project/__init__.py:176
│  │                       └─ 0.810 register_pipelines  kedro_spaceflights/pipeline_registry.py:8
│  │                          └─ 0.810 find_pipelines  kedro/framework/project/__init__.py:322
│  │                             └─ 0.790 import_module  importlib/__init__.py:109
│  │                                └─ 0.788 <module>  kedro_spaceflights/pipelines/data_science/__init__.py:1
│  │                                   └─ 0.788 <module>  kedro_spaceflights/pipelines/data_science/pipeline.py:1
│  │                                      └─ 0.787 <module>  kedro_spaceflights/pipelines/data_science/nodes.py:1
│  │                                         └─ 0.680 <module>  sklearn/__init__.py:1
│  │                                               [3 frames hidden]  sklearn
│  └─ 1.714 KedroCLI.__init__  kedro/framework/cli/cli.py:99
│     └─ 1.517 KedroCLI.global_groups  kedro/framework/cli/cli.py:175
│        └─ 1.517 load_entry_points  kedro/framework/cli/utils.py:387
│           └─ 1.503 _safe_load_entry_point  kedro/framework/cli/utils.py:371
│              └─ 1.503 EntryPoint.load  importlib_metadata/__init__.py:178
│                    [4 frames hidden]  importlib_metadata, importlib, kedro_viz
│                       1.021 <module>  kedro_viz/server.py:1
│                       └─ 0.790 <module>  kedro_viz/integrations/kedro/data_loader.py:1
│                          └─ 0.787 __getattr__  lazy_loader/__init__.py:72
│                                [4 frames hidden]  lazy_loader, importlib, kedro_dataset...
└─ 0.480 <module>  kedro/framework/cli/__init__.py:1
   └─ 0.478 <module>  kedro/framework/cli/cli.py:1

To view this report with different options, run:
    pyinstrument --load-prev 2024-03-22T15-07-06 [options]

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ivan Danov <[email protected]>

noklam · 2024-03-22T16:05:59Z

Quick thought, this will break %load_node immediately. Having a syntax that is identical in "pipelines.py" and DataCatalog is convenient for this case.

Also, what's the size of the parameters here? Can you share a link for the testing project that you used.

noklam · 2024-03-22T18:14:45Z

I tested with a YAML file ~2MB (100k lines), the nested logic takes less than a second.

astrojuanlu · 2024-03-25T15:54:23Z

kedro/io/data_catalog.py

@@ -12,6 +12,7 @@
 import re
 from typing import Any, Dict

+from omegaconf import OmegaConf


I don't know how I feel about coupling the DataCatalog class with OmegaConf here, let alone special-casing parameters by using the : separator.

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

+1 to that 👆🏻

Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader itself. I would have called it coupling if it uses the actual OmegaConfigLoader class, but this just imports the library.

Originally posted by @merelcht in #3893 (comment)

It's been 3 months so I had to re-read the PR again to give a sensible answer to that.

Maybe "coupling" is not the right word but I still feel weird about using OmegaConf inside DataCatalog in this way. Yes, we do have a dependency on OmegaConf already in other parts of the framework, but the way DataCatalog.load introspects the result of dataset.load() and applies extra logic if it's a dictionary feels wrong. What if the user has a custom dataset that returns a dict that is not meant to be configuration?

In any case, I'm also okay with separating the discussion about this solution from the idea of making parameters first-class citizens rather than fake datasets (before or after tackling #2240).

astrojuanlu · 2024-05-06T07:17:40Z

It's unclear whether this solution is appropriate and to what extent this PR is ready to be polished by some other team member, could we maybe transform this into an issue and properly prioritize it?

I think we need a broad performance analysis of Kedro to understand where are the main bottlenecks for big projects, see #3790, kedro-org/kedro-viz#1726

astrojuanlu · 2024-07-02T14:44:00Z

Looks like there was not enough consensus to merge this PR. Left a comment about the underlying user problem at #3893 (comment)

merelcht · 2024-07-09T09:11:33Z

Since there's no consensus on this POC and it's a breaking change, I suggest to close this for now. We have several tickets open regarding performance issues and the DataCatalog redesign so plenty of opportunities to address this properly in sprint work.

Performance tickets

Investigate performance of config loading for big projects #3893
Spike: design example kedro projects that can be used to assess performance issues #3957
[Stress Testing] - Create example projects to assess Kedro performance for complex pipelines #3866

Catalog redesign

Design DataCatalog2.0 #3995

Remove the copying hack and add proper params querying capabilities

122548f

Signed-off-by: Ivan Danov <[email protected]>

idanov requested review from astrojuanlu, marrrcin and noklam March 22, 2024 15:18

idanov self-assigned this Mar 22, 2024

astrojuanlu reviewed Mar 25, 2024

View reviewed changes

This was referenced Jul 2, 2024

Key completion for dataset access #3973

Merged

Investigate performance of config loading for big projects #3893

Open

This was referenced Jul 8, 2024

Close all (or as many as possible) open PRs! #3996

Closed

Spike: design example kedro projects that can be used to assess performance issues #3957

Closed

merelcht closed this Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

idanov commented Mar 22, 2024 •

edited

Loading

noklam commented Mar 22, 2024 •

edited

Loading

noklam commented Mar 22, 2024

astrojuanlu Mar 25, 2024

marrrcin Mar 26, 2024

astrojuanlu Jul 2, 2024

astrojuanlu Jul 2, 2024

astrojuanlu commented May 6, 2024

astrojuanlu commented Jul 2, 2024

merelcht commented Jul 9, 2024

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Conversation

idanov commented Mar 22, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

noklam commented Mar 22, 2024 • edited Loading

noklam commented Mar 22, 2024

astrojuanlu Mar 25, 2024

Choose a reason for hiding this comment

marrrcin Mar 26, 2024

Choose a reason for hiding this comment

astrojuanlu Jul 2, 2024

Choose a reason for hiding this comment

astrojuanlu Jul 2, 2024

Choose a reason for hiding this comment

astrojuanlu commented May 6, 2024

astrojuanlu commented Jul 2, 2024

merelcht commented Jul 9, 2024

idanov commented Mar 22, 2024 •

edited

Loading

noklam commented Mar 22, 2024 •

edited

Loading