[Feat] Enable datasets_from_catalog to return factory-based datasets (#1001)

gtauzin · pre-commit-ci[bot] · antonymilne · web-flow · commit 512962ef813b · 2025-02-12T10:12:28.000Z
Signed-off-by: Guillaume Tauzin &lt;4648633+gtauzin@users.noreply.github.com&gt;
Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: Antony Milne &lt;49395058+antonymilne@users.noreply.github.com&gt;
Co-authored-by: Petar Pejovic &lt;108530920+petar-qb@users.noreply.github.com&gt;
Co-authored-by: Antony Milne &lt;antony.milne@quantumblack.com&gt;
diff --git a/vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md b/vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md
@@ -0,0 +1,39 @@
+<!--
+A new scriv changelog fragment.
+
+Uncomment the section that is right (remove the HTML comment wrapper).
+-->
+
+<!--
+### Highlights ✨
+
+- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Removed
+
+- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+### Added
+
+- Kedro integration function `datasets_from_catalog` can handle dataset factories for `kedro>=0.19.9`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
+
+### Changed
+
+- Bump optional dependency lower bound to `kedro>=0.19.0`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
+
+<!--
+### Deprecated
+
+- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+
+<!--
+### Security
+
+- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
diff --git a/vizro-core/docs/pages/explanation/authors.md b/vizro-core/docs/pages/explanation/authors.md
@@ -10,7 +10,7 @@
 
 <!-- vale off -->
 
-[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
+[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
 
 with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,
 
diff --git a/vizro-core/docs/pages/explanation/faq.md b/vizro-core/docs/pages/explanation/faq.md
@@ -95,7 +95,7 @@ Any attempt at a high-level explanation must rely on an oversimplification that
 
 All are great entry points to the world of data apps. If you prefer a top-down scripting style, then Streamlit is a powerful approach. If you prefer full control and customization over callbacks and layouts, then Dash is a powerful approach. If you prefer a configuration approach with in-built best practices, and the potential for customization and scalability through Dash, then Vizro is a powerful approach.
 
-For a more detailed comparison, it may help to visit the introductory articles of [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose, and could be the best tool of choice.
+For a more detailed comparison, it may help to read introductory articles about [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://blog.streamlit.io/streamlit-101-python-data-app/) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose.
 
 ## How does Vizro compare with Python packages and business intelligence (BI) tools?
 
diff --git a/vizro-core/docs/pages/user-guides/kedro-data-catalog.md b/vizro-core/docs/pages/user-guides/kedro-data-catalog.md
@@ -12,7 +12,7 @@ pip install vizro[kedro]
 
 ## Use datasets from the Kedro Data Catalog
 
-`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
+`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
 
 ```python
 from vizro.integrations import kedro as kedro_integration
@@ -39,20 +39,21 @@ The full code for these different cases is given below.
         from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
 
-        catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
         ```
 
     === "app.ipynb (Kedro Jupyter session)"
         ```python
         from vizro.managers import data_manager
 
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
         ```
 
     === "app.py (Data Catalog configuration file)"
@@ -66,6 +67,51 @@ The full code for these different cases is given below.
 
         catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
+        ```
+
+### Use dataset factories
+
+To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:
+
+```python
+kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__"])  # (1)!
+```
+
+1. You can specify the name of your pipeline, for example `pipelines["my_pipeline"]`, or even combine multiple pipelines with `pipelines["a"] + pipelines["b"]`. The Kedro `__default__` pipeline is what runs by default with the `kedro run` command.
+
+The `pipelines` variable may have been created the following ways:
+
+1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
+1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.
+
+The full code for these different cases is given below.
+
+!!! example "Import a Kedro Data Catalog with dataset factories into the Vizro data manager"
+    === "app.py (Kedro project path)"
+        ```python
+        from vizro.integrations import kedro as kedro_integration
+        from vizro.managers import data_manager
+
+
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
+        pipelines = kedro_integration.pipelines_from_project(project_path)
+
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
+            catalog, pipeline=pipelines["__default__"]
+        ).items():
+            data_manager[dataset_name] = dataset_loader
+        ```
+
+    === "app.ipynb (Kedro Jupyter session)"
+        ```python
+        from vizro.managers import data_manager
+
+
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
+            catalog, pipeline=pipelines["__default__"]
+        ).items():
+            data_manager[dataset_name] = dataset_loader
         ```
diff --git a/vizro-core/hatch.toml b/vizro-core/hatch.toml
@@ -3,13 +3,6 @@
 [[envs.all.matrix]]
 python = ["3.9", "3.10", "3.11", "3.12", "3.13"]
 
-[envs.all.overrides]
-# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
-# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
-matrix.python.features = [
-  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
-]
-
 [envs.changelog]
 dependencies = ["scriv"]
 detached = true
@@ -37,6 +30,7 @@ dependencies = [
   "pyhamcrest",
   "gunicorn"
 ]
+features = ["kedro"]
 installer = "uv"
 
 [envs.default.env-vars]
@@ -133,9 +127,9 @@ extra-dependencies = [
   "dash==2.18.0",
   "plotly==5.24.0",
   "pandas==2.0.0",
-  "numpy==1.23.0"  # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
+  "numpy==1.23.0",  # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
+  "kedro==0.19.0"  # Includes kedro-datasets as a dependency.
 ]
-features = ["kedro"]
 python = "3.9"
 
 [publish.index]
diff --git a/vizro-core/pyproject.toml b/vizro-core/pyproject.toml
@@ -25,7 +25,8 @@ dependencies = [
   "flask_caching>=2",
   "wrapt>=1",
   "black",
-  "autoflake"
+  "autoflake",
+  "packaging"
 ]
 description = "Vizro is a package to facilitate visual analytics."
 dynamic = ["version"]
@@ -36,7 +37,7 @@ requires-python = ">=3.9"
 
 [project.optional-dependencies]
 kedro = [
-  "kedro>=0.17.3",
+  "kedro>=0.19.0",
   "kedro-datasets"  # no longer a dependency of kedro for kedro>=0.19.2
 ]
 
diff --git a/vizro-core/src/vizro/__init__.py b/vizro-core/src/vizro/__init__.py
@@ -5,6 +5,7 @@
 
 import plotly.io as pio
 from dash.development.base_component import ComponentRegistry
+from packaging.version import parse
 
 from ._constants import VIZRO_ASSETS_PATH
 from ._vizro import Vizro, _make_resource_spec
@@ -23,7 +24,7 @@
 # This would only be the case where you need to test something with serve_locally=False and have changed
 # assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
 # and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
-_git_branch = __version__ if "dev" not in __version__ else "main"
+_git_branch = __version__ if not parse(__version__).is_devrelease else "main"
 BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"
 # Enables the use of our own Bootstrap theme in a pure Dash app with `external_stylesheets=vizro.bootstrap`.
 bootstrap = f"{BASE_EXTERNAL_URL}static/css/vizro-bootstrap.min.css"
diff --git a/vizro-core/src/vizro/_vizro.py b/vizro-core/src/vizro/_vizro.py
@@ -11,6 +11,7 @@
 import plotly.io as pio
 from dash.development.base_component import ComponentRegistry
 from flask_caching import SimpleCache
+from packaging.version import parse
 
 import vizro
 from vizro._constants import VIZRO_ASSETS_PATH
@@ -209,7 +210,7 @@ def _make_resource_spec(path: Path) -> _ResourceSpec:
     # This would only be the case where you need to test something with serve_locally=False and have changed
     # assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
     # and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
-    _git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
+    _git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"
     BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"
 
     # Get path relative to the vizro package root, where this file resides.
diff --git a/vizro-core/src/vizro/integrations/kedro/__init__.py b/vizro-core/src/vizro/integrations/kedro/__init__.py
@@ -1,3 +1,3 @@
-from ._data_manager import catalog_from_project, datasets_from_catalog
+from ._data_manager import catalog_from_project, datasets_from_catalog, pipelines_from_project
 
-__all__ = ["catalog_from_project", "datasets_from_catalog"]
+__all__ = ["catalog_from_project", "datasets_from_catalog", "pipelines_from_project"]
diff --git a/vizro-core/src/vizro/integrations/kedro/_data_manager.py b/vizro-core/src/vizro/integrations/kedro/_data_manager.py
@@ -1,27 +1,73 @@
+from __future__ import annotations
+
+from importlib.metadata import version
 from pathlib import Path
-from typing import Any, Optional, Union
+from typing import TYPE_CHECKING, Any, Optional, Union
 
 from kedro.framework.session import KedroSession
 from kedro.framework.startup import bootstrap_project
-from kedro.io import DataCatalog
+from kedro.pipeline import Pipeline
+from packaging.version import parse
 
 from vizro.managers._data_manager import pd_DataFrameCallable
 
+if TYPE_CHECKING:
+    from kedro.io import CatalogProtocol
+
 
 def catalog_from_project(
     project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
-) -> DataCatalog:
+) -> CatalogProtocol:
     bootstrap_project(project_path)
     with KedroSession.create(
         project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
     ) as session:
         return session.load_context().catalog
 
 
-def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
+def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
+    bootstrap_project(project_path)
+    from kedro.framework.project import pipelines
+
+    return pipelines
+
+
+def _legacy_datasets_from_catalog(catalog: CatalogProtocol) -> dict[str, pd_DataFrameCallable]:
+    # The old version of datasets_from_catalog from before https://github.com/mckinsey/vizro/pull/1001.
+    # This does not support dataset factories.
+    # We keep this version to maintain backwards compatibility with 0.19.0 <= kedro < 0.19.9.
+    # Note the pipeline argument does not exist.
     datasets = {}
     for name in catalog.list():
         dataset = catalog._get_dataset(name, suggest=False)
         if "pandas" in dataset.__module__:
             datasets[name] = dataset.load
     return datasets
+
+
+def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
+    if parse(version("kedro")) < parse("0.19.9"):
+        return _legacy_datasets_from_catalog(catalog)
+
+    # This doesn't include things added to the catalog at run time but that is ok for our purposes.
+    config_resolver = catalog.config_resolver
+    kedro_datasets = config_resolver.config.copy()
+
+    if pipeline:
+        # Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
+        # resolved give an empty dictionary and are ignored.
+        for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
+            if dataset_config := config_resolver.resolve_pattern(dataset_name):
+                kedro_datasets[dataset_name] = dataset_config
+
+    vizro_data_sources = {}
+
+    for dataset_name, dataset_config in kedro_datasets.items():
+        # "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
+        if "pandas" in dataset_config["type"]:
+            # TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
+            #  but need to check if works with caching.
+            dataset = catalog._get_dataset(dataset_name, suggest=False)
+            vizro_data_sources[dataset_name] = dataset.load
+
+    return vizro_data_sources
diff --git a/vizro-core/tests/unit/test_vizro.py b/vizro-core/tests/unit/test_vizro.py
@@ -2,11 +2,12 @@
 
 import dash
 import pytest
+from packaging.version import parse
 
 import vizro
 from vizro._constants import VIZRO_ASSETS_PATH
 
-_git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
+_git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"
 
 
 def test_vizro_bootstrap():
diff --git a/vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml b/vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml
@@ -1,7 +1,15 @@
-companies:
-  type: pandas.JSONDataset
-  filepath: companies.json
+"{pandas_factory}#csv":
+  type: pandas.CSVDataset
+  filepath: "{pandas_factory}.csv"
 
-reviews:
+pandas_excel:
+  type: pandas.ExcelDataset
+  filepath: pandas_excel.xlsx
+
+pandas_parquet:
+  type: pandas.ParquetDataset
+  filepath: pandas_parquet.parquet
+
+not_dataframe:
   type: pickle.PickleDataset
-  filepath: reviews.pkl
+  filepath: pickle.pkl
diff --git a/vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py b/vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py