Skip to content

Commit 512962e

Browse files
gtauzinpre-commit-ci[bot]antonymilnepetar-qb
authored
[Feat] Enable datasets_from_catalog to return factory-based datasets (#1001)
Signed-off-by: Guillaume Tauzin <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Antony Milne <[email protected]> Co-authored-by: Petar Pejovic <[email protected]> Co-authored-by: Antony Milne <[email protected]>
1 parent 93cca0e commit 512962e

File tree

13 files changed

+224
-46
lines changed

13 files changed

+224
-46
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
<!--
2+
A new scriv changelog fragment.
3+
4+
Uncomment the section that is right (remove the HTML comment wrapper).
5+
-->
6+
7+
<!--
8+
### Highlights ✨
9+
10+
- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
11+
12+
-->
13+
<!--
14+
### Removed
15+
16+
- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
17+
18+
-->
19+
### Added
20+
21+
- Kedro integration function `datasets_from_catalog` can handle dataset factories for `kedro>=0.19.9`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
22+
23+
### Changed
24+
25+
- Bump optional dependency lower bound to `kedro>=0.19.0`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
26+
27+
<!--
28+
### Deprecated
29+
30+
- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
31+
32+
-->
33+
34+
<!--
35+
### Security
36+
37+
- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
38+
39+
-->

vizro-core/docs/pages/explanation/authors.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
<!-- vale off -->
1212

13-
[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
13+
[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
1414

1515
with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,
1616

vizro-core/docs/pages/explanation/faq.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ Any attempt at a high-level explanation must rely on an oversimplification that
9595

9696
All are great entry points to the world of data apps. If you prefer a top-down scripting style, then Streamlit is a powerful approach. If you prefer full control and customization over callbacks and layouts, then Dash is a powerful approach. If you prefer a configuration approach with in-built best practices, and the potential for customization and scalability through Dash, then Vizro is a powerful approach.
9797

98-
For a more detailed comparison, it may help to visit the introductory articles of [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose, and could be the best tool of choice.
98+
For a more detailed comparison, it may help to read introductory articles about [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://blog.streamlit.io/streamlit-101-python-data-app/) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose.
9999

100100
## How does Vizro compare with Python packages and business intelligence (BI) tools?
101101

vizro-core/docs/pages/user-guides/kedro-data-catalog.md

+54-8
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ pip install vizro[kedro]
1212

1313
## Use datasets from the Kedro Data Catalog
1414

15-
`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
15+
`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
1616

1717
```python
1818
from vizro.integrations import kedro as kedro_integration
@@ -39,20 +39,21 @@ The full code for these different cases is given below.
3939
from vizro.integrations import kedro as kedro_integration
4040
from vizro.managers import data_manager
4141

42+
project_path = "/path/to/kedro/project"
43+
catalog = kedro_integration.catalog_from_project(project_path)
4244

43-
catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")
4445

45-
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
46-
data_manager[dataset_name] = dataset
46+
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
47+
data_manager[dataset_name] = dataset_loader
4748
```
4849

4950
=== "app.ipynb (Kedro Jupyter session)"
5051
```python
5152
from vizro.managers import data_manager
5253

5354

54-
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
55-
data_manager[dataset_name] = dataset
55+
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
56+
data_manager[dataset_name] = dataset_loader
5657
```
5758

5859
=== "app.py (Data Catalog configuration file)"
@@ -66,6 +67,51 @@ The full code for these different cases is given below.
6667

6768
catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))
6869

69-
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
70-
data_manager[dataset_name] = dataset
70+
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
71+
data_manager[dataset_name] = dataset_loader
72+
```
73+
74+
### Use dataset factories
75+
76+
To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:
77+
78+
```python
79+
kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__"]) # (1)!
80+
```
81+
82+
1. You can specify the name of your pipeline, for example `pipelines["my_pipeline"]`, or even combine multiple pipelines with `pipelines["a"] + pipelines["b"]`. The Kedro `__default__` pipeline is what runs by default with the `kedro run` command.
83+
84+
The `pipelines` variable may have been created the following ways:
85+
86+
1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
87+
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.
88+
89+
The full code for these different cases is given below.
90+
91+
!!! example "Import a Kedro Data Catalog with dataset factories into the Vizro data manager"
92+
=== "app.py (Kedro project path)"
93+
```python
94+
from vizro.integrations import kedro as kedro_integration
95+
from vizro.managers import data_manager
96+
97+
98+
project_path = "/path/to/kedro/project"
99+
catalog = kedro_integration.catalog_from_project(project_path)
100+
pipelines = kedro_integration.pipelines_from_project(project_path)
101+
102+
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
103+
catalog, pipeline=pipelines["__default__"]
104+
).items():
105+
data_manager[dataset_name] = dataset_loader
106+
```
107+
108+
=== "app.ipynb (Kedro Jupyter session)"
109+
```python
110+
from vizro.managers import data_manager
111+
112+
113+
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
114+
catalog, pipeline=pipelines["__default__"]
115+
).items():
116+
data_manager[dataset_name] = dataset_loader
71117
```

vizro-core/hatch.toml

+3-9
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,6 @@
33
[[envs.all.matrix]]
44
python = ["3.9", "3.10", "3.11", "3.12", "3.13"]
55

6-
[envs.all.overrides]
7-
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
8-
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
9-
matrix.python.features = [
10-
{value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
11-
]
12-
136
[envs.changelog]
147
dependencies = ["scriv"]
158
detached = true
@@ -37,6 +30,7 @@ dependencies = [
3730
"pyhamcrest",
3831
"gunicorn"
3932
]
33+
features = ["kedro"]
4034
installer = "uv"
4135

4236
[envs.default.env-vars]
@@ -133,9 +127,9 @@ extra-dependencies = [
133127
"dash==2.18.0",
134128
"plotly==5.24.0",
135129
"pandas==2.0.0",
136-
"numpy==1.23.0" # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
130+
"numpy==1.23.0", # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
131+
"kedro==0.19.0" # Includes kedro-datasets as a dependency.
137132
]
138-
features = ["kedro"]
139133
python = "3.9"
140134

141135
[publish.index]

vizro-core/pyproject.toml

+3-2
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ dependencies = [
2525
"flask_caching>=2",
2626
"wrapt>=1",
2727
"black",
28-
"autoflake"
28+
"autoflake",
29+
"packaging"
2930
]
3031
description = "Vizro is a package to facilitate visual analytics."
3132
dynamic = ["version"]
@@ -36,7 +37,7 @@ requires-python = ">=3.9"
3637

3738
[project.optional-dependencies]
3839
kedro = [
39-
"kedro>=0.17.3",
40+
"kedro>=0.19.0",
4041
"kedro-datasets" # no longer a dependency of kedro for kedro>=0.19.2
4142
]
4243

vizro-core/src/vizro/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
import plotly.io as pio
77
from dash.development.base_component import ComponentRegistry
8+
from packaging.version import parse
89

910
from ._constants import VIZRO_ASSETS_PATH
1011
from ._vizro import Vizro, _make_resource_spec
@@ -23,7 +24,7 @@
2324
# This would only be the case where you need to test something with serve_locally=False and have changed
2425
# assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
2526
# and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
26-
_git_branch = __version__ if "dev" not in __version__ else "main"
27+
_git_branch = __version__ if not parse(__version__).is_devrelease else "main"
2728
BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"
2829
# Enables the use of our own Bootstrap theme in a pure Dash app with `external_stylesheets=vizro.bootstrap`.
2930
bootstrap = f"{BASE_EXTERNAL_URL}static/css/vizro-bootstrap.min.css"

vizro-core/src/vizro/_vizro.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import plotly.io as pio
1212
from dash.development.base_component import ComponentRegistry
1313
from flask_caching import SimpleCache
14+
from packaging.version import parse
1415

1516
import vizro
1617
from vizro._constants import VIZRO_ASSETS_PATH
@@ -209,7 +210,7 @@ def _make_resource_spec(path: Path) -> _ResourceSpec:
209210
# This would only be the case where you need to test something with serve_locally=False and have changed
210211
# assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
211212
# and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
212-
_git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
213+
_git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"
213214
BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"
214215

215216
# Get path relative to the vizro package root, where this file resides.
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
from ._data_manager import catalog_from_project, datasets_from_catalog
1+
from ._data_manager import catalog_from_project, datasets_from_catalog, pipelines_from_project
22

3-
__all__ = ["catalog_from_project", "datasets_from_catalog"]
3+
__all__ = ["catalog_from_project", "datasets_from_catalog", "pipelines_from_project"]
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,73 @@
1+
from __future__ import annotations
2+
3+
from importlib.metadata import version
14
from pathlib import Path
2-
from typing import Any, Optional, Union
5+
from typing import TYPE_CHECKING, Any, Optional, Union
36

47
from kedro.framework.session import KedroSession
58
from kedro.framework.startup import bootstrap_project
6-
from kedro.io import DataCatalog
9+
from kedro.pipeline import Pipeline
10+
from packaging.version import parse
711

812
from vizro.managers._data_manager import pd_DataFrameCallable
913

14+
if TYPE_CHECKING:
15+
from kedro.io import CatalogProtocol
16+
1017

1118
def catalog_from_project(
1219
project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
13-
) -> DataCatalog:
20+
) -> CatalogProtocol:
1421
bootstrap_project(project_path)
1522
with KedroSession.create(
1623
project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
1724
) as session:
1825
return session.load_context().catalog
1926

2027

21-
def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
28+
def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
29+
bootstrap_project(project_path)
30+
from kedro.framework.project import pipelines
31+
32+
return pipelines
33+
34+
35+
def _legacy_datasets_from_catalog(catalog: CatalogProtocol) -> dict[str, pd_DataFrameCallable]:
36+
# The old version of datasets_from_catalog from before https://github.com/mckinsey/vizro/pull/1001.
37+
# This does not support dataset factories.
38+
# We keep this version to maintain backwards compatibility with 0.19.0 <= kedro < 0.19.9.
39+
# Note the pipeline argument does not exist.
2240
datasets = {}
2341
for name in catalog.list():
2442
dataset = catalog._get_dataset(name, suggest=False)
2543
if "pandas" in dataset.__module__:
2644
datasets[name] = dataset.load
2745
return datasets
46+
47+
48+
def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
49+
if parse(version("kedro")) < parse("0.19.9"):
50+
return _legacy_datasets_from_catalog(catalog)
51+
52+
# This doesn't include things added to the catalog at run time but that is ok for our purposes.
53+
config_resolver = catalog.config_resolver
54+
kedro_datasets = config_resolver.config.copy()
55+
56+
if pipeline:
57+
# Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
58+
# resolved give an empty dictionary and are ignored.
59+
for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
60+
if dataset_config := config_resolver.resolve_pattern(dataset_name):
61+
kedro_datasets[dataset_name] = dataset_config
62+
63+
vizro_data_sources = {}
64+
65+
for dataset_name, dataset_config in kedro_datasets.items():
66+
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
67+
if "pandas" in dataset_config["type"]:
68+
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
69+
# but need to check if works with caching.
70+
dataset = catalog._get_dataset(dataset_name, suggest=False)
71+
vizro_data_sources[dataset_name] = dataset.load
72+
73+
return vizro_data_sources

vizro-core/tests/unit/test_vizro.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22

33
import dash
44
import pytest
5+
from packaging.version import parse
56

67
import vizro
78
from vizro._constants import VIZRO_ASSETS_PATH
89

9-
_git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
10+
_git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"
1011

1112

1213
def test_vizro_bootstrap():
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,15 @@
1-
companies:
2-
type: pandas.JSONDataset
3-
filepath: companies.json
1+
"{pandas_factory}#csv":
2+
type: pandas.CSVDataset
3+
filepath: "{pandas_factory}.csv"
44

5-
reviews:
5+
pandas_excel:
6+
type: pandas.ExcelDataset
7+
filepath: pandas_excel.xlsx
8+
9+
pandas_parquet:
10+
type: pandas.ParquetDataset
11+
filepath: pandas_parquet.parquet
12+
13+
not_dataframe:
614
type: pickle.PickleDataset
7-
filepath: reviews.pkl
15+
filepath: pickle.pkl

0 commit comments

Comments
 (0)