Skip to content

Commit

Permalink
Merge branch 'noklam/stress-testing-runners-4127' of github.com:kedro…
Browse files Browse the repository at this point in the history
…-org/kedro into noklam/stress-testing-runners-4127

Signed-off-by: Nok Lam Chan <[email protected]>
  • Loading branch information
noklam committed Oct 28, 2024
2 parents a219069 + 4907858 commit 7b00c0d
Show file tree
Hide file tree
Showing 10 changed files with 395 additions and 38 deletions.
79 changes: 79 additions & 0 deletions .github/workflows/pipeline-performance-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
name: Trigger and Run Pipeline Performance Test

on:
pull_request:
types: [labeled]

jobs:
performance-test:
runs-on: ubuntu-latest

steps:
- name: Check if 'performance' label was added
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: echo "Performance label detected. Running performance test."

- name: Clone test repo
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
git clone https://x-access-token:${{ secrets.GH_TAGGING_TOKEN }}@github.com/kedro-org/pipeline-performance-test.git
- name: Set up Python 3.11
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install dependencies
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
pip install kedro
pip install uv
cd pipeline-performance-test/performance-test
pip install -r requirements.txt
- name: Run performance test and capture time for latest Kedro release
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
cd pipeline-performance-test/performance-test
total_time_release=0.0
for i in {1..10}; do
{ time kedro run; } 2> release_time_output.txt
real_time_release=$(grep real release_time_output.txt | awk '{print $2}' | sed 's/[^0-9.]//g')
total_time_release=$(echo "$total_time_release + $real_time_release" | bc)
done
average_time_release=$(echo "scale=3; $total_time_release / 10" | bc)
echo "average_time_release=${average_time_release}" >> $GITHUB_ENV
- name: Pull specific branch from Kedro and install it
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
git clone --branch ${{ github.event.pull_request.head.ref }} https://github.com/kedro-org/kedro.git
cd kedro
make install
- name: Run performance test and capture time for specific Kedro branch
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
cd pipeline-performance-test/performance-test
total_time_branch=0.0
for i in {1..10}; do
{ time kedro run --params=hook_delay=0,dataset_load_delay=0,file_save_delay=0; } 2> branch_time_output.txt
real_time_branch=$(grep real branch_time_output.txt | awk '{print $2}' | sed 's/[^0-9.]//g')
total_time_branch=$(echo "$total_time_branch + $real_time_branch" | bc)
done
average_time_branch=$(echo "scale=3; $total_time_branch / 10" | bc)
echo "average_time_branch=${average_time_branch}" >> $GITHUB_ENV
- name: Extract and format real time from release version
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: echo "Average elapsed time for Kedro release version test was ${average_time_release} seconds"

- name: Extract and format real time from specific branch
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: echo "Average elapsed time for specific branch test was ${average_time_branch} seconds"

- name: Clean up time output files
if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
run: |
rm pipeline-performance-test/performance-test/release_time_output.txt pipeline-performance-test/performance-test/branch_time_output.txt
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
## Breaking changes to the API
## Documentation changes
* Updated CLI autocompletion docs with new Click syntax.
* Standardised `.parquet` suffix in docs and tests.

## Community contributions
* [Hyewon Choi](https://github.com/hyew0nChoi)
Expand Down
26 changes: 25 additions & 1 deletion docs/source/data/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# The Kedro Data Catalog
# Data Catalog

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class.

Expand Down Expand Up @@ -46,3 +46,27 @@ This section on handing data with Kedro concludes with an advanced use case, ill
how_to_create_a_custom_dataset
```

## `KedroDataCatalog` (experimental feature)

As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`.

At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements:
* Simplified dataset access: `_FrozenDatasets` has been replaced with a public `get` method to retrieve datasets.
* Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets.

For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page.

```{toctree}
:maxdepth: 1
kedro_data_catalog
```

The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements.

```{note}
`KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`.
```

We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://kedro-org.slack.com).
102 changes: 102 additions & 0 deletions docs/source/data/kedro_data_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Kedro Data Catalog
`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` [documentation](./data_catalog.md) before exploring the additional functionality of `KedroDataCatalog`.

This page highlights the new features and provides usage examples:
* [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run)
* [How to access datasets in the catalog](#how-to-access-datasets-in-the-catalog)
* [How to add datasets to the catalog](#how-to-add-datasets-to-the-catalog)
* [How to iterate trough datasets in the catalog](#how-to-iterate-trough-datasets-in-the-catalog)
* [How to get the number of datasets in the catalog](#how-to-get-the-number-of-datasets-in-the-catalog)
* [How to print the full catalog and individual datasets](#how-to-print-the-full-catalog-and-individual-datasets)
* [How to access dataset patterns](#how-to-access-dataset-patterns)

## How to make `KedroDataCatalog` the default catalog for Kedro `run`

To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows:

```python
from kedro.io import KedroDataCatalog

DATA_CATALOG_CLASS = KedroDataCatalog
```

Once this change is made, you can run your Kedro project as usual.

For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md).

## How to access datasets in the catalog

You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method:

```python
reviews_ds = catalog["reviews"]
reviews_ds = catalog.get("reviews", default=default_ds)
```

## How to add datasets to the catalog

The new API allows you to add datasets as well as raw data directly to the catalog:

```python
from kedro_datasets.pandas import CSVDataset

bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
catalog["bikes"] = bikes_ds # Adding a dataset
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data
```

When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.

## How to iterate trough datasets in the catalog

`KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries:

```python
for ds_name in catalog: # __iter__ defaults to keys
pass

for ds_name in catalog.keys(): # Iterate over dataset names
pass

for ds in catalog.values(): # Iterate over datasets
pass

for ds_name, ds in catalog.items(): # Iterate over (name, dataset) tuples
pass
```

## How to get the number of datasets in the catalog

You can get the number of datasets in the catalog using the `len()` function:

```python
ds_count = len(catalog)
```

## How to print the full catalog and individual datasets

To print the catalog or an individual dataset programmatically, use the `print()` function or in an interactive environment like IPython or JupyterLab, simply enter the variable:

```bash
In [1]: catalog
Out[1]: {'shuttles': kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}), 'preprocessed_companies': kedro_datasets.pandas.parquet_dataset.ParquetDataset(filepath=PurePosixPath('/data/02_intermediate/preprocessed_companies.pq'), protocol='file', load_args={}, save_args={}), 'params:model_options.test_size': kedro.io.memory_dataset.MemoryDataset(data='<float>'), 'params:model_options.features': kedro.io.memory_dataset.MemoryDataset(data='<list>'))}

In [2]: catalog["shuttles"]
Out[2]: kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'})
```

## How to access dataset patterns

The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog:

```python
config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
```

```{note}
`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`).
```

For a full list of supported methods, refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py).
26 changes: 13 additions & 13 deletions docs/source/data/kedro_dataset_factories.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,21 +164,21 @@ entries share `type`, `file_format` and `save_args`:
```yaml
processing.factory_data:
type: spark.SparkDataset
filepath: data/processing/factory_data.pq
filepath: data/processing/factory_data.parquet
file_format: parquet
save_args:
mode: overwrite
processing.process_data:
type: spark.SparkDataset
filepath: data/processing/process_data.pq
filepath: data/processing/process_data.parquet
file_format: parquet
save_args:
mode: overwrite
modelling.metrics:
type: spark.SparkDataset
filepath: data/modelling/factory_data.pq
filepath: data/modelling/factory_data.parquet
file_format: parquet
save_args:
mode: overwrite
Expand All @@ -189,7 +189,7 @@ This could be generalised to the following pattern:
```yaml
"{layer}.{dataset_name}":
type: spark.SparkDataset
filepath: data/{layer}/{dataset_name}.pq
filepath: data/{layer}/{dataset_name}.parquet
file_format: parquet
save_args:
mode: overwrite
Expand All @@ -202,7 +202,7 @@ You can have multiple dataset factories in your catalog. For example:
```yaml
"{namespace}.{dataset_name}@spark":
type: spark.SparkDataset
filepath: data/{namespace}/{dataset_name}.pq
filepath: data/{namespace}/{dataset_name}.parquet
file_format: parquet
"{dataset_name}@csv":
Expand Down Expand Up @@ -255,19 +255,19 @@ Consider a catalog file with the following patterns:
"preprocessed_{dataset_name}":
type: pandas.ParquetDataset
filepath: data/02_intermediate/preprocessed_{dataset_name}.pq
filepath: data/02_intermediate/preprocessed_{dataset_name}.parquet
"processed_{dataset_name}":
type: pandas.ParquetDataset
filepath: data/03_primary/processed_{dataset_name}.pq
filepath: data/03_primary/processed_{dataset_name}.parquet
"{dataset_name}_csv":
type: pandas.CSVDataset
filepath: data/03_primary/{dataset_name}.csv
"{namespace}.{dataset_name}_pq":
type: pandas.ParquetDataset
filepath: data/03_primary/{dataset_name}_{namespace}.pq
filepath: data/03_primary/{dataset_name}_{namespace}.parquet
"{default_dataset}":
type: pickle.PickleDataset
Expand Down Expand Up @@ -315,11 +315,11 @@ shuttles:
"preprocessed_{name}":
type: pandas.ParquetDataset
filepath: data/02_intermediate/preprocessed_{name}.pq
filepath: data/02_intermediate/preprocessed_{name}.parquet
"{default}":
type: pandas.ParquetDataset
filepath: data/03_primary/{default}.pq
filepath: data/03_primary/{default}.parquet
```
</details>

Expand Down Expand Up @@ -365,13 +365,13 @@ companies:
filepath: data/01_raw/companies.csv
type: pandas.CSVDataset
model_input_table:
filepath: data/03_primary/model_input_table.pq
filepath: data/03_primary/model_input_table.parquet
type: pandas.ParquetDataset
preprocessed_companies:
filepath: data/02_intermediate/preprocessed_companies.pq
filepath: data/02_intermediate/preprocessed_companies.parquet
type: pandas.ParquetDataset
preprocessed_shuttles:
filepath: data/02_intermediate/preprocessed_shuttles.pq
filepath: data/02_intermediate/preprocessed_shuttles.parquet
type: pandas.ParquetDataset
reviews:
filepath: data/01_raw/reviews.csv
Expand Down
2 changes: 1 addition & 1 deletion docs/source/integrations/mlflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ For that, you can make use of {ref}`runtime parameters <runtime-params>`:
# Add the intermediate datasets to run only the inference
X_test:
type: pandas.ParquetDataset
filepath: data/05_model_input/X_test.pq
filepath: data/05_model_input/X_test.parquet
y_test:
type: pandas.CSVDataset # https://github.com/pandas-dev/pandas/issues/54638
Expand Down
6 changes: 3 additions & 3 deletions docs/source/tutorial/create_a_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,11 +200,11 @@ Each of the nodes outputs a new dataset (`preprocessed_companies` and `preproces
```yaml
preprocessed_companies:
type: pandas.ParquetDataset
filepath: data/02_intermediate/preprocessed_companies.pq
filepath: data/02_intermediate/preprocessed_companies.parquet

preprocessed_shuttles:
type: pandas.ParquetDataset
filepath: data/02_intermediate/preprocessed_shuttles.pq
filepath: data/02_intermediate/preprocessed_shuttles.parquet
```
</details>
Expand Down Expand Up @@ -290,7 +290,7 @@ The following entry in `conf/base/catalog.yml` saves the model input table datas
```yaml
model_input_table:
type: pandas.ParquetDataset
filepath: data/03_primary/model_input_table.pq
filepath: data/03_primary/model_input_table.parquet
```

## Test the example again
Expand Down
Loading

0 comments on commit 7b00c0d

Please sign in to comment.