Merge branch 'noklam/stress-testing-runners-4127' of github.com:kedro…

…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <[email protected]>
kedro-org · Oct 28, 2024 · 7b00c0d · 7b00c0d
2 parents a219069 + 4907858
commit 7b00c0d
Show file tree

Hide file tree

Showing 10 changed files with 395 additions and 38 deletions.
diff --git a/.github/workflows/pipeline-performance-test.yml b/.github/workflows/pipeline-performance-test.yml
@@ -0,0 +1,79 @@
+name: Trigger and Run Pipeline Performance Test
+
+on:
+  pull_request:
+    types: [labeled]
+
+jobs:
+  performance-test:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Check if 'performance' label was added
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: echo "Performance label detected. Running performance test."
+
+    - name: Clone test repo
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        git clone https://x-access-token:${{ secrets.GH_TAGGING_TOKEN }}@github.com/kedro-org/pipeline-performance-test.git
+
+    - name: Set up Python 3.11
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.11'
+
+    - name: Install dependencies
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        pip install kedro
+        pip install uv
+        cd pipeline-performance-test/performance-test
+        pip install -r requirements.txt
+
+    - name: Run performance test and capture time for latest Kedro release
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        cd pipeline-performance-test/performance-test
+        total_time_release=0.0
+        for i in {1..10}; do
+          { time kedro run; } 2> release_time_output.txt
+          real_time_release=$(grep real release_time_output.txt | awk '{print $2}' | sed 's/[^0-9.]//g')
+          total_time_release=$(echo "$total_time_release + $real_time_release" | bc)
+        done
+        average_time_release=$(echo "scale=3; $total_time_release / 10" | bc)
+        echo "average_time_release=${average_time_release}" >> $GITHUB_ENV
+
+    - name: Pull specific branch from Kedro and install it
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        git clone --branch ${{ github.event.pull_request.head.ref }} https://github.com/kedro-org/kedro.git
+        cd kedro
+        make install
+
+    - name: Run performance test and capture time for specific Kedro branch
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        cd pipeline-performance-test/performance-test
+        total_time_branch=0.0
+        for i in {1..10}; do
+          { time kedro run --params=hook_delay=0,dataset_load_delay=0,file_save_delay=0; } 2> branch_time_output.txt
+          real_time_branch=$(grep real branch_time_output.txt | awk '{print $2}' | sed 's/[^0-9.]//g')
+          total_time_branch=$(echo "$total_time_branch + $real_time_branch" | bc)
+        done
+        average_time_branch=$(echo "scale=3; $total_time_branch / 10" | bc)
+        echo "average_time_branch=${average_time_branch}" >> $GITHUB_ENV
+
+    - name: Extract and format real time from release version
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: echo "Average elapsed time for Kedro release version test was ${average_time_release} seconds"
+
+    - name: Extract and format real time from specific branch
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: echo "Average elapsed time for specific branch test was ${average_time_branch} seconds"
+
+    - name: Clean up time output files
+      if: github.event.action == 'labeled' && contains(github.event.label.name, 'performance')
+      run: |
+        rm pipeline-performance-test/performance-test/release_time_output.txt pipeline-performance-test/performance-test/branch_time_output.txt
diff --git a/RELEASE.md b/RELEASE.md
@@ -11,6 +11,7 @@
 ## Breaking changes to the API
 ## Documentation changes
 * Updated CLI autocompletion docs with new Click syntax.
+* Standardised `.parquet` suffix in docs and tests.
 
 ## Community contributions
 * [Hyewon Choi](https://github.com/hyew0nChoi)

diff --git a/docs/source/data/index.md b/docs/source/data/index.md
@@ -1,5 +1,5 @@
 
-# The Kedro Data Catalog
+# Data Catalog
 
 In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class.
 
@@ -46,3 +46,27 @@ This section on handing data with Kedro concludes with an advanced use case, ill
 
 how_to_create_a_custom_dataset
 ```
+
+## `KedroDataCatalog` (experimental feature)
+
+As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`.
+
+At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements:
+* Simplified dataset access: `_FrozenDatasets` has been replaced with a public `get` method to retrieve datasets.
+* Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets.
+
+For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page.
+
+```{toctree}
+:maxdepth: 1
+
+kedro_data_catalog
+```
+
+The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements.
+
+```{note}
+`KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`.
+```
+
+We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://kedro-org.slack.com).
diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md
@@ -0,0 +1,102 @@
+# Kedro Data Catalog
+`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` [documentation](./data_catalog.md) before exploring the additional functionality of `KedroDataCatalog`.
+
+This page highlights the new features and provides usage examples:
+* [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run)
+* [How to access datasets in the catalog](#how-to-access-datasets-in-the-catalog)
+* [How to add datasets to the catalog](#how-to-add-datasets-to-the-catalog)
+* [How to iterate trough datasets in the catalog](#how-to-iterate-trough-datasets-in-the-catalog)
+* [How to get the number of datasets in the catalog](#how-to-get-the-number-of-datasets-in-the-catalog)
+* [How to print the full catalog and individual datasets](#how-to-print-the-full-catalog-and-individual-datasets)
+* [How to access dataset patterns](#how-to-access-dataset-patterns)
+
+## How to make `KedroDataCatalog` the default catalog for Kedro `run`
+
+To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows:
+
+```python
+from kedro.io import KedroDataCatalog
+
+DATA_CATALOG_CLASS = KedroDataCatalog
+```
+
+Once this change is made, you can run your Kedro project as usual.
+
+For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md).
+
+## How to access datasets in the catalog
+
+You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method:
+
+```python
+reviews_ds = catalog["reviews"]
+reviews_ds = catalog.get("reviews", default=default_ds)
+```
+
+## How to add datasets to the catalog
+
+The new API allows you to add datasets as well as raw data directly to the catalog:
+
+```python
+from kedro_datasets.pandas import CSVDataset
+
+bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
+catalog["bikes"] = bikes_ds  # Adding a dataset
+catalog["cars"] = ["Ferrari", "Audi"]  # Adding raw data
+```
+
+When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.
+
+## How to iterate trough datasets in the catalog
+
+`KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries:
+
+```python
+for ds_name in catalog:  # __iter__ defaults to keys
+    pass
+
+for ds_name in catalog.keys():  # Iterate over dataset names
+    pass
+
+for ds in catalog.values():  # Iterate over datasets
+    pass
+
+for ds_name, ds in catalog.items():  # Iterate over (name, dataset) tuples
+    pass
+```
+
+## How to get the number of datasets in the catalog
+
+You can get the number of datasets in the catalog using the `len()` function:
+
+```python
+ds_count = len(catalog)
+```
+
+## How to print the full catalog and individual datasets
+
+To print the catalog or an individual dataset programmatically, use the `print()` function or in an interactive environment like IPython or JupyterLab, simply enter the variable:
+
+```bash
+In [1]: catalog
+Out[1]: {'shuttles': kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}), 'preprocessed_companies': kedro_datasets.pandas.parquet_dataset.ParquetDataset(filepath=PurePosixPath('/data/02_intermediate/preprocessed_companies.pq'), protocol='file', load_args={}, save_args={}), 'params:model_options.test_size': kedro.io.memory_dataset.MemoryDataset(data='<float>'), 'params:model_options.features': kedro.io.memory_dataset.MemoryDataset(data='<list>'))}
+
+In [2]: catalog["shuttles"]
+Out[2]: kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'})
+```
+
+## How to access dataset patterns
+
+The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog:
+
+```python
+config_resolver = catalog.config_resolver
+ds_config = catalog.config_resolver.resolve_pattern(ds_name)  # Resolving a dataset pattern
+patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
+```
+
+```{note}
+`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`).
+```
+
+For a full list of supported methods, refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py).
diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md
@@ -164,21 +164,21 @@ entries share `type`, `file_format` and `save_args`:
 ```yaml
 processing.factory_data:
   type: spark.SparkDataset
-  filepath: data/processing/factory_data.pq
+  filepath: data/processing/factory_data.parquet
   file_format: parquet
   save_args:
     mode: overwrite
 
 processing.process_data:
   type: spark.SparkDataset
-  filepath: data/processing/process_data.pq
+  filepath: data/processing/process_data.parquet
   file_format: parquet
   save_args:
     mode: overwrite
 
 modelling.metrics:
   type: spark.SparkDataset
-  filepath: data/modelling/factory_data.pq
+  filepath: data/modelling/factory_data.parquet
   file_format: parquet
   save_args:
     mode: overwrite
@@ -189,7 +189,7 @@ This could be generalised to the following pattern:
 ```yaml
 "{layer}.{dataset_name}":
   type: spark.SparkDataset
-  filepath: data/{layer}/{dataset_name}.pq
+  filepath: data/{layer}/{dataset_name}.parquet
   file_format: parquet
   save_args:
     mode: overwrite
@@ -202,7 +202,7 @@ You can have multiple dataset factories in your catalog. For example:
 ```yaml
 "{namespace}.{dataset_name}@spark":
   type: spark.SparkDataset
-  filepath: data/{namespace}/{dataset_name}.pq
+  filepath: data/{namespace}/{dataset_name}.parquet
   file_format: parquet
 
 "{dataset_name}@csv":
@@ -255,19 +255,19 @@ Consider a catalog file with the following patterns:
 
 "preprocessed_{dataset_name}":
   type: pandas.ParquetDataset
-  filepath: data/02_intermediate/preprocessed_{dataset_name}.pq
+  filepath: data/02_intermediate/preprocessed_{dataset_name}.parquet
 
 "processed_{dataset_name}":
   type: pandas.ParquetDataset
-  filepath: data/03_primary/processed_{dataset_name}.pq
+  filepath: data/03_primary/processed_{dataset_name}.parquet
 
 "{dataset_name}_csv":
   type: pandas.CSVDataset
   filepath: data/03_primary/{dataset_name}.csv
 
 "{namespace}.{dataset_name}_pq":
   type: pandas.ParquetDataset
-  filepath: data/03_primary/{dataset_name}_{namespace}.pq
+  filepath: data/03_primary/{dataset_name}_{namespace}.parquet
 
 "{default_dataset}":
   type: pickle.PickleDataset
@@ -315,11 +315,11 @@ shuttles:
 
 "preprocessed_{name}":
   type: pandas.ParquetDataset
-  filepath: data/02_intermediate/preprocessed_{name}.pq
+  filepath: data/02_intermediate/preprocessed_{name}.parquet
 
 "{default}":
   type: pandas.ParquetDataset
-  filepath: data/03_primary/{default}.pq
+  filepath: data/03_primary/{default}.parquet
 ```
 </details>
 
@@ -365,13 +365,13 @@ companies:
   filepath: data/01_raw/companies.csv
   type: pandas.CSVDataset
 model_input_table:
-  filepath: data/03_primary/model_input_table.pq
+  filepath: data/03_primary/model_input_table.parquet
   type: pandas.ParquetDataset
 preprocessed_companies:
-  filepath: data/02_intermediate/preprocessed_companies.pq
+  filepath: data/02_intermediate/preprocessed_companies.parquet
   type: pandas.ParquetDataset
 preprocessed_shuttles:
-  filepath: data/02_intermediate/preprocessed_shuttles.pq
+  filepath: data/02_intermediate/preprocessed_shuttles.parquet
   type: pandas.ParquetDataset
 reviews:
   filepath: data/01_raw/reviews.csv

diff --git a/docs/source/integrations/mlflow.md b/docs/source/integrations/mlflow.md
@@ -195,7 +195,7 @@ For that, you can make use of {ref}`runtime parameters <runtime-params>`:
 # Add the intermediate datasets to run only the inference
 X_test:
   type: pandas.ParquetDataset
-  filepath: data/05_model_input/X_test.pq
+  filepath: data/05_model_input/X_test.parquet
 
 y_test:
   type: pandas.CSVDataset  # https://github.com/pandas-dev/pandas/issues/54638

diff --git a/docs/source/tutorial/create_a_pipeline.md b/docs/source/tutorial/create_a_pipeline.md
@@ -200,11 +200,11 @@ Each of the nodes outputs a new dataset (`preprocessed_companies` and `preproces
 ```yaml
 preprocessed_companies:
   type: pandas.ParquetDataset
-  filepath: data/02_intermediate/preprocessed_companies.pq
+  filepath: data/02_intermediate/preprocessed_companies.parquet
 
 preprocessed_shuttles:
   type: pandas.ParquetDataset
-  filepath: data/02_intermediate/preprocessed_shuttles.pq
+  filepath: data/02_intermediate/preprocessed_shuttles.parquet
 ```
 </details>
 
@@ -290,7 +290,7 @@ The following entry in `conf/base/catalog.yml` saves the model input table datas
 ```yaml
 model_input_table:
   type: pandas.ParquetDataset
-  filepath: data/03_primary/model_input_table.pq
+  filepath: data/03_primary/model_input_table.parquet
 ```
 
 ## Test the example again