Unpack dictionary parameters #3905

bpmeek · 2024-05-31T22:03:53Z

Description

In response to this ticket

Development notes

Can now unpack dictionaries in the inputs argument of a node, example below.

_unpack_params added to kedro/kedro/pipeline/node.py, this function iterates over node inputs and updates node.inputs if needed.

note: It is assumed that the dictionary entries and kwarg have the same name

Changes:

kedro/kedro/pipeline/modular_pipeline.py was updated to not throw errors when mapping dataset names.
_is_single_parameter returns true if name starts with **params:
_normalize_param_name does not append params: if the name already begins with **params:
_validate_datasets_exist removes datasets that begin with ** from non_existent, I have confirmed that kedro/runner/runner.py will catch these missing datasets.

Undesirable behavior

~~Currently modular pipelines will still namespace the unpacked parameters, I'll need assistance with either updating the rules list or adjusting the rename function in a way that makes sense.~~

EDIT: This is no longer applicable after commit

Examples:

nodes.py

def split_data(data: pd.DataFrame, features: List[str], test_size, random_state) -> Tuple:
    X = data[features]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    return X_train, X_test, y_train, y_test

pipeline.py

node(
    func=split_data,
    inputs=["model_input_table", "**params:model_options"],
    outputs=["X_train", "X_test", "y_train", "y_test"],
    name="split_data_node",
),

model_options:
  test_size: 0.2
  random_state: 3
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

astrojuanlu · 2024-06-20T16:32:11Z

Thanks for this PR @bpmeek! We'll review it shortly 🙏🏼

datajoely · 2024-06-25T10:16:47Z

kedro/pipeline/node.py

+        if _input.startswith("**"):
+            use_new = True
+            dict_root = _input.split(":")[-1]
+            _func_arguments = [arg for arg in inspect.signature(func).parameters]


So there is a soft assumption that the dictionary keys match the function arguments. Do we want to throw a warning or error if:

(a) No arguments match
(b) Partial arguments match

So that is an assumption, I suppose we could also implement tuple unpacking if a user wanted to use index instead of key-value pairs.

The current behavior of no match/partial match is consistent I think with what users would expect. If you pass the dictionary model_options and the function expects a key features and that key is missing you would get the following error before the run.

ValueError: Pipeline input(s) {'params:model_options.features'} not found in the DataCatalog

true! interested to see if others have an opinion though

I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.

I'm a little confused about how we would implement any validation at this point in the process, this function is called when the Node object is being initialized and I don't believe the catalog is available to verify against is it?

there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function.

Can you help me understand this? if your function was something like def foo(float_1, integer_2, string_3): you could pass **params:dictionary and it would work. If your function was def foo(**kwargs) then this implementation wouldn't work but I wasn't under the impression that was what was being asked for, because you could just change it to def foo(kwargs) and then access the values normally from there.

Sure. I don't mean that this is the expected usage example and we aim to make it work. But it doesn't mean that someone cannot make that (or something else that breaks the current unpacking method) by mistake if syntax allows. There might be some other cases as well. To simplify the validation, we can follow the approach proposed above - check no/partial argument match and add some possible incorrect test cases to ensure the validation catches them. At the same time extend the docs with the expected usage example.

I think this is the main outstanding discussion right?

ElenaKhaustova

Thank you for the PR!

Left some comments/suggestions regarding the implementation. My main push is to clarify which cases we cover by adding basic validation, negative tests and small usage example here

ElenaKhaustova · 2024-06-25T15:41:49Z

kedro/pipeline/node.py

+        if _input.startswith("**"):
+            use_new = True
+            dict_root = _input.split(":")[-1]
+            _func_arguments = [arg for arg in inspect.signature(func).parameters]


Should it be outside of the loop since func remains the same?

I think this makes sense and went ahead and made the change, but I'm not sure which would be more efficient, creating a list of every node's function or making the list multiple times if the input is a dictionary.

kedro/pipeline/node.py

ElenaKhaustova · 2024-06-25T15:59:06Z

kedro/pipeline/node.py

+            use_new = True
+            dict_root = _input.split(":")[-1]
+            _func_arguments = [arg for arg in inspect.signature(func).parameters]
+            for param in _func_arguments[idx:]:


Instead of slicing the list, we can just use idx to start looping from it.

I'm not sure I understand what you're recommending.

ElenaKhaustova · 2024-06-25T16:29:20Z

kedro/pipeline/node.py

+        if _input.startswith("**"):
+            use_new = True
+            dict_root = _input.split(":")[-1]
+            _func_arguments = [arg for arg in inspect.signature(func).parameters]


I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.

noklam · 2024-07-16T14:51:14Z

@bpmeek Do you want to continue this PR? I am eager to see this being merged finally.

bpmeek · 2024-07-16T15:46:18Z

@noklam, I can continue it, I think I'm just stuck right now on the best way to implement checking for partial matches, since the current process doesn't have access to the catalog when it's populating the required datasets and the verification doesn't know what dictionaries have been unpacked.

noklam · 2024-07-16T15:57:52Z

Just quick thought this can be done at the node level. There are some node function argument parsing logic done in the %load_node feature already.

The key here seems to be when should this validation step happen. I will put some more thoughts on this tmr.

kedro/pipeline/node.py

Changed name of function, Added checks for dictionary unpacking in function Added negative test for dictionary unpacking in function Signed-off-by: bpmeek <[email protected]>

bpmeek · 2024-07-30T15:17:39Z

Are the additional checks/tests here sufficient or is there still more to be done?

astrojuanlu · 2024-08-19T09:57:09Z

Thanks for your patience @bpmeek! Let us look into this again.

noklam · 2024-08-19T15:25:59Z

I have a long list of PRs so I will only have time to look at this next week earliest.

bpmeek requested a review from merelcht as a code owner May 31, 2024 22:03

bpmeek mentioned this pull request May 31, 2024

Kedro PR will break parameters in kedro-viz kedro-org/kedro-viz#1933

Open

1 task

ankatiyar added the Community Issue/PR opened by the open-source community label Jun 10, 2024

bpmeek requested review from yetudada and astrojuanlu as code owners June 20, 2024 15:50

bpmeek force-pushed the dict-unpacking branch from 7e307cb to 3b0217c Compare June 20, 2024 15:51

datajoely reviewed Jun 25, 2024

View reviewed changes

ElenaKhaustova reviewed Jun 25, 2024

View reviewed changes

merelcht mentioned this pull request Jul 9, 2024

Close all (or as many as possible) open PRs! #3996

Closed

9 tasks

noklam reviewed Jul 18, 2024

View reviewed changes

kedro/pipeline/node.py Outdated Show resolved Hide resolved

Reset head,

29fd1b3

Changed name of function, Added checks for dictionary unpacking in function Added negative test for dictionary unpacking in function Signed-off-by: bpmeek <[email protected]>

bpmeek force-pushed the dict-unpacking branch from 7d94573 to 29fd1b3 Compare July 18, 2024 21:15

astrojuanlu requested review from ElenaKhaustova, datajoely and noklam August 19, 2024 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpack dictionary parameters #3905

Unpack dictionary parameters #3905

bpmeek commented May 31, 2024 •

edited

Loading

astrojuanlu commented Jun 20, 2024

datajoely Jun 25, 2024

bpmeek Jun 25, 2024

datajoely Jun 25, 2024

ElenaKhaustova Jun 25, 2024 •

edited

Loading

bpmeek Jun 25, 2024

ElenaKhaustova Jun 26, 2024

astrojuanlu Jul 12, 2024

ElenaKhaustova left a comment •

edited

Loading

ElenaKhaustova Jun 25, 2024

bpmeek Jul 18, 2024

ElenaKhaustova Jun 25, 2024

bpmeek Jul 18, 2024

ElenaKhaustova Jun 25, 2024 •

edited

Loading

noklam commented Jul 16, 2024

bpmeek commented Jul 16, 2024

noklam commented Jul 16, 2024

bpmeek commented Jul 30, 2024

astrojuanlu commented Aug 19, 2024

noklam commented Aug 19, 2024

Unpack dictionary parameters #3905

Are you sure you want to change the base?

Unpack dictionary parameters #3905

Conversation

bpmeek commented May 31, 2024 • edited Loading

Description

Development notes

Changes:

Undesirable behavior

Examples:

Developer Certificate of Origin

Checklist

astrojuanlu commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

noklam commented Jul 16, 2024

bpmeek commented Jul 16, 2024

noklam commented Jul 16, 2024

bpmeek commented Jul 30, 2024

astrojuanlu commented Aug 19, 2024

noklam commented Aug 19, 2024

bpmeek commented May 31, 2024 •

edited

Loading

ElenaKhaustova Jun 25, 2024 •

edited

Loading

ElenaKhaustova left a comment •

edited

Loading

ElenaKhaustova Jun 25, 2024 •

edited

Loading