Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpack dictionary parameters #3905

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

bpmeek
Copy link

@bpmeek bpmeek commented May 31, 2024

Description

In response to this ticket

Development notes

Can now unpack dictionaries in the inputs argument of a node, example below.

_unpack_params added to kedro/kedro/pipeline/node.py, this function iterates over node inputs and updates node.inputs if needed.

note: It is assumed that the dictionary entries and kwarg have the same name

Changes:

kedro/kedro/pipeline/modular_pipeline.py was updated to not throw errors when mapping dataset names.
_is_single_parameter returns true if name starts with **params:
_normalize_param_name does not append params: if the name already begins with **params:
_validate_datasets_exist removes datasets that begin with ** from non_existent, I have confirmed that kedro/runner/runner.py will catch these missing datasets.

Undesirable behavior

Currently modular pipelines will still namespace the unpacked parameters, I'll need assistance with either updating the rules list or adjusting the rename function in a way that makes sense.

EDIT: This is no longer applicable after commit

Examples:

nodes.py

def split_data(data: pd.DataFrame, features: List[str], test_size, random_state) -> Tuple:
    X = data[features]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    return X_train, X_test, y_train, y_test

pipeline.py

node(
    func=split_data,
    inputs=["model_input_table", "**params:model_options"],
    outputs=["X_train", "X_test", "y_train", "y_test"],
    name="split_data_node",
),
model_options:
  test_size: 0.2
  random_state: 3
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@bpmeek bpmeek requested a review from merelcht as a code owner May 31, 2024 22:03
@ankatiyar ankatiyar added the Community Issue/PR opened by the open-source community label Jun 10, 2024
@astrojuanlu
Copy link
Member

Thanks for this PR @bpmeek! We'll review it shortly 🙏🏼

if _input.startswith("**"):
use_new = True
dict_root = _input.split(":")[-1]
_func_arguments = [arg for arg in inspect.signature(func).parameters]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is a soft assumption that the dictionary keys match the function arguments. Do we want to throw a warning or error if:

(a) No arguments match
(b) Partial arguments match

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is an assumption, I suppose we could also implement tuple unpacking if a user wanted to use index instead of key-value pairs.

The current behavior of no match/partial match is consistent I think with what users would expect. If you pass the dictionary model_options and the function expects a key features and that key is missing you would get the following error before the run.

ValueError: Pipeline input(s) {'params:model_options.features'} not found in the DataCatalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true! interested to see if others have an opinion though

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about how we would implement any validation at this point in the process, this function is called when the Node object is being initialized and I don't believe the catalog is available to verify against is it?

there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function.

Can you help me understand this? if your function was something like def foo(float_1, integer_2, string_3): you could pass **params:dictionary and it would work. If your function was def foo(**kwargs) then this implementation wouldn't work but I wasn't under the impression that was what was being asked for, because you could just change it to def foo(kwargs) and then access the values normally from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I don't mean that this is the expected usage example and we aim to make it work. But it doesn't mean that someone cannot make that (or something else that breaks the current unpacking method) by mistake if syntax allows. There might be some other cases as well. To simplify the validation, we can follow the approach proposed above - check no/partial argument match and add some possible incorrect test cases to ensure the validation catches them. At the same time extend the docs with the expected usage example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the main outstanding discussion right?

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Left some comments/suggestions regarding the implementation. My main push is to clarify which cases we cover by adding basic validation, negative tests and small usage example here

if _input.startswith("**"):
use_new = True
dict_root = _input.split(":")[-1]
_func_arguments = [arg for arg in inspect.signature(func).parameters]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be outside of the loop since func remains the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense and went ahead and made the change, but I'm not sure which would be more efficient, creating a list of every node's function or making the list multiple times if the input is a dictionary.

kedro/pipeline/node.py Show resolved Hide resolved
kedro/pipeline/node.py Outdated Show resolved Hide resolved
use_new = True
dict_root = _input.split(":")[-1]
_func_arguments = [arg for arg in inspect.signature(func).parameters]
for param in _func_arguments[idx:]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of slicing the list, we can just use idx to start looping from it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you're recommending.

if _input.startswith("**"):
use_new = True
dict_root = _input.split(":")[-1]
_func_arguments = [arg for arg in inspect.signature(func).parameters]
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.

@noklam
Copy link
Contributor

noklam commented Jul 16, 2024

@bpmeek Do you want to continue this PR? I am eager to see this being merged finally.

@bpmeek
Copy link
Author

bpmeek commented Jul 16, 2024

@noklam, I can continue it, I think I'm just stuck right now on the best way to implement checking for partial matches, since the current process doesn't have access to the catalog when it's populating the required datasets and the verification doesn't know what dictionaries have been unpacked.

@noklam
Copy link
Contributor

noklam commented Jul 16, 2024

Just quick thought this can be done at the node level. There are some node function argument parsing logic done in the %load_node feature already.

The key here seems to be when should this validation step happen. I will put some more thoughts on this tmr.

kedro/pipeline/node.py Outdated Show resolved Hide resolved
Changed name of function,
Added checks for dictionary unpacking in function
Added negative test for dictionary unpacking in function

Signed-off-by: bpmeek <[email protected]>
@bpmeek
Copy link
Author

bpmeek commented Jul 30, 2024

Are the additional checks/tests here sufficient or is there still more to be done?

@astrojuanlu
Copy link
Member

Thanks for your patience @bpmeek! Let us look into this again.

@noklam
Copy link
Contributor

noklam commented Aug 19, 2024

I have a long list of PRs so I will only have time to look at this next week earliest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants