Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user defined YAML tools #19434

Draft
wants to merge 67 commits into
base: dev
Choose a base branch
from

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Jan 21, 2025

This work enhances the existing YAML tool format to bring it close to feature parity with XML tools, and strips inherently unsafe elements, which should eventually allow a subset of trusted users to bring their own tools.

I'll follow up with a more extensive description, but here's a screenshot of the embedded tool editor.

Screenshot 2025-01-21 at 15 59 57

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

container: str


class AdminToolSource(ToolSourceBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fine just dropping the Admin version here if it is easier. The point was to establish some abstractions that could be used by the feature you're implementing and CWL - you've gone a lot farther with the user tools so I'm happy to drop any old admin-only, insecure functionality if it would help. Either way though is fine.

# and is scoped to to individual user and never adds to global toolbox
dynamic_tools_manager: DynamicToolManager = depends(DynamicToolManager)

@router.get("/api/unprivileged_tools")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I would prefer a name like 'user_tools' but if it is a big change - feel free to ignore. Also sorry I'm reviewing by commit so if any of these changes are undone in future commits feel free to ignore.

Also like the last comment - feel free to just dump the dynamic tools API endpoint and replace it with this. I think I like that name better also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely open to changing how we name things, we did already have 3 or 4 different versions when we hacked on this in Berlin. I'll keep that for a last pass when it's nearing completion though.

@jmchilton
Copy link
Member

This is amazing. I don't hate any of it. The CWL fields I think we would struggle to fill out with our runtime is the only part that caused significant stress but I didn't see an attempt to fill those out. I would have semantic questions like is name going to be the name or collection identifier, etc.. but I don't think those are details you've tackled yet unless I missed the commit.

I would have started with locking tools down at the XML layer and have dozens of test cases around making sure tool action expressions cannot be evaluated, etc.. for unprivileged tools but I understand that part is pretty unsexy and I think there is some chance that having a fully defined model means those things might be completely unreachable and so that might have been unnecessary work. I think we need to at least audit all the features before the final merge.

I created a list of things I'd like to see to broken out into smaller PRs to clean up the core as I was reviewing the commits. None of this is essential - if it works, it works - but any of that extra effort would be appreciated and would ease follow up reviews I think and help isolate potential problems.

  • Other John will want the migration pulled into its own commit, I’d like it if you used abstractions from migration util instead of alembic directly.
  • Fix YAML and tool output default handling in f7c0014 (Rebased with d1e7bb8)
  • Mapping type fixes in model/init.py in c2b72e0
  • Component refactor in 9f8ed6d
  • Typing for trans in _workflow_to_dict_run in b3a98d7
  • get to get_one in tools/init in b3a98d7
  • Fix 1d0977c I think?
  • 782caae (container parsing from dicts)
  • de27de2 (collection type in parameter models)
  • f756bcb (better discriminators in parameter models)
  • 95b2eb4 (fix conditionals in YAML tools)
  • ec05106 (Fix job parameter summary for inputs without label)
  • Modeling existing requirements in parameters/interface in 6a98fe8 (would be hard to unwind with new requirement though)

Copy link
Member

@nsoranzo nsoranzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions from my current work on the cwl-1.0 branch.

)

from galaxy import (
exceptions,
model,
)
from galaxy.exceptions import DuplicatedIdentifierException
from galaxy.model import DynamicTool
from galaxy.managers.context import ProvidesUserContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from galaxy.managers.context import ProvidesUserContext

lib/galaxy/managers/tools.py Outdated Show resolved Hide resolved
lib/galaxy/workflow/modules.py Show resolved Hide resolved
@@ -1381,6 +1400,62 @@ def test_dynamic_tool_no_id(self):
output_content = self.dataset_populator.get_history_dataset_content(history_id)
assert output_content == "Hello World 2\n"

# This works except I don't want to add it to the schema right now,
# since I think the shell_command is what we'lll go with (at least initially)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# since I think the shell_command is what we'lll go with (at least initially)
# since I think the shell_command is what we'll go with (at least initially)

@jmchilton
Copy link
Member

I think we need to at least audit all the features before the final merge.

Maybe the way to address this is worrying about documentation. A tool translation guide maybe where each feature from XML is listed (under contents in https://docs.galaxyproject.org/en/master/dev/schema.html) and how to port it to YAML and if there are any security considerations. I did a lot of work in syncing XSD and YAML model docs in #18787 - I think we will want something like that for the broader tools right? We will need to keep model docs and XSD docs synchronized but also have separate customizations for each. It is kind of a hard problem but worth thinking about and maybe capturing security concerns at this point.

@mvdbeek
Copy link
Member Author

mvdbeek commented Jan 23, 2025

I would have semantic questions like is name going to be the name or collection identifier, etc.. but I don't think those are details you've tackled yet unless I missed the commit.

We already had to do that for the conditional step work, the implementation is in galaxy.workflows.modules.to_cwl. The keys for the inputs object is the input name, the inner elements are the element identifiers (since there's no name anyway for nested collections). For leaf elements (i.e class: File), basename is defined as value.dataset.created_from_basename or element_identifier or value.name ... which is debatable.

I think we need to at least audit all the features before the final merge.

Definitely, I didn't want to go down that route before getting a bit of feedback that we're moving in the right direction, and convincing myself that authoring yaml tools with the schema and shell_command autocompletion is a nice experience. Yes, the input model helps in avoiding some things, but right now you still have cheetah available e.g in the label section of the outputs for instance, that probably needs to go as well.

I agree that there is a lot of work left, and some work that can be pulled out separately, I'll do that as I keep making progress here.

how to port it to YAML and if there are any security considerations

Since we parse both XML and YAML into ToolSource, and we can build the representation out of the ToolSource there is also an opportunity to convert existing XML tools (minus the cheeath, configfiles etc things). That's a nice way of exploring how it'll feel to build complex tools in the YAML variant.

think we will want something like that for the broader tools right? We will need to keep model docs and XSD docs synchronized

Indeed, that's probably the next thing to do to make the json schema really helpful. I have for instance sprinkled a few annotation into the model, and I just copied the relevant part from the XSD. We should keep that in sync somehow.
Screenshot 2025-01-23 at 17 25 54

hidden: Mapped[Optional[bool]] = mapped_column(default=False)
active: Mapped[Optional[bool]] = mapped_column(default=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 columns are also present in "DynamicTool", is the duplication because the plan is to make DynamicTools sharable by associating multiple user ids to the same tool id?

The primary benefit is that the command section does not have access the
app or any dangling reference to the database connection or any other
secrets. There are two flavors here, one uses base_command and
arguments, and allows building up an (escaped) argv list, the other is a
shortcut for writing shell scripts and feels maybe a bit more like
writing a very simple cheetah section.

base_command:

```yml
name: base_command tool
class: GalaxyTool
version: 1.0.0
base_command: cat
arguments:
- $(inputs.input.path)
- '>'
- output.fastq
inputs:
- type: data
  name: input
outputs:
  output:
    type: data
    from_work_dir: output.fastq
    name: output
```

shell_command style:

```yml
name: shell_command tool
class: GalaxyTool
version: 1.0.0
shell_command: cat '$(inputs.input.path)' > output.fastq
inputs:
- type: data
  name: input
outputs:
  output:
    type: data
    from_work_dir: output.fastq
    name: output
```
Will probably need this later for efficiency and ignoring `$()` outside
of shell_command.
Simply re-use models by index and set values.
It's currently a high-water mark situation, and there will be a warning
once 200 models (i.e. 200 embedded fragments) are created, but that
seems pretty unlikely.
@mvdbeek mvdbeek force-pushed the user_defined_tools branch from ec05106 to a0c1e15 Compare January 31, 2025 17:48
Co-authored-by: Nicola Soranzo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants