Move file virtualization in toil-wdl-runner to task boundaries #5028

stxue1 · 2024-07-19T23:30:08Z

This should solve #5004

This also reverts #4994 and introduces another fix for #4988; there should be behavioral parity with miniwdl

This should also resolve #5031

This is a bit of an overhaul of how we handle files in toil-wdl-runner; instead of immediately virtualizing upon seeing a File type (coerced or not), the WDL value is kept as a path until the last second. Only before tasks are sent off, the files will be virtualized. This way, the WDL side functionality should still be the same while allowing for File to String coercion as we no longer replace all File paths with our virtualized representation immediately.

Monkeypatching is removed as we no longer virtualize at the coercion step, but before/after function calls and manually at task boundaries.

I don't think this will increase the amount of IO. One edge case is if the user changes the filename right before being put into a function; toil will virtualize the new filename for that function call and keep both the original and new virtualized file around in the jobstore until the job is completed. It may be worth removing the new virtualized file from the jobstore after use, but I'm not sure where to hook in the behavior.

~~The downside is that all jobs that are marked as local must always be local in the future. I explicitly set the flag in the class constructors where I could.~~ I think the virtualization behavior should ensure local jobs on workers should work.

The function evaluate_bindings_from_decls is now used for all decl parsing/evaluation to reduce duplicative code.

Changelog Entry

To be copied to the draft changelog by merger:

File virtualization in toil-wdl-runner now only happens at task boundaries
- File to String coercion should be supported

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passes tests.
Make sure the PR has been reviewed since its last modification. If not, review it.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

…luation. Also gets rid of monkeypatching in favor of a manual function call

src/toil/wdl/wdltoil.py

…tartup and carry through mappings

…04-wdl-virtualize-only-at-task-boundaries

…zation in TaskWrapper for MiniWDL parity

…alid coerced-to-null files and raise if exception found

src/toil/wdl/wdltoil.py

adamnovak

I'm concerned that by leaving relative URLs from the input JSON as relative URLs in WDL Files and then trying to figure out where they were meant to be relative to later, we've introduced a whole universe of cross-talk that we don't want to deal with. Maybe we can poll for where they ought to come from and turn them into absolute URLs before handing them to the workflow?

I like that we're ripping out the coercion monkey-patch and also nonexistent:, which simplifies things. But are we sure that we retain support for using Toil's supported URI schemes when MiniWDL reads a file? File the_file = "gs://google-bucket-name/filename.txt" ought to work if Toil has Google Storage support installed, even if MiniWDL doesn't know about reading Google Storage URLs. I think the coercion hook was doing that for us and I can't really tell if something is taking its place. If we're losing that feature and we think it is worth it, we should know we're losing it.

I'd also like to see a new comment at the top describing how the system works now, and what invariants or abstractions it imposes that new code is going to need to follow to keep it working. File holds a URL or leader-local path at the workflow level. When we actually read it with read_lines() at the workflow level, does MiniWDL fetch it? If we read_lines() it twice, do we have a cache? If we load it into the file store when it enters a task, and it is passed out of the task unmodified, do we see the original URL in the task output (because everything at workflow level is always an original URL) or the imported file store one?

Is talking about "virtualized" (visible to WDL code) and "devirtualized" (visible to Python open()) file names still the right way to understand what the code is doing? Or should we be using different terms or concepts to think about the packing and unpacking that happens when taking Bindings in and out of tasks?

I also like the idea of wdl_options to hold the global settings for the run, kind of like the CWL RuntimeContext. But I think it needs a type so we can have a better handle on what's allowed to be in there.

src/toil/wdl/wdltoil.py

adamnovak · 2024-07-30T16:44:18Z

src/toil/wdl/wdltoil.py

+    if isinstance(value, WDL.Value.File):
+        pass
+    elif isinstance(value, WDL.Value.Array) and isinstance(expected_type, WDL.Type.Array):
+        for elem, orig_elem in zip(value.value, original_value.value):
+            map_over_files_in_value_check_null_type(elem, orig_elem, expected_type.item_type)
+    elif isinstance(value, WDL.Value.Map) and isinstance(expected_type, WDL.Type.Map):
+        for pair, orig_pair in zip(value.value, original_value.value):
+            # The key of the map cannot be optional or else it is not serializable, so we only need to check the value
+            map_over_files_in_value_check_null_type(pair[1], orig_pair[1], expected_type.item_type[1])
+    elif isinstance(value, WDL.Value.Pair) and isinstance(expected_type, WDL.Type.Pair):
+        map_over_files_in_value_check_null_type(value.value[0], original_value.value[0], expected_type.left_type)
+        map_over_files_in_value_check_null_type(value.value[1], original_value.value[1], expected_type.right_type)
+    elif isinstance(value, WDL.Value.Struct) and isinstance(expected_type, WDL.Type.StructInstance):
+        for (k, v), (_, orig_v) in zip(value.value.items(), original_value.value.items()):
+            # The parameters method for WDL.Type.StructInstance returns the values rather than the dictionary
+            # While dictionaries are ordered, this should be more robust; the else branch should never be hit
+            if expected_type.members is not None:
+                map_over_files_in_value_check_null_type(v, orig_v, expected_type.members[k])
+    elif isinstance(value, WDL.Value.Null):
+        if not expected_type.optional:
+            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), original_value.value)


Having one version of this traversal logic is what the map function was supposed to be for. But I guess here we're traversing two mirrored structures in parallel so we can't use the other code.

adamnovak · 2024-08-01T20:01:54Z

@stxue1 Another thing I thought of is, if we need to make local=True always run a job on the leader and never run the job elsewhere, as part of this, then we need to remove the command lien option that lets you run local jobs on workers. Otherwise using it will break WDL workflows.

We also might need to revise the batch system documentation, and maybe the inheritance hierarchy, since you would need all batch systems, even those coming from plugins, to always have the local job logic.

adamnovak · 2024-08-13T17:11:57Z

We think this will also fix #4992.

adamnovak · 2024-08-14T15:05:06Z

@stxue1 If this fixes #5031, it also needs to enable the test from the conformance tests; it looks like this doesn't touch the tests file.

…04-wdl-virtualize-only-at-task-boundaries

…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries

…and permanent devirtualization of URLs

Co-authored-by: Adam Novak <[email protected]>

…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries

adamnovak

There are still a couple things I would change, but I think this is fine.

adamnovak · 2024-09-24T16:54:09Z

src/toil/wdl/wdltoil.py

+# WDL options to pass into the WDL jobs and standard libraries
+#   task_path: Dotted WDL name of the part of the workflow this library is working for.
+#   execution_dir: Directory to use as the working directory for workflow code.
+#   container: The type of container to use when executing a WDL task. Carries through the value of the commandline --container option
+WDL_Context = TypedDict('WDL_Context', {"execution_dir": NotRequired[str], "container": NotRequired[str],
+                                        "task_path": str, "namespace": str})


It would be nice to have docs for namespace.

src/toil/wdl/wdltoil.py

adamnovak · 2024-09-24T17:25:58Z

src/toil/wdl/wdltoil.py

+#   task_path: Dotted WDL name of the part of the workflow this library is working for.
+#   execution_dir: Directory to use as the working directory for workflow code.
+#   container: The type of container to use when executing a WDL task. Carries through the value of the commandline --container option
+WDL_Context = TypedDict('WDL_Context', {"execution_dir": NotRequired[str], "container": NotRequired[str],


I wouldn't use an _ in the name here; it's a different convention than all the other types.

…y-at-task-boundaries' into issues/5004-wdl-virtualize-only-at-task-boundaries

adamnovak · 2024-09-26T16:29:51Z

@stxue1 It looks like conformance test 27 ("Legacy test for type_pair_with_files") is failing because it is trying to devirtualize a file to the same path twice. I think it can't deal with how that test sends the same path twice in the input JSON, so there are two File values from the same directory with the same basename (because it's the same file twice).

…ssue of multiple imports on the same file

…at-task-boundaries' into issues/5004-wdl-virtualize-only-at-task-boundaries

…04-wdl-virtualize-only-at-task-boundaries

stxue1 · 2024-09-28T02:36:55Z

I changed the function convert_remote_files, so that function needs a rereview

adamnovak

I think the changes are probably fine, but could be slightly better.

adamnovak · 2024-09-30T22:07:41Z

src/toil/wdl/wdltoil.py

+    :param import_remote_files: If set, import files from remote locations. Else leave them as URI references.
+    """
+    path_to_id: Dict[str, uuid.UUID] = {}
+    @memoize


I think this memoization gets scoped to the call to the enclosing function, but I also think that's what we want here.

I think that's what we want, it is the same way we did memoization in the previous implementation. I believe file imports happening within the WDL have their own caching system separate from this anyway.

adamnovak · 2024-09-30T22:11:14Z

src/toil/wdl/wdltoil.py

+        candidate_uri, toil_uri = import_filename(file.value)
+        if candidate_uri is None and toil_uri is None:
+            # If we get here we tried all the candidates
+            raise RuntimeError(f"Could not find {file.value} at any of: {list(potential_absolute_uris(file.value, search_paths if search_paths is not None else []))}")


It's odd to re-do the calculation of the possible locations here. If it changes it needs to change in both places. Having a function to compute it helps with that, but then we have a long expression going into the function and it's hard to tell if it is in sync just by looking.

Maybe the list of tried locations really wants to be an output from import_filename? Or maybe import_filename could raise this, since it is only called in the one place?

It might be better to have it inside import_filename, though the function wouldn't ever have a base return case, and I'm not sure if that's better or worse behaviorally. My thinking was that even though it is called twice, it will only be called again if a runtime error is hit, which is unlikely.

Defer file virtualization to task boundaries and consolidate decl eva…

89c276a

…luation. Also gets rid of monkeypatching in favor of a manual function call

stxue1 commented Jul 19, 2024

View reviewed changes

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved

stxue1 marked this pull request as draft July 23, 2024 03:17

stxue1 added 3 commits July 23, 2024 18:43

Implement support for importing relative URL paths; import files at s…

c85f173

…tartup and carry through mappings

Fix possible invalid lookup and don't import raw URLs

230efa0

Merge branch 'master' of github.com:DataBiosphere/toil into issues/50…

2955bc3

…04-wdl-virtualize-only-at-task-boundaries

stxue1 mentioned this pull request Jul 24, 2024

Symlink pasthrough on WDL is broken again #5031

Closed

stxue1 added 2 commits July 24, 2024 12:50

Get rid of sentinel value implementation + drop files before virtuali…

066f780

…zation in TaskWrapper for MiniWDL parity

Drop missing files during decl eval for outputs + Add a check for inv…

d5365a2

…alid coerced-to-null files and raise if exception found

stxue1 marked this pull request as ready for review July 25, 2024 01:14

stxue1 linked an issue Jul 25, 2024 that may be closed by this pull request

Support filename coercion back to string #5004

Closed

stxue1 commented Jul 25, 2024

View reviewed changes

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved

stxue1 added 2 commits July 24, 2024 19:28

Deal with mypy

bbb098d

Don't drop unnecesssarily

1b6bd02

adamnovak requested changes Jul 30, 2024

View reviewed changes

adamnovak linked an issue Aug 13, 2024 that may be closed by this pull request

toil-wdl-runner does not handle string to file coercion properly for optional file types #4992

Closed

stxue1 mentioned this pull request Aug 15, 2024

run cactus on the grid_engine cluster #5022

Closed

stxue1 and others added 10 commits August 19, 2024 19:45

Switch to setattr implementation

f65d59c

Fix overwriting files

be0203e

Fix prototype implementation

f60d475

Merge branch 'master' of github.com:DataBiosphere/toil into issues/50…

552ac74

…04-wdl-virtualize-only-at-task-boundaries

Merge master into issues/5004-wdl-virtualize-only-at-task-boundaries

8163c2c

Fix documentation

4bbe63d

Merge branch 'issues/5004-wdl-virtualize-only-at-task-boundaries' of …

05dc240

…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries

Resolve nested virtualize files issue by converting back to original …

a6b18c7

…and permanent devirtualization of URLs

Fix virtualization of URLs that aren't toil URIs

94da9af

Mypy

644fd51

stxue1 and others added 9 commits September 17, 2024 12:42

Apply suggestions from code review

a7292be

Co-authored-by: Adam Novak <[email protected]>

Update src/toil/wdl/wdltoil.py

e6718cd

Co-authored-by: Adam Novak <[email protected]>

Rename, add comments, remove unused code/comments

2ad130f

Merge branch 'issues/5004-wdl-virtualize-only-at-task-boundaries' of …

cb8b230

…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries

Add comments and adjust wdl context usage

b0db027

add namespace

8c24d8f

integrate namespace into wdl_context

666aef5

properly name wdl value bases

0b6fb82

Remove irrelevant comment

4d5ecae

adamnovak approved these changes Sep 24, 2024

View reviewed changes

adamnovak and others added 5 commits September 24, 2024 13:33

Adjust docstring formatting to remove RST warnings

b43d256

Adjust comment grammar

7a36b59

Merge master into issues/5004-wdl-virtualize-only-at-task-boundaries

96cde22

Remove disallowed backticks

8c42888

Merge remote-tracking branch 'upstream/issues/5004-wdl-virtualize-onl…

fec8663

…y-at-task-boundaries' into issues/5004-wdl-virtualize-only-at-task-boundaries

github-actions bot and others added 8 commits September 26, 2024 16:30

Merge master into issues/5004-wdl-virtualize-only-at-task-boundaries

966d855

Merge master into issues/5004-wdl-virtualize-only-at-task-boundaries

dcf2f27

Extract out the import function to add back memoization and resolve i…

a318208

…ssue of multiple imports on the same file

Merge remote-tracking branch 'origin/issues/5004-wdl-virtualize-only-…

a545493

…at-task-boundaries' into issues/5004-wdl-virtualize-only-at-task-boundaries

Merge branch 'master' of github.com:DataBiosphere/toil into issues/50…

3ece42b

…04-wdl-virtualize-only-at-task-boundaries

change wdl_context to wdlcontext

f9db076

also change typeddict

c16290f

Add some documentation for the WDLContext object

cb0f986

stxue1 requested a review from adamnovak September 27, 2024 20:21

Merge master into issues/5004-wdl-virtualize-only-at-task-boundaries

c7c69eb

adamnovak approved these changes Sep 30, 2024

View reviewed changes

stxue1 merged commit 0034c92 into master Oct 1, 2024
1 check passed

stxue1 deleted the issues/5004-wdl-virtualize-only-at-task-boundaries branch October 1, 2024 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move file virtualization in toil-wdl-runner to task boundaries #5028

Move file virtualization in toil-wdl-runner to task boundaries #5028

stxue1 commented Jul 19, 2024 •

edited

Loading

adamnovak left a comment

adamnovak Jul 30, 2024

adamnovak commented Aug 1, 2024 •

edited

Loading

adamnovak commented Aug 13, 2024

adamnovak commented Aug 14, 2024

adamnovak left a comment

adamnovak Sep 24, 2024

adamnovak Sep 24, 2024

adamnovak commented Sep 26, 2024

stxue1 commented Sep 28, 2024

adamnovak left a comment

adamnovak Sep 30, 2024

stxue1 Oct 1, 2024

adamnovak Sep 30, 2024

stxue1 Oct 1, 2024

Move file virtualization in toil-wdl-runner to task boundaries #5028

Move file virtualization in toil-wdl-runner to task boundaries #5028

Conversation

stxue1 commented Jul 19, 2024 • edited Loading

Changelog Entry

Reviewer Checklist

Merger Checklist

adamnovak left a comment

Choose a reason for hiding this comment

adamnovak Jul 30, 2024

Choose a reason for hiding this comment

adamnovak commented Aug 1, 2024 • edited Loading

adamnovak commented Aug 13, 2024

adamnovak commented Aug 14, 2024

adamnovak left a comment

Choose a reason for hiding this comment

adamnovak Sep 24, 2024

Choose a reason for hiding this comment

adamnovak Sep 24, 2024

Choose a reason for hiding this comment

adamnovak commented Sep 26, 2024

stxue1 commented Sep 28, 2024

adamnovak left a comment

Choose a reason for hiding this comment

adamnovak Sep 30, 2024

Choose a reason for hiding this comment

stxue1 Oct 1, 2024

Choose a reason for hiding this comment

adamnovak Sep 30, 2024

Choose a reason for hiding this comment

stxue1 Oct 1, 2024

Choose a reason for hiding this comment

stxue1 commented Jul 19, 2024 •

edited

Loading

adamnovak commented Aug 1, 2024 •

edited

Loading