Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more methods of WDL task disk specification #5001

Open
wants to merge 54 commits into
base: master
Choose a base branch
from

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Jun 29, 2024

Closes #4995

Changelog Entry

To be copied to the draft changelog by merger:

  • Support WDL disk specification as per 1.1 spec

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to go back and get clarification from the WDL folks before we can implement this properly.

The spec talks about "persistent volumes", but doesn't really explain what those are or which tasks would expect to be able to read which other tasks' writes. The implementation here doesn't actually provide any kind of persistence that I can see, unless running somewhere where the worker nodes have persistent filesystems.

It's not really clear to me whether we're meant to be mounting arbitrary host paths into the containers, or doing something more like Docker volumes. There's a requirement for the host-side path to exist but no other evidence that the task would be able to expect to actually access anything at that host-side path, and the execution engine is somehow responsible for making the volumes be the right size, which is impossible if it is just meant to mount across whatever's already there.

Can we dig up any workflows that genuinely use the mountpoint feature as more than just a test case, to see how they expect it to behave? Can we find or elicit any more explanation from the spec authors as to what the mount point feature is meant to accomplish?

It also might not really make sense to implement this on our own in Toil without some support from MiniWDL. We rely on MiniWDL for ordering up an appropriate container given the runtime spec, and unless we need to hook it into Toil's job requirements logic or make the batch system do special things with batch-system-specific persistent storage, it would be best if we could just get this feature for free when it shows up in MiniWDL.

Comment on lines 1805 to 1807
if not os.path.exists(part_mount_point):
# this isn't a valid mount point
raise NotImplementedError(f"Cannot use mount point {part_mount_point} as it does not exist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the spec says we're supposed to require (the mount point needs to exist on the "host system"), but I've read the disks section of the spec twice and that requirement doesn't really make any sense, because we're mounting storage into the container. The spec doesn't say we actually do anything with this path on the host.

Comment on lines 2241 to 2285
singularity_original_prepare_mounts = task_container.prepare_mounts

def patch_prepare_mounts_singularity() -> List[Tuple[str, str, bool]]:
"""
Mount the mount points specified from the disk requirements.

The singularity and docker patch are separate as they have different function signatures
"""
# todo: support AWS EBS/Kubernetes persistent volumes
# this logic likely only works for local clusters as we don't deal with the size of each mount point
mounts: List[Tuple[str, str, bool]] = singularity_original_prepare_mounts()
# todo: support AWS EBS/Kubernetes persistent volumes
# this logic likely only works for local clusters as we don't deal with the size of each mount point
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append((abs_mount_point, abs_mount_point, True))
return mounts
task_container.prepare_mounts = patch_prepare_mounts_singularity # type: ignore[method-assign]
elif isinstance(task_container, SwarmContainer):
docker_original_prepare_mounts = task_container.prepare_mounts

try:
# miniwdl depends on docker so this should be available but check just in case
import docker
# docker stubs are still WIP: https://github.com/docker/docker-py/issues/2796
from docker.types import Mount # type: ignore[import-untyped]

def patch_prepare_mounts_docker(logger: logging.Logger) -> List[Mount]:
"""
Same as the singularity patch but for docker
"""
mounts: List[Mount] = docker_original_prepare_mounts(logger)
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append(
Mount(
abs_mount_point.rstrip("/").replace("{{", '{{"{{"}}'),
abs_mount_point.rstrip("/").replace("{{", '{{"{{"}}'),
type="bind",
)
)
return mounts
task_container.prepare_mounts = patch_prepare_mounts_docker # type: ignore[method-assign]
except ImportError:
logger.warning("Docker package not installed. Unable to add mount points.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to add the ability to make these mounts in Toil as a monkey-patch. There's nothing here that wouldn't make just as much sense in MiniWDL (or more, since MiniWDL actually has a shared filesystem as an assumption), so instead of doing monkey-patches we should PR this machinery to MiniWDL.

Or if we're starting to add multiple monkey-patches to the TaskContainers, maybe we really want to extend them instead?

Copy link
Contributor Author

@stxue1 stxue1 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably is a good idea to PR this machinery to MiniWDL. The main reason why I had to monkeypatch this in the first place is because MiniWDL doesn't actually support the disks runtime attribute. Instead of PR-ing the code one-to-one, I think a PR that adds functionality to extend the list of mount points for both docker and singularity is best

# this logic likely only works for local clusters as we don't deal with the size of each mount point
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append((abs_mount_point, abs_mount_point, True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure are you that the spec actually means you're supposed to mount this particular host path through to the container?

It does say that the host path is required to exist. But it also says a lot about mounting "persistent volumes" at these mount points, and making them the required size. It doesn't seem to say anything about mounting those host paths specifically into the container.

What if you ask for a 100 GiB /tmp mount, and /tmp exists on the host, but /tmp on the host is only 10 GiB and Toil is actually doing all its work on a much larger /scratch/tmp? Shouldn't you get a 100 GiB /tmp mount in the container that actually refers to some location in /scratch/tmp on the host?

If the mount point feature was really meant to mount particular host paths across, wouldn't it take both a host-side path and a container-side path like the actual underlying mount functionality uses?

@adamnovak
Copy link
Member

@stxue1 Don't we now have the clarification we need so that we can finish this? I think the WDL spec is being revised to make it clearer that the mounts are meant to request so much storage available at such-and-such a path in the container, and that they are not actually meant to mount specific paths into the container. But we still need the changes in this PR adding array-of-mounts support.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the total space requirement is not actually getting through to the job's disk requirement field. Also, we're only checking for enough space for each mountpoint individually, when we know they're all going to be fulfilled from the same underlying filesystem and we can just check for the total amount of free space.

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
Comment on lines 1951 to 1958
if part.replace(".", "", 1).isdigit():
# round down floats
part_size = int(float(part))
continue
if i == 0:
# mount point is always the first
specified_mount_point = part
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mount point isn't always the first; if the first item is all numeric it gets interpreted as a size and not a mount point.

I think we get away with this because it's impossible to have an all-numeric mount point that is valid, since the mount point is specified to be an "absolute" path and thus necessarily contains /.

But if someone asks for 001 15 GiB hoping for a directory named 001 somewhere, we are not going to parse that like they meant it. I think we will interpret that as a root volume size of 15 GiB.

if part_size is not None:
# suffix will always be after the size, if it exists
part_suffix = part
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have anything here to prohibit extraneous pieces. We probably should reject anything that follows neither the spec nor Cromwell's convention, because in that case we know we can't do whatever is being asked for.

# can't imagine that ever being standardized; just leave it
# alone so that the workflow doesn't rely on this weird and
# likely-to-change Cromwell detail.
logger.warning('Not rounding LOCAL disk to the nearest 375 GB; workflow execution will differ from Cromwell!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also probably switch the default unit to GB here, since that is what the Cromwell syntax expects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep the default unit to GiB as that is the WDL spec default https://github.com/openwdl/wdl/blob/e43e042104b728df1f1ad6e6145945d2b32331a6/SPEC.md?plain=1#L5082

Comment on lines 1983 to 1985
if mount_spec.get(specified_mount_point) is not None:
# raise an error as all mount points must be unique
raise ValueError(f"Could not parse disks = {disks_spec} because the mount point {specified_mount_point} is specified multiple times")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never actually add local-disk in here, so you are going to be allowed to specify it multiple times. Maybe that's fine? It makes sense to sum them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check to catch local-disk being specified multiple times

part_suffix = "GB"

per_part_size = convert_units(part_size, part_suffix)
total_bytes += per_part_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total space needed (including any local-disk or mount-point-less size) doesn't ever get copied to runtime_disk and so isn't actually used for scheduling.


per_part_size = convert_units(part_size, part_suffix)
total_bytes += per_part_size
if specified_mount_point is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not prohibiting multiple mount-point-less specifications, or using e.g. 25 GiB and local-disk 300 SSD together. And we also don't store either of those in the mount_spec dict, which means they don't feed into the df check later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a check now, I also store it into the mount_spec dict now

Comment on lines 2251 to 2275
def ensure_mount_point(self, file_store: AbstractFileStore, mount_spec: Dict[str, int]) -> Dict[str, str]:
"""
Ensure the mount point sources are available.

Will check if the mount point source has the requested amount of space available.

Note: We are depending on Toil's job scheduling backend to error when the sum of multiple mount points disk requests is greater than the total available
For example, if a task has two mount points request 100 GB each but there is only 100 GB available, the df check may pass
but Toil should fail to schedule the jobs internally

:param mount_spec: Mount specification from the disks attribute in the WDL task. Is a dict where key is the mount point target and value is the size
:param file_store: File store to create a tmp directory for the mount point source
:return: Dict mapping mount point target to mount point source
"""
logger.debug("Detected mount specifications, creating mount points.")
mount_src_mapping = {}
# Create one tmpdir to encapsulate all mount point sources, each mount point will be associated with a subdirectory
tmpdir = file_store.getLocalTempDir()

# The POSIX standard doesn't specify how to escape spaces in mount points and file system names
# The only defect of this regex is if the target mount point is the same format as the df output
# It is likely reliable enough to trust the user has not created a mount with a df output-like name
regex_df = re.compile(r".+ \d+ +\d+ +(\d+) +\d+% +.+")
try:
for mount_target, mount_size in mount_spec.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this as a loop over each requested mount point (except the local-disk/no-mount-point ones that set the size of the working directory, whicvh never made it into mount_spec), and checking for space for each individually, we should sum up all the space needed and check for at least that much total space. We know all the mount points and also the task working directory are going to come from the same underlying filesystem where the Toil work directory is. So we only need to ask about the free space on that filesystem once.

Then we would no longer have the two-100GB-mount-points problem; we would know that we need 200GB total and only have say 150.

We can probably just read the job's disk requirement (I think self.requirements.disk) and make sure df shows that much space. Then we don't need two copies of the logic to sum up the total of all the mount points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow array of disk specifications for WDL
2 participants