Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flags for non-interactive use #15

Open
egrace479 opened this issue Jun 24, 2024 · 17 comments
Open

Add flags for non-interactive use #15

egrace479 opened this issue Jun 24, 2024 · 17 comments
Labels
design UX or presentation needs attention enhancement New feature or request

Comments

@egrace479
Copy link
Member

egrace479 commented Jun 24, 2024

Would be good to be able to schedule in batch without requiring user input. This could be implemented as an exit in case of the regular interactive prompts (non-unique filenames, not enough filenames for existing urls, or overwriting the targeted image download folder). In this case, it would likely be a good idea to check all three before exiting so the user can fix all at once instead of finding more issues on resubmit.

@johnbradley, as this was your request, I'd very much appreciate your opinion on the implementation.

@egrace479 egrace479 added design UX or presentation needs attention enhancement New feature or request labels Jun 24, 2024
@johnbradley
Copy link

I think the original idea was being able to use this tool in a background job on a cluster using a command such as sbatch. When run like this there is no user input. The job must run to completion without any user input. So anywhere we prompt for user input may not behave as expected. I'm aware of two places in the current code that reads input:

if urls_no_name > 0:
ignore = input(f"'{filename_col}' is missing values for {urls_no_name} URLs. Proceed with download ignoring these URLs? [y/n]: ")
if ignore.lower() != "y":
sys.exit("Exited without executing.")

if os.path.exists(img_dir):
overwrite = input(f"'{img_dir}' already exists (may impact downsizing too). Overwrite? [y/n]: ")
if overwrite.lower() != "y":
sys.exit("Exited without executing.")

I'm wondering if the code above would hang at the input() calls waiting for a user to type input.

@johnbradley
Copy link

These two scenarios seem like error handling:

  1. There is a problem with the input CSV and we are letting the user ignore bad rows.
  2. Overwrite image directory.

One approach is to have these two scenarios error out instead of prompting and add flags to allow a user to manually override.

@johnbradley
Copy link

johnbradley commented Aug 5, 2024

Another thought I had related to running cautious-robot in a cluster is running multiple instances of cautious-robot in parallel. For example using a sbatch array job. Each job would download part of a CSV.

A work around that might be acceptable is to have users split their CSV into multiple files and run an instance of cautious-robot with each.


Using cautious-robot this way may require some changes to how we create output directories.
For example I have seen code similar to the following fail in the second line when running in parallel:

if os.path.exists(image_dir_path) != True:
os.makedirs(image_dir_path, exist_ok=False)

Then when re-running it works. Changing this to exist_ok=True would fix this edge case.

@johnbradley
Copy link

Another consideration to think of is making the tool friendly to workflow languages.
Some workflow languages(CWL) have clean room logic such that the directory containing the input file is read-only and output files must be named based on input arguments.
With the current logic writing to these two logs files below would fail:

# Set location for logs
metadata_path = csv_path.split(".")[0]
log_filepath = metadata_path + "_log.jsonl"
error_log_filepath = metadata_path + "_error_log.jsonl"

Being able to specify the output directory or file paths for all files created by the tool would resolve this problem.

@thompsonmj
Copy link
Contributor

For interactive vs non-interactive behavior, sys.stdin.isatty() returns True when used in a script called from the terminal and False when the same script is run in a Slurm job.

@johnbradley
Copy link

I think there are some edge cases for sys.stdin.isatty() to be aware of. One that comes to mind is if you pipe input into a script isatty() will return False.

@egrace479
Copy link
Member Author

Another thought I had related to running cautious-robot in a cluster is running multiple instances of cautious-robot in parallel. For example using a sbatch array job. Each job would download part of a CSV.

A work around that might be acceptable is to have users split their CSV into multiple files and run an instance of cautious-robot with each.

Using cautious-robot this way may require some changes to how we create output directories. For example I have seen code similar to the following fail in the second line when running in parallel:

if os.path.exists(image_dir_path) != True:
os.makedirs(image_dir_path, exist_ok=False)

Then when re-running it works. Changing this to exist_ok=True would fix this edge case.

Adding in an ending index would also help the batching, since a user wouldn't need to create separate CSVs to implement that, but this would be a good start.

@egrace479
Copy link
Member Author

Another consideration to think of is making the tool friendly to workflow languages. Some workflow languages(CWL) have clean room logic such that the directory containing the input file is read-only and output files must be named based on input arguments. With the current logic writing to these two logs files below would fail:

# Set location for logs
metadata_path = csv_path.split(".")[0]
log_filepath = metadata_path + "_log.jsonl"
error_log_filepath = metadata_path + "_error_log.jsonl"

Being able to specify the output directory or file paths for all files created by the tool would resolve this problem.

We do have the option to specify output directory for images. Part of the reasoning for doing it this way was to match the logs to the CSV used for download and to avoid placing the logs into the image folder, as that could be rather inconvenient. Would it be reasonable to create an optional log_output flag and keep the current log settings as default?

@johnbradley
Copy link

We do have the option to specify output directory for images. Part of the reasoning for doing it this way was to match the logs to the CSV used for download and to avoid placing the logs into the image folder, as that could be rather inconvenient. Would it be reasonable to create an optional log_output flag and keep the current log settings as default?

I agree that putting logs into the image folder would be painful. Having the an optional flag to control where output logs are created sounds good to me.

@johnbradley
Copy link

There could be a race condition in the logging code when running multiple downloads at once if all processes use the same log filename.

@egrace479
Copy link
Member Author

There could be a race condition in the logging code when running multiple downloads at once if all processes use the same log filename.

If it's being run from multiple CSVs, then the default logs would be named differently. However, if we set up an end index option for multiprocessing or if a user passed the same logging location/name for their multiple CSVs that would become an issue. We could add submission date-time to the log filename to improve robustness?

@thompsonmj
Copy link
Contributor

FWIW, using sys.stdin.isatty() or sys.stderr.isatty() works to return True when running a script with piped input, which also returns False when running with sbatch. Though I'm not sure how to get it to return False when the same commands are run e.g. as a Snakemake workflow rule. All to say that auto-detection of interactivity in an arbitrary environment may be a Pandora's box, so:

One approach is to have these two scenarios error out instead of prompting and add flags to allow a user to manually override.

sounds like a good solution to the core issue here.

For the missing URLs issue:
I'm in favor of adding a flag like --ignore-missing-url-entries, which if called would create another log file of entries that were skipped (with log file written to the log location specified). If the flag's not called, have a custom exception

class MissingURLs(Exception):
    ...

with a message explaining how to use the flag or to fix the input.

For the existing output directory issue:
The user might run a download again in a) another attempt at failed downloads, b) they added new entries to the input CSV, or c) create different resized images. In each case, it would be wasteful to overwrite already downloaded full-size images. For a new resize parameter, the full size folder should be referenced to create a new resize folder. To accommodate this, I think the <img_dir>_downsized should change to <img_dir>_<downsample>. That is, append the resolution to the directory name rather than simply "downsample". (This functionality should also be revisited in a separate issue since img.resize((downsample, downsample)) could increase resolution.)

So when the output location is found to already exist, cautious-robot should prune the entries from data where data[filename][i] is found in the output directory and try to download what remains. (should it also check to make sure that there is not something in the output dir that isn't in the CSV?) The resize code should be separated out of the download_images function so that it can be executed independently on whatever is in the full-size image directory.

With this functionality, I'm not sure when or why a user would want to overwrite any already existing output data, so an overwrite flag may not be needed unless I'm missing something?

@egrace479
Copy link
Member Author

For the missing URLs issue: I'm in favor of adding a flag like --ignore-missing-url-entries, which if called would create another log file of entries that were skipped (with log file written to the log location specified). If the flag's not called, have a custom exception

class MissingURLs(Exception):
    ...

with a message explaining how to use the flag or to fix the input.

Agreed.

For the existing output directory issue: The user might run a download again in a) another attempt at failed downloads, b) they added new entries to the input CSV, or c) create different resized images. In each case, it would be wasteful to overwrite already downloaded full-size images. For a new resize parameter, the full size folder should be referenced to create a new resize folder. To accommodate this, I think the <img_dir>_downsized should change to <img_dir>_<downsample>. That is, append the resolution to the directory name rather than simply "downsample". (This functionality should also be revisited in a separate issue since img.resize((downsample, downsample)) could increase resolution.)

Yes, let's make that adjustment to the resized folder name, and our re-write to the resize code definitely makes the option for separating that out simpler and a good idea.

So when the output location is found to already exist, cautious-robot should prune the entries from data where data[filename][i] is found in the output directory and try to download what remains. (should it also check to make sure that there is not something in the output dir that isn't in the CSV?) The resize code should be separated out of the download_images function so that it can be executed independently on whatever is in the full-size image directory.

With this functionality, I'm not sure when or why a user would want to overwrite any already existing output data, so an overwrite flag may not be needed unless I'm missing something?

If we remove the overwrite (instead checking for each image already downloaded, as noted here), then it would be more about whether or not a user wants to place images in folder that already exists (and thus has contents that would be included in the checksum at the end). Not sure if that would warrant a flag.

@egrace479
Copy link
Member Author

Suggested adjustment for avoiding overwrite. Everything following if downsample (line 168), could be moved into a separate resize function which could be given an exposed wrapper to allow for a direct call. Does that seem reasonable?

@egrace479
Copy link
Member Author

@johnbradley and @thompsonmj, any thoughts on this?

Suggested adjustment for avoiding overwrite. Everything following if downsample (line 168), could be moved into a separate resize function which could be given an exposed wrapper to allow for a direct call. Does that seem reasonable?

@thompsonmj
Copy link
Contributor

Yep, I think that's what I was going for. So with this change, download_images would check if downsample, then just have a call to this new resize method, passing in downsample_path (which would be a concatenation of the image directory name and the resize dimension) and the image filename. Then the new method can parse the downsample_path input for the input image directory name and the resize dimension, and use the input image directory name with the image filename to locate the input image to resize.
It could use the existing data logic as-is, since it skips overwriting already resized images.

@johnbradley
Copy link

Breaking that part out sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design UX or presentation needs attention enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants