Add flags for non-interactive use #15

egrace479 · 2024-06-24T19:41:59Z

Would be good to be able to schedule in batch without requiring user input. This could be implemented as an exit in case of the regular interactive prompts (non-unique filenames, not enough filenames for existing urls, or overwriting the targeted image download folder). In this case, it would likely be a good idea to check all three before exiting so the user can fix all at once instead of finding more issues on resubmit.

@johnbradley, as this was your request, I'd very much appreciate your opinion on the implementation.

johnbradley · 2024-08-05T20:34:47Z

I think the original idea was being able to use this tool in a background job on a cluster using a command such as sbatch. When run like this there is no user input. The job must run to completion without any user input. So anywhere we prompt for user input may not behave as expected. I'm aware of two places in the current code that reads input:

cautious-robot/src/cautiousrobot/__main__.py

Lines 214 to 217 in 632103d

    
           if urls_no_name > 0: 
        
               ignore = input(f"'{filename_col}' is missing values for {urls_no_name} URLs. Proceed with download ignoring these URLs? [y/n]: ") 
        
               if ignore.lower() != "y": 
        
                   sys.exit("Exited without executing.")

cautious-robot/src/cautiousrobot/__main__.py

Lines 224 to 227 in 632103d

    
           if os.path.exists(img_dir): 
        
               overwrite = input(f"'{img_dir}' already exists (may impact downsizing too). Overwrite? [y/n]: ") 
        
               if overwrite.lower() != "y": 
        
                   sys.exit("Exited without executing.")

I'm wondering if the code above would hang at the input() calls waiting for a user to type input.

johnbradley · 2024-08-05T20:39:17Z

These two scenarios seem like error handling:

There is a problem with the input CSV and we are letting the user ignore bad rows.
Overwrite image directory.

One approach is to have these two scenarios error out instead of prompting and add flags to allow a user to manually override.

johnbradley · 2024-08-05T20:49:57Z

Another thought I had related to running cautious-robot in a cluster is running multiple instances of cautious-robot in parallel. For example using a sbatch array job. Each job would download part of a CSV.

A work around that might be acceptable is to have users split their CSV into multiple files and run an instance of cautious-robot with each.

Using cautious-robot this way may require some changes to how we create output directories.
For example I have seen code similar to the following fail in the second line when running in parallel:

cautious-robot/src/cautiousrobot/__main__.py

Lines 134 to 135 in 632103d

    
           if os.path.exists(image_dir_path) != True: 
        
               os.makedirs(image_dir_path, exist_ok=False)

Then when re-running it works. Changing this to exist_ok=True would fix this edge case.

johnbradley · 2024-08-05T21:01:37Z

Another consideration to think of is making the tool friendly to workflow languages.
Some workflow languages(CWL) have clean room logic such that the directory containing the input file is read-only and output files must be named based on input arguments.
With the current logic writing to these two logs files below would fail:

cautious-robot/src/cautiousrobot/__main__.py

Lines 229 to 232 in 632103d

    
           # Set location for logs 
        
           metadata_path = csv_path.split(".")[0] 
        
           log_filepath = metadata_path + "_log.jsonl" 
        
           error_log_filepath = metadata_path + "_error_log.jsonl"

Being able to specify the output directory or file paths for all files created by the tool would resolve this problem.

thompsonmj · 2024-08-05T21:06:05Z

For interactive vs non-interactive behavior, sys.stdin.isatty() returns True when used in a script called from the terminal and False when the same script is run in a Slurm job.

johnbradley · 2024-08-06T12:02:27Z

I think there are some edge cases for sys.stdin.isatty() to be aware of. One that comes to mind is if you pipe input into a script isatty() will return False.

egrace479 · 2024-08-06T17:02:59Z

Another thought I had related to running cautious-robot in a cluster is running multiple instances of cautious-robot in parallel. For example using a sbatch array job. Each job would download part of a CSV.

A work around that might be acceptable is to have users split their CSV into multiple files and run an instance of cautious-robot with each.

Using cautious-robot this way may require some changes to how we create output directories. For example I have seen code similar to the following fail in the second line when running in parallel:

cautious-robot/src/cautiousrobot/__main__.py

Lines 134 to 135 in 632103d

if os.path.exists(image_dir_path) != True:

os.makedirs(image_dir_path, exist_ok=False)

Then when re-running it works. Changing this to exist_ok=True would fix this edge case.

Adding in an ending index would also help the batching, since a user wouldn't need to create separate CSVs to implement that, but this would be a good start.

egrace479 · 2024-08-06T17:08:03Z

Another consideration to think of is making the tool friendly to workflow languages. Some workflow languages(CWL) have clean room logic such that the directory containing the input file is read-only and output files must be named based on input arguments. With the current logic writing to these two logs files below would fail:

cautious-robot/src/cautiousrobot/__main__.py

Lines 229 to 232 in 632103d

# Set location for logs

metadata_path = csv_path.split(".")[0]

log_filepath = metadata_path + "_log.jsonl"

error_log_filepath = metadata_path + "_error_log.jsonl"

Being able to specify the output directory or file paths for all files created by the tool would resolve this problem.

We do have the option to specify output directory for images. Part of the reasoning for doing it this way was to match the logs to the CSV used for download and to avoid placing the logs into the image folder, as that could be rather inconvenient. Would it be reasonable to create an optional log_output flag and keep the current log settings as default?

johnbradley · 2024-08-07T12:08:24Z

We do have the option to specify output directory for images. Part of the reasoning for doing it this way was to match the logs to the CSV used for download and to avoid placing the logs into the image folder, as that could be rather inconvenient. Would it be reasonable to create an optional log_output flag and keep the current log settings as default?

I agree that putting logs into the image folder would be painful. Having the an optional flag to control where output logs are created sounds good to me.

johnbradley · 2024-08-07T12:10:28Z

There could be a race condition in the logging code when running multiple downloads at once if all processes use the same log filename.

egrace479 · 2024-08-07T14:06:50Z

There could be a race condition in the logging code when running multiple downloads at once if all processes use the same log filename.

If it's being run from multiple CSVs, then the default logs would be named differently. However, if we set up an end index option for multiprocessing or if a user passed the same logging location/name for their multiple CSVs that would become an issue. We could add submission date-time to the log filename to improve robustness?

thompsonmj · 2024-08-07T14:22:09Z

FWIW, using sys.stdin.isatty() or sys.stderr.isatty() works to return True when running a script with piped input, which also returns False when running with sbatch. Though I'm not sure how to get it to return False when the same commands are run e.g. as a Snakemake workflow rule. All to say that auto-detection of interactivity in an arbitrary environment may be a Pandora's box, so:

One approach is to have these two scenarios error out instead of prompting and add flags to allow a user to manually override.

sounds like a good solution to the core issue here.

For the missing URLs issue:
I'm in favor of adding a flag like --ignore-missing-url-entries, which if called would create another log file of entries that were skipped (with log file written to the log location specified). If the flag's not called, have a custom exception

class MissingURLs(Exception):
    ...

with a message explaining how to use the flag or to fix the input.

For the existing output directory issue:
The user might run a download again in a) another attempt at failed downloads, b) they added new entries to the input CSV, or c) create different resized images. In each case, it would be wasteful to overwrite already downloaded full-size images. For a new resize parameter, the full size folder should be referenced to create a new resize folder. To accommodate this, I think the <img_dir>_downsized should change to <img_dir>_<downsample>. That is, append the resolution to the directory name rather than simply "downsample". (This functionality should also be revisited in a separate issue since img.resize((downsample, downsample)) could increase resolution.)

So when the output location is found to already exist, cautious-robot should prune the entries from data where data[filename][i] is found in the output directory and try to download what remains. (should it also check to make sure that there is not something in the output dir that isn't in the CSV?) The resize code should be separated out of the download_images function so that it can be executed independently on whatever is in the full-size image directory.

With this functionality, I'm not sure when or why a user would want to overwrite any already existing output data, so an overwrite flag may not be needed unless I'm missing something?

egrace479 · 2024-08-07T14:36:38Z

For the missing URLs issue: I'm in favor of adding a flag like --ignore-missing-url-entries, which if called would create another log file of entries that were skipped (with log file written to the log location specified). If the flag's not called, have a custom exception
class MissingURLs(Exception):
    ...
with a message explaining how to use the flag or to fix the input.

Agreed.

For the existing output directory issue: The user might run a download again in a) another attempt at failed downloads, b) they added new entries to the input CSV, or c) create different resized images. In each case, it would be wasteful to overwrite already downloaded full-size images. For a new resize parameter, the full size folder should be referenced to create a new resize folder. To accommodate this, I think the <img_dir>_downsized should change to <img_dir>_<downsample>. That is, append the resolution to the directory name rather than simply "downsample". (This functionality should also be revisited in a separate issue since img.resize((downsample, downsample)) could increase resolution.)

Yes, let's make that adjustment to the resized folder name, and our re-write to the resize code definitely makes the option for separating that out simpler and a good idea.

So when the output location is found to already exist, cautious-robot should prune the entries from data where data[filename][i] is found in the output directory and try to download what remains. (should it also check to make sure that there is not something in the output dir that isn't in the CSV?) The resize code should be separated out of the download_images function so that it can be executed independently on whatever is in the full-size image directory.

With this functionality, I'm not sure when or why a user would want to overwrite any already existing output data, so an overwrite flag may not be needed unless I'm missing something?

If we remove the overwrite (instead checking for each image already downloaded, as noted here), then it would be more about whether or not a user wants to place images in folder that already exists (and thus has contents that would be included in the checksum at the end). Not sure if that would warrant a flag.

egrace479 · 2024-08-07T15:12:13Z

Suggested adjustment for avoiding overwrite. Everything following if downsample (line 168), could be moved into a separate resize function which could be given an exposed wrapper to allow for a direct call. Does that seem reasonable?

egrace479 · 2024-08-27T14:25:54Z

@johnbradley and @thompsonmj, any thoughts on this?

Suggested adjustment for avoiding overwrite. Everything following if downsample (line 168), could be moved into a separate resize function which could be given an exposed wrapper to allow for a direct call. Does that seem reasonable?

thompsonmj · 2024-08-27T14:53:39Z

Yep, I think that's what I was going for. So with this change, download_images would check if downsample, then just have a call to this new resize method, passing in downsample_path (which would be a concatenation of the image directory name and the resize dimension) and the image filename. Then the new method can parse the downsample_path input for the input image directory name and the resize dimension, and use the input image directory name with the image filename to locate the input image to resize.
It could use the existing data logic as-is, since it skips overwriting already resized images.

johnbradley · 2024-08-27T15:13:30Z

Breaking that part out sounds good to me.

egrace479 added design UX or presentation needs attention enhancement New feature or request labels Jun 24, 2024

egrace479 mentioned this issue Aug 27, 2024

Update challenge description with more background and context Imageomics/HDR-anomaly-challenge#29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flags for non-interactive use #15

Add flags for non-interactive use #15

egrace479 commented Jun 24, 2024 •

edited

Loading

johnbradley commented Aug 5, 2024

johnbradley commented Aug 5, 2024

johnbradley commented Aug 5, 2024 •

edited

Loading

johnbradley commented Aug 5, 2024

thompsonmj commented Aug 5, 2024

johnbradley commented Aug 6, 2024

egrace479 commented Aug 6, 2024

egrace479 commented Aug 6, 2024

johnbradley commented Aug 7, 2024

johnbradley commented Aug 7, 2024

egrace479 commented Aug 7, 2024

thompsonmj commented Aug 7, 2024

egrace479 commented Aug 7, 2024

egrace479 commented Aug 7, 2024

egrace479 commented Aug 27, 2024

thompsonmj commented Aug 27, 2024

johnbradley commented Aug 27, 2024

Add flags for non-interactive use #15

Add flags for non-interactive use #15

Comments

egrace479 commented Jun 24, 2024 • edited Loading

johnbradley commented Aug 5, 2024

johnbradley commented Aug 5, 2024

johnbradley commented Aug 5, 2024 • edited Loading

johnbradley commented Aug 5, 2024

thompsonmj commented Aug 5, 2024

johnbradley commented Aug 6, 2024

egrace479 commented Aug 6, 2024

egrace479 commented Aug 6, 2024

johnbradley commented Aug 7, 2024

johnbradley commented Aug 7, 2024

egrace479 commented Aug 7, 2024

thompsonmj commented Aug 7, 2024

egrace479 commented Aug 7, 2024

egrace479 commented Aug 7, 2024

egrace479 commented Aug 27, 2024

thompsonmj commented Aug 27, 2024

johnbradley commented Aug 27, 2024

egrace479 commented Jun 24, 2024 •

edited

Loading

johnbradley commented Aug 5, 2024 •

edited

Loading