Skip to content

Conversation

@mwiebe
Copy link
Contributor

@mwiebe mwiebe commented Jul 1, 2025

What was the problem/requirement? (What/Why)

For incremental output download for a queue, want to track all the jobs that have downloads started, and compare it with all the jobs that might start new downloads.

What was the solution? (How)

Update the incremental output command to record the full state of the job as returned by SearchJobs, and show a diff with previous state. Running this command to view the output as a queue evolves over time will help refine how this should work.

  • Move the eventual consistency max time to a constant
  • Add the full job state from the SearchJobs responses to the
    IncrementalDownloadState object. This is to help explore what options
    we have for tracking/optimizing job state changes.
  • Use the _list_jobs_by_filter_expression function to collect
    all the candidate jobs for downloading tasks.
  • In the incremental download function, perform a diff between the
    in-progress and new candidate job lists, printing info about changes.
  • Print a summary of the job updates at the end.
  • Remove debug printing from the pid file lock implementation and the
    incremental download state load/save.
  • Adjust the CLI help output to use the docstring instead of from the
    decorator. Shorten some of the option names where the context makes it
    clear. Added a default checkpoint directory for the operation.
  • Modified how the opt-in environment variable works. Now, the command
    is always available and showin in the help output, but running it only
    succeeds if the environment variable is set.
  • Update the CLI tests to use the mocked search_jobs API, and
    implemented a mocked get_job API that works the same way. The golden
    path test now calls the CLI twice, modifying the state of the mocked
    job between calls, and validates that the output of the CLI command
    reflects that job's state appropriately.

What is the impact of this change?

Progress implementing the incremental download feature.

$ deadline queue -h
Usage: deadline queue [OPTIONS] COMMAND [ARGS]...

  Commands for queues.

Options:
  -h, --help  Show this message and exit.

Commands:
  export-credentials           Export queue credentials in a format...
  get                          Get the details of a queue.
  incremental-output-download  BETA - Downloads job attachments output...
  list                         Lists the available queues.
  paramdefs                    Lists a Queue's Parameters Definitions.

$ deadline queue incremental-output-download -h
Usage: deadline queue incremental-output-download [OPTIONS]

  BETA - Downloads job attachments output incrementally for all jobs in a
  queue. When run for the first time or with the --force-bootstrap option, it
  starts downloading from --bootstrap-lookback-minutes in the past. When run
  each subsequent time, it loads  the previous checkpoint and continues where
  it left off.

  To try this command, set the ENABLE_INCREMENTAL_OUTPUT_DOWNLOAD environment
  variable to 1 to acknowledge its incomplete beta status.

  [NOTE] This command is still WIP and partially implemented right now.

Options:
  --farm-id TEXT                  The AWS Deadline Cloud Farm to use.
  --queue-id TEXT                 The AWS Deadline Cloud Queue to use.
  --json                          Output is printed as JSON for scripting.
  --bootstrap-lookback-minutes FLOAT
                                  Downloads outputs for job-session-actions
                                  that have been completed since these many
                                  minutes at bootstrap. Default value is 0
                                  minutes.
  --checkpoint-dir TEXT           Proceed downloading from the previous
                                  progress file stored in this directory, if
                                  it exists. If the file does not exist, the
                                  download will initialize using the bootstrap
                                  lookback in minutes.
  --force-bootstrap               Forces command to start from the bootstrap
                                  lookback period and overwrite any previous
                                  checkpoint. Default value is False.
  --conflict-resolution [skip|overwrite|create_copy]
                                  How to handle downloads if an output file
                                  already exists: CREATE_COPY: Download the
                                  file with a new name, appending '(1)' to the
                                  end SKIP: Do not download the file OVERWRITE
                                  (default): Download and replace the existing
                                  file. Default behaviour is to OVERWRITE.
  -h, --help                      Show this message and exit.

How was this change tested?

Wrote a separate script that continually submits random jobs, and viewed the output over time. Refined the output to include the summary at the end, clean up the debug prints, etc.

Was this change documented?

No.

Does this PR introduce new dependencies?

No.

Is this a breaking change?

No.

Does this change impact security?

No.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mwiebe mwiebe force-pushed the incremental-job-list branch 6 times, most recently from cac5bd8 to 744f78e Compare July 1, 2025 18:31
"continues downloading from the last saved progress thereafter until bootstrap is forced.\n"
"[NOTE] This command is still WIP and partially implemented right now",
)
@cli_queue.command(name="incremental-output-download")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we remove the help text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click uses the docstring for help if it's not supplied here. The other commands do it that way, so this change brings it in line with them

f"Continuing from: {current_download_state.downloads_completed_timestamp.astimezone().isoformat()}"
)

logger.echo()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this provides vertical space to separate different parts of the output.

# )
# WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side.
# We want to download everything that is succeeded or suspended, but not
# FAILED, CANCELED, or NOT_COMPATIBLE.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can a failed job not have partial outputs for the steps/tasks that were successfully finished?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's something to think about as we refine the command. What should the default behavior be for failed jobs, and should we add an option to customize that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same

# This as an upper bound to allow for eventual consistency into the materialized view that
# the deadline:SearchJobs API is based on. It's taken from numbers seen in heavy load testing,
# increased by a generous amount.
EVENTUAL_CONSISTENCY_MAX_SECONDS = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making this configurable!

@mwiebe mwiebe force-pushed the incremental-job-list branch 5 times, most recently from b412c25 to 54a7e18 Compare July 2, 2025 00:47
@mwiebe mwiebe marked this pull request as ready for review July 2, 2025 00:52
@mwiebe mwiebe requested review from a team as code owners July 2, 2025 00:52
farm_id,
queue_id,
filter_expression={
"filters": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we could to 3 groups of 3,

EG:
group1 = SearchGroupFilter( READY, ASSIGNED, STARTING, OR )
group1 = SearchGroupFilter( SCHEDULED, RUNNING, OR )
aggregate = SearchGroupFilter( group1, group2, OR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why,

, if the input was a list of filters then we maybe able to optimize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer level needs the AND operator in _list_jobs_by_filter_expression, so combining would need a third nesting level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - ok definitely need to work on the grouped filter instead.

# )
# WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side.
# We want to download everything that is succeeded or suspended, but not
# FAILED, CANCELED, or NOT_COMPATIBLE.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same

# 1. job["taskRunStatusCounts"]["SUCCEEDED"] stayed the same. Except for when a task is requeued, this count will always increase
# when new output is available to download. If a task is requeued, this value could drop and then return to the same value
# when new output is generated.
# 2. job["updatedAt"] stayed the same. If a task is requeued, this timestamp will be updated, so this catches anything missed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a true assumption right? If a task is re-queued the job updated_at doesn't change. If a job is re-queued the job updated_at changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I observed in testing. It's worth making double-sure, yeah.

Copy link
Contributor Author

@mwiebe mwiebe Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've double-checked, and found that you are correct. Looks like I hadn't properly re-checked that after I removed aws-deadline/deadline-cloud-samples#81 from my test farm (since its updates were setting updatedAt). We'll have to think through the requeue case again more carefully.

for job_id in new_job_ids:
dc_job = download_candidate_jobs[job_id]

# Call deadline:GetJob to retrieve attachments manifest information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this or this is just for this iteration? Given that we can get the manifest info directly from s3 using the session actions at the end.. why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job itself is the source of truth, it's better to directly use the data from it. I believe the metadata attached to the S3 object won't be enough for everything we want either, we can look closer when we're doing that part.

@mwiebe mwiebe force-pushed the incremental-job-list branch 2 times, most recently from b1fca65 to fe14c20 Compare July 3, 2025 17:40
@mwiebe mwiebe enabled auto-merge (rebase) July 3, 2025 17:51
* Move the eventual consistency max time to a constant
* Add the full job state from the SearchJobs responses to the
  IncrementalDownloadState object. This is to help explore what options
  we have for tracking/optimizing job state changes.
* Use the _list_jobs_by_filter_expression function to collect
  all the candidate jobs for downloading tasks.
* In the incremental download function, perform a diff between the
  in-progress and new candidate job lists, printing info about changes.
* Print a summary of the job updates at the end.
* Remove debug printing from the pid file lock implementation and the
  incremental download state load/save.
* Adjust the CLI help output to use the docstring instead of from the
  decorator. Shorten some of the option names where the context makes it
  clear. Added a default checkpoint directory for the operation.
* Modified how the opt-in environment variable works. Now, the command
  is always available and showin in the help output, but running it only
  succeeds if the environment variable is set.
* Update the CLI tests to use the mocked search_jobs API, and
  implemented a mocked get_job API that works the same way. The golden
  path test now calls the CLI twice, modifying the state of the mocked
  job between calls, and validates that the output of the CLI command
  reflects that job's state appropriately.

Signed-off-by: Mark <[email protected]>
@mwiebe mwiebe force-pushed the incremental-job-list branch from fe14c20 to 5a136d3 Compare July 3, 2025 20:04
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jul 3, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
4 Security Hotspots

See analysis details on SonarQube Cloud

@mwiebe mwiebe merged commit 9a9d59e into aws-deadline:mainline Jul 3, 2025
24 of 25 checks passed
@mwiebe mwiebe deleted the incremental-job-list branch July 3, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants