feat(experimental): Track jobs list for incremental download over time #721

mwiebe · 2025-07-01T00:45:39Z

What was the problem/requirement? (What/Why)

For incremental output download for a queue, want to track all the jobs that have downloads started, and compare it with all the jobs that might start new downloads.

What was the solution? (How)

Update the incremental output command to record the full state of the job as returned by SearchJobs, and show a diff with previous state. Running this command to view the output as a queue evolves over time will help refine how this should work.

Move the eventual consistency max time to a constant
Add the full job state from the SearchJobs responses to the
IncrementalDownloadState object. This is to help explore what options
we have for tracking/optimizing job state changes.
Use the _list_jobs_by_filter_expression function to collect
all the candidate jobs for downloading tasks.
In the incremental download function, perform a diff between the
in-progress and new candidate job lists, printing info about changes.
Print a summary of the job updates at the end.
Remove debug printing from the pid file lock implementation and the
incremental download state load/save.
Adjust the CLI help output to use the docstring instead of from the
decorator. Shorten some of the option names where the context makes it
clear. Added a default checkpoint directory for the operation.
Modified how the opt-in environment variable works. Now, the command
is always available and showin in the help output, but running it only
succeeds if the environment variable is set.
Update the CLI tests to use the mocked search_jobs API, and
implemented a mocked get_job API that works the same way. The golden
path test now calls the CLI twice, modifying the state of the mocked
job between calls, and validates that the output of the CLI command
reflects that job's state appropriately.

What is the impact of this change?

Progress implementing the incremental download feature.

$ deadline queue -h
Usage: deadline queue [OPTIONS] COMMAND [ARGS]...

  Commands for queues.

Options:
  -h, --help  Show this message and exit.

Commands:
  export-credentials           Export queue credentials in a format...
  get                          Get the details of a queue.
  incremental-output-download  BETA - Downloads job attachments output...
  list                         Lists the available queues.
  paramdefs                    Lists a Queue's Parameters Definitions.

$ deadline queue incremental-output-download -h
Usage: deadline queue incremental-output-download [OPTIONS]

  BETA - Downloads job attachments output incrementally for all jobs in a
  queue. When run for the first time or with the --force-bootstrap option, it
  starts downloading from --bootstrap-lookback-minutes in the past. When run
  each subsequent time, it loads  the previous checkpoint and continues where
  it left off.

  To try this command, set the ENABLE_INCREMENTAL_OUTPUT_DOWNLOAD environment
  variable to 1 to acknowledge its incomplete beta status.

  [NOTE] This command is still WIP and partially implemented right now.

Options:
  --farm-id TEXT                  The AWS Deadline Cloud Farm to use.
  --queue-id TEXT                 The AWS Deadline Cloud Queue to use.
  --json                          Output is printed as JSON for scripting.
  --bootstrap-lookback-minutes FLOAT
                                  Downloads outputs for job-session-actions
                                  that have been completed since these many
                                  minutes at bootstrap. Default value is 0
                                  minutes.
  --checkpoint-dir TEXT           Proceed downloading from the previous
                                  progress file stored in this directory, if
                                  it exists. If the file does not exist, the
                                  download will initialize using the bootstrap
                                  lookback in minutes.
  --force-bootstrap               Forces command to start from the bootstrap
                                  lookback period and overwrite any previous
                                  checkpoint. Default value is False.
  --conflict-resolution [skip|overwrite|create_copy]
                                  How to handle downloads if an output file
                                  already exists: CREATE_COPY: Download the
                                  file with a new name, appending '(1)' to the
                                  end SKIP: Do not download the file OVERWRITE
                                  (default): Download and replace the existing
                                  file. Default behaviour is to OVERWRITE.
  -h, --help                      Show this message and exit.

How was this change tested?

Wrote a separate script that continually submits random jobs, and viewed the output over time. Refined the output to include the summary at the end, clean up the debug prints, etc.

Was this change documented?

No.

Does this PR introduce new dependencies?

No.

Is this a breaking change?

No.

Does this change impact security?

No.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sakshie95 · 2025-07-01T18:18:36Z

src/deadline/client/cli/_groups/queue_group.py

-        "continues downloading from the last saved progress thereafter until bootstrap is forced.\n"
-        "[NOTE] This command is still WIP and partially implemented right now",
-    )
+    @cli_queue.command(name="incremental-output-download")


why did we remove the help text?

Click uses the docstring for help if it's not supplied here. The other commands do it that way, so this change brings it in line with them

sakshie95 · 2025-07-01T18:50:35Z

src/deadline/client/cli/_groups/queue_group.py

+                    f"Continuing from: {current_download_state.downloads_completed_timestamp.astimezone().isoformat()}"
                )

+            logger.echo()


No, this provides vertical space to separate different parts of the output.

sakshie95 · 2025-07-01T18:52:02Z

src/deadline/client/cli/_incremental_download.py

+    # )
+    # WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side.
+    #             We want to download everything that is succeeded or suspended, but not
+    #             FAILED, CANCELED, or NOT_COMPATIBLE.


can a failed job not have partial outputs for the steps/tasks that were successfully finished?

Yes, that's something to think about as we refine the command. What should the default behavior be for failed jobs, and should we add an option to customize that?

I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same

src/deadline/client/cli/_incremental_download.py

src/deadline/client/cli/_pid_file_lock.py

sakshie95 · 2025-07-01T18:56:22Z

src/deadline/job_attachments/incremental_downloads/incremental_download_state.py

+# This as an upper bound to allow for eventual consistency into the materialized view that
+# the deadline:SearchJobs API is based on. It's taken from numbers seen in heavy load testing,
+# increased by a generous amount.
+EVENTUAL_CONSISTENCY_MAX_SECONDS = 120


thanks for making this configurable!

leongdl · 2025-07-03T04:42:28Z

src/deadline/client/cli/_incremental_download.py

+            farm_id,
+            queue_id,
+            filter_expression={
+                "filters": [


I thought we could to 3 groups of 3,

EG:
group1 = SearchGroupFilter( READY, ASSIGNED, STARTING, OR )
group1 = SearchGroupFilter( SCHEDULED, RUNNING, OR )
aggregate = SearchGroupFilter( group1, group2, OR)

I see why,

deadline-cloud/src/deadline/client/api/_list_jobs_by_filter_expression.py

Line 110 in 93c71a6

provided_filter = {

, if the input was a list of filters then we maybe able to optimize.

The outer level needs the AND operator in _list_jobs_by_filter_expression, so combining would need a third nesting level.

I see - ok definitely need to work on the grouped filter instead.

src/deadline/client/cli/_incremental_download.py

sakshie95 · 2025-07-03T15:56:24Z

src/deadline/client/cli/_incremental_download.py

+    # )
+    # WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side.
+    #             We want to download everything that is succeeded or suspended, but not
+    #             FAILED, CANCELED, or NOT_COMPATIBLE.


I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same

sakshie95 · 2025-07-03T15:58:04Z

src/deadline/client/cli/_incremental_download.py

+    # 1. job["taskRunStatusCounts"]["SUCCEEDED"] stayed the same. Except for when a task is requeued, this count will always increase
+    #    when new output is available to download. If a task is requeued, this value could drop and then return to the same value
+    #    when new output is generated.
+    # 2. job["updatedAt"] stayed the same. If a task is requeued, this timestamp will be updated, so this catches anything missed


This isn't a true assumption right? If a task is re-queued the job updated_at doesn't change. If a job is re-queued the job updated_at changes

This is what I observed in testing. It's worth making double-sure, yeah.

I've double-checked, and found that you are correct. Looks like I hadn't properly re-checked that after I removed aws-deadline/deadline-cloud-samples#81 from my test farm (since its updates were setting updatedAt). We'll have to think through the requeue case again more carefully.

sakshie95 · 2025-07-03T15:59:56Z

src/deadline/client/cli/_incremental_download.py

+    for job_id in new_job_ids:
+        dc_job = download_candidate_jobs[job_id]
+
+        # Call deadline:GetJob to retrieve attachments manifest information


Do we need this or this is just for this iteration? Given that we can get the manifest info directly from s3 using the session actions at the end.. why do we need this?

The job itself is the source of truth, it's better to directly use the data from it. I believe the metadata attached to the S3 object won't be enough for everything we want either, we can look closer when we're doing that part.

* Move the eventual consistency max time to a constant * Add the full job state from the SearchJobs responses to the IncrementalDownloadState object. This is to help explore what options we have for tracking/optimizing job state changes. * Use the _list_jobs_by_filter_expression function to collect all the candidate jobs for downloading tasks. * In the incremental download function, perform a diff between the in-progress and new candidate job lists, printing info about changes. * Print a summary of the job updates at the end. * Remove debug printing from the pid file lock implementation and the incremental download state load/save. * Adjust the CLI help output to use the docstring instead of from the decorator. Shorten some of the option names where the context makes it clear. Added a default checkpoint directory for the operation. * Modified how the opt-in environment variable works. Now, the command is always available and showin in the help output, but running it only succeeds if the environment variable is set. * Update the CLI tests to use the mocked search_jobs API, and implemented a mocked get_job API that works the same way. The golden path test now calls the CLI twice, modifying the state of the mocked job between calls, and validates that the output of the CLI command reflects that job's state appropriately. Signed-off-by: Mark <[email protected]>

sonarqubecloud · 2025-07-03T20:05:02Z

Quality Gate failed

Failed conditions
4 Security Hotspots

See analysis details on SonarQube Cloud

mwiebe force-pushed the incremental-job-list branch 6 times, most recently from cac5bd8 to 744f78e Compare July 1, 2025 18:31

sakshie95 reviewed Jul 1, 2025

View reviewed changes

mwiebe force-pushed the incremental-job-list branch 5 times, most recently from b412c25 to 54a7e18 Compare July 2, 2025 00:47

mwiebe marked this pull request as ready for review July 2, 2025 00:52

mwiebe requested review from a team as code owners July 2, 2025 00:52

leongdl reviewed Jul 3, 2025

View reviewed changes

src/deadline/client/cli/_incremental_download.py Show resolved Hide resolved

leongdl reviewed Jul 3, 2025

View reviewed changes

src/deadline/client/cli/_incremental_download.py Show resolved Hide resolved

leongdl reviewed Jul 3, 2025

View reviewed changes

src/deadline/client/cli/_incremental_download.py Outdated Show resolved Hide resolved

sakshie95 reviewed Jul 3, 2025

View reviewed changes

mwiebe force-pushed the incremental-job-list branch 2 times, most recently from b1fca65 to fe14c20 Compare July 3, 2025 17:40

sakshie95 approved these changes Jul 3, 2025

View reviewed changes

mwiebe enabled auto-merge (rebase) July 3, 2025 17:51

mwiebe force-pushed the incremental-job-list branch from fe14c20 to 5a136d3 Compare July 3, 2025 20:04

leongdl approved these changes Jul 3, 2025

View reviewed changes

mwiebe merged commit 9a9d59e into aws-deadline:mainline Jul 3, 2025
24 of 25 checks passed

mwiebe deleted the incremental-job-list branch July 3, 2025 21:45

This was referenced Jul 21, 2025

chore(release): 0.51.0 #739

Merged

chore(release): 0.51.0 #749

Merged

chore(release): 0.51.0 #752

Merged

chore(release): 0.51.0 #756

Merged

client-software-ci mentioned this pull request Aug 26, 2025

chore(release): 0.52.1 #808

Closed

client-software-ci mentioned this pull request Oct 2, 2025

chore(release): 0.53.1 #869

Merged

feat(experimental): Track jobs list for incremental download over time #721

feat(experimental): Track jobs list for incremental download over time #721

Uh oh!

Conversation

mwiebe commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Does this PR introduce new dependencies?

Is this a breaking change?

Does this change impact security?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwiebe Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jul 3, 2025

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mwiebe commented Jul 1, 2025 •

edited

Loading

mwiebe Jul 3, 2025 •

edited

Loading