-
Notifications
You must be signed in to change notification settings - Fork 59
feat(experimental): Track jobs list for incremental download over time #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cac5bd8 to
744f78e
Compare
| "continues downloading from the last saved progress thereafter until bootstrap is forced.\n" | ||
| "[NOTE] This command is still WIP and partially implemented right now", | ||
| ) | ||
| @cli_queue.command(name="incremental-output-download") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we remove the help text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click uses the docstring for help if it's not supplied here. The other commands do it that way, so this change brings it in line with them
| f"Continuing from: {current_download_state.downloads_completed_timestamp.astimezone().isoformat()}" | ||
| ) | ||
|
|
||
| logger.echo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this provides vertical space to separate different parts of the output.
| # ) | ||
| # WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side. | ||
| # We want to download everything that is succeeded or suspended, but not | ||
| # FAILED, CANCELED, or NOT_COMPATIBLE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can a failed job not have partial outputs for the steps/tasks that were successfully finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's something to think about as we refine the command. What should the default behavior be for failed jobs, and should we add an option to customize that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same
| # This as an upper bound to allow for eventual consistency into the materialized view that | ||
| # the deadline:SearchJobs API is based on. It's taken from numbers seen in heavy load testing, | ||
| # increased by a generous amount. | ||
| EVENTUAL_CONSISTENCY_MAX_SECONDS = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making this configurable!
b412c25 to
54a7e18
Compare
| farm_id, | ||
| queue_id, | ||
| filter_expression={ | ||
| "filters": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we could to 3 groups of 3,
EG:
group1 = SearchGroupFilter( READY, ASSIGNED, STARTING, OR )
group1 = SearchGroupFilter( SCHEDULED, RUNNING, OR )
aggregate = SearchGroupFilter( group1, group2, OR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why,
| provided_filter = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer level needs the AND operator in _list_jobs_by_filter_expression, so combining would need a third nesting level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - ok definitely need to work on the grouped filter instead.
| # ) | ||
| # WORKAROUND: Get all jobs with a SUCCEEDED or SUSPENDED task run status, and filter by endedAt client-side. | ||
| # We want to download everything that is succeeded or suspended, but not | ||
| # FAILED, CANCELED, or NOT_COMPATIBLE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the default behaviour should be what the default behavior is for job download command. If it downloads partial outputs automatic downloads should do the same
| # 1. job["taskRunStatusCounts"]["SUCCEEDED"] stayed the same. Except for when a task is requeued, this count will always increase | ||
| # when new output is available to download. If a task is requeued, this value could drop and then return to the same value | ||
| # when new output is generated. | ||
| # 2. job["updatedAt"] stayed the same. If a task is requeued, this timestamp will be updated, so this catches anything missed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a true assumption right? If a task is re-queued the job updated_at doesn't change. If a job is re-queued the job updated_at changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I observed in testing. It's worth making double-sure, yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've double-checked, and found that you are correct. Looks like I hadn't properly re-checked that after I removed aws-deadline/deadline-cloud-samples#81 from my test farm (since its updates were setting updatedAt). We'll have to think through the requeue case again more carefully.
| for job_id in new_job_ids: | ||
| dc_job = download_candidate_jobs[job_id] | ||
|
|
||
| # Call deadline:GetJob to retrieve attachments manifest information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this or this is just for this iteration? Given that we can get the manifest info directly from s3 using the session actions at the end.. why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job itself is the source of truth, it's better to directly use the data from it. I believe the metadata attached to the S3 object won't be enough for everything we want either, we can look closer when we're doing that part.
b1fca65 to
fe14c20
Compare
* Move the eventual consistency max time to a constant * Add the full job state from the SearchJobs responses to the IncrementalDownloadState object. This is to help explore what options we have for tracking/optimizing job state changes. * Use the _list_jobs_by_filter_expression function to collect all the candidate jobs for downloading tasks. * In the incremental download function, perform a diff between the in-progress and new candidate job lists, printing info about changes. * Print a summary of the job updates at the end. * Remove debug printing from the pid file lock implementation and the incremental download state load/save. * Adjust the CLI help output to use the docstring instead of from the decorator. Shorten some of the option names where the context makes it clear. Added a default checkpoint directory for the operation. * Modified how the opt-in environment variable works. Now, the command is always available and showin in the help output, but running it only succeeds if the environment variable is set. * Update the CLI tests to use the mocked search_jobs API, and implemented a mocked get_job API that works the same way. The golden path test now calls the CLI twice, modifying the state of the mocked job between calls, and validates that the output of the CLI command reflects that job's state appropriately. Signed-off-by: Mark <[email protected]>
fe14c20 to
5a136d3
Compare
|


What was the problem/requirement? (What/Why)
For incremental output download for a queue, want to track all the jobs that have downloads started, and compare it with all the jobs that might start new downloads.
What was the solution? (How)
Update the incremental output command to record the full state of the job as returned by SearchJobs, and show a diff with previous state. Running this command to view the output as a queue evolves over time will help refine how this should work.
IncrementalDownloadState object. This is to help explore what options
we have for tracking/optimizing job state changes.
all the candidate jobs for downloading tasks.
in-progress and new candidate job lists, printing info about changes.
incremental download state load/save.
decorator. Shorten some of the option names where the context makes it
clear. Added a default checkpoint directory for the operation.
is always available and showin in the help output, but running it only
succeeds if the environment variable is set.
implemented a mocked get_job API that works the same way. The golden
path test now calls the CLI twice, modifying the state of the mocked
job between calls, and validates that the output of the CLI command
reflects that job's state appropriately.
What is the impact of this change?
Progress implementing the incremental download feature.
How was this change tested?
Wrote a separate script that continually submits random jobs, and viewed the output over time. Refined the output to include the summary at the end, clean up the debug prints, etc.
Was this change documented?
No.
Does this PR introduce new dependencies?
No.
Is this a breaking change?
No.
Does this change impact security?
No.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.