-
Notifications
You must be signed in to change notification settings - Fork 43
Capture SP task dumps in support bundles #8177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7380a0e
to
a4b0acf
Compare
Update the support bundle collector to capture task dumps from the SPs.
a4b0acf
to
dbb6669
Compare
Tested on Dublin:
|
@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer { | |||
task_impl: Box::new( | |||
support_bundle_collector::SupportBundleCollector::new( | |||
datastore.clone(), | |||
resolver.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this isn't the best approach, given the conversation in the control plane huddle today. On the other hand, we're very unlikely to be collecting bundles fast enough to have a real impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| { | ||
format!("failed to create SP task dump directory {sp_dumps_dir}") | ||
})?; | ||
let sp_dumps_fut = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should strongly consider modifying SupportBundleCollectionReport
in nexus/types/src/internal_api/background.rs
to identify whether or not collection of the SP dumps was successful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, added.
let sp_dumps_dir = dir.path().join("sp_task_dumps"); | ||
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| { | ||
format!("failed to create SP task dump directory {sp_dumps_dir}") | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks replicated from above? We should define this variable / create_dir_all in one spot, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, forgot to remove this one when I moved the start above the sled tasks.
@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer { | |||
task_impl: Box::new( | |||
support_bundle_collector::SupportBundleCollector::new( | |||
datastore.clone(), | |||
resolver.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically
.context("failed to get list of SPs from MGS")? | ||
.into_inner(); | ||
|
||
let mut futures = futures::stream::iter(all_sps.into_iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend checking out parallel-task-set
, which we added relatively recently, to perform this task-saving operation in parallel.
(It should be quite similar to using buffer_unordered
, but it'll actually spawn tokio tasks for each SP - but we can still set a limit on the maximum amount of parallelism)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, that's very nice. I went with the default parallelism since the requests are very light.
sp_dumps_dir: &Utf8Path, | ||
) -> anyhow::Result<()> { | ||
let dump_count = mgs_client | ||
.sp_task_dump_count(sp.type_, sp.slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When / how are these task dumps deleted? Just trying to understand if there's a TOCTTOU issue between "get # of tasks" vs "iterate over them"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK they're never removed during standard operation. So we may miss a dump created after fetching the count, but would never have one removed out from under us.
Humility can clear them out with --initialize-dump-agent
, and I think upgrades will as well.
@@ -593,6 +611,12 @@ impl BundleCollection { | |||
} | |||
} | |||
|
|||
if let Err(e) = sp_dumps_fut.await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most minor of nit; still seems weird this is "sandwiching" the sled-querying code. Why not immediately use sp_dumps_fut
next to where it is declared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thinking was to allow both the sled and SP tasks to run concurrently, but the SP calls should be very quick so I don't think we're really getting any benefit. Moved the await up top as you've suggested.
Update the support bundle collector to capture task dumps from the SPs.