Skip to content

Conversation

@davramov
Copy link
Contributor

This PR adds the reconstruct_multinode() method for tomography reconstruction in orchestration/flows/bl832/nersc.py.

  • Ability to request num_nodes when calling reconstruction, and use the correct QOS.
  • Uses shifter instead of podman to handle containers
  • By partitioning sinograms across multiple CPU nodes, I achieved near-linear speedup. For 8 nodes, there was close to a 7x speedup in performance for the reconstruction (not including overhead).
  • From here, I found the next main bottleneck was Podman-hpc, which pulls the microct image on every job (~90 seconds). By switching to Shifter with a pre-cached image, container startup dropped to ~2-3 seconds. The remaining overhead (~1 minute) is due to SFAPI and queuing on Perlmutter (not much we can improve here).
  • The sweet spot seems to be 4 CPU nodes in the realtime queue using Shifter, bringing down the total time from ~10 minutes to ~2 minutes. This balances the quick pickup by the realtime queue and the linear performance boost. Scaling beyond this requires the regular, demand, or premium queues, which have longer wait times that offset the reconstruction speedup (maybe we can ask Bjoern nicely for more nodes in the realtime queue).
  • For fun, I ran one test using 128 nodes, and while recon was fast (~30 seconds), the wait in the queue was close to 30 minutes.
image

Additionally, this PR improves the cancel_sfapi_job.py script, includes the reconstruction/multiresolution scripts used on Perlmutter.

@davramov davramov mentioned this pull request Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant