Skip to content

Generating MFC Images and Testing Them on OSPool #935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 85 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jul 11, 2025

User description

Description

Concerning (#654),
Generating four images CPU, CPU_Benchmark, GPU, and GPU_Benchmark. All MFC builds occur on a GitHub runner, while testing and storing latest images take place on OSPOOL. They are retrievable on the CI itself as the images are pre-built MFC with pre-installed packages that can be accessed with simple commands.

Debugging info,
To locally generate images, apptainer build mfc_cpu.sif Singularity.cpu
To start shell instance, apptainer shell --fakeroot --writable-tmpfs mfc_cpu.sif
To execute directly specific commands, apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'

To-dos,

  • Proper packages and base container for each recipe.
  • htcondor test script to request specific allocations per image.
  • Sanity-check by using the images on various resources/clusters.
  • Maintainer triggered if needed, otherwise most recent images will only be hosted.

Note to Self: current secrets are hosted in the fork, and prior to merge new dedicated ones should be added to the base repo. To do so, request access point under "GATech_Bryngelson" project, then upload public SSH key to https://registry.cilogon.org/. Later on, update secrets which include private SSH key and user@host.

Ref's
NVIDIA Container


PR Type

Other


Description

  • Remove existing CI workflows and testing infrastructure

  • Add Singularity container image building workflow

  • Create four container definitions for CPU/GPU variants

  • Implement automated image building and testing on OSPool


Changes diagram

flowchart LR
  A["Old CI Workflows"] -- "removed" --> B["Deleted Files"]
  C["New Container Workflow"] -- "builds" --> D["Singularity Images"]
  D -- "stores on" --> E["OSPool"]
  F["Container Definitions"] -- "defines" --> G["CPU/GPU Variants"]
Loading

Changes walkthrough 📝

Relevant files
Miscellaneous
17 files
build.sh
Remove Frontier build script                                                         
+0/-9     
submit.sh
Remove Frontier job submission script                                       
+0/-56   
test.sh
Remove Frontier test script                                                           
+0/-10   
bench.sh
Remove Phoenix benchmark script                                                   
+0/-20   
submit-bench.sh
Remove Phoenix benchmark submission script                             
+0/-64   
submit.sh
Remove Phoenix job submission script                                         
+0/-64   
test.sh
Remove Phoenix test script                                                             
+0/-21   
bench.yml
Remove benchmark workflow                                                               
+0/-68   
cleanliness.yml
Remove code cleanliness workflow                                                 
+0/-127 
coverage.yml
Remove coverage check workflow                                                     
+0/-48   
docs.yml
Remove documentation workflow                                                       
+0/-76   
formatting.yml
Remove formatting check workflow                                                 
+0/-19   
line-count.yml
Remove line count workflow                                                             
+0/-54   
lint-source.yml
Remove source linting workflow                                                     
+0/-51   
lint-toolchain.yml
Remove toolchain linting workflow                                               
+0/-17   
spelling.yml
Remove spell check workflow                                                           
+0/-17   
test.yml
Remove main test suite workflow                                                   
+0/-131 
Enhancement
5 files
container-image.yml
Add Singularity image building workflow                                   
+63/-0   
Singularity.cpu
Add CPU container definition                                                         
+24/-0   
Singularity.cpu_bench
Add CPU benchmark container definition                                     
+27/-0   
Singularity.gpu
Add GPU container definition                                                         
+34/-0   
Singularity.gpu_bench
Add GPU benchmark container definition                                     
+32/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 21, 2025

    Status Update: I overslept on this. The concept in itself works as supposed to. The persisting hurdle has been no space left on disk for GPU images as Nvidia HPC base container is like 5-8 GB. Clearing the cache wherever (GH runner or OSPOOL) is obsolete. I tried different base containers and recipe instructions but no shot.

    GPU Base Container: nvcr.io/nvidia/nvhpc:23.11-devel-cuda12.3-ubuntu22.04 is downgraded as one of cuda tools was depreciated after cuda12.3. While ubuntu22.04 is the latest OS with a recent python version. nvhpc 23.11 is capable of compiling mfc with no compromises.

    New Approach

    Build: make the process on self-hosted Phoenix.
    Test: transfer images with scp. Condor batch job files are already configured. CPU images would run on different CPU models & GPU images on different GPUs.
    Migrate: transfer images to larger storage on OSDF - allowing images to be publicly accessed.
    Enhance: add flag for local build e.g. ./mfc.sh build --<Build Instructions> --image <Singularity File> <Image Name> i.e. recipes become simpler as they just copy the entire mfc dir with its custom build instructions. Then, it would enclose it into a container image.
    Employ: NVIDIA HPC-Benchmarks base container to acquire benchmarking results, and I would just experiment with it for a tad some time.

    @sbryngelson
    Copy link
    Member

    sbryngelson commented Jul 21, 2025

    These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

    For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

    @Malmahrouqi3
    Copy link
    Collaborator Author

    These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

    For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

    Ohh, interesting, I will try this approach and see if it can compile and run MFC on gh runner. Converting between Docker & Singularity is not even something to worry about anyways.
    gnu+mpi docker container should be easy to make honestly. I will add a container image recipe for that.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    2 participants