This repository is a data analytics and reporting platform developed as part of the Summer 2025 Data Science for the Common Good (DS4CG) program in partnership with the Unity Research Computing Platform. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
High-performance GPUs are a critical resource on shared clusters, but they are often underutilized due to inefficient job scheduling, over-allocation, or lack of user awareness. Many jobs request more GPU memory or compute than they actually use, leading to wasted resources and longer queue times for others. This project aims to address these issues by providing analytics and reporting tools that help users and administrators understand GPU usage patterns, identify inefficiencies, and make data-driven decisions to improve overall cluster utilization.
This project includes:
- Python scripts and modules for data preprocessing, analysis, and report generation
- Jupyter notebooks for interactive analysis and visualization
- Automated report generation scripts (see the
feature/reportsbranch for the latest versions) - Documentation built with MkDocs and Quarto
The following notebooks generate comprehensive analysis for two subsets of the data:
notebooks/analysis/No VRAM Use Analysis.ipynb: Analysis of GPU jobs that end up using no VRAM.notebooks/analysis/Requested and Used VRAM.ipynb: Analysis of GPU jobs that request a specific amount of VRAM.
The following notebooks generate demonstrate key analyses, visualizations:
notebooks/module_demos/Basic Visualization.ipynb: Basic plots and metricsnotebooks/module_demos/Efficiency Analysis.ipynb: Calculation of efficiency metrics and user comparisonsnotebooks/module_demos/Resource Hoarding.ipynb: Analysis of CPU core and RAM overallocation
The notebooks directory contains all Jupyter notebooks.
This repository uses MkDocs for project documentation. The documentation source files are located in the docs/ directory and the configuration is in mkdocs.yml.
To build and serve the documentation locally:
pip install -r dev-requirements.txt
mkdocs serve
To build the static site:
mkdocs build
To deploy the documentation (e.g., to GitHub Pages):
mkdocs gh-deploy
See the MkDocs documentation for more details and advanced usage.
For any new features, modules, or major changes, please add a corresponding .md file under the docs/ directory. This helps keep the project documentation up to date and useful for all users and contributors.
The primary dataset for this project is a DuckDB database that contains information about jobs on
Unity. It is located under unity.rc.umass.edu:/modules/admin-resources/reporting/slurm_data.db and is updated daily.
The schema is provided below. In addition to the columns in the DuckDB file, this repository contains tools to add a number of useful derived columns for visualization and analysis.
| Column | Type | Description |
|---|---|---|
| UUID | VARCHAR | Unique identifier |
| JobID | INTEGER | Slurm job ID |
| ArrayID | INTEGER | Position in job array |
| ArrayJobID | INTEGER | Slurm job ID within array |
| JobName | VARCHAR | Name of job |
| IsArray | BOOLEAN | Indicator if job is part of an array |
| Interactive | VARCHAR | Indicator if job was interactive |
| Preempted | BOOLEAN | Was job preempted |
| Account | VARCHAR | Slurm account (PI group) |
| User | VARCHAR | Unity user |
| Constraints | VARCHAR[] | Job constraints |
| QOS | VARCHAR | Job QOS |
| Status | VARCHAR | Job status on termination |
| ExitCode | VARCHAR | Job exit code |
| SubmitTime | TIMESTAMP_NS | Job submission time |
| StartTime | TIMESTAMP_NS | Job start time |
| EndTime | TIMESTAMP_NS | Job end time |
| Elapsed | INTEGER | Job runtime (seconds) |
| TimeLimit | INTEGER | Job time limit (seconds) |
| Partition | VARCHAR | Job partition |
| Nodes | VARCHAR | Job nodes as compact string |
| NodeList | VARCHAR[] | List of job nodes |
| CPUs | SMALLINT | Number of CPU cores |
| Memory | INTEGER | Job allocated memory (bytes) |
| GPUs | SMALLINT | Number of GPUs requested |
| GPUType | DICT | Dictionary with keys as type of GPU (str) and the values as number of GPUs corresponding to that type (int) |
| GPUMemUsage | FLOAT | GPU memory usage (bytes) |
| GPUComputeUsage | FLOAT | GPU compute usage (pct) |
| CPUMemUsage | FLOAT | GPU memory usage (bytes) |
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
To set up your development environment, use the provided dev-requirements.txt for all development dependencies (including linting, testing, and documentation tools).
This project requires Python 3.11. Make sure you have Python 3.11 installed before creating the virtual environment.
python -m venv duckdb
.\duckdb\Scripts\Activate.ps1
pip install -r requirements.txt
pip install -r dev-requirements.txt
python -m venv duckdb
source duckdb/bin/activate
pip install -r requirements.txt
pip install -r dev-requirements.txt
If you need to reset your environment, you can delete the duckdb directory and recreate it as above.
To provide the path of the git configuration file of this project to git, run:
git config --local include.path ../.gitconfig
To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
git config --local core.autocrlf input
You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment visible in Jupyter, run
python -m ipykernel install --user --name "Duck DB"
from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
By default, Jupyter Notebook outputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the file name of the notebook to scripts/strip_notebook_exclude.txt.
This repository uses Ruff for linting and formatting. All code must pass Ruff checks before being committed. To run Ruff:
ruff check .
ruff format .
If using VS Code, you can install the Ruff extension for an integrated experience.
Ruff automatically checks your code when you open or edit Python or Jupyter Notebook files. You can use the extension to:
- View diagnostics: Linting issues are highlighted and can be viewed in the Problems panel (Ctrl+Shift+M).
- Fix issues: Use Quick Fix options for individual issues.
- Run the
Ruff: Fix all auto-fixable problemscommand to address multiple violations. By default, unsafe fixes require explicit configuration in your Ruff file. - Format code: Use the
Format Documentcommand or configure formatting on save.
This repository uses Mypy for static Typing in Python. All code should pass Mypy checks. To run it in terminal:
mypy --config-file pyproject.toml .
If using VS Code, you can install the Mypy Type Checker extension for an integrated experience. To adjust the extension settings similar to:
"mypy-type-checker.args": [
"--config-file=pyproject.toml"
]
"mypy-type-checker.cwd": "${workspaceFolder}/ds4cg-job-analytics"
"mypy-type-checker.preferDaemon": true
"mypy-type-checker.reportingScope": "workspace"
To manage Mypy with this extension, you can use the following commands:
-
Mypy: Recheck Workspace: Re-run Mypy on the workspace. -
Mypy: Restart Daemon and Recheck Workspace: Restart the Mypy daemon and recheck.
All Python code should use Google-style docstrings. Example template:
def example_function(arg1: int, arg2: str) -> None:
"""
Brief description of what the function does.
Args:
arg1 (int): Description of arg1.
arg2 (str): Description of arg2.
Returns:
None
"""
# ...function code...
To run tests, use the provided test scripts or pytest (if available):
pytest
The Unity documentation (https://docs.unity.rc.umass.edu/) has plenty of useful information about Unity and Slurm which would be helpful in understanding the data. For specific issues with the code in this repo or the DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.