Skip to content

HPC sharding for stability analysis #44

@jc-macdonald

Description

@jc-macdonald

Summary

Add SLURM-friendly parallelisation to analysis/stability_analysis.py so the simulation grid can be distributed across array jobs.

New CLI flags

Flag Purpose
--reps INT Override REPS_FULL (default: 10). Set to 200 for publication.
--task-id INT This worker's 0-based index. Typically $SLURM_ARRAY_TASK_ID.
--num-tasks INT Total number of workers.
--shard-dir PATH Write per-task JSON shards here (e.g. results/shards/). Each task writes stab_{task_id}.json and cov_{task_id}.json.
--merge Merge mode: concatenate all shards from --shard-dir into final stability_results.json + coverage_results.json in --output-dir, then generate all figures.

Sharding design

The unit of work is a (setting, rep) pair where setting = (n, p, true_rank).

  1. Enumerate all valid settings: [(n, p, k) for n, p, k in product(ns, ps, ks) if k < min(n, p)] — currently 140.
  2. Cross with reps: 140 × 200 = 28,000 work units. Each unit runs all 4 missingness patterns and all metrics.
  3. Partition the 28,000 units across --num-tasks workers using units[task_id::num_tasks] (interleaved, not block, for load balancing — larger n×p settings are slower).
  4. Each task writes its shard of _Trial and _CoverageTrial records as JSON.

Merge step

--merge reads all stab_*.json and cov_*.json from --shard-dir, concatenates, deduplicates (by n,p,k,miss,metric,rep key), validates expected record count, writes final JSON to --output-dir, and generates all 9 figures.

Example SLURM usage

#!/bin/bash
#SBATCH --job-name=vbpca-stability
#SBATCH --array=0-559
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=02:00:00

source activate vbpca-env
python analysis/stability_analysis.py \
    --reps 200 \
    --task-id $SLURM_ARRAY_TASK_ID \
    --num-tasks 560 \
    --shard-dir results/shards/ \
    --fmt png

# After all tasks complete (separate job with dependency):
# python analysis/stability_analysis.py --merge --shard-dir results/shards/ --fmt png

Backward compatibility

  • Without --task-id/--num-tasks, behaves exactly as before (single-process sequential)
  • --reps alone (without sharding) just overrides the rep count for local runs
  • Existing --smoke, --plot-only, --coverage-only flags remain unchanged

Seed handling

Each (setting_index, rep) pair should derive a deterministic sub-seed: seed = base_seed + setting_index * max_reps + rep. This ensures identical results regardless of sharding configuration.

Acceptance criteria

  • --reps 25 overrides REPS_FULL
  • --task-id 0 --num-tasks 4 runs ~1/4 of the grid
  • Shards are valid JSON, loadable as lists of _Trial / _CoverageTrial
  • --merge produces identical results to a single-process run (modulo float ordering)
  • Interleaved partitioning gives roughly equal wall time per task
  • Works with existing --smoke flag for quick validation

Metadata

Metadata

Assignees

No one assigned

    Labels

    HPCHPC parallelisation and SLURM supportenhancementNew feature or requestpaperJOSS paper, figures, and analysis

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions