Summary
Add SLURM-friendly parallelisation to analysis/stability_analysis.py so the simulation grid can be distributed across array jobs.
New CLI flags
| Flag |
Purpose |
--reps INT |
Override REPS_FULL (default: 10). Set to 200 for publication. |
--task-id INT |
This worker's 0-based index. Typically $SLURM_ARRAY_TASK_ID. |
--num-tasks INT |
Total number of workers. |
--shard-dir PATH |
Write per-task JSON shards here (e.g. results/shards/). Each task writes stab_{task_id}.json and cov_{task_id}.json. |
--merge |
Merge mode: concatenate all shards from --shard-dir into final stability_results.json + coverage_results.json in --output-dir, then generate all figures. |
Sharding design
The unit of work is a (setting, rep) pair where setting = (n, p, true_rank).
- Enumerate all valid settings:
[(n, p, k) for n, p, k in product(ns, ps, ks) if k < min(n, p)] — currently 140.
- Cross with reps: 140 × 200 = 28,000 work units. Each unit runs all 4 missingness patterns and all metrics.
- Partition the 28,000 units across
--num-tasks workers using units[task_id::num_tasks] (interleaved, not block, for load balancing — larger n×p settings are slower).
- Each task writes its shard of
_Trial and _CoverageTrial records as JSON.
Merge step
--merge reads all stab_*.json and cov_*.json from --shard-dir, concatenates, deduplicates (by n,p,k,miss,metric,rep key), validates expected record count, writes final JSON to --output-dir, and generates all 9 figures.
Example SLURM usage
#!/bin/bash
#SBATCH --job-name=vbpca-stability
#SBATCH --array=0-559
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=02:00:00
source activate vbpca-env
python analysis/stability_analysis.py \
--reps 200 \
--task-id $SLURM_ARRAY_TASK_ID \
--num-tasks 560 \
--shard-dir results/shards/ \
--fmt png
# After all tasks complete (separate job with dependency):
# python analysis/stability_analysis.py --merge --shard-dir results/shards/ --fmt png
Backward compatibility
- Without
--task-id/--num-tasks, behaves exactly as before (single-process sequential)
--reps alone (without sharding) just overrides the rep count for local runs
- Existing
--smoke, --plot-only, --coverage-only flags remain unchanged
Seed handling
Each (setting_index, rep) pair should derive a deterministic sub-seed: seed = base_seed + setting_index * max_reps + rep. This ensures identical results regardless of sharding configuration.
Acceptance criteria
Summary
Add SLURM-friendly parallelisation to
analysis/stability_analysis.pyso the simulation grid can be distributed across array jobs.New CLI flags
--reps INTREPS_FULL(default: 10). Set to 200 for publication.--task-id INT$SLURM_ARRAY_TASK_ID.--num-tasks INT--shard-dir PATHresults/shards/). Each task writesstab_{task_id}.jsonandcov_{task_id}.json.--merge--shard-dirinto finalstability_results.json+coverage_results.jsonin--output-dir, then generate all figures.Sharding design
The unit of work is a (setting, rep) pair where setting = (n, p, true_rank).
[(n, p, k) for n, p, k in product(ns, ps, ks) if k < min(n, p)]— currently 140.--num-tasksworkers usingunits[task_id::num_tasks](interleaved, not block, for load balancing — larger n×p settings are slower)._Trialand_CoverageTrialrecords as JSON.Merge step
--mergereads allstab_*.jsonandcov_*.jsonfrom--shard-dir, concatenates, deduplicates (by n,p,k,miss,metric,rep key), validates expected record count, writes final JSON to--output-dir, and generates all 9 figures.Example SLURM usage
Backward compatibility
--task-id/--num-tasks, behaves exactly as before (single-process sequential)--repsalone (without sharding) just overrides the rep count for local runs--smoke,--plot-only,--coverage-onlyflags remain unchangedSeed handling
Each (setting_index, rep) pair should derive a deterministic sub-seed:
seed = base_seed + setting_index * max_reps + rep. This ensures identical results regardless of sharding configuration.Acceptance criteria
--reps 25overrides REPS_FULL--task-id 0 --num-tasks 4runs ~1/4 of the grid_Trial/_CoverageTrial--mergeproduces identical results to a single-process run (modulo float ordering)--smokeflag for quick validation