Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
e75339c
Upgrade to aiida-workgraph 0.7.2
GeigerJ2 Sep 30, 2025
6e94e52
Support streaming submission via SLURM
GeigerJ2 Oct 27, 2025
9371fe2
Merge create-symlink-tree CLI endpoint
GeigerJ2 Oct 29, 2025
abb6e95
Refactor to purely functional approach
GeigerJ2 Oct 29, 2025
a5d4905
remove leftover files
GeigerJ2 Oct 31, 2025
a4181b0
try to add rolling frontier
GeigerJ2 Oct 31, 2025
d5a0249
Revert "try to add rolling frontier"
GeigerJ2 Nov 4, 2025
62ddca0
files for demo
GeigerJ2 Nov 4, 2025
bf93ae5
fix top-level `create-symlink-tree` directory name and command
GeigerJ2 Nov 5, 2025
51c5859
cleanup of demo files
GeigerJ2 Nov 5, 2025
204e281
Merge branch 'workgraph-demo' into workgraph
GeigerJ2 Nov 5, 2025
834fa3f
upgrade wg dep and fix `account` passing
GeigerJ2 Nov 6, 2025
bac5ba7
wip
GeigerJ2 Nov 6, 2025
dc13c0c
Upgrade wg dep in pyproject.toml
GeigerJ2 Nov 11, 2025
20c4a69
add provenance display files
GeigerJ2 Nov 17, 2025
313878f
wip
GeigerJ2 Nov 19, 2025
d337cca
fix hatch fmt and types
GeigerJ2 Nov 19, 2025
cebcdf1
fix tests
GeigerJ2 Nov 19, 2025
e6cdad4
hatch, tests, and aquaplanet pass
GeigerJ2 Nov 20, 2025
39edc58
modify gitignore and pyproject.toml
GeigerJ2 Nov 26, 2025
03f66af
modify gitignore and pyproject.toml
GeigerJ2 Nov 26, 2025
7c622d9
implement dynamic task-level computation
GeigerJ2 Nov 26, 2025
ebdea5c
remove large-local files
GeigerJ2 Dec 8, 2025
3c967ed
remove demo svgs
GeigerJ2 Dec 8, 2025
2d16228
remove large config svg
GeigerJ2 Dec 8, 2025
2c2a296
fix code creation bug
GeigerJ2 Dec 11, 2025
e34e664
cleanup provenance graph exports
GeigerJ2 Dec 11, 2025
73a6929
Workgraph with dynamic levels refactor (#237)
GeigerJ2 Dec 17, 2025
650995e
Merge branch 'workgraph-update-levels' into workgraph
GeigerJ2 Dec 17, 2025
241268a
first hatch fmt; still many errors
GeigerJ2 Dec 17, 2025
11a6598
add DYAMOND example from Santis (doesn't run through yet)
GeigerJ2 Dec 17, 2025
b5b4f1b
wip
GeigerJ2 Jan 8, 2026
6236d2d
wip dyamond workflow and uenv support therefore
GeigerJ2 Jan 13, 2026
98a2e1e
state of last all-hands
GeigerJ2 Jan 20, 2026
88d37ba
`window_size` -> `front_depth`
GeigerJ2 Jan 20, 2026
3ee2ff1
fix overwriting of `front_depth`
GeigerJ2 Jan 20, 2026
93f4b45
python -> python3 in scripts
GeigerJ2 Jan 20, 2026
98c1d5b
update deps in pyproject.toml
GeigerJ2 Jan 20, 2026
897923c
fix scripts of `dynamic-complex` dummy workflow
GeigerJ2 Jan 20, 2026
c7a834c
rm DYAMOND_leclairm
GeigerJ2 Jan 20, 2026
6a5a788
rm ADR files
GeigerJ2 Jan 20, 2026
caf7445
rm CHANGES_90s_TEST.md
GeigerJ2 Jan 20, 2026
9030b71
move python scripts to scripts/
GeigerJ2 Jan 20, 2026
ddb038c
cleanup scripts
GeigerJ2 Jan 20, 2026
e66b527
fix script to plot gantt chart of process timings to account for slur…
GeigerJ2 Jan 21, 2026
484f85f
wip; dyamond and aiida-icon-link-dir-contents race condition
GeigerJ2 Jan 21, 2026
5fa41bb
compatibility with IconBaseRestartWorkChain
GeigerJ2 Jan 22, 2026
d668b5d
expand docs, and add case distinction in get_job_data for shell task
GeigerJ2 Jan 22, 2026
eabee4e
wip; killing processes and pausing
GeigerJ2 Jan 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,8 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
html/

tests/cases/APE_R02B04/config/AQUAPLANET.svg
examples/DYAMOND_aiida/CHANGES_90s_TEST.md
**/*.svg
131 changes: 131 additions & 0 deletions aiida-icon-link-dir-contents-race-condition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# aiida-icon `link_dir_contents` Race Condition with SLURM Pre-submission

## Issue Summary

The `link_dir_contents` feature in aiida-icon is incompatible with SLURM pre-submission when using job dependencies. The IconCalculation fails during `prepare_for_submission` because it tries to list remote directory contents that don't exist yet (upstream jobs are still creating them).

## Root Cause

### How SLURM Pre-submission Works

1. Upstream job (e.g., `prepare_input`) is submitted to SLURM and returns `job_id` + `remote_folder` PK immediately
2. Downstream job (e.g., `icon`) gets this information via `get_job_data`
3. Downstream job's `prepare_for_submission` runs **locally** to prepare submission files
4. Job is submitted to SLURM with `--dependency=afterok:<upstream_job_id>`
5. SLURM holds the job until upstream finishes

### The Problem

aiida-icon's `prepare_for_submission` method (line 190 in `calculations.py`) calls `remotedata.listdir()` to enumerate files in `link_dir_contents`:

```python
if "link_dir_contents" in self.inputs:
for remotedata in self.inputs.link_dir_contents.values():
for subpath in remotedata.listdir(): # <-- FAILS HERE
calcinfo.remote_symlink_list.append(...)
```

This happens **locally before SLURM submission**, but the remote directory doesn't exist yet because the upstream job is still running.

**SLURM dependencies prevent jobs from starting on compute nodes, but they don't prevent AiiDA from running local preparation steps.**

## Timeline Example (from DYAMOND workflow)

| Time | Event | Status |
|------|-------|--------|
| 06:36:54 | `prepare_input` job submitted to SLURM (job 545483) | Running on SLURM |
| 06:37:40 | `remote_folder` node created (PK 43155) | - |
| 06:37:50 | `get_job_data` returns job_id + remote_folder PK | Correct behavior |
| 06:37:52 | `icon` job's `prepare_for_submission` starts **locally** | - |
| 06:37:52-06:38:18 | `icon` calls `listdir()` on `icon_input/` directory | **FAILS** - directory doesn't exist |
| 06:38:18 | IconCalculation excepted with OSError | ❌ Failure |
| 06:39:43 | `prepare_input` finishes, creates `icon_input/` | Too late |

## Error Message

```
OSError: The required remote path /capstor/scratch/cscs/jgeiger/aiida/6a/5c/345a-8d48-44ca-a10d-6e1c0fb44d27/icon_input/.
on santis-async-ssh does not exist, is not a directory or has been deleted.
```

## Proposed Solutions

### Option 1: Skip Non-existing Directories (Simple)

Modify `aiida-icon/src/aiida_icon/calculations.py` line 188-197:

```python
if "link_dir_contents" in self.inputs:
for remotedata in self.inputs.link_dir_contents.values():
try:
subpaths = remotedata.listdir()
except OSError as e:
# Directory doesn't exist yet - skip for now
# It will be created by upstream job before this job starts (SLURM dependency)
self.logger.warning(
f"Directory {remotedata.get_remote_path()} does not exist yet, "
f"skipping link_dir_contents enumeration. Will be created by upstream job."
)
continue

for subpath in subpaths:
calcinfo.remote_symlink_list.append(
(
remotedata.computer.uuid,
str(pathlib.Path(remotedata.get_remote_path()) / subpath),
subpath,
)
)
```

**Pros:**
- Simple fix
- Maintains SLURM dependency semantics
- Job will still fail if directory truly doesn't exist when it starts running

**Cons:**
- Can't validate directory contents during submission
- Symlinks won't be enumerated in advance (but they'll be created via different mechanism?)

### Option 2: Defer Symlink Creation to Compute Node

Instead of enumerating files during `prepare_for_submission`, create a wrapper script that:
1. Waits for upstream jobs (handled by SLURM)
2. Creates symlinks from the remote directory contents on the compute node
3. Runs ICON

This would require more extensive changes to the workflow.

### Option 3: Wait for Directory in prepare_for_submission (Not Recommended)

Poll until the directory exists before calling `listdir()`.

**Problems:**
- Defeats the purpose of SLURM pre-submission
- Introduces arbitrary delays in workflow submission
- Could cause deadlocks if upstream job fails

## Recommendation

**Option 1** is the best approach:
- Skip non-existing directories with a warning
- Trust SLURM dependencies to ensure directory exists when job starts
- Simple, minimal change to aiida-icon
- Preserves the benefits of SLURM pre-submission

## Context

This issue was discovered in the DYAMOND workflow where:
- `prepare_input` task creates an `icon_input/` directory with symlinks
- `icon` task uses `link_dir_contents` to reference this directory
- With SLURM pre-submission enabled, the race condition manifests

The workflow configuration:
```yaml
- icon:
inputs:
- icon_link_input:
port: link_dir_contents
```

Where `icon_link_input` is a GeneratedData output from `prepare_input`.
29 changes: 25 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,29 @@ dependencies = [
"isoduration",
"pydantic",
"ruamel.yaml",
"aiida-core>=2.5",
"aiida-icon>=0.4.0",
"aiida-workgraph==0.5.2",
# "aiida-core==2.7.1",
# "aiida-core@git+https://github.com/aiidateam/aiida-core.git@f21bcd49d60b2e35b8b4df417f46ac15bd5bc861",
"aiida-core@git+https://github.com/aiidateam/aiida-core.git",
# "aiida-firecrest@git+https://github.com/aiidateam/aiida-firecrest.git@edab99ac6808c0ccfc63329d365654f54deacf5e",
"aiida-firecrest@git+https://github.com/aiidateam/aiida-firecrest.git@c4e287ffe40ae57db7032fe1c80d2c4def0dc7fa",
# "pyfirecrest",
# "aiida-icon @ git+https://github.com/aiida-icon/aiida-icon.git@add-arbitrary-inputs",
"aiida-icon @ git+https://github.com/aiida-icon/aiida-icon.git@dyamond-changes",
# "aiida-workgraph==1.0.0b4", # need latest for presubmission
# "aiida-workgraph @ git+https://github.com/GeigerJ2/aiida-workgraph.git@task-window-dynamic", # <-- now in `patches/` in Sirocco
"aiida-workgraph @git+https://github.com/aiidateam/aiida-workgraph.git",
"aiida-gui",
"aiida-gui-workgraph",
"ipdb",
"termcolor",
"pygraphviz",
"lxml",
"f90nml",
"aiida-shell>=0.8.1",
"rich~=14.0",
"typer~=0.16.0",
"aiida-gui-workgraph>=0.1.3",
"jinja2>=3.0",
]
license = {file = "LICENSE"}

Expand Down Expand Up @@ -99,6 +112,9 @@ include = [
[tool.hatch.version]
path = "src/sirocco/__init__.py"

[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.envs.default]
installer = "uv"
python = "3.12"
Expand Down Expand Up @@ -145,12 +161,17 @@ extra-dependencies = [
"types-colorama",
"types-Pygments",
"types-termcolor",
"types-requests"
"types-requests",
"types-PyYAML"

]

[tool.hatch.envs.types.scripts]
check = "mypy --exclude 'tests/cases/*' --no-incremental {args:.}"

[tool.mypy]
disable_error_code = ["import-untyped"]

[[tool.mypy.overrides]]
module = ["isoduration", "isoduration.*"]
follow_untyped_imports = true
Expand Down
Loading