Describe the bug
Running a workflow where step A has a stop_if condition on output from step B, if step B runs for a very short amount of time, it seems that step A maybe never receives a proper stop signal, and it eventually times out based on the closure_wait_timeout value.
To reproduce
This seems to be consistent with a workflow where a pcp step starts before a stress-ng step when the stress-ng is set to a low timeout of 2 seconds and when this workflow is run on a particularly slow system similar to a Raspberry Pi. In this case, the pcp step ends up getting killed by the engine after it times out, and therefore the workflow fails. Increasing the time of the stress-ng test in this configuration to 10 seconds (thus generating significantly more pcp data) results in a successful workflow.
workflow.yaml:
steps:
# Start the PCP data collection
pcp:
plugin:
deployment_type: image
src: quay.io/arcalot/arcaflow-plugin-pcp:0.10.1
step: run-pcp
deploy:
deployer_name: podman
deployment:
host:
NetworkMode: host
Binds:
- /etc/system-release:/etc/system-release
input: !expr $.input.constant
closure_wait_timeout: 60000
# Stop the PCP data collection after the post_wait step completes
stop_if: !expr $.steps.post_wait.outputs
# Wait the specified milliseconds before starting the stress-ng workload
pre_wait:
plugin:
deployment_type: image
src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
step: wait
input:
wait_time_ms: 10000
# Don't start this step until after the pcp step has started
wait_for: !expr $.steps.pcp.starting.started
# Start the stress-ng workload
stressng:
plugin:
deployment_type: image
src: quay.io/arcalot/arcaflow-plugin-stressng:0.8.1
step: workload
input: !expr $.input.item
# Don't start this step until after the pre_wait has completed
wait_for: !expr $.steps.pre_wait.outputs
# Wait the specified milliseconds after the stress-ng workload succeeds
post_wait:
plugin:
deployment_type: image
src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
step: wait
input:
wait_time_ms: 10000
# Don't start this step until after the stressng step completes
wait_for: !expr $.steps.stressng.outputs
input.yaml:
constant:
flatten: true
pmlogger_interval: 1.0
pmlogger_metrics: |
kernel.cpu.util.user, kernel.cpu.util.nice, kernel.cpu.util.sys,
kernel.cpu.util.wait, kernel.cpu.util.steal, kernel.cpu.util.idle
item:
timeout: 2
stressors:
- stressor: cpu
workers: 16
Additional context
Complete workflow: https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf
Describe the bug
Running a workflow where step A has a
stop_ifcondition on output from step B, if step B runs for a very short amount of time, it seems that step A maybe never receives a proper stop signal, and it eventually times out based on theclosure_wait_timeoutvalue.To reproduce
This seems to be consistent with a workflow where a pcp step starts before a stress-ng step when the stress-ng is set to a low timeout of 2 seconds and when this workflow is run on a particularly slow system similar to a Raspberry Pi. In this case, the pcp step ends up getting killed by the engine after it times out, and therefore the workflow fails. Increasing the time of the stress-ng test in this configuration to 10 seconds (thus generating significantly more pcp data) results in a successful workflow.
workflow.yaml:
input.yaml:
Additional context
Complete workflow: https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf