Likely race condition with stop_if

## Describe the bug

Running a workflow where step A has a `stop_if` condition on output from step B, if step B runs for a very short amount of time, it seems that step A maybe never receives a proper stop signal, and it eventually times out based on the `closure_wait_timeout` value.

## To reproduce

This seems to be consistent with a workflow where a pcp step starts before a stress-ng step when the stress-ng is set to a low timeout of 2 seconds and when this workflow is run on a particularly slow system similar to a Raspberry Pi. In this case, the pcp step ends up getting killed by the engine after it times out, and therefore the workflow fails. Increasing the time of the stress-ng test in this configuration to 10 seconds (thus generating significantly more pcp data) results in a successful workflow.

workflow.yaml:
```
steps:
  # Start the PCP data collection
  pcp:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-pcp:0.10.1
    step: run-pcp
    deploy:
      deployer_name: podman
      deployment:
        host:
          NetworkMode: host
          Binds:
            - /etc/system-release:/etc/system-release
    input: !expr $.input.constant
    closure_wait_timeout: 60000
    # Stop the PCP data collection after the post_wait step completes
    stop_if: !expr $.steps.post_wait.outputs

  # Wait the specified milliseconds before starting the stress-ng workload
  pre_wait:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
    step: wait
    input:
      wait_time_ms: 10000
    # Don't start this step until after the pcp step has started
    wait_for: !expr $.steps.pcp.starting.started

  # Start the stress-ng workload
  stressng:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-stressng:0.8.1
    step: workload
    input: !expr $.input.item
    # Don't start this step until after the pre_wait has completed
    wait_for: !expr $.steps.pre_wait.outputs

  # Wait the specified milliseconds after the stress-ng workload succeeds
  post_wait:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
    step: wait
    input:
      wait_time_ms: 10000
    # Don't start this step until after the stressng step completes
    wait_for: !expr $.steps.stressng.outputs
```

input.yaml:
```
constant:
  flatten: true
  pmlogger_interval: 1.0
  pmlogger_metrics: |
    kernel.cpu.util.user, kernel.cpu.util.nice, kernel.cpu.util.sys,
    kernel.cpu.util.wait, kernel.cpu.util.steal, kernel.cpu.util.idle
item:
  timeout: 2
  stressors:
    - stressor: cpu
      workers: 16
```

## Additional context

Complete workflow: https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Likely race condition with stop_if #240

Describe the bug

To reproduce

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Likely race condition with stop_if #240

Description

Describe the bug

To reproduce

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions