Skip to content

Likely race condition with stop_if #240

@dustinblack

Description

@dustinblack

Describe the bug

Running a workflow where step A has a stop_if condition on output from step B, if step B runs for a very short amount of time, it seems that step A maybe never receives a proper stop signal, and it eventually times out based on the closure_wait_timeout value.

To reproduce

This seems to be consistent with a workflow where a pcp step starts before a stress-ng step when the stress-ng is set to a low timeout of 2 seconds and when this workflow is run on a particularly slow system similar to a Raspberry Pi. In this case, the pcp step ends up getting killed by the engine after it times out, and therefore the workflow fails. Increasing the time of the stress-ng test in this configuration to 10 seconds (thus generating significantly more pcp data) results in a successful workflow.

workflow.yaml:

steps:
  # Start the PCP data collection
  pcp:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-pcp:0.10.1
    step: run-pcp
    deploy:
      deployer_name: podman
      deployment:
        host:
          NetworkMode: host
          Binds:
            - /etc/system-release:/etc/system-release
    input: !expr $.input.constant
    closure_wait_timeout: 60000
    # Stop the PCP data collection after the post_wait step completes
    stop_if: !expr $.steps.post_wait.outputs

  # Wait the specified milliseconds before starting the stress-ng workload
  pre_wait:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
    step: wait
    input:
      wait_time_ms: 10000
    # Don't start this step until after the pcp step has started
    wait_for: !expr $.steps.pcp.starting.started

  # Start the stress-ng workload
  stressng:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-stressng:0.8.1
    step: workload
    input: !expr $.input.item
    # Don't start this step until after the pre_wait has completed
    wait_for: !expr $.steps.pre_wait.outputs

  # Wait the specified milliseconds after the stress-ng workload succeeds
  post_wait:
    plugin:
      deployment_type: image
      src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.1
    step: wait
    input:
      wait_time_ms: 10000
    # Don't start this step until after the stressng step completes
    wait_for: !expr $.steps.stressng.outputs

input.yaml:

constant:
  flatten: true
  pmlogger_interval: 1.0
  pmlogger_metrics: |
    kernel.cpu.util.user, kernel.cpu.util.nice, kernel.cpu.util.sys,
    kernel.cpu.util.wait, kernel.cpu.util.steal, kernel.cpu.util.idle
item:
  timeout: 2
  stressors:
    - stressor: cpu
      workers: 16

Additional context

Complete workflow: https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingv1.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions