Skip to content

Add a global job timeout #1305

@Alex17Li

Description

@Alex17Li

Describe the bug

I don't know what the cause is, but when using this action I've started to see failures where the credential job crashes, but no retry or exit seems to happen - it just hangs (5+ hours)
We've had this working pretty consistently for a long time (~year?) but now it's sometimes failing like this which just takes up all of the runners.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

It should never hang. If the internet fails then it can crash

Current Behavior

Run aws-actions/configure-aws-credentials@v4
  with:
    role-session-name: GithubActionsRoleSession
    role-to-assume: arn:aws:iam::425642425116:role/github
    aws-region: us-west-2
    role-duration-seconds: 21600
    output-credentials: true
    audience: sts.amazonaws.com
  env:
    HOME: /root
    ADK_GITHUB_TOKEN: ***
    REMOTE_ROOT: /mnt/ssd/bazeltest_github
    VPU_ADDR: bazelvpu
Error: getIDToken call failed: Error message: Failed to get ID Token. 
 
        Error Code : undefined
 
        Error Message: read ECONNRESET
context canceled
Error: The operation was canceled.

The error appeared after cancelling the job

Reproduction Steps

It probably will not be reproduced easily. We are running in a company docker container, though I don't see why that would be an issue. Removing everything irrelevant the job looks like this.

jobs:
  run_embedded_vpu_bundle:
    runs-on: adk-vpu2-jp5
    container:
      image: artifactory.bluerivertech.com/dev-adk-docker/autonomy/adk/ubuntu_2204_build:2025-02-26
      volumes:
        - ghrunner_ci_cache_adk-vpu2-jp5:/ci_cache
      options: --shm-size 32G
   steps:
      - name: Assume AWS-Github role using OIDC (prod)
        id: aws-role
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-session-name: GithubActionsRoleSession
          role-to-assume: arn:aws:iam::425642425116:role/github
          aws-region: us-west-2
          role-duration-seconds: 21600
          output-credentials: true
          retry-max-attempts: 50

Possible Solution

Error Message: read ECONNRESET makes it seem that the network connection is breaking at an inopportune time during the step? Perhaps there is a point where you wait for a packet and don't crash if it doesn't arrive in a few seconds.

Additional Information/Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestA feature should be added or improved.p2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions