omsf · dwhswenson · Apr 19, 2024 · Apr 19, 2024 · Apr 22, 2024 · Apr 22, 2024
diff --git a/features/README.md b/features/README.md
@@ -0,0 +1,5 @@
+# features
+
+This directory contains feature descriptions written in
+[Gherkin](https://cucumber.io/docs/gherkin/). Please note that we should have
+one feature per file.
diff --git a/features/code_coverage.feature b/features/code_coverage.feature
@@ -0,0 +1,22 @@
+Feature: Code coverage
+
+    Our workflow should be able to report code coverage to external
+    services. (For testing, we'll just be sure we can integrate with
+    CodeCov.)
+
+    Scenario: Report default coverage
+        # Note that this is probably NOT what most users will want. Imagine
+        # that our runner, because it is on GPU, runs more code paths than
+        # the basic runs, and runs less frequently. This means that PRs (not
+        # using our runner) will see spurious decrease in coverage.
+        Given a workflow that uses CodeCov for coverage
+        When I run the workflow
+        Then coverage should successfully be updated on CodeCov
+
+    Scenario: Report coverage with CodeCov flags
+        # Using CodeCov flags may help solve the problem mentioned in the
+        # default coverage scenario, but we should play with it a bit to
+        # determine a recommended practice. (Out of scope for MVP.)
+        Given a workflow that uses CodeCov flags for coverage
+        When I run then workflow
+        Then the correct flag should be updated on CodeCov
diff --git a/features/external_users.feature b/features/external_users.feature
@@ -0,0 +1,12 @@
+Feature: Allow external contributors to use resources
+
+    ...
+
+    # NOTE: this is essentially the same as a scenario from the run_pr
+    # feature; might not ever fill it in
+    #Scenario: Authorized user permits a PR from unauthorized user to run
+
+    Scenario: Adding a new authorized user
+        Given an unauthorized user who should become authorized
+        When I give the user committer access to the repository
+        Then the user should have the ability to launch self-hosted workflows
diff --git a/features/hard_kill.feature b/features/hard_kill.feature
@@ -0,0 +1,12 @@
+Feature: Hard kill a runaway workflow job
+
+    A user should be able to kill a running job, and that should also
+    terminate the associated instance.
+
+    Scenario: Manual kill
+        Given a long-running workflow
+        And I am logged in as an authorized user
+        And the workflow is running
+        When I kill the workflow using the GitHub UI
+        Then the workflow should stop
+        And the instance should terminate
diff --git a/features/physical_cost.feature b/features/physical_cost.feature
@@ -0,0 +1,20 @@
+Feature: Track physical cost of running
+
+    The amount of time that has been used (or ideally, the actual cost
+    incurred) should be easily accessible.
+    [Possible mechanisms: (1) Refer to AWS billing info; (2) use an API to
+    extract stuff from AWS billing / CloudTrail; (3) have some custom
+    cloud-independent approach -- probably (1) or (2)]
+
+    Scenario: 
+        # TODO: having trouble with this one because I feel like it depends
+        # on the specific mechanism
+
+    # WIP: I think this is the generic form of this information
+    # the mechanism for tracking the cost is not specified here.
+    Scenario: When I run a test, I can see how much it costs
+        Given I have a test that runs for X amount of time
+        And I have a cost of Y per unit time
+        And I have a mechanism for tracking the cost
+        When I run the test
+        Then I receive a caclulated cost of running the test.
diff --git a/features/prevent_abuse.feature b/features/prevent_abuse.feature
@@ -0,0 +1,40 @@
+Feature: Safeguards to prevent abuse of self-hosted runners
+
+    Compute resources should be protected from use outside of intended runs,
+    either due to accidental triggering or due to intential abuse by
+    malicious actors. This includes preventing forks from accessing our
+    resources and includes preventing runs on untrusted PRs.
+
+    Scenario: Forks should not be able to use our runners
+        # This should be guaranteed by the fact that secrets don't propagate
+        # to forks.
+        Given a fork of a repository with a self-hosted workflow
+        When the fork owner tries to run (within fork) using workflow dispatch
+        Then the workflow should give an error due to authorization
+        And the workflow should fail to start instances on AWS
+
+    Scenario: Pull requests from first-time contributors should not start runners
+        # With default repo settings, first-time contributors should require
+        # approval to run CI at all.
+        Given a fork of a repository with a self-hosted workflow
+        And the fork owner has not previously contributed to the repository
+        And the fork owner has changed our workflow to run on PRs
+        When the fork owner creates a pull request to our repository
+        Then the workflow should give an error due to authorization
+        And the workflow should fail to start instances on AWS
+
+    Scenario: Pull requests from previous contributors should not start runners
+        # With default repo settings, an external contributor who has
+        # previously contributed no longer requires approval for CI to run.
+        # However, this should be guaranteed because PRs from forks don't
+        # have access to secrets.
+        Given a fork of a repository with a self-hosted workflow
+        And the fork owner has previously contributed to the repository
+        And the fork owner has changed our workflow to run on PRs
+        When the fork owner creates a pull request to our repository
+        Then the workflow should give an error due to authorization
+        And the workflow should fail to start instances on AWS
+
+    # Non-tested scenario: AWS tokens (as secrets) should not leak in PRs
+    # from forks because forks don't see secrets. (Leaking AWS tokens is
+    # a different attack vector from the ones described above.)
diff --git a/features/quickstart.feature b/features/quickstart.feature
@@ -0,0 +1,18 @@
+Feature: Quickstart guide
+
+    There should be a quick and easy way to set up workflows, and a simple
+    demo workflow.
+
+    # TODO: There should be a scenario here about documentation, maybe? or
+    # is that another feature? Up-to-date getting started documentation.
+
+    Scenario: Easy set-up of for first-time users
+        Given I have AWS credentials
+        And I have not previously set up AWS infra for this tool
+        When I use the quickstart command
+        Then I should have a working workflow
+
+    Scenario: Up-to date documentation
+        Given I have the latest version of the tool
+        When I look at the documentation
+        Then I should see up-to-date and tested information
diff --git a/features/reproducible_env.feature b/features/reproducible_env.feature
@@ -0,0 +1,10 @@
+Feature: Reproducible workflow environment
+
+    Within a version of our tool and a specific cloud machine image, the
+    starting environment for all workflows should be the same.
+
+    Scenario: Reproducible workflow environment
+        Given a fixed version of our tool and of a cloud machine image
+        When I start the workflow
+        Then the versions of important libraries should be as expected
+        And the versions of important software tools should be as expected
diff --git a/features/retrieve_results.feature b/features/retrieve_results.feature
@@ -0,0 +1,17 @@
+Feature: Retrieve results of a benchmarking run
+
+    A user may generate data during a run that they want to save somewhere
+    long-term. This will require that the user explicitly store that data
+    somewhere; in this, we will test that we can store it.
+
+    Scenario: Store results to an S3 bucket
+        Given a workflow that intends to upload a file to an S3 bucket
+        When I run the workflow
+        Then the file should be uploaded to the S3 bucket
+
+    Scenario: Store results to Dropbox
+        # we do a separate test for Dropbox just to ensure that there's
+        # nothing special happening because S3 and EC2 are both AWS
+        Given a workflow that intends to upload a file to Dropbox
+        When I run the workflow
+        Then the file should be uploaded to Dropbox
diff --git a/features/run_manual.feature b/features/run_manual.feature
@@ -0,0 +1,26 @@
+Feature: Manual runs of the workflow
+
+    A user should be able to manually launch a workflow from the web UI.
+    [Mechanism: workflow_dispatch and run workflow]
+
+    Scenario: Authorized users should see the run workflow button
+        Given I have a workflow generated with our tool
+        And I am logged in as an authorized user
+        When I load the workflow's page
+        Then I should see the Run Workflow button
+
+    Scenario: Unauthorized users should not see the run worklow button
+        Given I have a workflow generated with our tool
+        And I am logged in as an unauthorized user
+        When I load the workflow's page
+        Then I should not see the Run Workflow button
+
+    Scenario: Running the Run Workflow button should run the workflow
+        Given I have a workflow generated with our tool
+        And I am logged in as an authorized user
+        When I load the workflow's page
+        And I press the Run Workflow button
+        Then the workflow should complete a manual run
+
+
+
diff --git a/features/run_matrix.feature b/features/run_matrix.feature
@@ -0,0 +1,11 @@
+Feature: Run a matrix build
+
+    A user should be able to run a full build matrix (ideally in parallel).
+
+    Scenario: Run a matrix
+        Given a workflow that involves a complicated matrix
+        When I run the workflow
+        Then all builds in the matrix should complete
+        # maybe this too:
+        # And an instance should be launched for each job
+        # And all jobs should run on different instances
diff --git a/features/run_pr.feature b/features/run_pr.feature
@@ -0,0 +1,14 @@
+Feature: Run on pull requests
+
+    A user should be able to run a workflow on self-hosted runners prior to
+    merging a pull request. NOTE: This will *not* use the normal
+    pull_request trigger for workflows. Instead, this will be a
+    workflow_dispatch caused by some external decision. This is because we
+    don't expect to want to run expensive CI on every commit, but rather
+    when an admin chooses to.
+
+    Scenario: Choose to run a workflow on a PR
+        Given I have a workflow generated with our tool
+        And a pull request is open against that repository
+        When I [trigger the workflow to run on the PR]  (how? TBD)
+        Then the workflow runs on our runner using code in the PR
diff --git a/features/run_scheduled.feature b/features/run_scheduled.feature
@@ -0,0 +1,9 @@
+Feature: Scheduled runs of the workflow
+
+    A user should be able to run scheduled runs of a workflow
+
+    Scenario: A scheduled run should run
+        Given I have a workflow generated with our tool
+        When I wait until after the scheduled run time
+        Then the workflow should have completed a scheduled run
+
diff --git a/features/select_platform.feature b/features/select_platform.feature
@@ -0,0 +1,63 @@
+Feature: Select platform to run on
+
+    A user should be able to select the hardware that suits the needs of
+    their run.
+
+    Scenario: Running with large memory
+        Given a workflow that requires and requests a large-memory host
+        When I run the workflow
+        Then it should run on the appropriate large-memory host
+
+    Scenario: Running with a single CUDA GPU
+        Given a workflow that requires and requests a single CUDA GPU
+        When I run the workflow
+        Then it should run on hardware with a GPU
+        And my software should be able to interact with the CUDA drivers
+
+    Scenario: Running with multiple GPUs
+        Given a workflow that requires and request multiple GPUs
+        When I run the workflow
+        Then it should run on hardware with multiple GPUs
+        And my software should be able to interact with all requested GPUs
+
+    Scenario: Running with smaller hardware
+        Given a workflow that requests lower-cost hardware
+        When I run the workflow
+        Then it should run on the appropriate hardware
+
+    Scenario: Running with preemptible instances
+        Given a workflow that can run on preemptible hosts
+        When I run the workflow
+        Then it should run on a preemptible host
+        # NOTE: anything about continuing from preemption is the
+        # responsibility of the workflow writer
+
+    Scenario: A run on a preemptible instance is preempted
+        Given a workflow that can run on preemptible hosts
+        And the workflow is running
+        When the workflow is preempted
+        Then the workflow should be retried (up to a specified retry limit)
+
+    Scenario: True failures should not be retried on preemtible instances
+        Given a workflow that can run on preemptible hosts
+        And the workflow is running
+        When the workflow fails
+        Then the workflow should not be retried
+
+    # NOTE: This is not an MVP requirement
+    #Scenario: Running with ROCM stack
+    #    Given a workflow that requires an ROCM stack
+    #    When I run the workflow
+    #    Then it should run on hardware with the appropriate ROCM stack 
+
+    Scenario: Running with an inference stack with various hardware
+        Given a workflow that requires an inference stack
+        When I run the workflow
+        Then it should run on hardware with the appropriate inference stack
+        And my software should be able to interact with the inference stack
+
+    Scenario: Running a small ML training run
+        Given a workflow that requires an inference stack
+        And the workflow is a small ML training run
+        When I run the workflow
+        Then it should run on hardware with the appropriate inference stack
diff --git a/features/set_gpu_mode.feature b/features/set_gpu_mode.feature
@@ -0,0 +1,12 @@
+Feature: Workflow should be able to set the GPU compute mode
+
+    A given workflow should be able to use different GPU compute modes
+    (e.g., EXCLUSIVE_PROCESS).
+    [Mechanism: This might be either via machine selection or by setting
+    mode in the workflow]
+
+    Scenario: Run in EXCLUSIVE_PROCESS
+        Given a workflow that should run with EXCLUSIVE_PROCESS set
+        When I run the workflow
+        Then my main process should take the GPU
+        And any other process should error if it tries to use the GPU