Skip to content

feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287

Open
blarghmatey wants to merge 40 commits intomainfrom
feat/xqwatcher-kubernetes-migration
Open

feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287
blarghmatey wants to merge 40 commits intomainfrom
feat/xqwatcher-kubernetes-migration

Conversation

@blarghmatey
Copy link
Member

@blarghmatey blarghmatey commented Mar 11, 2026

Summary

Migrates xqueue-watcher infrastructure from EC2 Auto Scaling Groups with AppArmor/codejail sandboxing to a Kubernetes Deployment using container-based grading. This is the infrastructure companion to mitodl/xqueue-watcher#14 which implements the ContainerGrader backend.

Changes

src/ol_infrastructure/lib/ol_types.py

  • Added xqwatcher to both Services and Application enums for consistent K8s label generation.

src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl

  • Added read access to secret-DEPLOYMENT/edx-xqueue so the grader handler config (stored in Vault) can embed the xqueue server URL and authentication password.

src/ol_infrastructure/applications/xqwatcher/__main__.py

Complete rewrite replacing EC2 resources with Kubernetes resources:

Old (EC2) New (K8s)
IAM instance profile + Vault AWS auth OLEKSAuthBinding (IRSA + Vault K8s auth)
EC2 Launch Template + ASG Kubernetes Deployment
AMI with codejail/AppArmor mitodl/xqueue-watcher (DockerHub) container image
Consul config distribution ConfigMap + OLVaultK8SSecret CRD

New Kubernetes resources created:

  • OLEKSAuthBinding — IRSA role + Vault Kubernetes auth backend role
  • OLVaultK8SSecret — syncs grader handler config from Vault KV to a K8s Secret via Vault Secrets Operator
  • ConfigMap — base poll settings (xqwatcher.json) and stdout-only structured logging (logging.json)
  • Role + RoleBinding — grants xqwatcher pods permission to create/delete Jobs and read pod logs (required by ContainerGrader's Kubernetes backend)
  • Deployment — runs xqueue-watcher with non-root security context, resource limits, liveness probe, and topology spread for HA

Stack configs (9 files)

Removed EC2-specific keys (consul:address, auto_scale, instance_type) and added K8s-specific keys:

  • xqwatcher:cluster — EKS cluster name
  • xqwatcher:namespace — Kubernetes namespace
  • xqwatcher:min_replicas / max_replicas
  • xqwatcher:docker_tag

Deployment Prerequisites

Before applying this stack:

  1. Build and push mitodl/xqueue-watcher image to DockerHub (from PR Adding more precise filtering for VPC and subnet imports #14)
  2. Build and push course grader images (e.g. from MITx/graders-mit-600x#10)
  3. Update Vault secret secret-xqwatcher/{env}-grader-config with confd_json containing a ContainerGrader handler config
  4. Ensure Vault Secrets Operator is installed in the target cluster

Related PRs

blarghmatey and others added 5 commits March 11, 2026 12:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add read access to secret-DEPLOYMENT/edx-xqueue so the xqwatcher
service can retrieve the xqueue server URL and authentication
password needed by the ContainerGrader handler config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Completely rewrite the xqwatcher Pulumi stack to deploy on Kubernetes
instead of EC2 Auto Scaling Groups with AppArmor/codejail.

Changes:
- Replace IAM instance profile + Vault AWS auth with OLEKSAuthBinding
  (IRSA + Vault K8s auth backend)
- Add OLVaultK8SSecret to sync grader handler config from Vault KV
  to a Kubernetes Secret via the Vault Secrets Operator CRD
- Add a ConfigMap for base poll settings and structured JSON logging
  to stdout (no log rotation in containers)
- Add RBAC Role + RoleBinding granting the xqwatcher service account
  permission to create/delete Kubernetes Jobs and read pod logs,
  required by ContainerGrader's kubernetes backend
- Create a Kubernetes Deployment with:
  - ghcr.io/mitodl/xqueue-watcher image
  - Security context (non-root, drop ALL capabilities)
  - Resource requests + memory limit
  - Liveness probe via python -c import xqueue_watcher
  - Topology spread for HA across nodes
  - Vault grader config + base config mounted into /xqwatcher/conf.d/
- Preserve vault.kv.SecretV2 write so grader config remains managed
  in Pulumi
- Export k8s_deployment_name and k8s_namespace

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove EC2-specific settings (consul:address, auto_scale, instance_type)
and add Kubernetes-specific settings for all stacks:

- xqwatcher:cluster — EKS cluster name (residential or applications)
- xqwatcher:namespace — target Kubernetes namespace
- xqwatcher:min_replicas — minimum pod count (maps from auto_scale.desired)
- xqwatcher:max_replicas — maximum pod count (maps from auto_scale.max)
- xqwatcher:docker_tag — container image tag (default: latest)

Cluster assignments:
- mitx, mitx-staging → residential cluster
- mitxonline → applications cluster

Namespace assignments follow xqueue convention:
- mitx → mitx-openedx
- mitxonline → mitxonline-openedx
- mitx-staging → mitx-staging-openedx

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@blarghmatey blarghmatey requested a review from Copilot March 18, 2026 18:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the xqueue-watcher (xqwatcher) infrastructure in ol-infrastructure from an EC2/ASG-based deployment to a Kubernetes Deployment on EKS, aligning with the ContainerGrader-based runtime introduced in the application repo.

Changes:

  • Adds xqwatcher to shared enum types to support consistent labeling.
  • Updates Vault policy to allow reading xqueue server credentials.
  • Replaces the xqwatcher EC2 stack with Kubernetes resources (Vault auth binding + VSO-synced secret + ConfigMap + RBAC + Deployment) and updates stack configs accordingly.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ol_infrastructure/lib/ol_types.py Adds xqwatcher to Services/Application enums for consistent labels.
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl Extends Vault policy to read xqueue server secret path.
src/ol_infrastructure/applications/xqwatcher/__main__.py Full rewrite: provisions Vault+IRSA binding, VSO secret sync, ConfigMap, RBAC, and a Deployment for xqueue-watcher.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml Updates stack config to K8s-focused settings (cluster/namespace/replicas/docker tag).
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml Same as above for Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml Same as above for CI.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml Updates config for residential mitx QA.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml Updates config for residential mitx Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml Updates config for residential mitx CI.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml Updates config for mitx-staging QA.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml Updates config for mitx-staging Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml Updates config for mitx-staging CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

blarghmatey and others added 3 commits March 18, 2026 15:35
- Add create_irsa_service_account flag to OLEKSAuthBinding to
  optionally create the K8s ServiceAccount with IRSA annotation;
  use it in xqwatcher to fix 'serviceaccount not found' pod error
- Add XQWATCHER_* env vars to Deployment matching env_settings.py;
  expose http_basic_auth from Vault-synced secret via VSO template
- Fix image reference from ghcr.io to mitodl/ (DockerHub)
- Change imagePullPolicy to Always for mutable 'latest' tag
- Rename XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG
- Remove unused network_stack StackReference
- Remove dead xqwatcher:target_vpc config key from all 9 stacks
- Remove unimplemented xqwatcher:max_replicas from all 9 stacks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The manager CLI only accepts -d/--config_root; it auto-discovers
xqwatcher.json and logging.json from that directory. Remove the
non-existent --config and --logging-config flags.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 4 commits March 18, 2026 16:44
…pods

ContainerGrader calls k8s_config.load_incluster_config() which reads
the service account token from the projected volume at
/var/run/secrets/kubernetes.io/serviceaccount/token. The xqwatcher
ServiceAccount has automount_service_account_token=False (secure
default), so the PodSpec must explicitly opt in to have the token
mounted, otherwise all Kubernetes Job API calls will fail with a
ConfigException.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e tag

When the Concourse pipeline populates XQWATCHER_DOCKER_DIGEST, build
the image ref as mitodl/xqueue-watcher@sha256:... (immutable digest)
so Kubernetes always pulls exactly the image that was built and tested.
Fall back to :tag from stack config only when the digest is unavailable
(e.g. manual deploys). imagePullPolicy: Always is retained so new
digests are always pulled on rollout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The uv virtualenv bin directory is not on PATH in the container, so
the 'xqueue-watcher' console script can't be found directly. Use
'uv run xqueue-watcher' to invoke it through uv's environment, which
correctly resolves the script installed in the project virtualenv.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
uv run without --no-sync attempts to sync the virtualenv at startup,
which fails in the container (no write access / network). Use
--no-sync to run the already-installed entrypoint as-is.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 11 commits March 18, 2026 17:22
configure_from_directory(path) reads xqwatcher.json and logging.json
directly from path, then globs path/conf.d/*.json for queue watcher
configs. We were passing -d /xqwatcher/conf.d and mounting everything
flat there, so the manager looked for watchers at
/xqwatcher/conf.d/conf.d/*.json (not found).

Fix: pass -d /xqwatcher and restructure mounts:
  /xqwatcher/xqwatcher.json      <- manager config (ConfigMap)
  /xqwatcher/logging.json        <- logging config (ConfigMap)
  /xqwatcher/conf.d/grader_config.json  <- queue watchers (Vault secret)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VSO renders secret values via Go templates: {{ .Secrets.confd_json }}.
When confd_json is stored as a nested object, VSO renders a Go map
literal (map[...]) rather than valid JSON, causing a JSONDecodeError
at startup. Pre-serialize confd_json to a JSON string so the template
renders parseable JSON.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback

Match the keycloak pattern: require the digest env var so the image is
always pinned to an immutable digest. Remove the mutable :latest tag
fallback that allowed manual pulumi-up runs to silently deploy an
uncontrolled image. Also remove the unused xqwatcher:docker_tag config
key from all stack YAML files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gh cache

When the SOPS secret's confd_json contains a ContainerGrader handler
whose KWARGS include an 'image' key, rewrite that value through
cached_image_uri() before writing to Vault. This means the SOPS secret
stores a plain DockerHub reference (e.g. mitodl/mit-600x-grader:latest)
and Pulumi transforms it to the ECR pull-through cache URI at deploy
time, keeping grading Jobs free from DockerHub rate limits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CodeQL 'Analyze (actions)' job (exit code 32) fails because the
extractor finds .github/workflows/*.yml and .github/actions/**/*.yml
but cannot process any of them. This is a known extractor-level issue
with CodeQL 2.24.x on Erk agent workflow patterns.

Excluding .github from CodeQL's path analysis silences the fatal error
while leaving Python and JavaScript/TypeScript scans unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add src/ol_concourse/pipelines/open_edx/grader_images/ with three pipeline
definitions for building and publishing containerized course grader images
to private ECR.

base_image_pipeline.py:
  Builds grader_support/Dockerfile.base from the xqueue-watcher repo and
  pushes to both DockerHub (mitodl/xqueue-watcher-grader-base, public) and
  ECR (610119931565.dkr.ecr.us-east-1.amazonaws.com/mitodl/xqueue-watcher-
  grader-base, private). Triggered by changes to grader_support/ in the
  xqueue-watcher repo. The ECR push is the trigger source for downstream
  per-grader build pipelines.

build_pipeline.py:
  GraderPipelineConfig dataclass and grader_image_pipeline() factory for
  per-grader-repo build pipelines. Triggered by new commits to the grader
  repo OR a new base image digest in ECR. The Docker build receives
  GRADER_BASE_IMAGE=repo@sha256:... resolved at runtime via a sh wrapper
  around oci-build-task's build script (the only way to inject a
  file-derived BUILD_ARG in Concourse; params are static strings).
  Pushes to private ECR only. GRADER_PIPELINES list seeded with
  graders-mit-600x.

meta.py:
  Self-updating meta pipeline that creates and maintains the base image
  pipeline and one build pipeline per GRADER_PIPELINES entry. Triggered
  by changes to the grader_images/ pipeline code in ol-infrastructure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ines

- base_image_pipeline: use chore/migrate-to-uv-and-k8s-container-grader
  branch of xqueue-watcher (where Dockerfile.base updates live)
- build_pipeline: track feat/containerized-grader for graders-mit-600x
- Fix E501 in both files: split long strings to stay within 88-char limit

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CONTEXT was grader_support/ which caused the COPY grader_support/
instruction in Dockerfile.base to fail (no nested grader_support/ inside
the context). Use the repo root as CONTEXT so the COPY can locate the
directory relative to it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…images

Add ensure_ecr_task() helper to ol_concourse/lib/containers.py (mirrors the
pattern used in the dagster docker_pulumi_pipeline). The task runs the AWS
CLI to check for the ECR repository and creates it if missing, so the first
pipeline run does not fail on a missing registry.

Apply to both grader image pipelines:
- base_image_pipeline: ensures mitodl/xqueue-watcher-grader-base exists
  before pushing to ECR
- build_pipeline: ensures the per-grader ECR repo (config.ecr_repo_name)
  exists before pushing the course grader image

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When ecr_region is set, the registry-image resource automatically
constructs the full ECR URI as {account}.dkr.ecr.{region}.amazonaws.com/{repository}.
Passing the full URI in image_repository caused the hostname to be doubled
in API calls, resulting in NAME_UNKNOWN errors.

- Remove ecr_image_uri property from GraderPipelineConfig
- Fix grader_base_ecr_repo default to use repo-name-only string
- Change registry_image(image_repository=config.ecr_image_uri) to
  registry_image(image_repository=config.ecr_repo_name)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The grader-images-pipeline-code git resource was tracking 'main', but
the pipeline files don't exist on main yet. Switch to the feature branch
until this work is merged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 19, 2026 15:48
The graders-mit-600x repository is private. Switch the git resource from
an HTTPS git_repo to an ssh_git_repo so Concourse can clone it. The SSH
private key is read from Vault at ((github.ssh_private_key)).

- Import ssh_git_repo instead of git_repo
- Add github_private_key field to GraderPipelineConfig (defaults to
  ((github.ssh_private_key)))
- Update grader_repo_url in GRADER_PIPELINES to use SSH form
  (git@github.com:mitodl/graders-mit-600x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
infrastructure/github has no generic SSH key. The correct key for
cloning private mitodl repos from the infrastructure Concourse team
is odlbot_private_ssh_key in infrastructure/open_api_clients.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 3 commits March 19, 2026 15:58
Switch the grader-base-image registry-image resource from ECR to
DockerHub (mitodl/xqueue-watcher-grader-base). The base image pipeline
pushes to both DockerHub and ECR; DockerHub is public and simpler to
poll as a trigger without needing AWS credentials.

- Rename GraderPipelineConfig.grader_base_ecr_repo to grader_base_dockerhub_repo
- Remove ecr_region from the base image resource
- Add DockerHub credentials ((dockerhub.username/password))

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three related fixes for the containerized grader deployment:

1. Add XQWATCHER_GRADER_NAMESPACE env var set to the deployment namespace.
   Without this, ContainerGrader defaults to spawning Jobs in 'default',
   breaking the RBAC Role binding and landing Jobs in the wrong namespace.

2. Add XQWATCHER_GRADER_BACKEND, CPU_LIMIT, MEMORY_LIMIT, TIMEOUT env vars
   driven by new stack config keys (grader_namespace, grader_cpu_limit,
   grader_memory_limit, grader_timeout). These set deployment-wide defaults
   so individual conf.d queue files don't need to repeat them.

3. Fix the DockerHub pull-through cache rewrite to skip images that already
   have a registry hostname (e.g. private ECR URIs, ghcr.io). Previously
   cached_image_uri() was called unconditionally, which would mangle a
   full ECR URI into an invalid doubled-host path. Images are now only
   rewritten if the first path component contains no '.'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 20, 2026 11:28
…onfig + SERVER_REF

Queue configs (CONNECTIONS, HANDLERS, ContainerGrader KWARGS) are now
stored as plaintext in Pulumi stack YAML files under xqwatcher:queues.
The xqueue server URL is stored under xqwatcher:xqueue_server_url.

SERVER_REF is injected at deploy time so xqueue-watcher resolves
credentials at runtime from xqueue_servers.json, which is mounted from
a Vault-synced Kubernetes Secret.  The secret is sourced from the same
secret-{env_prefix}/edx-xqueue Vault KV path already used by the xqueue
and edxapp deployments (xqwatcher_password field), eliminating the
separate xqwatcher-specific KV mount and SOPS secrets files.

Changes:
- __main__.py: remove SOPS read, vault.kv.SecretV2, vault_mount_stack
  StackReference, and XQWATCHER_HTTP_BASIC_AUTH env var; read queues
  config from Pulumi config; inject SERVER_REF into each queue entry;
  move grader_config.json into ConfigMap; add xqueue_servers.json
  Vault-synced secret from secret-{env_prefix}/edx-xqueue; update
  Deployment volumes/mounts accordingly
- xqwatcher_server_policy.hcl: remove secret-xqwatcher/* path
- All 9 stack YAML files: add xqueue_server_url and queues config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add AWS_DEFAULT_REGION=us-east-1 to ensure_ecr_task params so the
  AWS CLI knows which region to use without relying on worker defaults
- Remove spurious service_account_name kwarg from OLVaultK8SResourcesConfig
  instantiation in OLEKSAuthBinding; the field does not exist on the
  model and the name is derived internally from application_name
- Fix liveness probe to use 'uv run --no-sync python' instead of bare
  'python', which would fail with ModuleNotFoundError because
  xqueue_watcher is only available inside the uv virtual environment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the xqueue-watcher infrastructure from an EC2/ASG deployment to Kubernetes, updating Vault access and stack configuration, and adding Concourse pipelines to build/publish grader container images used by the new ContainerGrader flow.

Changes:

  • Add xqwatcher to shared enums used for labeling.
  • Replace the xqwatcher stack’s EC2 resources with Kubernetes resources (Deployment, RBAC, ConfigMap, Vault Secrets Operator integration).
  • Add Concourse pipelines to build a grader base image and course-specific grader images, and update stack YAML configs for the new K8s-based deployment.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/ol_infrastructure/lib/ol_types.py Adds xqwatcher to enums used for consistent label generation.
src/ol_infrastructure/components/applications/eks.py Extends OLEKSAuthBinding to optionally create IRSA ServiceAccount(s).
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl Adjusts Vault policy to allow reading xqueue credentials from the shared secret path.
src/ol_infrastructure/applications/xqwatcher/main.py Replaces EC2-based deployment with K8s Deployment + RBAC + ConfigMap + VSO-managed secrets.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_concourse/pipelines/open_edx/grader_images/meta.py Adds a self-updating meta pipeline that creates/updates grader image pipelines.
src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py Adds reusable pipeline generator for course-specific grader images.
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py Adds pipeline generator for building/publishing the shared grader base image.
src/ol_concourse/pipelines/open_edx/grader_images/init.py Initializes the new grader_images pipeline package.
src/ol_concourse/lib/containers.py Adds a reusable task step to ensure an ECR repository exists before pushing.
src/bridge/secrets/xqwatcher/secrets.mitx.ci.yaml Updates encrypted xqwatcher grading configuration secrets for the new backend.
.github/codeql/codeql-config.yml Adds CodeQL config to exclude .github from actions extraction failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

blarghmatey and others added 2 commits March 20, 2026 13:22
Replace the old Packer-based xqwatcher pipeline with a Docker+Pulumi
pipeline that mirrors the xqueue pattern:

- Watches mitodl/xqueue-watcher (main) for new commits
- Builds and pushes the Docker image to DockerHub as
  mitodl/xqueue-watcher:{release}
- Passes the built image digest as XQWATCHER_DOCKER_DIGEST to each
  Pulumi stack so the Deployment rolls to the exact image SHA

Update meta.py to generate docker-pulumi-xqwatcher-{release} pipelines
instead of the retired packer-pulumi ones.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Unpin grader_images meta pipeline from feature branch; track main
- Unpin xqueue-watcher base image source from dev branch; track main
- Unpin graders-mit-600x grader repo from feature branch; track main
- Fix base_image_pipeline.py docstring: downstream pipelines trigger off
  the DockerHub push, not the ECR push
- Add xqwatcher:docker_tag config fallback for XQWATCHER_DOCKER_DIGEST
  so pulumi up can run without the env var set (matches xqueue pattern)
- Remove env vars that duplicate xqwatcher.json ConfigMap values
  (POLL_TIME, REQUESTS_TIMEOUT, POLL_INTERVAL, FOLLOW_CLIENT_REDIRECTS);
  keep only LOGIN_POLL_INTERVAL and GRADER_* which are not in the ConfigMap
- Update PR description: image is on DockerHub, not GHCR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 20, 2026 14:06
Register the MIT 6.686x course-specific grader image in GRADER_PIPELINES
so the meta pipeline creates a build-graders-mit-686x-image Concourse
pipeline that tracks the graders-mit-686x repo and pushes to ECR at
mitodl/graders-mit-686x.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
if not docker_image_digest:
msg = "Either XQWATCHER_DOCKER_DIGEST env var or xqwatcher:docker_tag config must be set" # noqa: E501
raise ValueError(msg)
docker_image_ref = f"mitodl/xqueue-watcher@{docker_image_digest}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code incorrectly uses @ to build the Docker image reference, which is only for digests. If the xqwatcher:docker_tag config provides a tag, the deployment will fail.
Severity: HIGH

Suggested Fix

Modify the image reference construction to use a colon (:) instead of an at-symbol (@). This will correctly handle both tags and digests, as Docker's syntax image:tag@digest prioritizes the digest if both are present. The code should be changed to docker_image_ref = f"mitodl/xqueue-watcher:{docker_image_digest}". This aligns with the behavior of other applications like xqueue.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/ol_infrastructure/applications/xqwatcher/__main__.py#L75

Potential issue: The code constructs a Docker image reference by hardcoding the `@`
symbol, which is reserved for image digests. However, the value for the image identifier
can be sourced from a Pulumi configuration named `xqwatcher:docker_tag`, which implies a
tag can be used. If a tag (e.g., `latest`) is provided through this configuration, the
resulting image reference, such as `mitodl/xqueue-watcher@latest`, will be syntactically
invalid. This will cause Kubernetes to fail when pulling the container image, preventing
the application pod from starting.

blarghmatey and others added 2 commits March 20, 2026 16:04
Add the edxorg-686x queue to the mitxonline production xqwatcher stack
using the ContainerGrader handler, replacing the legacy JailedGrader
configuration in confd_json. This is in preparation for deployment of
the xqueue-watcher changes in mitodl/xqueue-watcher#14.

The memory limit is set to 1Gi (vs 512Mi for 600x) to accommodate the
torch dependency used by the mnist problem set graders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an "edxorg" entry to the xqueue_servers.json Vault template so that
queues using SERVER_REF "edxorg" resolve credentials for
https://xqueue.edx.org. The template variables edxorg_xqueue_username
and edxorg_xqueue_password must be added to the existing edx-xqueue
Vault KV secret.

Update the queue config loop to use setdefault so that queues can
declare their own SERVER_REF in the Pulumi stack config rather than
always being assigned "default".

Set SERVER_REF: edxorg on the edxorg-686x queue in the mitxonline
production stack config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants