feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment by blarghmatey · Pull Request #4287 · mitodl/ol-infrastructure

blarghmatey · 2026-03-11T16:23:52Z

Summary

Migrates xqueue-watcher infrastructure from EC2 Auto Scaling Groups with AppArmor/codejail sandboxing to a Kubernetes Deployment using container-based grading. This is the infrastructure companion to mitodl/xqueue-watcher#14 which implements the ContainerGrader backend.

Changes

`src/ol_infrastructure/lib/ol_types.py`

Added xqwatcher to both Services and Application enums for consistent K8s label generation.

`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`

Added read access to secret-DEPLOYMENT/edx-xqueue so the grader handler config (stored in Vault) can embed the xqueue server URL and authentication password.

`src/ol_infrastructure/applications/xqwatcher/main.py`

Complete rewrite replacing EC2 resources with Kubernetes resources:

Old (EC2)	New (K8s)
IAM instance profile + Vault AWS auth	`OLEKSAuthBinding` (IRSA + Vault K8s auth)
EC2 Launch Template + ASG	Kubernetes `Deployment`
AMI with codejail/AppArmor	`mitodl/xqueue-watcher` (DockerHub) container image
Consul config distribution	`ConfigMap` + `OLVaultK8SSecret` CRD

New Kubernetes resources created:

OLEKSAuthBinding — IRSA role + Vault Kubernetes auth backend role
OLVaultK8SSecret — syncs grader handler config from Vault KV to a K8s Secret via Vault Secrets Operator
ConfigMap — base poll settings (xqwatcher.json) and stdout-only structured logging (logging.json)
Role + RoleBinding — grants xqwatcher pods permission to create/delete Jobs and read pod logs (required by ContainerGrader's Kubernetes backend)
Deployment — runs xqueue-watcher with non-root security context, resource limits, liveness probe, and topology spread for HA

Stack configs (9 files)

Removed EC2-specific keys (consul:address, auto_scale, instance_type) and added K8s-specific keys:

xqwatcher:cluster — EKS cluster name
xqwatcher:namespace — Kubernetes namespace
xqwatcher:min_replicas / max_replicas
xqwatcher:docker_tag

Deployment Prerequisites

Before applying this stack:

Build and push mitodl/xqueue-watcher image to DockerHub (from PR Adding more precise filtering for VPC and subnet imports #14)
Build and push course grader images (e.g. from MITx/graders-mit-600x#10)
Update Vault secret secret-xqwatcher/{env}-grader-config with confd_json containing a ContainerGrader handler config
Ensure Vault Secrets Operator is installed in the target cluster

Related PRs

feat: migrate to uv + add ContainerGrader for Kubernetes/Docker sandboxed grading xqueue-watcher#14 — ContainerGrader implementation + uv migration
MITx/graders-mit-600x#10 — Course grader Dockerfile

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add read access to secret-DEPLOYMENT/edx-xqueue so the xqwatcher service can retrieve the xqueue server URL and authentication password needed by the ContainerGrader handler config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Completely rewrite the xqwatcher Pulumi stack to deploy on Kubernetes instead of EC2 Auto Scaling Groups with AppArmor/codejail. Changes: - Replace IAM instance profile + Vault AWS auth with OLEKSAuthBinding (IRSA + Vault K8s auth backend) - Add OLVaultK8SSecret to sync grader handler config from Vault KV to a Kubernetes Secret via the Vault Secrets Operator CRD - Add a ConfigMap for base poll settings and structured JSON logging to stdout (no log rotation in containers) - Add RBAC Role + RoleBinding granting the xqwatcher service account permission to create/delete Kubernetes Jobs and read pod logs, required by ContainerGrader's kubernetes backend - Create a Kubernetes Deployment with: - ghcr.io/mitodl/xqueue-watcher image - Security context (non-root, drop ALL capabilities) - Resource requests + memory limit - Liveness probe via python -c import xqueue_watcher - Topology spread for HA across nodes - Vault grader config + base config mounted into /xqwatcher/conf.d/ - Preserve vault.kv.SecretV2 write so grader config remains managed in Pulumi - Export k8s_deployment_name and k8s_namespace Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove EC2-specific settings (consul:address, auto_scale, instance_type) and add Kubernetes-specific settings for all stacks: - xqwatcher:cluster — EKS cluster name (residential or applications) - xqwatcher:namespace — target Kubernetes namespace - xqwatcher:min_replicas — minimum pod count (maps from auto_scale.desired) - xqwatcher:max_replicas — maximum pod count (maps from auto_scale.max) - xqwatcher:docker_tag — container image tag (default: latest) Cluster assignments: - mitx, mitx-staging → residential cluster - mitxonline → applications cluster Namespace assignments follow xqueue convention: - mitx → mitx-openedx - mitxonline → mitxonline-openedx - mitx-staging → mitx-staging-openedx Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Migrates the xqueue-watcher (xqwatcher) infrastructure in ol-infrastructure from an EC2/ASG-based deployment to a Kubernetes Deployment on EKS, aligning with the ContainerGrader-based runtime introduced in the application repo.

Changes:

Adds xqwatcher to shared enum types to support consistent labeling.
Updates Vault policy to allow reading xqueue server credentials.
Replaces the xqwatcher EC2 stack with Kubernetes resources (Vault auth binding + VSO-synced secret + ConfigMap + RBAC + Deployment) and updates stack configs accordingly.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/ol_infrastructure/lib/ol_types.py`	Adds `xqwatcher` to `Services`/`Application` enums for consistent labels.
`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`	Extends Vault policy to read xqueue server secret path.
`src/ol_infrastructure/applications/xqwatcher/__main__.py`	Full rewrite: provisions Vault+IRSA binding, VSO secret sync, ConfigMap, RBAC, and a Deployment for xqueue-watcher.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml`	Updates stack config to K8s-focused settings (cluster/namespace/replicas/docker tag).
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml`	Same as above for Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml`	Same as above for CI.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml`	Updates config for residential mitx QA.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml`	Updates config for residential mitx Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml`	Updates config for residential mitx CI.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml`	Updates config for mitx-staging QA.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml`	Updates config for mitx-staging Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml`	Updates config for mitx-staging CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/ol_infrastructure/applications/xqwatcher/__main__.py

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml

- Add create_irsa_service_account flag to OLEKSAuthBinding to optionally create the K8s ServiceAccount with IRSA annotation; use it in xqwatcher to fix 'serviceaccount not found' pod error - Add XQWATCHER_* env vars to Deployment matching env_settings.py; expose http_basic_auth from Vault-synced secret via VSO template - Fix image reference from ghcr.io to mitodl/ (DockerHub) - Change imagePullPolicy to Always for mutable 'latest' tag - Rename XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG - Remove unused network_stack StackReference - Remove dead xqwatcher:target_vpc config key from all 9 stacks - Remove unimplemented xqwatcher:max_replicas from all 9 stacks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The manager CLI only accepts -d/--config_root; it auto-discovers xqwatcher.json and logging.json from that directory. Remove the non-existent --config and --logging-config flags. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/components/applications/eks.py

…pods ContainerGrader calls k8s_config.load_incluster_config() which reads the service account token from the projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token. The xqwatcher ServiceAccount has automount_service_account_token=False (secure default), so the PodSpec must explicitly opt in to have the token mounted, otherwise all Kubernetes Job API calls will fail with a ConfigException. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e tag When the Concourse pipeline populates XQWATCHER_DOCKER_DIGEST, build the image ref as mitodl/xqueue-watcher@sha256:... (immutable digest) so Kubernetes always pulls exactly the image that was built and tested. Fall back to :tag from stack config only when the digest is unavailable (e.g. manual deploys). imagePullPolicy: Always is retained so new digests are always pulled on rollout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The uv virtualenv bin directory is not on PATH in the container, so the 'xqueue-watcher' console script can't be found directly. Use 'uv run xqueue-watcher' to invoke it through uv's environment, which correctly resolves the script installed in the project virtualenv. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

uv run without --no-sync attempts to sync the virtualenv at startup, which fails in the container (no write access / network). Use --no-sync to run the already-installed entrypoint as-is. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

configure_from_directory(path) reads xqwatcher.json and logging.json directly from path, then globs path/conf.d/*.json for queue watcher configs. We were passing -d /xqwatcher/conf.d and mounting everything flat there, so the manager looked for watchers at /xqwatcher/conf.d/conf.d/*.json (not found). Fix: pass -d /xqwatcher and restructure mounts: /xqwatcher/xqwatcher.json <- manager config (ConfigMap) /xqwatcher/logging.json <- logging config (ConfigMap) /xqwatcher/conf.d/grader_config.json <- queue watchers (Vault secret) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

VSO renders secret values via Go templates: {{ .Secrets.confd_json }}. When confd_json is stored as a nested object, VSO renders a Go map literal (map[...]) rather than valid JSON, causing a JSONDecodeError at startup. Pre-serialize confd_json to a JSON string so the template renders parseable JSON. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…llback Match the keycloak pattern: require the digest env var so the image is always pinned to an immutable digest. Remove the mutable :latest tag fallback that allowed manual pulumi-up runs to silently deploy an uncontrolled image. Also remove the unused xqwatcher:docker_tag config key from all stack YAML files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gh cache When the SOPS secret's confd_json contains a ContainerGrader handler whose KWARGS include an 'image' key, rewrite that value through cached_image_uri() before writing to Vault. This means the SOPS secret stores a plain DockerHub reference (e.g. mitodl/mit-600x-grader:latest) and Pulumi transforms it to the ECR pull-through cache URI at deploy time, keeping grading Jobs free from DockerHub rate limits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CodeQL 'Analyze (actions)' job (exit code 32) fails because the extractor finds .github/workflows/*.yml and .github/actions/**/*.yml but cannot process any of them. This is a known extractor-level issue with CodeQL 2.24.x on Erk agent workflow patterns. Excluding .github from CodeQL's path analysis silences the fatal error while leaving Python and JavaScript/TypeScript scans unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add src/ol_concourse/pipelines/open_edx/grader_images/ with three pipeline definitions for building and publishing containerized course grader images to private ECR. base_image_pipeline.py: Builds grader_support/Dockerfile.base from the xqueue-watcher repo and pushes to both DockerHub (mitodl/xqueue-watcher-grader-base, public) and ECR (610119931565.dkr.ecr.us-east-1.amazonaws.com/mitodl/xqueue-watcher- grader-base, private). Triggered by changes to grader_support/ in the xqueue-watcher repo. The ECR push is the trigger source for downstream per-grader build pipelines. build_pipeline.py: GraderPipelineConfig dataclass and grader_image_pipeline() factory for per-grader-repo build pipelines. Triggered by new commits to the grader repo OR a new base image digest in ECR. The Docker build receives GRADER_BASE_IMAGE=repo@sha256:... resolved at runtime via a sh wrapper around oci-build-task's build script (the only way to inject a file-derived BUILD_ARG in Concourse; params are static strings). Pushes to private ECR only. GRADER_PIPELINES list seeded with graders-mit-600x. meta.py: Self-updating meta pipeline that creates and maintains the base image pipeline and one build pipeline per GRADER_PIPELINES entry. Triggered by changes to the grader_images/ pipeline code in ol-infrastructure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ines - base_image_pipeline: use chore/migrate-to-uv-and-k8s-container-grader branch of xqueue-watcher (where Dockerfile.base updates live) - build_pipeline: track feat/containerized-grader for graders-mit-600x - Fix E501 in both files: split long strings to stay within 88-char limit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CONTEXT was grader_support/ which caused the COPY grader_support/ instruction in Dockerfile.base to fail (no nested grader_support/ inside the context). Use the repo root as CONTEXT so the COPY can locate the directory relative to it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…images Add ensure_ecr_task() helper to ol_concourse/lib/containers.py (mirrors the pattern used in the dagster docker_pulumi_pipeline). The task runs the AWS CLI to check for the ECR repository and creates it if missing, so the first pipeline run does not fail on a missing registry. Apply to both grader image pipelines: - base_image_pipeline: ensures mitodl/xqueue-watcher-grader-base exists before pushing to ECR - build_pipeline: ensures the per-grader ECR repo (config.ecr_repo_name) exists before pushing the course grader image Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When ecr_region is set, the registry-image resource automatically constructs the full ECR URI as {account}.dkr.ecr.{region}.amazonaws.com/{repository}. Passing the full URI in image_repository caused the hostname to be doubled in API calls, resulting in NAME_UNKNOWN errors. - Remove ecr_image_uri property from GraderPipelineConfig - Fix grader_base_ecr_repo default to use repo-name-only string - Change registry_image(image_repository=config.ecr_image_uri) to registry_image(image_repository=config.ecr_repo_name) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The grader-images-pipeline-code git resource was tracking 'main', but the pipeline files don't exist on main yet. Switch to the feature branch until this work is merged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_concourse/lib/containers.py

The graders-mit-600x repository is private. Switch the git resource from an HTTPS git_repo to an ssh_git_repo so Concourse can clone it. The SSH private key is read from Vault at ((github.ssh_private_key)). - Import ssh_git_repo instead of git_repo - Add github_private_key field to GraderPipelineConfig (defaults to ((github.ssh_private_key))) - Update grader_repo_url in GRADER_PIPELINES to use SSH form (git@github.com:mitodl/graders-mit-600x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

infrastructure/github has no generic SSH key. The correct key for cloning private mitodl repos from the infrastructure Concourse team is odlbot_private_ssh_key in infrastructure/open_api_clients. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch the grader-base-image registry-image resource from ECR to DockerHub (mitodl/xqueue-watcher-grader-base). The base image pipeline pushes to both DockerHub and ECR; DockerHub is public and simpler to poll as a trigger without needing AWS credentials. - Rename GraderPipelineConfig.grader_base_ecr_repo to grader_base_dockerhub_repo - Remove ecr_region from the base image resource - Add DockerHub credentials ((dockerhub.username/password)) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three related fixes for the containerized grader deployment: 1. Add XQWATCHER_GRADER_NAMESPACE env var set to the deployment namespace. Without this, ContainerGrader defaults to spawning Jobs in 'default', breaking the RBAC Role binding and landing Jobs in the wrong namespace. 2. Add XQWATCHER_GRADER_BACKEND, CPU_LIMIT, MEMORY_LIMIT, TIMEOUT env vars driven by new stack config keys (grader_namespace, grader_cpu_limit, grader_memory_limit, grader_timeout). These set deployment-wide defaults so individual conf.d queue files don't need to repeat them. 3. Fix the DockerHub pull-through cache rewrite to skip images that already have a registry hostname (e.g. private ECR URIs, ghcr.io). Previously cached_image_uri() was called unconditionally, which would mangle a full ECR URI into an invalid doubled-host path. Images are now only rewritten if the first path component contains no '.'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

src/ol_infrastructure/lib/ol_types.py

…onfig + SERVER_REF Queue configs (CONNECTIONS, HANDLERS, ContainerGrader KWARGS) are now stored as plaintext in Pulumi stack YAML files under xqwatcher:queues. The xqueue server URL is stored under xqwatcher:xqueue_server_url. SERVER_REF is injected at deploy time so xqueue-watcher resolves credentials at runtime from xqueue_servers.json, which is mounted from a Vault-synced Kubernetes Secret. The secret is sourced from the same secret-{env_prefix}/edx-xqueue Vault KV path already used by the xqueue and edxapp deployments (xqwatcher_password field), eliminating the separate xqwatcher-specific KV mount and SOPS secrets files. Changes: - __main__.py: remove SOPS read, vault.kv.SecretV2, vault_mount_stack StackReference, and XQWATCHER_HTTP_BASIC_AUTH env var; read queues config from Pulumi config; inject SERVER_REF into each queue entry; move grader_config.json into ConfigMap; add xqueue_servers.json Vault-synced secret from secret-{env_prefix}/edx-xqueue; update Deployment volumes/mounts accordingly - xqwatcher_server_policy.hcl: remove secret-xqwatcher/* path - All 9 stack YAML files: add xqueue_server_url and queues config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

- Add AWS_DEFAULT_REGION=us-east-1 to ensure_ecr_task params so the AWS CLI knows which region to use without relying on worker defaults - Remove spurious service_account_name kwarg from OLVaultK8SResourcesConfig instantiation in OLEKSAuthBinding; the field does not exist on the model and the name is derived internally from application_name - Fix liveness probe to use 'uv run --no-sync python' instead of bare 'python', which would fail with ModuleNotFoundError because xqueue_watcher is only available inside the uv virtual environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Migrates the xqueue-watcher infrastructure from an EC2/ASG deployment to Kubernetes, updating Vault access and stack configuration, and adding Concourse pipelines to build/publish grader container images used by the new ContainerGrader flow.

Changes:

Add xqwatcher to shared enums used for labeling.
Replace the xqwatcher stack’s EC2 resources with Kubernetes resources (Deployment, RBAC, ConfigMap, Vault Secrets Operator integration).
Add Concourse pipelines to build a grader base image and course-specific grader images, and update stack YAML configs for the new K8s-based deployment.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/ol_infrastructure/lib/ol_types.py	Adds `xqwatcher` to enums used for consistent label generation.
src/ol_infrastructure/components/applications/eks.py	Extends `OLEKSAuthBinding` to optionally create IRSA ServiceAccount(s).
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl	Adjusts Vault policy to allow reading xqueue credentials from the shared secret path.
src/ol_infrastructure/applications/xqwatcher/main.py	Replaces EC2-based deployment with K8s Deployment + RBAC + ConfigMap + VSO-managed secrets.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_concourse/pipelines/open_edx/grader_images/meta.py	Adds a self-updating meta pipeline that creates/updates grader image pipelines.
src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py	Adds reusable pipeline generator for course-specific grader images.
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py	Adds pipeline generator for building/publishing the shared grader base image.
src/ol_concourse/pipelines/open_edx/grader_images/init.py	Initializes the new grader_images pipeline package.
src/ol_concourse/lib/containers.py	Adds a reusable task step to ensure an ECR repository exists before pushing.
src/bridge/secrets/xqwatcher/secrets.mitx.ci.yaml	Updates encrypted xqwatcher grading configuration secrets for the new backend.
.github/codeql/codeql-config.yml	Adds CodeQL config to exclude `.github` from actions extraction failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ol_concourse/pipelines/open_edx/grader_images/meta.py

src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py

src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py

src/ol_infrastructure/applications/xqwatcher/__main__.py

Replace the old Packer-based xqwatcher pipeline with a Docker+Pulumi pipeline that mirrors the xqueue pattern: - Watches mitodl/xqueue-watcher (main) for new commits - Builds and pushes the Docker image to DockerHub as mitodl/xqueue-watcher:{release} - Passes the built image digest as XQWATCHER_DOCKER_DIGEST to each Pulumi stack so the Deployment rolls to the exact image SHA Update meta.py to generate docker-pulumi-xqwatcher-{release} pipelines instead of the retired packer-pulumi ones. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Unpin grader_images meta pipeline from feature branch; track main - Unpin xqueue-watcher base image source from dev branch; track main - Unpin graders-mit-600x grader repo from feature branch; track main - Fix base_image_pipeline.py docstring: downstream pipelines trigger off the DockerHub push, not the ECR push - Add xqwatcher:docker_tag config fallback for XQWATCHER_DOCKER_DIGEST so pulumi up can run without the env var set (matches xqueue pattern) - Remove env vars that duplicate xqwatcher.json ConfigMap values (POLL_TIME, REQUESTS_TIMEOUT, POLL_INTERVAL, FOLLOW_CLIENT_REDIRECTS); keep only LOGIN_POLL_INTERVAL and GRADER_* which are not in the ConfigMap - Update PR description: image is on DockerHub, not GHCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

Register the MIT 6.686x course-specific grader image in GRADER_PIPELINES so the meta pipeline creates a build-graders-mit-686x-image Concourse pipeline that tracks the graders-mit-686x repo and pushes to ECR at mitodl/graders-mit-686x. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sentry · 2026-03-20T18:46:14Z

src/ol_infrastructure/applications/xqwatcher/__main__.py

+if not docker_image_digest:
+    msg = "Either XQWATCHER_DOCKER_DIGEST env var or xqwatcher:docker_tag config must be set"  # noqa: E501
+    raise ValueError(msg)
+docker_image_ref = f"mitodl/xqueue-watcher@{docker_image_digest}"


Bug: The code incorrectly uses @ to build the Docker image reference, which is only for digests. If the xqwatcher:docker_tag config provides a tag, the deployment will fail.
_{Severity: HIGH}

Suggested Fix

Modify the image reference construction to use a colon (:) instead of an at-symbol (@). This will correctly handle both tags and digests, as Docker's syntax image:tag@digest prioritizes the digest if both are present. The code should be changed to docker_image_ref = f"mitodl/xqueue-watcher:{docker_image_digest}". This aligns with the behavior of other applications like xqueue.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/ol_infrastructure/applications/xqwatcher/__main__.py#L75 Potential issue: The code constructs a Docker image reference by hardcoding the `@` symbol, which is reserved for image digests. However, the value for the image identifier can be sourced from a Pulumi configuration named `xqwatcher:docker_tag`, which implies a tag can be used. If a tag (e.g., `latest`) is provided through this configuration, the resulting image reference, such as `mitodl/xqueue-watcher@latest`, will be syntactically invalid. This will cause Kubernetes to fail when pulling the container image, preventing the application pod from starting.

Add the edxorg-686x queue to the mitxonline production xqwatcher stack using the ContainerGrader handler, replacing the legacy JailedGrader configuration in confd_json. This is in preparation for deployment of the xqueue-watcher changes in mitodl/xqueue-watcher#14. The memory limit is set to 1Gi (vs 512Mi for 600x) to accommodate the torch dependency used by the mnist problem set graders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an "edxorg" entry to the xqueue_servers.json Vault template so that queues using SERVER_REF "edxorg" resolve credentials for https://xqueue.edx.org. The template variables edxorg_xqueue_username and edxorg_xqueue_password must be added to the existing edx-xqueue Vault KV secret. Update the queue config loop to use setdefault so that queues can declare their own SERVER_REF in the Pulumi stack config rather than always being assigned "default". Set SERVER_REF: edxorg on the edxorg-686x queue in the mitxonline production stack config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

blarghmatey and others added 5 commits March 11, 2026 12:22

feat(ol_types): add xqwatcher to Services and Application enums

de52643

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5613460

for more information, see https://pre-commit.ci

blarghmatey requested a review from Copilot March 18, 2026 18:50

Copilot AI reviewed Mar 18, 2026

View reviewed changes

blarghmatey and others added 3 commits March 18, 2026 15:35

chore: Use xqwatcher image from dockerhub pull-through cache

3b62704

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_infrastructure/components/applications/eks.py Show resolved Hide resolved

blarghmatey and others added 4 commits March 18, 2026 16:44

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

blarghmatey and others added 11 commits March 18, 2026 17:22

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/ol_concourse/lib/containers.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 19, 2026 15:48

blarghmatey and others added 3 commits March 19, 2026 15:58

config: Update MITx CI watcher config for use on K8s

0f8f7f5

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Outdated Show resolved Hide resolved

blarghmatey added 2 commits March 19, 2026 17:03

config: Get grader path to strip erroneous prefix

5e38539

fix: Set proper grader root for dockerized graders

23298ed

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

fix: Don't strip path components either

3392b9f

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/ol_infrastructure/lib/ol_types.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 20, 2026 11:28

fix: Update mitx CI watcher password to match xqueue

a4fc006

sentry bot reviewed Mar 20, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

blarghmatey requested a review from Copilot March 20, 2026 17:00

Copilot started reviewing on behalf of blarghmatey March 20, 2026 17:01 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

blarghmatey and others added 2 commits March 20, 2026 13:22

sentry bot reviewed Mar 20, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 20, 2026 14:06

Delete .github/codeql/codeql-config.yml

d1f4d77

sentry bot reviewed Mar 20, 2026

View reviewed changes

blarghmatey and others added 2 commits March 20, 2026 16:04

Conversation

blarghmatey commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

src/ol_infrastructure/lib/ol_types.py

src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl

src/ol_infrastructure/applications/xqwatcher/__main__.py

Stack configs (9 files)

Deployment Prerequisites

Related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blarghmatey commented Mar 11, 2026 •

edited

Loading

`src/ol_infrastructure/lib/ol_types.py`

`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`

`src/ol_infrastructure/applications/xqwatcher/main.py`