feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287
feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287blarghmatey wants to merge 40 commits intomainfrom
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add read access to secret-DEPLOYMENT/edx-xqueue so the xqwatcher service can retrieve the xqueue server URL and authentication password needed by the ContainerGrader handler config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Completely rewrite the xqwatcher Pulumi stack to deploy on Kubernetes instead of EC2 Auto Scaling Groups with AppArmor/codejail. Changes: - Replace IAM instance profile + Vault AWS auth with OLEKSAuthBinding (IRSA + Vault K8s auth backend) - Add OLVaultK8SSecret to sync grader handler config from Vault KV to a Kubernetes Secret via the Vault Secrets Operator CRD - Add a ConfigMap for base poll settings and structured JSON logging to stdout (no log rotation in containers) - Add RBAC Role + RoleBinding granting the xqwatcher service account permission to create/delete Kubernetes Jobs and read pod logs, required by ContainerGrader's kubernetes backend - Create a Kubernetes Deployment with: - ghcr.io/mitodl/xqueue-watcher image - Security context (non-root, drop ALL capabilities) - Resource requests + memory limit - Liveness probe via python -c import xqueue_watcher - Topology spread for HA across nodes - Vault grader config + base config mounted into /xqwatcher/conf.d/ - Preserve vault.kv.SecretV2 write so grader config remains managed in Pulumi - Export k8s_deployment_name and k8s_namespace Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove EC2-specific settings (consul:address, auto_scale, instance_type) and add Kubernetes-specific settings for all stacks: - xqwatcher:cluster — EKS cluster name (residential or applications) - xqwatcher:namespace — target Kubernetes namespace - xqwatcher:min_replicas — minimum pod count (maps from auto_scale.desired) - xqwatcher:max_replicas — maximum pod count (maps from auto_scale.max) - xqwatcher:docker_tag — container image tag (default: latest) Cluster assignments: - mitx, mitx-staging → residential cluster - mitxonline → applications cluster Namespace assignments follow xqueue convention: - mitx → mitx-openedx - mitxonline → mitxonline-openedx - mitx-staging → mitx-staging-openedx Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Migrates the xqueue-watcher (xqwatcher) infrastructure in ol-infrastructure from an EC2/ASG-based deployment to a Kubernetes Deployment on EKS, aligning with the ContainerGrader-based runtime introduced in the application repo.
Changes:
- Adds
xqwatcherto shared enum types to support consistent labeling. - Updates Vault policy to allow reading xqueue server credentials.
- Replaces the xqwatcher EC2 stack with Kubernetes resources (Vault auth binding + VSO-synced secret + ConfigMap + RBAC + Deployment) and updates stack configs accordingly.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/ol_infrastructure/lib/ol_types.py |
Adds xqwatcher to Services/Application enums for consistent labels. |
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl |
Extends Vault policy to read xqueue server secret path. |
src/ol_infrastructure/applications/xqwatcher/__main__.py |
Full rewrite: provisions Vault+IRSA binding, VSO secret sync, ConfigMap, RBAC, and a Deployment for xqueue-watcher. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml |
Updates stack config to K8s-focused settings (cluster/namespace/replicas/docker tag). |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml |
Same as above for Production. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml |
Same as above for CI. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml |
Updates config for residential mitx QA. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml |
Updates config for residential mitx Production. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml |
Updates config for residential mitx CI. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml |
Updates config for mitx-staging QA. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml |
Updates config for mitx-staging Production. |
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml |
Updates config for mitx-staging CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml
Outdated
Show resolved
Hide resolved
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml
Outdated
Show resolved
Hide resolved
- Add create_irsa_service_account flag to OLEKSAuthBinding to optionally create the K8s ServiceAccount with IRSA annotation; use it in xqwatcher to fix 'serviceaccount not found' pod error - Add XQWATCHER_* env vars to Deployment matching env_settings.py; expose http_basic_auth from Vault-synced secret via VSO template - Fix image reference from ghcr.io to mitodl/ (DockerHub) - Change imagePullPolicy to Always for mutable 'latest' tag - Rename XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG - Remove unused network_stack StackReference - Remove dead xqwatcher:target_vpc config key from all 9 stacks - Remove unimplemented xqwatcher:max_replicas from all 9 stacks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The manager CLI only accepts -d/--config_root; it auto-discovers xqwatcher.json and logging.json from that directory. Remove the non-existent --config and --logging-config flags. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pods ContainerGrader calls k8s_config.load_incluster_config() which reads the service account token from the projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token. The xqwatcher ServiceAccount has automount_service_account_token=False (secure default), so the PodSpec must explicitly opt in to have the token mounted, otherwise all Kubernetes Job API calls will fail with a ConfigException. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e tag When the Concourse pipeline populates XQWATCHER_DOCKER_DIGEST, build the image ref as mitodl/xqueue-watcher@sha256:... (immutable digest) so Kubernetes always pulls exactly the image that was built and tested. Fall back to :tag from stack config only when the digest is unavailable (e.g. manual deploys). imagePullPolicy: Always is retained so new digests are always pulled on rollout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The uv virtualenv bin directory is not on PATH in the container, so the 'xqueue-watcher' console script can't be found directly. Use 'uv run xqueue-watcher' to invoke it through uv's environment, which correctly resolves the script installed in the project virtualenv. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
uv run without --no-sync attempts to sync the virtualenv at startup, which fails in the container (no write access / network). Use --no-sync to run the already-installed entrypoint as-is. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
configure_from_directory(path) reads xqwatcher.json and logging.json directly from path, then globs path/conf.d/*.json for queue watcher configs. We were passing -d /xqwatcher/conf.d and mounting everything flat there, so the manager looked for watchers at /xqwatcher/conf.d/conf.d/*.json (not found). Fix: pass -d /xqwatcher and restructure mounts: /xqwatcher/xqwatcher.json <- manager config (ConfigMap) /xqwatcher/logging.json <- logging config (ConfigMap) /xqwatcher/conf.d/grader_config.json <- queue watchers (Vault secret) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VSO renders secret values via Go templates: {{ .Secrets.confd_json }}.
When confd_json is stored as a nested object, VSO renders a Go map
literal (map[...]) rather than valid JSON, causing a JSONDecodeError
at startup. Pre-serialize confd_json to a JSON string so the template
renders parseable JSON.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback Match the keycloak pattern: require the digest env var so the image is always pinned to an immutable digest. Remove the mutable :latest tag fallback that allowed manual pulumi-up runs to silently deploy an uncontrolled image. Also remove the unused xqwatcher:docker_tag config key from all stack YAML files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gh cache When the SOPS secret's confd_json contains a ContainerGrader handler whose KWARGS include an 'image' key, rewrite that value through cached_image_uri() before writing to Vault. This means the SOPS secret stores a plain DockerHub reference (e.g. mitodl/mit-600x-grader:latest) and Pulumi transforms it to the ECR pull-through cache URI at deploy time, keeping grading Jobs free from DockerHub rate limits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CodeQL 'Analyze (actions)' job (exit code 32) fails because the extractor finds .github/workflows/*.yml and .github/actions/**/*.yml but cannot process any of them. This is a known extractor-level issue with CodeQL 2.24.x on Erk agent workflow patterns. Excluding .github from CodeQL's path analysis silences the fatal error while leaving Python and JavaScript/TypeScript scans unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add src/ol_concourse/pipelines/open_edx/grader_images/ with three pipeline definitions for building and publishing containerized course grader images to private ECR. base_image_pipeline.py: Builds grader_support/Dockerfile.base from the xqueue-watcher repo and pushes to both DockerHub (mitodl/xqueue-watcher-grader-base, public) and ECR (610119931565.dkr.ecr.us-east-1.amazonaws.com/mitodl/xqueue-watcher- grader-base, private). Triggered by changes to grader_support/ in the xqueue-watcher repo. The ECR push is the trigger source for downstream per-grader build pipelines. build_pipeline.py: GraderPipelineConfig dataclass and grader_image_pipeline() factory for per-grader-repo build pipelines. Triggered by new commits to the grader repo OR a new base image digest in ECR. The Docker build receives GRADER_BASE_IMAGE=repo@sha256:... resolved at runtime via a sh wrapper around oci-build-task's build script (the only way to inject a file-derived BUILD_ARG in Concourse; params are static strings). Pushes to private ECR only. GRADER_PIPELINES list seeded with graders-mit-600x. meta.py: Self-updating meta pipeline that creates and maintains the base image pipeline and one build pipeline per GRADER_PIPELINES entry. Triggered by changes to the grader_images/ pipeline code in ol-infrastructure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ines - base_image_pipeline: use chore/migrate-to-uv-and-k8s-container-grader branch of xqueue-watcher (where Dockerfile.base updates live) - build_pipeline: track feat/containerized-grader for graders-mit-600x - Fix E501 in both files: split long strings to stay within 88-char limit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CONTEXT was grader_support/ which caused the COPY grader_support/ instruction in Dockerfile.base to fail (no nested grader_support/ inside the context). Use the repo root as CONTEXT so the COPY can locate the directory relative to it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…images Add ensure_ecr_task() helper to ol_concourse/lib/containers.py (mirrors the pattern used in the dagster docker_pulumi_pipeline). The task runs the AWS CLI to check for the ECR repository and creates it if missing, so the first pipeline run does not fail on a missing registry. Apply to both grader image pipelines: - base_image_pipeline: ensures mitodl/xqueue-watcher-grader-base exists before pushing to ECR - build_pipeline: ensures the per-grader ECR repo (config.ecr_repo_name) exists before pushing the course grader image Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When ecr_region is set, the registry-image resource automatically
constructs the full ECR URI as {account}.dkr.ecr.{region}.amazonaws.com/{repository}.
Passing the full URI in image_repository caused the hostname to be doubled
in API calls, resulting in NAME_UNKNOWN errors.
- Remove ecr_image_uri property from GraderPipelineConfig
- Fix grader_base_ecr_repo default to use repo-name-only string
- Change registry_image(image_repository=config.ecr_image_uri) to
registry_image(image_repository=config.ecr_repo_name)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The grader-images-pipeline-code git resource was tracking 'main', but the pipeline files don't exist on main yet. Switch to the feature branch until this work is merged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The graders-mit-600x repository is private. Switch the git resource from an HTTPS git_repo to an ssh_git_repo so Concourse can clone it. The SSH private key is read from Vault at ((github.ssh_private_key)). - Import ssh_git_repo instead of git_repo - Add github_private_key field to GraderPipelineConfig (defaults to ((github.ssh_private_key))) - Update grader_repo_url in GRADER_PIPELINES to use SSH form (git@github.com:mitodl/graders-mit-600x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
infrastructure/github has no generic SSH key. The correct key for cloning private mitodl repos from the infrastructure Concourse team is odlbot_private_ssh_key in infrastructure/open_api_clients. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch the grader-base-image registry-image resource from ECR to DockerHub (mitodl/xqueue-watcher-grader-base). The base image pipeline pushes to both DockerHub and ECR; DockerHub is public and simpler to poll as a trigger without needing AWS credentials. - Rename GraderPipelineConfig.grader_base_ecr_repo to grader_base_dockerhub_repo - Remove ecr_region from the base image resource - Add DockerHub credentials ((dockerhub.username/password)) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three related fixes for the containerized grader deployment: 1. Add XQWATCHER_GRADER_NAMESPACE env var set to the deployment namespace. Without this, ContainerGrader defaults to spawning Jobs in 'default', breaking the RBAC Role binding and landing Jobs in the wrong namespace. 2. Add XQWATCHER_GRADER_BACKEND, CPU_LIMIT, MEMORY_LIMIT, TIMEOUT env vars driven by new stack config keys (grader_namespace, grader_cpu_limit, grader_memory_limit, grader_timeout). These set deployment-wide defaults so individual conf.d queue files don't need to repeat them. 3. Fix the DockerHub pull-through cache rewrite to skip images that already have a registry hostname (e.g. private ECR URIs, ghcr.io). Previously cached_image_uri() was called unconditionally, which would mangle a full ECR URI into an invalid doubled-host path. Images are now only rewritten if the first path component contains no '.'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onfig + SERVER_REF
Queue configs (CONNECTIONS, HANDLERS, ContainerGrader KWARGS) are now
stored as plaintext in Pulumi stack YAML files under xqwatcher:queues.
The xqueue server URL is stored under xqwatcher:xqueue_server_url.
SERVER_REF is injected at deploy time so xqueue-watcher resolves
credentials at runtime from xqueue_servers.json, which is mounted from
a Vault-synced Kubernetes Secret. The secret is sourced from the same
secret-{env_prefix}/edx-xqueue Vault KV path already used by the xqueue
and edxapp deployments (xqwatcher_password field), eliminating the
separate xqwatcher-specific KV mount and SOPS secrets files.
Changes:
- __main__.py: remove SOPS read, vault.kv.SecretV2, vault_mount_stack
StackReference, and XQWATCHER_HTTP_BASIC_AUTH env var; read queues
config from Pulumi config; inject SERVER_REF into each queue entry;
move grader_config.json into ConfigMap; add xqueue_servers.json
Vault-synced secret from secret-{env_prefix}/edx-xqueue; update
Deployment volumes/mounts accordingly
- xqwatcher_server_policy.hcl: remove secret-xqwatcher/* path
- All 9 stack YAML files: add xqueue_server_url and queues config
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add AWS_DEFAULT_REGION=us-east-1 to ensure_ecr_task params so the AWS CLI knows which region to use without relying on worker defaults - Remove spurious service_account_name kwarg from OLVaultK8SResourcesConfig instantiation in OLEKSAuthBinding; the field does not exist on the model and the name is derived internally from application_name - Fix liveness probe to use 'uv run --no-sync python' instead of bare 'python', which would fail with ModuleNotFoundError because xqueue_watcher is only available inside the uv virtual environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Migrates the xqueue-watcher infrastructure from an EC2/ASG deployment to Kubernetes, updating Vault access and stack configuration, and adding Concourse pipelines to build/publish grader container images used by the new ContainerGrader flow.
Changes:
- Add
xqwatcherto shared enums used for labeling. - Replace the
xqwatcherstack’s EC2 resources with Kubernetes resources (Deployment, RBAC, ConfigMap, Vault Secrets Operator integration). - Add Concourse pipelines to build a grader base image and course-specific grader images, and update stack YAML configs for the new K8s-based deployment.
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ol_infrastructure/lib/ol_types.py | Adds xqwatcher to enums used for consistent label generation. |
| src/ol_infrastructure/components/applications/eks.py | Extends OLEKSAuthBinding to optionally create IRSA ServiceAccount(s). |
| src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl | Adjusts Vault policy to allow reading xqueue credentials from the shared secret path. |
| src/ol_infrastructure/applications/xqwatcher/main.py | Replaces EC2-based deployment with K8s Deployment + RBAC + ConfigMap + VSO-managed secrets. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml | Updates stack config from EC2 params to K8s params + queue definitions. |
| src/ol_concourse/pipelines/open_edx/grader_images/meta.py | Adds a self-updating meta pipeline that creates/updates grader image pipelines. |
| src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py | Adds reusable pipeline generator for course-specific grader images. |
| src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py | Adds pipeline generator for building/publishing the shared grader base image. |
| src/ol_concourse/pipelines/open_edx/grader_images/init.py | Initializes the new grader_images pipeline package. |
| src/ol_concourse/lib/containers.py | Adds a reusable task step to ensure an ECR repository exists before pushing. |
| src/bridge/secrets/xqwatcher/secrets.mitx.ci.yaml | Updates encrypted xqwatcher grading configuration secrets for the new backend. |
| .github/codeql/codeql-config.yml | Adds CodeQL config to exclude .github from actions extraction failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py
Outdated
Show resolved
Hide resolved
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py
Outdated
Show resolved
Hide resolved
src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py
Outdated
Show resolved
Hide resolved
Replace the old Packer-based xqwatcher pipeline with a Docker+Pulumi
pipeline that mirrors the xqueue pattern:
- Watches mitodl/xqueue-watcher (main) for new commits
- Builds and pushes the Docker image to DockerHub as
mitodl/xqueue-watcher:{release}
- Passes the built image digest as XQWATCHER_DOCKER_DIGEST to each
Pulumi stack so the Deployment rolls to the exact image SHA
Update meta.py to generate docker-pulumi-xqwatcher-{release} pipelines
instead of the retired packer-pulumi ones.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Unpin grader_images meta pipeline from feature branch; track main - Unpin xqueue-watcher base image source from dev branch; track main - Unpin graders-mit-600x grader repo from feature branch; track main - Fix base_image_pipeline.py docstring: downstream pipelines trigger off the DockerHub push, not the ECR push - Add xqwatcher:docker_tag config fallback for XQWATCHER_DOCKER_DIGEST so pulumi up can run without the env var set (matches xqueue pattern) - Remove env vars that duplicate xqwatcher.json ConfigMap values (POLL_TIME, REQUESTS_TIMEOUT, POLL_INTERVAL, FOLLOW_CLIENT_REDIRECTS); keep only LOGIN_POLL_INTERVAL and GRADER_* which are not in the ConfigMap - Update PR description: image is on DockerHub, not GHCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Register the MIT 6.686x course-specific grader image in GRADER_PIPELINES so the meta pipeline creates a build-graders-mit-686x-image Concourse pipeline that tracks the graders-mit-686x repo and pushes to ECR at mitodl/graders-mit-686x. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| if not docker_image_digest: | ||
| msg = "Either XQWATCHER_DOCKER_DIGEST env var or xqwatcher:docker_tag config must be set" # noqa: E501 | ||
| raise ValueError(msg) | ||
| docker_image_ref = f"mitodl/xqueue-watcher@{docker_image_digest}" |
There was a problem hiding this comment.
Bug: The code incorrectly uses @ to build the Docker image reference, which is only for digests. If the xqwatcher:docker_tag config provides a tag, the deployment will fail.
Severity: HIGH
Suggested Fix
Modify the image reference construction to use a colon (:) instead of an at-symbol (@). This will correctly handle both tags and digests, as Docker's syntax image:tag@digest prioritizes the digest if both are present. The code should be changed to docker_image_ref = f"mitodl/xqueue-watcher:{docker_image_digest}". This aligns with the behavior of other applications like xqueue.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/ol_infrastructure/applications/xqwatcher/__main__.py#L75
Potential issue: The code constructs a Docker image reference by hardcoding the `@`
symbol, which is reserved for image digests. However, the value for the image identifier
can be sourced from a Pulumi configuration named `xqwatcher:docker_tag`, which implies a
tag can be used. If a tag (e.g., `latest`) is provided through this configuration, the
resulting image reference, such as `mitodl/xqueue-watcher@latest`, will be syntactically
invalid. This will cause Kubernetes to fail when pulling the container image, preventing
the application pod from starting.
Add the edxorg-686x queue to the mitxonline production xqwatcher stack using the ContainerGrader handler, replacing the legacy JailedGrader configuration in confd_json. This is in preparation for deployment of the xqueue-watcher changes in mitodl/xqueue-watcher#14. The memory limit is set to 1Gi (vs 512Mi for 600x) to accommodate the torch dependency used by the mnist problem set graders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an "edxorg" entry to the xqueue_servers.json Vault template so that queues using SERVER_REF "edxorg" resolve credentials for https://xqueue.edx.org. The template variables edxorg_xqueue_username and edxorg_xqueue_password must be added to the existing edx-xqueue Vault KV secret. Update the queue config loop to use setdefault so that queues can declare their own SERVER_REF in the Pulumi stack config rather than always being assigned "default". Set SERVER_REF: edxorg on the edxorg-686x queue in the mitxonline production stack config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Migrates xqueue-watcher infrastructure from EC2 Auto Scaling Groups with AppArmor/codejail sandboxing to a Kubernetes Deployment using container-based grading. This is the infrastructure companion to mitodl/xqueue-watcher#14 which implements the ContainerGrader backend.
Changes
src/ol_infrastructure/lib/ol_types.pyxqwatcherto bothServicesandApplicationenums for consistent K8s label generation.src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hclsecret-DEPLOYMENT/edx-xqueueso the grader handler config (stored in Vault) can embed the xqueue server URL and authentication password.src/ol_infrastructure/applications/xqwatcher/__main__.pyComplete rewrite replacing EC2 resources with Kubernetes resources:
OLEKSAuthBinding(IRSA + Vault K8s auth)Deploymentmitodl/xqueue-watcher(DockerHub) container imageConfigMap+OLVaultK8SSecretCRDNew Kubernetes resources created:
OLEKSAuthBinding— IRSA role + Vault Kubernetes auth backend roleOLVaultK8SSecret— syncs grader handler config from Vault KV to a K8s Secret via Vault Secrets OperatorConfigMap— base poll settings (xqwatcher.json) and stdout-only structured logging (logging.json)Role+RoleBinding— grants xqwatcher pods permission to create/delete Jobs and read pod logs (required by ContainerGrader's Kubernetes backend)Deployment— runs xqueue-watcher with non-root security context, resource limits, liveness probe, and topology spread for HAStack configs (9 files)
Removed EC2-specific keys (
consul:address,auto_scale,instance_type) and added K8s-specific keys:xqwatcher:cluster— EKS cluster namexqwatcher:namespace— Kubernetes namespacexqwatcher:min_replicas/max_replicasxqwatcher:docker_tagDeployment Prerequisites
Before applying this stack:
mitodl/xqueue-watcherimage to DockerHub (from PR Adding more precise filtering for VPC and subnet imports #14)secret-xqwatcher/{env}-grader-configwithconfd_jsoncontaining a ContainerGrader handler configRelated PRs