feat(compute): add EC2 fleet compute strategy#32
feat(compute): add EC2 fleet compute strategy#32MichaelWalker-git wants to merge 7 commits intomainfrom
Conversation
Add a third compute backend (EC2 fleet with SSM Run Command) alongside the existing AgentCore and ECS strategies. This provides maximum flexibility with no image size limits, configurable instance types (including GPU), and full control over the compute environment. New files: - ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for instance tracking and SSM RunShellScript for task dispatch - ec2-agent-fleet.ts: CDK construct with ASG, launch template, security group, S3 payload bucket, and IAM role - ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage Wiring: - repo-config.ts: add 'ec2' to ComputeType, add instance_type field - compute-strategy.ts: add EC2 SessionHandle variant and resolver case - task-orchestrator.ts: add ec2Config prop with env vars and IAM grants - orchestrate-task.ts: enable compute polling for EC2 - cancel-task.ts: add SSM CancelCommand for EC2 tasks - task-api.ts: add ssm:CancelCommand permission for cancel Lambda - agent.ts: add commented-out EC2 fleet block (same pattern as ECS)
There was a problem hiding this comment.
Pull request overview
This PR adds a third compute backend (“ec2” fleet) alongside AgentCore and ECS, enabling task dispatch via SSM Run Command to tagged EC2 instances and adding the supporting CDK construct and runtime wiring.
Changes:
- Introduces
Ec2ComputeStrategy(S3 payload upload + EC2 instance selection/tagging + SSM dispatch/poll/cancel). - Adds
Ec2AgentFleetCDK construct (ASG + instance role/SG + payload bucket + user data) and optional stack wiring. - Extends config/types and handlers to support
compute_type: 'ec2'and EC2 cancellation/polling.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| yarn.lock | Locks new AWS SDK client dependencies used by EC2/SSM/S3 integration. |
| cdk/package.json | Adds @aws-sdk/client-ec2, client-ssm, client-s3 dependencies. |
| cdk/test/handlers/shared/strategies/ec2-strategy.test.ts | Unit tests for EC2 strategy start/poll/stop behavior. |
| cdk/test/handlers/shared/compute-strategy.test.ts | Ensures compute_type: ec2 resolves to Ec2ComputeStrategy. |
| cdk/test/constructs/ec2-agent-fleet.test.ts | Asserts key resources/permissions created by Ec2AgentFleet. |
| cdk/src/stacks/agent.ts | Adds commented wiring to enable EC2 fleet backend. |
| cdk/src/handlers/shared/strategies/ec2-strategy.ts | Implements EC2 fleet compute strategy via S3 + EC2 tags + SSM. |
| cdk/src/handlers/shared/repo-config.ts | Extends ComputeType to include ec2; adds instance_type. |
| cdk/src/handlers/shared/orchestrator.ts | Propagates instance_type into BlueprintConfig. |
| cdk/src/handlers/shared/compute-strategy.ts | Adds EC2 handle type + resolver case. |
| cdk/src/handlers/orchestrate-task.ts | Stores EC2 compute metadata and enables compute polling for EC2. |
| cdk/src/handlers/cancel-task.ts | Adds SSM CancelCommand support for EC2-backed tasks. |
| cdk/src/constructs/task-orchestrator.ts | Adds EC2 env vars and IAM policies for orchestrator Lambda. |
| cdk/src/constructs/task-api.ts | Adds ssm:CancelCommand permission option for cancel-task Lambda. |
| cdk/src/constructs/ec2-agent-fleet.ts | New ASG-based fleet construct with IAM/SG/S3/log group/user data. |
| cdk/src/constructs/blueprint.ts | Allows blueprint compute.type to be ec2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove unnecessary iam:PassRole from orchestrator (EC2 strategy never passes a role to any API) - Simplify ec2FleetConfig in task-api to empty object (instanceRoleArn was unused) - Use CDK Tags.of() for ASG fleet tag propagation instead of no-op user-data tagging — instances are now tagged at launch - Fix missing AWS_REGION in boot script by deriving from IMDS - Eliminate shell injection risk by reading all task data from S3 payload at runtime instead of interpolating into bash exports - Add cleanup trap in boot script to always retag instance as idle on exit (success, error, or signal) - Add try/catch rollback in startSession to retag instance as idle when SSM dispatch fails - Generalize ECS-specific log messages in poll loop to be compute-backend-agnostic (uses strategy type label)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix malformed sed quoting in AWS_REGION derivation (ec2-strategy.ts) - Remove unused blueprintConfig destructuring (ec2-strategy.ts) - Scope EC2/SSM IAM permissions: condition ec2:CreateTags on fleet tag, scope ssm:SendCommand to fleet-tagged instances and AWS-RunShellScript document, separate DescribeInstances (requires resource '*')
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
1. TOCTOU race in instance selection: after tagging an instance as busy, re-describe to verify our task-id stuck. If another orchestrator won the race, try the next idle candidate instead of double-dispatching. 2. Heartbeat false-positive: EC2/ECS tasks invoke run_task() directly and may not send continuous heartbeats. Suppress sessionUnhealthy checks when compute-level crash detection (pollSession) is active, preventing premature task failure after ~6 minutes. 3. SSM Cancelling status: map to 'running' (transient) instead of 'failed' to avoid premature failure while cancel propagates. 4. Fix babel parse errors in test mocks (remove `: unknown` annotations from jest.mock factory callbacks).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1. Add rollback on verify failure: if DescribeInstances throws during the tag-then-verify claim, roll back the busy/task-id tags so the instance isn't stuck. 2. Use docker container prune instead of docker system prune in cleanup trap to preserve cached images and avoid re-pulling on next task. 3. Add ecr:BatchCheckLayerAvailability to instance role ECR permissions — required for docker pull from ECR. 4. InvocationDoesNotExist now rethrows instead of returning failed, letting the orchestrator's consecutiveComputePollFailures counter handle transient propagation delays (fails after 3 consecutive).
… container Two bugs prevented the EC2 compute strategy from working end-to-end: 1. Python sys.path used /app but the Docker image places modules at /app/src — fixed to sys.path.insert(0, "/app/src"). 2. GITHUB_TOKEN_SECRET_ARN was not passed to the Docker container, causing the agent to fail with "github_token is required" — now exported in the boot script and forwarded via docker run -e. Also enables the EC2 fleet construct in agent.ts with blueprints for krokoko/agent-plugins and aws-samples/sample-autonomous-cloud-coding-agents.
End-to-End EC2 Compute Strategy Test ResultsDeployed the stack with EC2 fleet enabled and ran a live test task against Bugs Found and Fixed
Test Task:
|
Summary
Ec2ComputeStrategyhandler: finds idle instances via tags, uploads payload to S3, dispatches via SSMAWS-RunShellScript, pollsGetCommandInvocation, cancels withCancelCommandEc2AgentFleetCDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling imagescompute_type: 'ec2'instance_typefield toRepoConfigandBlueprintConfigfor future GPU/custom instance type supportTest plan
mise //cdk:compile— no TypeScript errorsmise //cdk:test— 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)mise //cdk:synth— synthesizes without errors (EC2 block commented out)mise //cdk:build— full build including lint passescompute_type: 'ec2'