Skip to content

feat(compute): add EC2 fleet compute strategy#32

Open
MichaelWalker-git wants to merge 7 commits intomainfrom
feat/ec2-fleet-strategy
Open

feat(compute): add EC2 fleet compute strategy#32
MichaelWalker-git wants to merge 7 commits intomainfrom
feat/ec2-fleet-strategy

Conversation

@MichaelWalker-git
Copy link
Copy Markdown
Contributor

@MichaelWalker-git MichaelWalker-git commented Apr 14, 2026

Summary

  • Add EC2 fleet compute strategy with SSM Run Command dispatch — a third compute backend alongside AgentCore (default) and ECS Fargate
  • New Ec2ComputeStrategy handler: finds idle instances via tags, uploads payload to S3, dispatches via SSM AWS-RunShellScript, polls GetCommandInvocation, cancels with CancelCommand
  • New Ec2AgentFleet CDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling images
  • Wire orchestrator polling, cancel-task SSM dispatch, and task-api SSM permissions for EC2
  • Stack wiring is commented-out (same pattern as ECS) — ready to enable per-repo via blueprint compute_type: 'ec2'
  • Add instance_type field to RepoConfig and BlueprintConfig for future GPU/custom instance type support
image

Test plan

  • mise //cdk:compile — no TypeScript errors
  • mise //cdk:test — 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)
  • mise //cdk:synth — synthesizes without errors (EC2 block commented out)
  • mise //cdk:build — full build including lint passes
  • Deploy with EC2 block uncommented and run an end-to-end task with compute_type: 'ec2'

Add a third compute backend (EC2 fleet with SSM Run Command) alongside
the existing AgentCore and ECS strategies. This provides maximum
flexibility with no image size limits, configurable instance types
(including GPU), and full control over the compute environment.

New files:
- ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for
  instance tracking and SSM RunShellScript for task dispatch
- ec2-agent-fleet.ts: CDK construct with ASG, launch template,
  security group, S3 payload bucket, and IAM role
- ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage

Wiring:
- repo-config.ts: add 'ec2' to ComputeType, add instance_type field
- compute-strategy.ts: add EC2 SessionHandle variant and resolver case
- task-orchestrator.ts: add ec2Config prop with env vars and IAM grants
- orchestrate-task.ts: enable compute polling for EC2
- cancel-task.ts: add SSM CancelCommand for EC2 tasks
- task-api.ts: add ssm:CancelCommand permission for cancel Lambda
- agent.ts: add commented-out EC2 fleet block (same pattern as ECS)
@MichaelWalker-git MichaelWalker-git requested a review from a team as a code owner April 14, 2026 20:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a third compute backend (“ec2” fleet) alongside AgentCore and ECS, enabling task dispatch via SSM Run Command to tagged EC2 instances and adding the supporting CDK construct and runtime wiring.

Changes:

  • Introduces Ec2ComputeStrategy (S3 payload upload + EC2 instance selection/tagging + SSM dispatch/poll/cancel).
  • Adds Ec2AgentFleet CDK construct (ASG + instance role/SG + payload bucket + user data) and optional stack wiring.
  • Extends config/types and handlers to support compute_type: 'ec2' and EC2 cancellation/polling.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
yarn.lock Locks new AWS SDK client dependencies used by EC2/SSM/S3 integration.
cdk/package.json Adds @aws-sdk/client-ec2, client-ssm, client-s3 dependencies.
cdk/test/handlers/shared/strategies/ec2-strategy.test.ts Unit tests for EC2 strategy start/poll/stop behavior.
cdk/test/handlers/shared/compute-strategy.test.ts Ensures compute_type: ec2 resolves to Ec2ComputeStrategy.
cdk/test/constructs/ec2-agent-fleet.test.ts Asserts key resources/permissions created by Ec2AgentFleet.
cdk/src/stacks/agent.ts Adds commented wiring to enable EC2 fleet backend.
cdk/src/handlers/shared/strategies/ec2-strategy.ts Implements EC2 fleet compute strategy via S3 + EC2 tags + SSM.
cdk/src/handlers/shared/repo-config.ts Extends ComputeType to include ec2; adds instance_type.
cdk/src/handlers/shared/orchestrator.ts Propagates instance_type into BlueprintConfig.
cdk/src/handlers/shared/compute-strategy.ts Adds EC2 handle type + resolver case.
cdk/src/handlers/orchestrate-task.ts Stores EC2 compute metadata and enables compute polling for EC2.
cdk/src/handlers/cancel-task.ts Adds SSM CancelCommand support for EC2-backed tasks.
cdk/src/constructs/task-orchestrator.ts Adds EC2 env vars and IAM policies for orchestrator Lambda.
cdk/src/constructs/task-api.ts Adds ssm:CancelCommand permission option for cancel-task Lambda.
cdk/src/constructs/ec2-agent-fleet.ts New ASG-based fleet construct with IAM/SG/S3/log group/user data.
cdk/src/constructs/blueprint.ts Allows blueprint compute.type to be ec2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove unnecessary iam:PassRole from orchestrator (EC2 strategy
  never passes a role to any API)
- Simplify ec2FleetConfig in task-api to empty object (instanceRoleArn
  was unused)
- Use CDK Tags.of() for ASG fleet tag propagation instead of no-op
  user-data tagging — instances are now tagged at launch
- Fix missing AWS_REGION in boot script by deriving from IMDS
- Eliminate shell injection risk by reading all task data from S3
  payload at runtime instead of interpolating into bash exports
- Add cleanup trap in boot script to always retag instance as idle
  on exit (success, error, or signal)
- Add try/catch rollback in startSession to retag instance as idle
  when SSM dispatch fails
- Generalize ECS-specific log messages in poll loop to be
  compute-backend-agnostic (uses strategy type label)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix malformed sed quoting in AWS_REGION derivation (ec2-strategy.ts)
- Remove unused blueprintConfig destructuring (ec2-strategy.ts)
- Scope EC2/SSM IAM permissions: condition ec2:CreateTags on fleet tag,
  scope ssm:SendCommand to fleet-tagged instances and AWS-RunShellScript
  document, separate DescribeInstances (requires resource '*')
@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

1. TOCTOU race in instance selection: after tagging an instance as busy,
   re-describe to verify our task-id stuck. If another orchestrator won
   the race, try the next idle candidate instead of double-dispatching.

2. Heartbeat false-positive: EC2/ECS tasks invoke run_task() directly
   and may not send continuous heartbeats. Suppress sessionUnhealthy
   checks when compute-level crash detection (pollSession) is active,
   preventing premature task failure after ~6 minutes.

3. SSM Cancelling status: map to 'running' (transient) instead of
   'failed' to avoid premature failure while cancel propagates.

4. Fix babel parse errors in test mocks (remove `: unknown` annotations
   from jest.mock factory callbacks).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

1. Add rollback on verify failure: if DescribeInstances throws during
   the tag-then-verify claim, roll back the busy/task-id tags so the
   instance isn't stuck.
2. Use docker container prune instead of docker system prune in cleanup
   trap to preserve cached images and avoid re-pulling on next task.
3. Add ecr:BatchCheckLayerAvailability to instance role ECR permissions
   — required for docker pull from ECR.
4. InvocationDoesNotExist now rethrows instead of returning failed,
   letting the orchestrator's consecutiveComputePollFailures counter
   handle transient propagation delays (fails after 3 consecutive).
… container

Two bugs prevented the EC2 compute strategy from working end-to-end:

1. Python sys.path used /app but the Docker image places modules at
   /app/src — fixed to sys.path.insert(0, "/app/src").
2. GITHUB_TOKEN_SECRET_ARN was not passed to the Docker container,
   causing the agent to fail with "github_token is required" — now
   exported in the boot script and forwarded via docker run -e.

Also enables the EC2 fleet construct in agent.ts with blueprints for
krokoko/agent-plugins and aws-samples/sample-autonomous-cloud-coding-agents.
@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

End-to-End EC2 Compute Strategy Test Results

Deployed the stack with EC2 fleet enabled and ran a live test task against aws-samples/sample-autonomous-cloud-coding-agents using compute_type: 'ec2'.

Bugs Found and Fixed

  1. Python module path (/app -> /app/src): The Docker image copies src/ to /app/src/, so entrypoint.py lives at /app/src/entrypoint.py. The boot script's sys.path.insert was pointing to /app, causing ModuleNotFoundError: No module named 'entrypoint'.

  2. Missing GITHUB_TOKEN_SECRET_ARN: The ECS strategy passes this env var to the container so the agent can fetch the GitHub PAT from Secrets Manager. The EC2 strategy was missing it, causing ValueError: github_token is required. Fixed by extracting the ARN from blueprintConfig and passing it via docker run -e.

Test Task: 01KP78GJ6P8G559VH2VGW5CGNJ

Task:   "Add a CONTRIBUTORS.md file listing project contributors"
Repo:   aws-samples/sample-autonomous-cloud-coding-agents
Status: RUNNING (agent completed work, push failed due to PAT permissions on aws-samples org)

What worked end-to-end:

  • Task submitted via CLI with compute_type: 'ec2' blueprint
  • Orchestrator found idle EC2 instance, tagged it busy, dispatched SSM command
  • SSM boot script: S3 payload download, ECR login, docker pull all succeeded
  • Agent container started, resolved GitHub token from Secrets Manager
  • Agent cloned repo, explored git history, created CONTRIBUTORS.md, committed locally
  • Ran 11 turns, cost $0.16

What didn't work (not an infra issue):

  • git push and gh pr create failed with Permission denied to MichaelWalker-git — the configured PAT doesn't have write access to the aws-samples org. This is a GitHub permissions issue, not an EC2 strategy bug.

SSM Command Output (key lines)

[00:26:40] TASK Repository: aws-samples/sample-autonomous-cloud-coding-agents
[00:26:40] TASK Model: us.anthropic.claude-sonnet-4-6
[00:26:40] CMD clone: gh repo clone aws-samples/sample-autonomous-cloud-coding-agents /workspace/...
[00:26:41] CMD clone: OK
[00:26:46] CMD mise-install: OK
[00:27:13] TOOL Write: /workspace/.../CONTRIBUTORS.md
[00:27:26] TOOL Bash: git add CONTRIBUTORS.md && git commit -m "..."
[00:27:26] RESULT [ok] 1 file changed, 17 insertions(+)
[00:27:28] TOOL Bash: git push ... 
[00:27:28] RESULT [ERROR] Permission denied to MichaelWalker-git

Conclusion

The EC2 compute strategy works end-to-end. The two bugs in this commit were the only blockers. Instance lifecycle (idle -> busy -> idle tagging) and cleanup trap both function correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants