Skip to content

feat(aws): optionally provision managed ElastiCache Redis#186

Draft
BimaPangestu28 wants to merge 1 commit into
mainfrom
fix/aws-provision-managed-redis
Draft

feat(aws): optionally provision managed ElastiCache Redis#186
BimaPangestu28 wants to merge 1 commit into
mainfrom
fix/aws-provision-managed-redis

Conversation

@BimaPangestu28

Copy link
Copy Markdown
Member

Summary

Adds opt-in ElastiCache provisioning to the AWS operator module so deploys whose bundles include state-redis actually get a working Redis endpoint without the user having to stand one up out-of-band.

  • New top-level variable aws_provision_redis (bool, default false).
  • When aws_provision_redis = true and redis_url is empty, the operator module creates:
    • aws_elasticache_subnet_group in the same subnets the ECS service uses.
    • aws_security_group with a single ingress on TCP/6379 from the ECS task security group.
    • aws_elasticache_cluster (engine redis, cache.t3.micro × 1, port 6379, default.redis7 parameter group).
  • A new local effective_redis_url picks the external redis_url when supplied, otherwise the managed cluster's primary endpoint. The container's REDIS_URL env var now reads from it.
  • aws_redis_node_type / aws_redis_engine_version exposed for future sizing without code change.

Why

The bundle's state-redis pack reads its connection string from the runtime secrets and expects a reachable Redis. The AWS scaffold only shipped a modules/redis/main.tf null_resource stub — nothing was ever provisioned — so any bundle that resolved to state-redis at runtime got secrets://dev/<tenant>/_/state-redis/redis_url pointed at a host that didn't exist. Combined with the placeholder-expansion gap in runtime_secrets.rs (see #185), the deep-research demo's button-click flow surfaced pack execution failed: failed to render node input template because state writes silently failed.

The companion PR in greentic-demo (greenticai/greentic-demo#166) drops state-redis from the deep-research AWS bundle and unblocks 3Point. This PR is the proper fix for any caller that wants a working state-redis on AWS.

Design notes

  • Resources live inside the operator module. ElastiCache needs the operator's VPC, subnets, and ECS task SG. Keeping the cluster co-located with the operator avoids a cross-module data flow that would otherwise circle (Redis needs operator's VPC; operator needs Redis's URL).
  • External redis_url still wins. effective_redis_url = var.redis_url != "" ? var.redis_url : local.managed_redis_url preserves backward compatibility for callers that already point at an outside-of-this-stack Redis.
  • Strictly opt-in. Default count = 0 means existing deploys see no change. A separate Rust-side change can later wire a GREENTIC_DEPLOY_TERRAFORM_VAR_AWS_PROVISION_REDIS env switch when we want the deployer CLI to drive this.
  • Stub left intact. modules/redis/main.tf (the null_resource) stays because tests/pr04_terraform_pack.rs asserts the file exists in the pack. Cleanup belongs to a separate change.

Test plan

  • terraform init -backend=false && terraform validate in fixtures/packs/aws/terraform/ — clean (only pre-existing data.aws_region.current.name deprecation warnings).
  • cargo test --workspace — 0 failures.
  • cargo fmt --all --check — clean.
  • cargo clippy --workspace -- -D warnings — clean.
  • Live AWS deploy with TF_VAR_aws_provision_redis=true. Expect: terraform apply takes ~5–10 min extra for the cluster, REDIS_URL env on the ECS task points at the cluster endpoint, state-redis pack connects, deep-research button-click flow works end-to-end. Not exercised in CI.

Out of scope

  • Multi-AZ replication group with automatic failover (current is single-node).
  • AUTH token / transit encryption (current cluster is open inside the VPC).
  • Removing the dead modules/redis/main.tf stub.
  • Deployer CLI flag to flip aws_provision_redis from the command line.

When `var.aws_provision_redis = true` and no external `redis_url` is supplied,
the AWS operator module now stands up a single-node ElastiCache cluster
(`cache.t3.micro`, Redis 7.1 by default), gates ingress on port 6379 to the
ECS service security group, and feeds the resulting endpoint into the ECS
container as `REDIS_URL`. External `redis_url` (when non-empty) still takes
precedence, so existing deploys that target an outside-of-this-stack Redis
keep working without any change.

Why
---
The bundle's state-redis pack expects a reachable Redis endpoint, but the
AWS scaffold only shipped a `modules/redis/main.tf` `null_resource` stub —
nothing was ever provisioned. The companion fix in greentic-demo PR #166
drops state-redis from the deep-research demo so the demo path is unblocked,
but multi-instance scaling (or any future bundle that legitimately needs a
shared state-kv backend) needs a real Redis. This makes that opt-in.

Design notes
------------
- Resources live inside the operator module so VPC/subnet/security-group
  references stay local; no cross-module data flow with the redis stub.
- `local.effective_redis_url` prefers an externally-supplied URL over the
  managed one. That preserves backward compatibility for callers that pass
  `redis_url` from outside (e.g. an existing ElastiCache in another stack).
- Default `count = 0` keeps it strictly opt-in. The existing
  `modules/redis/main.tf` stub is left in place because tests in
  `tests/pr04_terraform_pack.rs` assert the file exists; cleanup belongs to
  a separate change.
- `provision_redis` is propagated from the top-level
  `var.aws_provision_redis` so the deployer Rust side can later wire a
  `GREENTIC_DEPLOY_TERRAFORM_VAR_AWS_PROVISION_REDIS` env switch without
  touching this module.

Test plan
---------
- `terraform init -backend=false && terraform validate` in
  `fixtures/packs/aws/terraform/` is clean (only pre-existing
  `data.aws_region.current.name` deprecation warnings).
- `cargo test --workspace` is green.
- `cargo fmt --all --check` and `cargo clippy --workspace -- -D warnings`
  are clean.
- A live AWS deploy with `TF_VAR_aws_provision_redis=true` will need
  ~5–10 min extra `terraform apply` for the cluster; not exercised in CI.
@BimaPangestu28

Copy link
Copy Markdown
Member Author

Heads-up before this lands

Live-tested this PR end-to-end. ElastiCache provisioning itself works: with TF_VAR_aws_provision_redis=true and an empty redis_url, terraform stands up cache.t3.micro, wires the SG, and local.effective_redis_url populates REDIS_URL on the ECS container.

But the state-redis pack doesn't read the REDIS_URL env directly — it reads secrets://dev/<tenant>/_/state-redis/redis_url from the runtime secrets store (AWS Secrets Manager via valueFrom). That secret's value is whatever the deployer promoted at apply time (post #185 + #187: the resolved env-var lookup of ${REDIS_URL}). So even after this PR provisions a Redis cluster, the pack still tries to connect to whatever the user supplied in REDIS_URL (or fails-fast missing if they didn't).

In other words: this PR is correct at the infra layer but the demo path doesn't benefit until one of these lands as a follow-up:

  1. state-redis pack reads REDIS_URL env directly (or has it as a fallback) instead of going through the secrets:// URI for the connection string.
  2. Or the deployer overwrites the runtime secret with module.operator_aws[0].managed_redis_url after terraform apply so the AWS SM value matches the auto-provisioned endpoint.

Marking do-not-merge until that follow-up is decided — happy to keep this branch alive so it doesn't bit-rot, but it shouldn't ship in isolation.

@BimaPangestu28 BimaPangestu28 marked this pull request as draft May 11, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant