Skip to content

Latest commit

 

History

History
170 lines (131 loc) · 6.86 KB

File metadata and controls

170 lines (131 loc) · 6.86 KB

Prerequisites

What to install, configure, and verify before running terraform apply against this reference architecture.

Tools on your PATH

Required for terraform apply (invoked automatically by TF provisioners):

Tool Minimum Used for
Terraform 1.5.7 All reference architecture operations
skopeo 1.14+ ECR image push (bundle containers/ → private ECR); invoked by the ecr module
python3 3.9+ (Optional) Model checkpoint streaming uploader; invoked by the model-checkpoints module (conditional on enable_model_s3_upload = true, default true for the full profile)
boto3 1.30+ (Optional) Imported by the checkpoint uploader. Must be importable by the python3 that's first on PATH when Terraform runs. See the boto3 note below.

Required for the documented operator workflow (run in your shell, NOT invoked by TF):

Tool Minimum Used for
AWS CLI v2 Shell auth (aws sts get-caller-identity), kubectl setup (aws eks update-kubeconfig), ad-hoc verification (aws ecr describe-images, aws s3 ls, etc.), operational escape hatches (aws ecr batch-delete-image, etc.). Not invoked by any Terraform provisioner.
kubectl 1.30+ Post-apply verification, day-2 troubleshooting

The reference architecture does not require Docker, Helm, or envsubst on the operator's box. Helm runs via the Terraform Helm provider; ECR push uses skopeo (no Docker daemon).

AWS credentials

The Terraform providers do NOT pin an AWS profile. They inherit from your shell. Before running anything:

export AWS_PROFILE=<your-admin-profile>
aws sts get-caller-identity     # confirm you're logged in and as whom

Required permissions

This reference architecture creates IAM roles, an OIDC provider, and KMS keys so it needs admin-equivalent access on the target account. A read-only or insufficiently powerful developer profile will fail partway through.

Specifically, the operator identity needs:

  • IAM: create/update/delete roles, policies, instance profiles, OIDC providers; tag IAM resources
  • EKS: create cluster, access entries, addons
  • EC2: VPC/subnets/NAT gateways/security groups/capacity reservation consumption
  • RDS: create instances, subnet groups, parameter groups, secrets
  • S3 + KMS: create buckets, keys, policies
  • ECR: create repositories, push images
  • Secrets Manager: managed database master password
  • Cognito: create user pools (if enable_cognito = true)

Same profile for TF and subprocesses

Several modules spawn subprocesses via local-exec: the ECR push helper (skopeo) and the model-checkpoints uploader (Python + boto3). These inherit AWS_PROFILE from your shell. If you pin a profile in the Terraform provider block but the subprocess uses a different one, you'll get confusing permission errors only at apply time.

Rule: export AWS_PROFILE in your shell once, then run terraform. This reference architecture's providers.tf deliberately doesn't override it.

Poolside Helm bundle

The reference architecture is bundle-driven. Before the first apply, you need an extracted Poolside Helm bundle somewhere on disk, outside this repo:

~/poolside/helm/poolside-helm-<version>/
├── charts/
│   ├── poolside-deployment/
│   └── inference-stack/
├── containers/
│   ├── <image>__<tag>__<arch>.tar   (one per container image)
│   └── ...
└── scripts/

This layout comes from extracting the bundle tarball Poolside ships.

Set containers_dir (ECR source) and bundle_root (Helm chart source) in your terraform.tfvars to the appropriate paths. For example:

containers_dir = "/home/ops/poolside/helm/poolside-helm-<version>/containers"
bundle_root    = "/home/ops/poolside/helm/poolside-helm-<version>"

The reference architecture never assumes this lives inside the git repo. Treat the bundle as a vendor artifact you stage separately.

Model checkpoint tarballs (full profile only)

If you're running the full profile and using architecture-managed model uploads (enable_model_s3_upload = true, the default), you also need model checkpoint tarballs on disk:

~/poolside/models/
├── laguna_xs-20250427_int4.tar
├── malibu-v2.20251021_int4.tar
├── point-v2.20250403.tar
└── ... (one per model you want to deploy)

Filename convention: <model>-<version>[_<quant>].tar. The reference architecture splits each filename on the first hyphen to derive a model alias (e.g. malibu-v2.20251021_int4.tar → alias malibu). See model-checkpoints.md for the full details, including the BYO-bucket alternative.

Public hostname and ACM certificate

The reference architecture is HTTPS-only at the edge: the ALB terminates TLS against an ACM certificate, and Cognito (when enabled) uses your public hostname for callback URLs. Both must be ready before terraform apply:

  • A public DNS hostname you've chosen for the deployment, for example poolside.example.com. You'll set this as public_hostname in terraform.tfvars. The hostname does not have to resolve yet; you'll point it at the ALB after the apply.
  • An ACM certificate covering that hostname, issued in the target region (var.region). The example roots look it up by domain name via data "aws_acm_certificate", so if it isn't issued at plan time the lookup fails.

If you don't already have an ACM certificate, request one in the AWS console (or via your usual cert-issuance path) and complete DNS validation before continuing.

Target cluster access

After terraform apply creates the EKS cluster, configure kubectl:

eval "$(terraform output -raw kubeconfig_command)"
kubectl get nodes

The cluster is created with both a public API endpoint (gated by the CIDRs in cluster_endpoint_public_access_cidrs) and a private endpoint. Terraform communicates with the API via the public endpoint; in-cluster workloads use the private one.

If your organization requires private-only EKS API access, you'll need to run Terraform from inside the VPC (bastion, peered VPC, or transit gateway). See the limitations note in architecture.md.

Sanity checklist before terraform apply

  • aws sts get-caller-identity returns your admin profile
  • terraform version ≥ 1.5.7
  • skopeo --version works
  • python3 -c "import boto3; print(boto3.__version__)" succeeds for the same python3 that's first on your PATH (full profile only)
  • Bundle extracted; containers/<something>.tar visible
  • Model checkpoint tarballs at the expected path (full profile, upload mode enabled)
  • cluster_endpoint_public_access_cidrs includes the IP you'll run Terraform from (curl -s ifconfig.me to check)
  • admin_principal_arns includes the role Terraform is running as
  • Public hostname chosen, ACM certificate issued in var.region