Hello! This is a high-quality containerized Python API → managed Kafka cluster → AWS Lambda consumer function reference architecture, provisioned with Terraform. (CloudFormation is used indirectly, for a modular Kafka consumer stack.) I hope you will be able to adapt it for your own projects, under the terms of the license.
Jump to: Installation • Recommendations • Licenses
Low-cost
- Expensive AWS resources can be toggled off during development
- Spot pricing reduces compute costs up to 70% even without a long-term, always-on Savings Plan commitment
- ARM CPU architecture offers a better price/performance ratio than Intel
Secure Docker container
-
Amazon Linux starts with fewer vulnerabilities, is updated frequently by AWS staff, and uses deterministic operating system package versions
-
AWS CloudShell or EC2 provides a controlled, auditable environment for building container images
-
The API server process runs as a non-root user, reducing the impact if it is compromised
Secure private network
-
Security group rules reference named security groups rather than ranges of numeric addresses; only known pairs of resources can communicate
-
PrivateLink endpoints keep AWS API traffic off the public Internet
-
No public Internet access from private subnets
Compatible with continuous integration/continuous deployment (CI/CD)
-
Getting container image build properties from Terraform variables allows separate versions for development, testing and blue/green deployment
-
AWS IP Address Manager (IPAM) takes a single address range input and divides the space flexibly, accommodating multiple environments of different sizes
-
An AWS Lambda function test event in the shared registry allows realistic, centralized testing
-
Amazon Linux on EC2 provides a consistent, central build platform
Small Docker container image
-
Docker cache mounts prevent image bloat and avoid slow re-downloading on re-build (other people needlessly disable or empty operating system package and Python module caches)
-
Temporary software is installed, used and removed in the same step, avoiding extra layers and multi‑stage build complexity
-
Temporary Python modules are uninstalled, just like temporary operating system packages (other people leave
pip, which will never be used again!)
Low-code
-
API methods, parameters and input validation rules are defined declaratively, in a standard OpenAPI specification; API code need only process requests
-
A managed container service (ECS) and a serverless computing option (Fargate) reduce infrastructure-as-code lines and eliminate scripts
-
The AWS event source mapping interacts with Kafka, so that the consumer Lambda function need only process JSON input (I re-used a simple SQS consumer CloudFormation template from my other projects!)
Jump to: Recommendations • Licenses
-
Choose between AWS CloudShell or an EC2 instance for building the Docker image and running Terraform.
-
CloudShell
Easy ✓-
Authenticate to the AWS Console. Use a non-production AWS account and a privileged role.
-
Open an AWS CloudShell terminal.
-
Prepare for a cross-platform container image build. CloudShell seems to provide Intel CPUs. The following instructions are from "Multi-platform builds" in the Docker Build manual.
sudo docker buildx create --name 'container-builder' --driver 'docker-container' --bootstrap --use
sudo docker run --privileged --rm 'tonistiigi/binfmt' --install all -
Review the Terraform S3 backend documentation and create an S3 bucket to store Terraform state.
-
If at any time you find that your previous CloudShell session has expired, repeat any necessary software installation steps. Your home directory is preserved between sessions, subject to CloudShell persistent storage limitations.
-
-
EC2 instance
EC2 instructions...
-
Create and/or connect to an EC2 instance. I recommend:
arm64t4g.micro⚠ The ARM-based AWS Gravitongarchitecture avoids multi-platform build complexity.- Amazon Linux 2023
- A 30 GiB EBS volume, with default encryption (supports hibernation)
- No key pair; connect through Session Manager
- A custom security group with no ingress rules (yay for Session Manager!)
- A
sched-stop=d=_ H:M=07:00tag for automatic nightly shutdown (this example corresponds to midnight Pacific Daylight Time) with sqlxpert/lights-off-aws
-
During the instance creation workflow (Advanced details → IAM instance profile → Create new IAM profile) or afterward, give your EC2 instance a custom role. Terraform must be able to list/describe, get tags for, create, tag, untag, update, and delete all of the AWS resource types included in this project's
.tffiles. -
Update operating system packages (thanks to AWS's deterministic upgrade philosophy, there shouldn't be any updates if you chose the latest Amazon Linux 2023 image), install Docker, and start it.
sudo dnf check-update
sudo dnf --releasever=latest update
sudo dnf install docker
sudo systemctl start docker
-
-
-
Install Terraform. I've standardized on Terraform v1.10.0 (2024-11-27) as the minimum supported version for my open-source projects.
sudo dnf --assumeyes install 'dnf-command(config-manager)' sudo dnf config-manager --add-repo 'https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo' # sudo dnf --assumeyes install terraform-1.10.0-1 sudo dnf --assumeyes install terraform
-
Clone this repository and create
terraform.tfvarsto customize variables.git clone 'https://github.com/sqlxpert/docker-python-openapi-kafka-terraform-cloudformation-aws.git' ~/docker-python-openapi-kafka cd ~/docker-python-openapi-kafka/terraform touch terraform.tfvars
Generate a terraform.tfvars skeleton...
# Requires an up-to-date GNU sed (not the MacOS default!) sed --regexp-extended --silent \ --expression='s/^variable "(.+)" \{$/\n\n# \1 =/p' \ --expression='s/^ description = "(.+)"$/#\n# \1/p' \ --expression='s/^ default = (.+)$/#\n# Default: \1/p' variables.tf
Optional: To save money while building the Docker container image, set
hello_api_aws_ecs_service_desired_count_tasks = 0andcreate_vpc = false. -
In CloudShell (optional if you chose EC2), create an override file to configure your Terraform S3 backend.
cat > terraform_override.tf << 'EOF' terraform { backend "s3" { insecure = false region = "RegionCodeForYourS3Bucket" bucket = "NameOfYourS3Bucket" key = "DesiredTerraformStateFileName" use_lockfile = true # No more DynamoDB; now S3-native! } } EOF
-
Initialize Terraform and create the AWS infrastructure. There's no need for a separate
terraform planstep.terraform applyoutputs the plan and gives you a chance to approve before anything is done. If you don't like the plan, don't typeyes!terraform init
terraform apply -target='aws_vpc_ipam_pool_cidr_allocation.hello_vpc_subnets'About this two-stage process...
CloudPosse's otherwise excellent dynamic-subnets module isn't dynamic enough to co-operate with AWS IP Address Manager (IPAM), so you have to let IPAM finalize subnet IP address range allocations beforehand.
terraform apply
In case of "already exists" errors...
-
If you receive a "RepositoryAlreadyExistsException: The repository with name 'hello_api' already exists", set
create_aws_ecr_repository = false. -
If you receive a "ConflictException: Registry with name lambda-testevent-schemas already exists", set
create_lambda_testevent_schema_registry = false.
After changing the variable(s), run
terraform applyagain. -
-
Set environment variables needed for building, tagging and pushing up the Docker container image, then build it.
AMAZON_LINUX_BASE_VERSION=$(terraform output -raw 'amazon_linux_base_version') AMAZON_LINUX_BASE_DIGEST=$(terraform output -raw 'amazon_linux_base_digest') AWS_ECR_REGISTRY_REGION=$(terraform output -raw 'hello_api_aws_ecr_registry_region') AWS_ECR_REGISTRY_URI=$(terraform output -raw 'hello_api_aws_ecr_registry_uri') AWS_ECR_REPOSITORY_URL=$(terraform output -raw 'hello_api_aws_ecr_repository_url') HELLO_API_AWS_ECR_IMAGE_TAG=$(terraform output -raw 'hello_api_aws_ecr_image_tag') aws ecr get-login-password --region "${AWS_ECR_REGISTRY_REGION}" | sudo docker login --username 'AWS' --password-stdin "${AWS_ECR_REGISTRY_URI}" cd ../python_docker
sudo docker buildx build --build-arg AMAZON_LINUX_BASE_VERSION="${AMAZON_LINUX_BASE_VERSION}" --build-arg AMAZON_LINUX_BASE_DIGEST="${AMAZON_LINUX_BASE_DIGEST}" --platform='linux/arm64' --tag "${AWS_ECR_REPOSITORY_URL}:${HELLO_API_AWS_ECR_IMAGE_TAG}" --output 'type=docker' .
sudo docker push "${AWS_ECR_REPOSITORY_URL}:${HELLO_API_AWS_ECR_IMAGE_TAG}"Scanning and updating the container image...
In case you have not configured ECR for automatic security scanning on image push, you may be able to initiate a free operating system-level vulnerability scan once per image per day. If you have opted-in to paid, enhanced scanning, you cannot initiate a scan manually. See Scan images for software vulnerabilities in Amazon ECR for all options.
aws ecr start-image-scan --repository-name 'hello_api' --image-id "imageTag=${HELLO_API_AWS_ECR_IMAGE_TAG}"
Carefully review findings from a manual or automatic vulnerability scan.
aws ecr describe-image-scan-findings --repository-name 'hello_api' --image-id "imageTag=${HELLO_API_AWS_ECR_IMAGE_TAG}"
You can resolve most or all operating system-level findings by specifying the version number and digest that correspond to the latest Amazon Linux 2023 release. Its tag is
2023, not the usual "latest". Resolving Python-level findings (from a paid, enhanced scan) might be as simple as re-building to pick up newer versions of secondary dependencies, or it might require updating primary module version numbers, in:Note: For the Kafka consumer function, AWS Lambda automatically applies security updates to the Lambda runtime.
Set the
amazon_linux_base_versionandamazon_linux_base_digestvariables in Terraform, runterraform apply, and re-set the environment variables.Then, to re-build the image, run
HELLO_API_AWS_ECR_IMAGE_TAG='1.0.1'(choose an appropriate new version number, taking semantic versioning into account) in the shell and repeat the build and push commands.To deploy the new image version, set
hello_api_aws_ecr_image_tag = "1.0.1"(for example) in Terraform and runterraform applyone more time. -
If you changed Terraform variables at the end of Step 3, revert the changes and run both
terraform applycommands from Step 5. -
In the Amazon Elastic Container Service section of the AWS Console, check the
hello_apicluster. Eventually, you should see 2 tasks running.- It will take a few minutes for ECS to notice, and then deploy, the
container image. Relax, and let it happen. If you are impatient, or if
there is a problem, you can navigate to the
hello_apiservice, open the orange "Update service" pop-up menu, and select "Force new deployment".
- It will take a few minutes for ECS to notice, and then deploy, the
container image. Relax, and let it happen. If you are impatient, or if
there is a problem, you can navigate to the
-
Generate the URLs and then test your API.
cd ../terraform HELLO_API_DOMAIN_NAME=$(terraform output -raw 'hello_api_load_balander_domain_name') echo -e "curl --location --insecure 'http://${HELLO_API_DOMAIN_NAME}/"{'healthcheck','hello','current_time?name=Paul','current_time?name=;echo','error'}"'\n"
Try the different URLs using your Web browser or
curl --location --insecure(these options allow redirection and self-signed TLS certificates).Method, parameters Result expected /healthcheckEmpty response /helloFixed greeting, in a JSON object /current_time?name=PaulReflected greeting and timestamp, in a JSON object /current_time?name=;echoHTTP 400"bad request" error;
Demonstrates protection from command injection/errorHTTP 404"not found" errorAbout redirection to HTTPS, and certificates...
Your Web browser should redirect you from
http:tohttps:and (let's hope!) warn you about the untrusted, self-signed TLS certificate used in this system (which of course is not tied to a pre-determined domain name). Proceed to view the responses from your new API...If your browser configuration does not allow accessing Web sites with untrusted certificates, change the
enable_httpsvariable tofalseand runterraform apply. Now,http:links will work without redirection. After you have usedhttps:with a particular domain, your browser might no longer allowhttp:. Try with another browser. -
Access the
/hello/hello_api_web_logCloudWatch log group in the AWS Console.Periodic internal health checks, plus your occasional Web requests, should appear.
API access log limitations...
The Python connexion module, which I chose because it serves an API from a precise OpenAPI-format specification, uses uvicorn workers. Unfortunately, uvicorn has lousy log format customization support.
-
If you wish to run commands remotely, or to open an interactive shell inside a
hello_apicontainer, use ECS Exec.ECS Exec instructions...
Change the
enable_ecs_execvariable totrue, runterraform apply, and replace the container(s) using "Force new deployment", as explained at the end of Step 8.In the Amazon Elastic Container Service section of the AWS Console, click
hello_apito open the cluster's page. Open the "Tasks" tab and click an identifier in the "Task" column. Under "Containers", select the container, then click "Connect". Confirm the command that will be executed.You can also use the AWS command-line interface from your main CloudShell session (or, with sufficient permissions, from an EC2 instance if you chose to deploy from EC2).
aws ecs list-tasks --cluster 'hello_api' --query 'taskArns' --output text read -p 'Task ID: ' HELLO_API_ECS_TASK_ID aws ecs execute-command --cluster 'hello_api' --task "${HELLO_API_ECS_TASK_ID}" --interactive --command '/bin/bash'
Activities are logged in the
/hello/hello_api_ecs_exec_logCloudWatch log group. -
If you don't wish use Kafka, skip to Step 14.
If you wish to enable Kafka, set
enable_kafka = trueand runterraform apply. AWS MSK is expensive, so enable Kafka only after confirming that the rest of the system is working for you.In case HelloApiKafkaConsumer CloudFormation stack creation fails...
Creation of the Kafka consumer might fail for various reasons. Once the
HelloApiKafkaConsumerCloudFormation stack is inROLLBACK_COMPLETEstatus, delete it, then runterraform applyagain. -
Access the
/current_time?name=Paulmethod several times (adjust thenameparameter as you wish). The first use of this method prompts creation of theeventsKafka topic. From now on, use of this method (not the other methods) will send a message to theeventsKafka topic.The AWS MSK event source mapping reads from the Kafka topic and triggers the consumer Lambda function, which logs decoded Kafka messages to the HelloApiKafkaConsumer CloudWatch log group.
-
If you wish to continue experimenting, set the
enable_kafka,hello_api_aws_ecs_service_desired_count_tasksandcreate_vpcvariables to their cost-saving values and runterraform apply.When you are finished, delete all resources; the minimum configuration carries a cost.
If you will be using the container image again soon, you can preserve the Elastic Container Registry image repository (at a cost) by removing it from Terraform state.
cd ../terraform terraform state rm 'aws_schemas_registry.lambda_testevent' # terraform state rm 'aws_ecr_repository.hello' 'aws_ecr_lifecycle_policy.hello' 'data.aws_ecr_lifecycle_policy_document.hello' terraform apply -destroy
Deletion delays and errors...
-
Harmless "Invalid target address" errors will occur in some configurations.
-
A newly-created ECR repository is deleted along with any images (unless you explicitly removed it from Terraform state), but if you imported your previously-created ECR repository and it contains images, you will receive a "RepositoryNotEmptyException". Either delete the images or remove the ECR repository from Terraform state. Run
terraform apply -destroyagain. -
Deleting a VPC Lambda function takes a long time because of the network association; expect 30 minutes if
enable_kafkawastrue. -
Expect an error message about retiring KMS encryption key grants (harmless, in this case).
-
If you cancel and re-run
terraform apply -destroy, a bug in CloudPosse'sdynamic-subnetsmodule might cause a "value depends on resource attributes that cannot be determined until apply" error. For a work-around, edit the cached module file indicated in the error message. Comment out the indicated line and forcecount = 0. Be sure to revert this temporary patch later.
-
This is my own original work, produced without the use of artificial intelligence (AI) and large language model (LLM) code generation. Code from other sources is acknowledged.
I write long option names in my instructions so that other people don't have to look up unfamiliar single-letter options — assuming they can find them!
Here's an example that shows why I go to the trouble, even at the expense of being laughed at by macho Linux users. I started using UNICOS in 1991, so it's not for lack of experience.
Search for the literal text
-tin docs.docker.com/reference/cli/docker/buildx/build , using Command-F, Control-F,/, orgrep. Only 2 of 41 occurrences of-tare relevant!
Where available, full-text (that is, not strictly literal) search engines can't make sense of a 1-letter search term and are also likely to ignore a 2-character term as a "stop-word" that's too short to search for.
My professional and ethical commitment is simple: Only as much technology as a business...
- needs,
- can afford,
- understands (or can learn), and
- can maintain.
Having worked for startups since 2013, I always recommend focusing software engineering effort. It is not possible to do everything, let alone to be good at everything. Managed services, serverless technology, and low-code architecture free software engineers to focus on the core product, that is, on what the company actually sells. Avoid complex infrastructure and tooling unless it offers a unique, tangible, and substantial benefit. Simplicity pays!
Security is easier and cheaper to incorporate at the start than to graft on after the architecture has been finalized, the infrastructure has been templated and created, and the executable code has been written and deployed.
Specialized knowledge of the chosen cloud provider is indispensable. I call it "idiomatic" knowledge, a good part of which is awareness of the range of options supported by your cloud provider. Building generically would mean giving up some performance, some security, and some cloud cost savings. Optimizing later is difficult. "Lean to steer the ship you're on."
| Scope | Link | Included Copy |
|---|---|---|
| Source code, and source code in documentation | GNU General Public License (GPL) 3.0 | LICENSE_CODE.md |
| Documentation, including this ReadMe file | GNU Free Documentation License (FDL) 1.3 | LICENSE_DOC.md |
Copyright Paul Marcelin
Contact: marcelin at cmu.edu (replace "at" with @)