diff --git a/terraform-dynamodb-glue-s3-integration/Images/pattern.png b/terraform-dynamodb-glue-s3-integration/Images/pattern.png new file mode 100644 index 000000000..426e3d128 Binary files /dev/null and b/terraform-dynamodb-glue-s3-integration/Images/pattern.png differ diff --git a/terraform-dynamodb-glue-s3-integration/README.md b/terraform-dynamodb-glue-s3-integration/README.md new file mode 100644 index 000000000..adccf6f6a --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/README.md @@ -0,0 +1,53 @@ +# Amazon DynamoDB to S3 with zero-ETL using AWS Glue with Terraform + +This pattern demonstrates how to create a zero-ETL integration between Aamazon DynamoDB and Amazon S3 using AWS Glue transformation job. The AWS Glue job copies data in the specified format, which can be queried usind Amazon Athena. + +Learn more about this pattern at Serverless Land Patterns: Learn more about this pattern at Serverless Land Patterns: https://serverlessland.com/patterns/terraform-dynamodb-glue-s3-integration + +Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example. + +## Requirements + +* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources. +* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured +* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) +* [Terraform](https://www.terraform.io/) installed + +## Deployment Instructions + +1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository: + ``` + git clone https://github.com/aws-samples/serverless-patterns + ``` +2. Change directory to the pattern directory: + ``` + cd terraform-dynamodb-glue-s3-integration + ``` +3. Run below terraform commands to deploy to your AWS account in the desired region (default is us-east-1): + ``` + terraform init + terraform validate + terraform plan -var region= + terraform apply -var region= + ``` + +## How it works + +This Terraform pattern creates a zero-ETL integration that automatically exports DynamoDB data to S3 using AWS Glue. The AWS Glue job reads from the Amazon DynamoDB table and writes the data in the specified format (currently specified as Parquet in the script) to an encrypted Amazon S3 bucket for potential use in analytics and/or for long-term storage. The entire infrastructure is provisioned with the required IAM permissions, and includes automated testing script to validate the data pipeline functionality. + +![pattern](Images/pattern.png) + +## Testing + +After deployment, run ./test.sh. This script adds rows to Amazon DynamoDB database and triggers then the AWS Glue job. Once the job is complete, check Amazon S3 for the target files. + +## Cleanup + +1. Delete the stack + ``` + terraform destroy -var region= + ``` +---- +Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. + +SPDX-License-Identifier: MIT-0 diff --git a/terraform-dynamodb-glue-s3-integration/example-pattern.json b/terraform-dynamodb-glue-s3-integration/example-pattern.json new file mode 100644 index 000000000..e32d99b14 --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/example-pattern.json @@ -0,0 +1,64 @@ +{ + "title": "Amazon DynamoDB and Amazon S3 zero-ETL ingegration using AWS Glue", + "description": "Create a Amazon DynamoDB and S3 bucket and integrate them using a AWS Glue job for zero-ETL data transfer.", + "language": "Python", + "level": "200", + "framework": "Terraform", + "introBox": { + "headline": "How it works", + "text": [ + "This pattern sets up Amazon DynamoDB and Amazon S3 buckets, and integrates them with an AWS Glue job. Using this setup, you can move data from Amazon DynamoDB to Amazon S3 buckets (triggers are not implemented in the pattern, need to be added as required) and use that for analytics or long term storage. The scenarios where this pattern can be used are when there are large amounts of data on Amazon DynamoDB and you want to move them to a data lake as part of data strategy, or if the data has to be moved to long term storage due to refulatory reasons. The AWS Glue job copies the data into an encrypted Amazon S3 bucket and stores them in the specified format. In this pattern the format has been set to Parquet." + "This pattern also creates the required roles and policies for the services, with the right level of permissions required. The roles and policies can be expanded if additional services come into play, based on principle of least privilege." + ] + }, + "gitHub": { + "template": { + "repoURL": "https://github.com/aws-samples/serverless-patterns/tree/main/terraform-dynamodb-glue-s3-integration", + "templateURL": "serverless-patterns/terraform-dynamodb-glue-s3-integration", + "projectFolder": "terraform-dynamodb-glue-s3-integration", + "templateFile": "main.tf" + } + }, + "resources": { + "bullets": [ + { + "text": "AWS Glue", + "link": "https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html" + }, + { + "text": "Amazon DynamoDB", + "link": "https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html" + }, + { + "text": "Amazon S3", + "link": "https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html" + } + ] + }, + "deploy": { + "text": [ + "terraform init", + "terraform plan", + "terraform apply" + ] + }, + "testing": { + "text": [ + "See the GitHub repo for testing instructions." + ] + }, + "cleanup": { + "text": [ + "terraform destroy" + ] + }, + "authors": [ + { + "name": "Kiran Ramamurthy", + "image": "n/a", + "bio": "I am a Senior Partner Solutions Architect for Enterprise Transformation. I work predominantly with partners and specialize in migrations and modernization.", + "linkedin": "kiran-ramamurthy-a96341b", + "twitter": "twitter-handle" + } + ] +} diff --git a/terraform-dynamodb-glue-s3-integration/glue-zero-etl-script.py b/terraform-dynamodb-glue-s3-integration/glue-zero-etl-script.py new file mode 100644 index 000000000..d72d464eb --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/glue-zero-etl-script.py @@ -0,0 +1,40 @@ +############################################# +# This script will be used by Glue for zero +# ETL copy of data from DynamoDB to S3. +############################################# + +import sys +from awsglue.transforms import * +from awsglue.utils import getResolvedOptions +from pyspark.context import SparkContext +from awsglue.context import GlueContext +from awsglue.job import Job + +args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source-table', 'target-bucket']) + +sc = SparkContext() +glueContext = GlueContext(sc) +spark = glueContext.spark_session +job = Job(glueContext) +job.init(args['JOB_NAME'], args) + +# Read data from DynamoDB +datasource = glueContext.create_dynamic_frame.from_options( + connection_type="dynamodb", + connection_options={ + "dynamodb.input.tableName": args['source_table'], + "dynamodb.throughput.read.percent": "0.5" + } +) + +# Write data to S3 in the specified format +glueContext.write_dynamic_frame.from_options( + frame=datasource, + connection_type="s3", + connection_options={ + "path": f"s3://{args['target_bucket']}/data/" + }, + format="parquet" +) + +job.commit() diff --git a/terraform-dynamodb-glue-s3-integration/main.tf b/terraform-dynamodb-glue-s3-integration/main.tf new file mode 100644 index 000000000..07028dfb8 --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/main.tf @@ -0,0 +1,153 @@ +terraform { + required_version = ">= 1.0" + required_providers { + aws = { + source = "hashicorp/aws" + version = "~> 5.0" + } + } +} + +provider "aws" { + region = var.aws_region +} + +locals { + name_prefix = var.environment +} + +# S3 bucket for data storage +resource "aws_s3_bucket" "data_bucket" { + bucket = "${local.name_prefix}-${var.s3_bucket_name}" + force_destroy = true +} + +resource "aws_s3_bucket_public_access_block" "data_bucket_pab" { + bucket = aws_s3_bucket.data_bucket.id + + block_public_acls = true + block_public_policy = true + ignore_public_acls = true + restrict_public_buckets = true +} + +# Upload the script file to S3 +resource "aws_s3_object" "file_upload" { + bucket = aws_s3_bucket.data_bucket.id + key = "scripts/glue-zero-etl-script.py" + source = "glue-zero-etl-script.py" +} + +# DynamoDB source table +resource "aws_dynamodb_table" "source_table" { + name = "${var.table_name}" + billing_mode = "PAY_PER_REQUEST" + hash_key = "id" + stream_enabled = true + stream_view_type = "NEW_AND_OLD_IMAGES" + + attribute { + name = "id" + type = "S" + } + + point_in_time_recovery { + enabled = true + } +} + +# IAM role for Glue Zero-ETL +resource "aws_iam_role" "glue_zero_etl_role" { + name = "${local.name_prefix}-glue-zero-etl-role" + + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [ + { + Action = "sts:AssumeRole" + Effect = "Allow" + Principal = { + Service = "glue.amazonaws.com" + } + } + ] + }) +} + +resource "aws_iam_role_policy" "glue_zero_etl_policy" { + name = "${local.name_prefix}-glue-zero-etl-policy" + role = aws_iam_role.glue_zero_etl_role.id + + policy = jsonencode({ + Version = "2012-10-17" + Statement = [ + { + Effect = "Allow" + Action = [ + "dynamodb:DescribeTable", + "dynamodb:GetRecords", + "dynamodb:ListStreams", + "dynamodb:ExportTableToPointInTime", + "dynamodb:Scan", + "dynamodb:Query" + ] + Resource = [ + aws_dynamodb_table.source_table.arn, + "${aws_dynamodb_table.source_table.arn}/stream/*" + ] + }, + { + Effect = "Allow" + Action = [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket" + ] + Resource = [ + aws_s3_bucket.data_bucket.arn, + "${aws_s3_bucket.data_bucket.arn}/*" + ] + }, + { + Effect = "Allow" + Action = [ + "glue:*", + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents" + ] + Resource = "*" + } + ] + }) +} + +# Glue Catalog Database +resource "aws_glue_catalog_database" "zero_etl_database" { + name = "${local.name_prefix}_zero_etl_database" +} + +# Glue Job for Zero-ETL +resource "aws_glue_job" "zero_etl_job" { + name = "${local.name_prefix}-dynamodb-to-s3-zero-etl" + role_arn = aws_iam_role.glue_zero_etl_role.arn + glue_version = "4.0" + + command { + script_location = "s3://${aws_s3_bucket.data_bucket.bucket}/scripts/glue-zero-etl-script.py" + python_version = "3" + } + + default_arguments = { + "--job-language" = "python" + "--job-bookmark-option" = "job-bookmark-enable" + "--enable-continuous-cloudwatch-log" = "true" + "--source-table" = aws_dynamodb_table.source_table.name + "--target-bucket" = aws_s3_bucket.data_bucket.bucket + "--database-name" = aws_glue_catalog_database.zero_etl_database.name + } + + max_retries = 1 + timeout = 30 +} diff --git a/terraform-dynamodb-glue-s3-integration/outputs.tf b/terraform-dynamodb-glue-s3-integration/outputs.tf new file mode 100644 index 000000000..52756db60 --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/outputs.tf @@ -0,0 +1,24 @@ +output "dynamodb_table_name" { + description = "Name of the DynamoDB table" + value = aws_dynamodb_table.source_table.name +} + +output "s3_bucket_name" { + description = "Name of the S3 bucket" + value = aws_s3_bucket.data_bucket.bucket +} + +output "glue_job_name" { + description = "Name of the Glue job" + value = aws_glue_job.zero_etl_job.name +} + +output "glue_database_name" { + description = "Name of the Glue database" + value = aws_glue_catalog_database.zero_etl_database.name +} + +output "iam_role_arn" { + description = "ARN of the IAM role for Glue" + value = aws_iam_role.glue_zero_etl_role.arn +} diff --git a/terraform-dynamodb-glue-s3-integration/terraform.tfvars b/terraform-dynamodb-glue-s3-integration/terraform.tfvars new file mode 100644 index 000000000..597ddbab9 --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/terraform.tfvars @@ -0,0 +1,7 @@ +# Use this file to set the variables to override the default +# values in variables.tf + +aws_region = "us-east-1" +table_name = "glue-zero-etl-table" +s3_bucket_name = "glue-zero-etl-bucket-us-east-1" +environment = "sandbox" diff --git a/terraform-dynamodb-glue-s3-integration/test.sh b/terraform-dynamodb-glue-s3-integration/test.sh new file mode 100755 index 000000000..380bebe4b --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/test.sh @@ -0,0 +1,33 @@ +# This script should be used to test the pattern. It does the following tasks: +# 1. Gathers output values from Terraform +# 2. Adds test data to DynamoDB +# 3. Starts the Glue job +# 4. Runs a CLI command to check the contents of S3 + +#!/bin/bash +set -e + +echo "Testing DynamoDB to S3 Glue Zero-ETL Integration..." + +# Get the required resource names from terraform outputs +BUCKET_NAME=$(terraform output -raw s3_bucket_name) +TABLE_NAME=$(terraform output -raw dynamodb_table_name) +JOB_NAME=$(terraform output -raw glue_job_name) + +# Add data to DynamoDB +echo "Adding test data to DynamoDB..." +aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test1"},"name":{"S":"John"},"age":{"N":"30"}}' +aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test2"},"name":{"S":"Jane"},"age":{"N":"25"}}' +aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test3"},"name":{"S":"Julie"},"age":{"N":"35"}}' + +# Run the Glue job +echo "Starting Glue job..." +JOB_RUN_ID=$(aws glue start-job-run --job-name $JOB_NAME --query 'JobRunId' --output text) + +echo "✅ Job started with ID: $JOB_RUN_ID" +echo "" +echo "Run this command to monitor job status:" +echo " aws glue get-job-run --job-name $JOB_NAME --run-id $JOB_RUN_ID" +echo "" +echo "Run this command to check the results in S3:" +echo " aws s3 ls s3://$BUCKET_NAME/data/ --recursive" diff --git a/terraform-dynamodb-glue-s3-integration/variables.tf b/terraform-dynamodb-glue-s3-integration/variables.tf new file mode 100644 index 000000000..be70dccce --- /dev/null +++ b/terraform-dynamodb-glue-s3-integration/variables.tf @@ -0,0 +1,23 @@ +variable "aws_region" { + description = "AWS region" + type = string + default = "us-east-1" +} + +variable "table_name" { + description = "DynamoDB table name" + type = string + default = "source-table" +} + +variable "s3_bucket_name" { + description = "S3 bucket name for data export" + type = string + default = "glue-zero-etl-bucket" +} + +variable "environment" { + description = "Environment name" + type = string + default = "dev" +}