Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 53 additions & 0 deletions terraform-dynamodb-glue-s3-integration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Amazon DynamoDB to S3 with zero-ETL using AWS Glue with Terraform

This pattern demonstrates how to create a zero-ETL integration between Aamazon DynamoDB and Amazon S3 using AWS Glue transformation job. The AWS Glue job copies data in the specified format, which can be queried usind Amazon Athena.

Learn more about this pattern at Serverless Land Patterns: Learn more about this pattern at Serverless Land Patterns: https://serverlessland.com/patterns/terraform-dynamodb-glue-s3-integration

Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example.

## Requirements

* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured
* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
* [Terraform](https://www.terraform.io/) installed

## Deployment Instructions

1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:
```
git clone https://github.com/aws-samples/serverless-patterns
```
2. Change directory to the pattern directory:
```
cd terraform-dynamodb-glue-s3-integration
```
3. Run below terraform commands to deploy to your AWS account in the desired region (default is us-east-1):
```
terraform init
terraform validate
terraform plan -var region=<YOUR_REGION>
terraform apply -var region=<YOUR_REGION>
```

## How it works

This Terraform pattern creates a zero-ETL integration that automatically exports DynamoDB data to S3 using AWS Glue. The AWS Glue job reads from the Amazon DynamoDB table and writes the data in the specified format (currently specified as Parquet in the script) to an encrypted Amazon S3 bucket for potential use in analytics and/or for long-term storage. The entire infrastructure is provisioned with the required IAM permissions, and includes automated testing script to validate the data pipeline functionality.

![pattern](Images/pattern.png)

## Testing

After deployment, run ./test.sh. This script adds rows to Amazon DynamoDB database and triggers then the AWS Glue job. Once the job is complete, check Amazon S3 for the target files.

## Cleanup

1. Delete the stack
```
terraform destroy -var region=<YOUR_REGION>
```
----
Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: MIT-0
64 changes: 64 additions & 0 deletions terraform-dynamodb-glue-s3-integration/example-pattern.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
{
"title": "Amazon DynamoDB and Amazon S3 zero-ETL ingegration using AWS Glue",
"description": "Create a Amazon DynamoDB and S3 bucket and integrate them using a AWS Glue job for zero-ETL data transfer.",
"language": "Python",
"level": "200",
"framework": "Terraform",
"introBox": {
"headline": "How it works",
"text": [
"This pattern sets up Amazon DynamoDB and Amazon S3 buckets, and integrates them with an AWS Glue job. Using this setup, you can move data from Amazon DynamoDB to Amazon S3 buckets (triggers are not implemented in the pattern, need to be added as required) and use that for analytics or long term storage. The scenarios where this pattern can be used are when there are large amounts of data on Amazon DynamoDB and you want to move them to a data lake as part of data strategy, or if the data has to be moved to long term storage due to refulatory reasons. The AWS Glue job copies the data into an encrypted Amazon S3 bucket and stores them in the specified format. In this pattern the format has been set to Parquet."
"This pattern also creates the required roles and policies for the services, with the right level of permissions required. The roles and policies can be expanded if additional services come into play, based on principle of least privilege."
]
},
"gitHub": {
"template": {
"repoURL": "https://github.com/aws-samples/serverless-patterns/tree/main/terraform-dynamodb-glue-s3-integration",
"templateURL": "serverless-patterns/terraform-dynamodb-glue-s3-integration",
"projectFolder": "terraform-dynamodb-glue-s3-integration",
"templateFile": "main.tf"
}
},
"resources": {
"bullets": [
{
"text": "AWS Glue",
"link": "https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html"
},
{
"text": "Amazon DynamoDB",
"link": "https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html"
},
{
"text": "Amazon S3",
"link": "https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html"
}
]
},
"deploy": {
"text": [
"terraform init",
"terraform plan",
"terraform apply"
]
},
"testing": {
"text": [
"See the GitHub repo for testing instructions."
]
},
"cleanup": {
"text": [
"terraform destroy"
]
},
"authors": [
{
"name": "Kiran Ramamurthy",
"image": "n/a",
"bio": "I am a Senior Partner Solutions Architect for Enterprise Transformation. I work predominantly with partners and specialize in migrations and modernization.",
"linkedin": "kiran-ramamurthy-a96341b",
"twitter": "twitter-handle"
}
]
}
40 changes: 40 additions & 0 deletions terraform-dynamodb-glue-s3-integration/glue-zero-etl-script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#############################################
# This script will be used by Glue for zero
# ETL copy of data from DynamoDB to S3.
#############################################

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source-table', 'target-bucket'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from DynamoDB
datasource = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": args['source_table'],
"dynamodb.throughput.read.percent": "0.5"
}
)

# Write data to S3 in the specified format
glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={
"path": f"s3://{args['target_bucket']}/data/"
},
format="parquet"
)

job.commit()
153 changes: 153 additions & 0 deletions terraform-dynamodb-glue-s3-integration/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}

provider "aws" {
region = var.aws_region
}

locals {
name_prefix = var.environment
}

# S3 bucket for data storage
resource "aws_s3_bucket" "data_bucket" {
bucket = "${local.name_prefix}-${var.s3_bucket_name}"
force_destroy = true
}

resource "aws_s3_bucket_public_access_block" "data_bucket_pab" {
bucket = aws_s3_bucket.data_bucket.id

block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}

# Upload the script file to S3
resource "aws_s3_object" "file_upload" {
bucket = aws_s3_bucket.data_bucket.id
key = "scripts/glue-zero-etl-script.py"
source = "glue-zero-etl-script.py"
}

# DynamoDB source table
resource "aws_dynamodb_table" "source_table" {
name = "${var.table_name}"
billing_mode = "PAY_PER_REQUEST"
hash_key = "id"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"

attribute {
name = "id"
type = "S"
}

point_in_time_recovery {
enabled = true
}
}

# IAM role for Glue Zero-ETL
resource "aws_iam_role" "glue_zero_etl_role" {
name = "${local.name_prefix}-glue-zero-etl-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "glue.amazonaws.com"
}
}
]
})
}

resource "aws_iam_role_policy" "glue_zero_etl_policy" {
name = "${local.name_prefix}-glue-zero-etl-policy"
role = aws_iam_role.glue_zero_etl_role.id

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"dynamodb:DescribeTable",
"dynamodb:GetRecords",
"dynamodb:ListStreams",
"dynamodb:ExportTableToPointInTime",
"dynamodb:Scan",
"dynamodb:Query"
]
Resource = [
aws_dynamodb_table.source_table.arn,
"${aws_dynamodb_table.source_table.arn}/stream/*"
]
},
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
]
Resource = [
aws_s3_bucket.data_bucket.arn,
"${aws_s3_bucket.data_bucket.arn}/*"
]
},
{
Effect = "Allow"
Action = [
"glue:*",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "*"
}
]
})
}

# Glue Catalog Database
resource "aws_glue_catalog_database" "zero_etl_database" {
name = "${local.name_prefix}_zero_etl_database"
}

# Glue Job for Zero-ETL
resource "aws_glue_job" "zero_etl_job" {
name = "${local.name_prefix}-dynamodb-to-s3-zero-etl"
role_arn = aws_iam_role.glue_zero_etl_role.arn
glue_version = "4.0"

command {
script_location = "s3://${aws_s3_bucket.data_bucket.bucket}/scripts/glue-zero-etl-script.py"
python_version = "3"
}

default_arguments = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-enable"
"--enable-continuous-cloudwatch-log" = "true"
"--source-table" = aws_dynamodb_table.source_table.name
"--target-bucket" = aws_s3_bucket.data_bucket.bucket
"--database-name" = aws_glue_catalog_database.zero_etl_database.name
}

max_retries = 1
timeout = 30
}
24 changes: 24 additions & 0 deletions terraform-dynamodb-glue-s3-integration/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
output "dynamodb_table_name" {
description = "Name of the DynamoDB table"
value = aws_dynamodb_table.source_table.name
}

output "s3_bucket_name" {
description = "Name of the S3 bucket"
value = aws_s3_bucket.data_bucket.bucket
}

output "glue_job_name" {
description = "Name of the Glue job"
value = aws_glue_job.zero_etl_job.name
}

output "glue_database_name" {
description = "Name of the Glue database"
value = aws_glue_catalog_database.zero_etl_database.name
}

output "iam_role_arn" {
description = "ARN of the IAM role for Glue"
value = aws_iam_role.glue_zero_etl_role.arn
}
7 changes: 7 additions & 0 deletions terraform-dynamodb-glue-s3-integration/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Use this file to set the variables to override the default
# values in variables.tf

aws_region = "us-east-1"
table_name = "glue-zero-etl-table"
s3_bucket_name = "glue-zero-etl-bucket-us-east-1"
environment = "sandbox"
33 changes: 33 additions & 0 deletions terraform-dynamodb-glue-s3-integration/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# This script should be used to test the pattern. It does the following tasks:
# 1. Gathers output values from Terraform
# 2. Adds test data to DynamoDB
# 3. Starts the Glue job
# 4. Runs a CLI command to check the contents of S3

#!/bin/bash
set -e

echo "Testing DynamoDB to S3 Glue Zero-ETL Integration..."

# Get the required resource names from terraform outputs
BUCKET_NAME=$(terraform output -raw s3_bucket_name)
TABLE_NAME=$(terraform output -raw dynamodb_table_name)
JOB_NAME=$(terraform output -raw glue_job_name)

# Add data to DynamoDB
echo "Adding test data to DynamoDB..."
aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test1"},"name":{"S":"John"},"age":{"N":"30"}}'
aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test2"},"name":{"S":"Jane"},"age":{"N":"25"}}'
aws dynamodb put-item --table-name $TABLE_NAME --item '{"id":{"S":"test3"},"name":{"S":"Julie"},"age":{"N":"35"}}'

# Run the Glue job
echo "Starting Glue job..."
JOB_RUN_ID=$(aws glue start-job-run --job-name $JOB_NAME --query 'JobRunId' --output text)

echo "✅ Job started with ID: $JOB_RUN_ID"
echo ""
echo "Run this command to monitor job status:"
echo " aws glue get-job-run --job-name $JOB_NAME --run-id $JOB_RUN_ID"
echo ""
echo "Run this command to check the results in S3:"
echo " aws s3 ls s3://$BUCKET_NAME/data/ --recursive"
Loading