Amazon S3 Find and Forget is a solution you deploy in your own AWS account using AWS CloudFormation. There is no charge for the solution: you pay only for the AWS services used to run the solution. This page outlines the services used by the solution, and examples of the charges you should expect for typical usage of the solution.
Disclaimer
You are responsible for the cost of the AWS services used while running this deployment. There is no additional cost for using the solution. For full details, see the following pricing pages for each AWS service you will be using. Prices are subject to change.
The Amazon S3 Find and Forget solution uses a serverless computing architecture. This model minimises costs when you're not actively using the solution, and allows the solution to scale while only paying for what you use.
The sample VPC provided in this solution makes use of VPC Endpoints, which have an hourly cost as well as data transfer cost. All the other costs depend on the usage of the API, and for typical usage, the greatest proportion of what you pay will be for use of Amazon Athena, Amazon S3 and AWS Fargate.
The Forget phase of the solution uses AWS Fargate. Using Fargate, you pay for the duration that Fargate tasks run during the Forget phase.
The AWS Fargate cost is affected by the number of Fargate tasks you choose to run concurrently, and their configuration (vCPU and memory). You can configure these parameters when deploying the Solution.
AWS Glue Data Catalog is used by the solution to define data mappers. You pay a monthly fee based on the number of objects stored in the data catalog, and for requests made to the AWS Glue service when the solution runs.
AWS Lambda Functions are used throughout the solution. You pay for the requests to, and execution time of, these functions. Functions execute when using the solution web interface, API, and when a deletion job runs.
AWS Step Functions Standard Workflows are used when a deletion job runs. You pay for the amount of state transitions in the Step Function Workflow. The number of state transitions will increase with the number of data mappers, and partitions in those data mappers, included in a deletion job.
Amazon API Gateway is used to provide the solution web interface and API. You pay for requests made when using the web interface or API, and any data transferred out.
Amazon Athena scans your data lake during the Find phase of a deletion job. You pay for the Athena queries run based on the amount of data scanned.
You can achieve significant cost savings and performance gains by reducing the quantity of data Athena needs to scan per query by using compression, partitioning and conversion of your data to a columnar format. See Supported Data Formats for more information regarding supported data and compression formats.
The Amazon Athena Pricing page contains an overview of prices and provides a calculator to estimate the Athena query cost for each deletion job run based on the Data Lake size. See Using Workgroups to Control Query Access and Costs for more information on using workgroups to set limits on the amount of data each query or the entire workgroup can process, and to track costs.
If you choose to deploy a CloudFront distribution for the solution interface, you will pay CloudFront charges for requests and data transferred when you access the web interface.
Amazon Cognito provides authentication to secure access to the API using an administrative user created during deployment. You pay a monthly fee for active users in the Cognito User Pool.
Amazon DynamoDB stores internal state data for the solution. All tables created by the solution use the on-demand capacity mode of pricing. You pay for storage used by these tables, and DynamoDB capacity used when interacting with the solution web interface, API, or running a deletion job.
Four types of charges occur when working with Amazon S3: Storage, Requests and data retrievals, Data Transfer, and Management.
Uses of Amazon S3 in the solution include:
- The solution web interface is deployed to, and served, from an S3 Bucket
- During the Find phase, Amazon Athena will:
- Retrieve data from Amazon S3 for the columns defined in the data mapper
- Store its results in an S3 bucket
- During the Forget phase, a program run in AWS Fargate processes each object
identified in the Find phase will:
- Retrieve the entire object and its metadata
- Create a new version of the file, and PUT this object to a staging bucket
- Delete the original object
- Copy the updated object from the staging bucket to the data bucket, and sets any metadata identified from the original object
- Delete the object from the staging bucket
- Some artefacts, and state data relating to AWS Step Functions Workflows may be stored in S3
The solution uses standard and FIFO SQS queues to handle internal state during a deletion job. You pay for the number of requests made to SQS. The number of requests increases with the number of data mappers, partitions in those data mappers, and the number of Amazon S3 objects processed in a deletion job.
Amazon VPC provides network connectivity for AWS Fargate tasks that run during the Forget phase.
How you build the VPC will determine the prices you pay. For example, VPC Endpoints and NAT Gateways are two different ways to provide network access to the solutions' dependencies. Both ways have different hourly prices and costs for data transferred.
The sample VPC provided in this solution makes use of VPC Endpoints, which have an hourly cost as well as data transfer cost. You can choose to use this sample VPC, however it may be more cost-efficient to use an existing suitable VPC in your account if you have one.
During deployment, the solution uses AWS CodeBuild, AWS CodePipeline and AWS Lambda custom resources to deploy the frontend and the backend. AWS Fargate uses Amazon Elastic Container Registry to store container images.
You are responsible for the cost of the AWS services used while running this solution. As of the date of publication of this version of the source code, the estimated cost to run a job with different Data Lake configurations in the Europe (Ireland) region is shown in the tables below. The estimates do not include VPC costs.
Summary | |
---|---|
Scenario 1 | 100GB Snappy Parquet |
Scenario 2 | 750GB Snappy Parquet |
Scenario 3 | 10TB Snappy Parquet |
Scenario 4 | 50TB Snappy Parquet |
Scenario 5 | 100GB Gzip JSON |
This example shows how the charges would be calculated for a deletion job where:
- Your dataset is 100GB of Snappy compressed Parquet objects that are distributed across 2 Partitions
- The S3 bucket containing the objects is in the same region as the S3 Find and Forget Solution
- The total size of the data held in the column queried by Athena is 6.8GB
- The Find phase returns 15 objects which need to be modified
- The Forget phase uses 3 Fargate tasks with 4 vCPUs and 30GB of memory each, running concurrently for 60 minutes
Service | Spending | Notes |
---|---|---|
Amazon Athena | $0.03 | 6.8GB of data scanned |
AWS Fargate | $0.89 | 3 tasks x 4 vCPUs, 30GB memory x 1 hour |
Amazon S3 | $0.01 | $0.01 of requests and data retrieval. $0 of data transfer |
Other services | $0.05 | n/a |
Total | $0.98 | n/a |
Note: This estimate doesn't include the costs for Amazon VPC
This example shows how the charges would be calculated for a deletion job where:
- Your dataset is 750GB of Snappy compressed Parquet objects that are distributed across 1000 Partitions
- The S3 bucket containing the objects is in the same region as the S3 Find and Forget Solution
- The total size of the data held in the column queried by Athena is 10GB
- The Find phase returns 1000 objects which need to be modified
- The Forget phase uses 50 Fargate tasks with 4 vCPUs and 30GB of memory each, running concurrently for 45 minutes
Service | Spending | Notes |
---|---|---|
Amazon Athena | $0.05 | 10GB of data scanned |
AWS Fargate | $11.07 | 50 tasks x 4 vCPUs, 30GB memory x 0.75 hours |
Amazon S3 | $0.01 | $0.01 of requests and data retrieval. $0 of data transfer |
Other services | $0.01 | n/a |
Total | $11.14 | n/a |
Note: This estimate doesn't include the costs for Amazon VPC
This example shows how the charges would be calculated for a deletion job where:
- Your dataset is 10TB of Snappy compressed Parquet objects that are distributed across 2000 Partitions
- The S3 bucket containing the objects is in the same region as the S3 Find and Forget Solution
- The total size of the data held in the column queried by Athena is 156GB
- The Find phase returns 11000 objects which need to be modified
- The Forget phase uses 100 Fargate tasks with 4 vCPUs and 30GB of memory each, running concurrently for 150 minutes
Service | Spending | Notes |
---|---|---|
Amazon Athena | $0.76 | 156GB of data scanned |
AWS Fargate | $73.82 | 100 tasks x 4 vCPUs, 30GB memory x 2.5 hours |
Amazon S3 | $0.11 | $0.11 of requests and data retrieval. $0 of data transfer |
Other services | $1 | n/a |
Total | $75.69 | n/a |
Note: This estimate doesn't include the costs for Amazon VPC
This example shows how the charges would be calculated for a deletion job where:
- Your dataset is 50TB of Snappy compressed Parquet objects that are distributed across 5300 Partitions
- The S3 bucket containing the objects is in the same region as the S3 Find and Forget Solution
- The total size of the data held in the column queried by Athena is 671GB
- The Find phase returns 45300 objects which need to be modified
- The Forget phase uses 100 Fargate tasks with 4 vCPUs and 30GB of memory each, running concurrently for 10.5 hours
Service | Spending | Notes |
---|---|---|
Amazon Athena | $3.28 | 671GB of data scanned |
AWS Fargate | $310.03 | 100 tasks x 4 vCPUs, 30GB memory x 10.5 hours |
Amazon S3 | $0.49 | $0.49 of requests and data retrieval. $0 of data transfer |
Other services | $3 | n/a |
Total | $316.80 | n/a |
Note: This estimate doesn't include the costs for Amazon VPC
This example shows how the charges would be calculated for a deletion job where:
- Your dataset is 100GB of Gzip compressed JSON objects that are distributed across 310 Partitions
- The S3 bucket containing the objects is in the same region as the S3 Find and Forget Solution
- The Find phase returns 3500 objects which need to be modified
- The Forget phase uses 50 Fargate tasks with 4 vCPUs and 30GB of memory each, running concurrently for 22 minutes
Service | Spending | Notes |
---|---|---|
Amazon Athena | $0.50 | 100GB of data scanned |
AWS Fargate | $5.31 | 50 tasks x 4 vCPUs, 30GB memory x 0.36 hours |
Amazon S3 | $0.03 | $0.03 of requests and data retrieval. $0 of data transfer |
Other services | $0.05 | n/a |
Total | $5.89 | n/a |
Note: This estimate doesn't include the costs for Amazon VPC