This repository contains the supporting infrastructure (VPC, IAM roles, OIDC provider, etc.) needed to grant DataChain Studio enough permissions to manage compute clusters on AWS, and for those clusters to access the designated object storage buckets.
- Update
variables.tfwith values specific to your deployment. - Install Terraform
- Run
terraform init - Run
terraform apply - Run
terraform outputand copy the output values
- VPC (
aws_vpc.datachain_cluster): A Virtual Private Cloud with CIDR block172.18.0.0/16to isolate the compute cluster. - Subnets (
aws_subnet.datachain_cluster): Public subnets are created in all available availability zones to ensure high availability. - Internet Gateway (
aws_internet_gateway.datachain_cluster): Provides internet access for resources within the VPC. - Route Table (
aws_route_table.datachain_cluster): Configures routing for internet-bound traffic through the Internet Gateway. - Security Group (
aws_security_group.datachain_cluster): Controls inbound and outbound traffic for the cluster, allowing all traffic within the VPC and outbound traffic to the internet.
- Cluster Role (
aws_iam_role.datachain_cluster): Assumed by the EKS cluster to manage AWS resources. - Node Role (
aws_iam_role.datachain_cluster_node): Assumed by EC2 instances (worker nodes) in the cluster. - Pod Role (
aws_iam_role.datachain_cluster_pod): Assumed by EKS pods for accessing AWS resources. - OIDC Compute Role (
aws_iam_role.datachain_oidc_compute): Assumed by DataChain Studio (SaaS) to create, delete and manage DataChain clusters. - OIDC Storage Role (
aws_iam_role.datachain_oidc_storage): Assumed by DataChain Studio (SaaS) to read and write object storage (S3) buckets.
- OIDC Provider (
aws_iam_openid_connect_provider.datachain_oidc): Configures an OIDC provider for the cluster, enabling secure authentication for pods.
-
IAM Policies: The roles are attached to AWS-managed policies to ensure least privilege access, and restrict allowed actions to DataChain-managed resources.
-
OIDC Integration: The use of OIDC allows DataChain Studio to securely manage cloud resources in the target account eliminating the need for static credentials.
-
Network Isolation: The VPC and security groups ensure that the cluster is isolated from external networks, with controlled ingress and egress rules.
| Name | Description | Example |
|---|---|---|
aws_region |
AWS region where resources will be deployed | "eu-north-1" |
oidc_provider |
OIDC issuer URL (used in federated identity) | "studio.datachain.ai/api" |
oidc_condition_compute |
OIDC subject string for compute role | "credentials:example-team/datachain-compute" |
oidc_condition_storage |
OIDC subject string for storage role | "credentials:example-team/datachain-storage" |
storage_buckets |
List of S3 bucket names accessible to DataChain Studio Jobs | ["example-bucket"] |
secrets |
List of Secrets Manager secret ARNs accessible to Studio Jobs | ["arn:aws:secretsmanager:us-east-1:000000000000:secret:example-secret/test-abcdef"] |
| Name | Description |
|---|---|
datachain_aws_region |
AWS region where resources are deployed |
datachain_oidc_compute_role_arn |
ARN of the OIDC role assumed by Studio to manage compute clusters |
datachain_oidc_storage_role_arn |
ARN of the OIDC role assumed by Studio to access S3 buckets |
datachain_cluster_role_arn |
ARN of the IAM role assumed by the EKS cluster |
datachain_cluster_node_role_arn |
ARN of the IAM role assumed by EKS worker nodes |
datachain_cluster_vpc_id |
ID of the VPC hosting the compute cluster |
datachain_cluster_subnet_ids |
IDs of the subnets used by the compute cluster |
datachain_cluster_security_group_ids |
IDs of the security groups attached to the compute cluster |
DataChain Studio is split into 2 main components:
- Control Plane — typically hosted by us as a fully managed service
- Compute & Data Plane — typically hosted on your cloud accounts
Compute resources will be provisioned through managed Kubernetes clusters we automatically deploy on your account, using the permissions described in this repository.
Update the storage_buckets list in variables.tf with the list of S3 bucket names DataChain Studio Jobs should have access to, and run terraform apply.
You can securely inject sensitive configuration (such as tokens, passwords, or private URLs) into your compute jobs by referencing AWS Secrets Manager secrets through environment variables. This avoids hardcoding credentials and allows fine-grained secret management.
-
Create a JSON Secret in AWS Secrets Manager
Store your secret as a JSON object. For example:
{ "EXAMPLE_SECRET": "your-secret-value-or-url" } -
Grant Access to the Secret through Terraform
Update the
secretslist invariables.tfin order to grant access to the created secret, and runterraform apply. -
Set an Environment Variable in the Studio Job Settings
In DataChain Studio, configure your job with an environment variable that references the secret using the
awssecret://syntax:EXAMPLE_SECRET=awssecret://arn:aws:secretsmanager:us-east-1:000000000000:secret:example-secret/test-abcdef#EXAMPLE_SECRET- Replace
arn:aws:secretsmanager:us-east-1:000000000000:secret:example-secret/test-abcdefwith the full ARN of your secret. - The part after the
#(e.g.,#EXAMPLE_SECRET) refers to the key in your JSON secret. - Add the full ARN of your secret to
variables.tf.
- Replace
