Supporting infrastructure for DataChain Compute Clusters on Azure

This repository contains the supporting infrastructure (OIDC, roles, resource group, etc.) needed to grant DataChain Studio enough permissions to manage compute clusters on Azure, and for those clusters to access the designated storage accounts.

Setup

Update variables.tf with values specific to your deployment.
Install Terraform
Run terraform init
Run terraform apply
Run terraform output and copy the output values

Overview

1. Resource Group

azurerm_resource_group.datachain: Central resource group to organize and manage all related compute and storage resources for the DataChain Compute Clusters.

2. Identity and Access Management (IAM)

Azure AD Applications:
- datachain_oidc_compute: Represents the identity used by DataChain Studio to provision and manage compute resources.
- datachain_oidc_storage: Represents the identity used by DataChain Studio to access storage resources.
Service Principals:
- Created for both OIDC applications above, allowing authentication and role assignment.
Federated Identity Credentials:
- Define OIDC-based trust between DataChain Studio and Azure AD applications, using a specific issuer and subject claim.

3. Role Definitions and Assignments

Custom Role: Compute (azurerm_role_definition.datachain_oidc_compute):
- Grants permissions to manage AKS clusters and assign managed identities within the resource group.
- Assigned to the datachain_oidc_compute service principal on the resource group scope.
Custom Role: Storage (azurerm_role_definition.datachain_oidc_storage):
- Grants permissions to read/write/delete storage accounts and containers.
- Assigned to the datachain_oidc_storage service principal on each authorized storage account.

4. Storage Accounts

azurerm_storage_account.datachain_oidc_storage:
- Retrieves information from the storage accounts that DataChain jobs should have access to.

Security Considerations

Least Privilege Access: Custom role definitions limit actions to only those required for compute and storage operations.
OIDC-based Federation: Federated identity credentials eliminate the need for long-lived secrets, enabling secure and auditable access from DataChain Studio.
Scoped Role Assignments: Storage permissions are granted only to explicitly defined accounts. Compute permissions are limited to a single resource group.
AKS Resource Group Boundaries: Permissions for the compute role are restricted to a single Resource Group, under which the AKS Compute clusters are created. AKS will automatically create additional resource groups prefixed with MC_ to host internal infrastructure like virtual networks and managed node pools.

Variables

Name	Description	Example
`az_subscription_id`	Azure subscription ID	`"00000000-0000-0000-0000-000000000000"`
`az_location`	Azure region where resources will be deployed	`"East US"`
`oidc_provider`	OIDC issuer URL (used in federated identity)	`"studio.datachain.ai/api"`
`oidc_condition_compute`	OIDC subject string for compute role	`"credentials:example-team/datachain-compute"`
`oidc_condition_storage`	OIDC subject string for storage role	`"credentials:example-team/datachain-storage"`
`storage_buckets`	Map of resource group names to storage account names	`{ "example-resource-group" = "examplestorageaccount" }`
`secret_stores`	Map of resource group names to Key Vault names	`{ "example-resource-group" = "example-key-vault" }`

Outputs

Name	Description
`datachain_compute_azure_subscription_id`	Subscription ID used for compute
`datachain_compute_azure_tenant_id`	Azure tenant ID for compute resources
`datachain_compute_azure_client_id`	Client ID of the compute Azure AD application
`datachain_storage_azure_subscription_id`	Subscription ID used for storage
`datachain_storage_azure_tenant_id`	Azure tenant ID for storage resources
`datachain_storage_azure_client_id`	Client ID of the storage Azure AD application
`datachain_compute_resource_group`	Name of the resource group used for AKS and compute

Architecture

DataChain Studio is split into 2 main components:

Control Plane — typically hosted by us as a fully managed service
Compute & Data Plane — typically hosted on your cloud accounts

Compute resources will be provisioned through managed Kubernetes clusters we automatically deploy on your account, using the permissions described in this repository.

Guidance

Granting Access to Azure Storage Accounts in DataChain Studio Jobs

Update the storage_buckets map in variables.tf with the resource groups and storage account names DataChain Studio Jobs should have access to, and run terraform apply.

Granting Access to Azure Key Vault Secrets in DataChain Studio Jobs

You can securely inject sensitive configuration (such as tokens, passwords, or private URLs) into your compute jobs by referencing Azure Key Vault secrets through environment variables. This avoids hardcoding credentials and allows fine-grained secret management.

Create a Secret in Azure Key Vault

Store your secret value under a named key in an existing Key Vault.
Grant Access to the Key Vault through Terraform

Update the secret_stores map in variables.tf with the resource group and name of the Key Vault, and run terraform apply. This grants the storage service principal the Key Vault Secrets User role on the vault.
Set an Environment Variable in the Studio Job Settings

In DataChain Studio, configure your job with an environment variable that references the secret using the azsecret:// syntax:
```
EXAMPLE_SECRET=azsecret://example-key-vault.vault.azure.net/secrets/example-secret#EXAMPLE_SECRET
```
- Replace example-key-vault.vault.azure.net with the hostname of your Key Vault.
- Replace example-secret with the name of the secret inside the vault.
- The part after the # (e.g., #EXAMPLE_SECRET) refers to the key in your JSON secret payload (omit it if the secret is a plain string).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
README.md		README.md
diagram.jpg		diagram.jpg
main.tf		main.tf
outputs.tf		outputs.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supporting infrastructure for DataChain Compute Clusters on Azure

Setup

Overview

1. Resource Group

2. Identity and Access Management (IAM)

3. Role Definitions and Assignments

4. Storage Accounts

Security Considerations

Variables

Outputs

Architecture

Guidance

Granting Access to Azure Storage Accounts in DataChain Studio Jobs

Granting Access to Azure Key Vault Secrets in DataChain Studio Jobs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Supporting infrastructure for DataChain Compute Clusters on Azure

Setup

Overview

1. Resource Group

2. Identity and Access Management (IAM)

3. Role Definitions and Assignments

4. Storage Accounts

Security Considerations

Variables

Outputs

Architecture

Guidance

Granting Access to Azure Storage Accounts in DataChain Studio Jobs

Granting Access to Azure Key Vault Secrets in DataChain Studio Jobs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages