Skip to content

Conversation

@jkroll-deepgram
Copy link
Contributor

@jkroll-deepgram jkroll-deepgram commented Nov 7, 2025

Proposed changes

Adds optional support for the Billing container, providing airgapped license management and usage tracking for Deepgram self-hosted deployments.

This implementation provides enterprise customers with a robust, HA-capable, airgapped licensing solution while maintaining full backward compatibility with existing cloud-connected deployments.

Key Features & Implementation Choices

1. Deployment Flexibility

  • Three supported deployment patterns:
    • Pattern 1 (Cloud/Connected): API/Engine → License Proxy → license.deepgram.com (default, no changes)
    • Pattern 2 (Airgapped Direct): API/Engine → Billing (License Proxy disabled)
    • Pattern 3 (Airgapped with Caching): API/Engine → License Proxy → Billing (optional caching layer)

2. High Availability (HA) Support

  • StatefulSet architecture with stable network identities for multi-replica deployments
  • Per-pod persistent journal storage via volumeClaimTemplates (ReadWriteOnce by default)
  • Optional shared storage via billing.journal.existingPvcName for EFS/NFS (ReadWriteMany)
  • Configurable replica count (default: 1, supports N replicas for redundancy)
  • Each replica maintains its own usage journal to avoid write conflicts (EBS), or writes to separate subdirectories (EFS)

3. Persistent Usage Tracking

  • Journal file (/journal/journal) persists all usage data for billing and compliance
  • EBS-backed PVCs (default: 1Gi, configurable storage class and size)
  • Critical data protection: journals survive pod restarts, node failures, and cluster migrations
  • Flexible storage: Supports both EBS (per-pod PVCs) and EFS (shared PVC) for different HA patterns

4. Secure Secret Management

  • Two-tier secret architecture:
    • global.deepgramLicenseSecretRef: License key (env var: DEEPGRAM_LICENSE_KEY)
    • billing.licenseFile.secretRef: License file (mounted as /license/license.dg)
  • Runtime config rendering via sed in initContainers for secure key injection
  • Configurable init container image (global.initContainer.image) with ubuntu:22.04 default
  • Airgapped-friendly: init container image can be mirrored to private registries
  • Secrets never exposed in ConfigMaps or logs

5. Minimal Container Design

  • Ultra-minimal billing image (no shell, no tar) for security and size optimization
  • Debug workflow provided via ephemeral debug pods for journal file access (documented in samples/airgapped.md)
  • Consistent with Deepgram's security-first container design philosophy

6. Seamless License Proxy Integration

  • New configuration flag: licenseProxy.upstream.useBilling
    • When true: License Proxy forwards requests to Billing
    • When false: License Proxy uses license.deepgram.com (default)
  • Allows optional License Proxy as a caching/request aggregation layer in airgapped environments

7. Automatic Configuration Management

  • Conditional TOML generation in API/Engine ConfigMaps based on billing.enabled
  • Dynamic service discovery for billing-internal headless service
  • Automatic placeholder injection: DEEPGRAM_API_KEY=airgapped-mode for API/Engine/License Proxy to satisfy internal checks
  • Backward compatibility: existing deployments unaffected when billing.enabled: false

8. Resource Management

  • Default resource limits match License Proxy (configurable per-deployment)
  • Node affinity/selectors for workload isolation (e.g., k8s.deepgram.com/node-type=billing)
  • Tolerations for specialized node pools

9. Comprehensive Documentation

  • New airgapped guide: samples/airgapped.md with step-by-step instructions
    • Quick start with minimal configuration examples
    • Manual journal retrieval (debug pod method)
    • Automated backup via CronJob (with S3/local storage options)
    • Multi-replica considerations (EBS vs EFS)
    • Troubleshooting and best practices
  • Sample configurations:
    • 06-basic-setup-aws-airgapped.values.yaml - Complete airgapped deployment values
    • 06-basic-setup-aws-airgapped.cluster-config.yaml - EKS cluster setup for airgapped deployments

Migration Path

  • Zero breaking changes for existing customers
  • Opt-in only: set billing.enabled: true and configure secrets
  • Side-by-side compatibility: License Proxy and Billing can coexist during testing
  • Clear migration guide in samples/airgapped.md

Technical Highlights

  • Kubernetes Resources: StatefulSet, Headless Service, ConfigMap, RBAC, PVCs
  • Storage Requirements: EBS CSI driver (AWS) or equivalent storage provisioner, configurable storage class
  • Secret Types: generic (license key), generic (license file as file mount)
  • Init Container Pattern: Configurable image (default ubuntu:22.04) with sed for runtime secret injection
  • Journal Format: Newline-delimited JSON (NDJSON) with base64-encoded, signed usage events

Testing Completed

  • End-to-end airgapped deployment on EKS 1.32
  • Successful transcription requests with billing tracking
  • Journal file generation and retrieval (both manual and automated methods)
  • Multi-replica StatefulSet behavior
  • License Proxy integration (Pattern 3)
  • Init container secret injection with sed
  • Backward compatibility ("cloud" self-hosted deployments unaffected)

Types of changes

What types of changes does your code introduce to the Deepgram self-hosted resources?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update or tests (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have tested my changes in my local self-hosted environment
    • Please describe your testing setup and methodology here

I brought up a deployment with this Helm chart and the 06 airgapped samples, served a test STT request, and confirmed that the usage was logged in the journal file.

  • I have added necessary documentation (if appropriate)

Further comments

@jkroll-deepgram jkroll-deepgram marked this pull request as ready for review November 13, 2025 23:58
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to explicitly define maxSurge alongside this to clearly describe the behavior.

However, I'm not sure if this will work with the way you've setup the volumes. If you have extra billing containers during a deployment, they can't all read and write from the same volume that is configured with ReadWriteOnce. You would need a ReadWriteMany, and you'd want to ensure you set it up as a PersistentVolume so that it could be accessed after the containers are spun down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The billing.deployment.yaml uses a StatefulSet rather than a Deployment, so each Billing pod gets its own dedicated journal file via volumeClaimTemplates. That does mean there will be N journal files corresponding to N Billing pods. The StatefulSet uses ordered, one-at-a-time rollouts, so a maxSurge doesn't apply. The ReadWriteOnce is okay here since each Billing pod owns its exclusive PVC. Let me know if you see any downside to Billing using a StatefulSet instead of a Deployment like API/Engine/LP are using, as well as implications of producing multiple journal files for multiple Billing pods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I missed that, makes sense. As long as we've tested it, makes sense.

It looks like we have the client provide a custom persistent volume to store that journal file. Could we add documentation in the README over how to do that itself, to help provide guardrails to guide folks to persist that correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the need for documentation overall.

To answer your specific point, the journal persistence is automatic with no manual PV creation needed. The StatefulSet uses volumeClaimTemplates, so Kubernetes automatically provisions the PVCs/PVs via the cluster's StorageClass (like gp2 on AWS). Customers configure billing.journal.storageClass (optional, defaults to cluster default), and billing.journal.size (optional, defaults to 1Gi).

I will add documentation with guidance on how to back up the PVC outside of the chart (such as with Velero or cloud-specific snapshots).

It's also important to help customers know how to exfiltrate the journal file. I had to create a debugging container - some kind of sidecar container is needed, or else use a cron job to copy to S3, like on a set cadence. Do you have any feedback around a journal file exfiltration method? S3 might be difficult in an airgapped context, but a sidecar container also adds complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation at a few levels: 82fe595

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkroll-deepgram I have seen u already fixed that envsubst and API KEY error , i also got one more error related to flux , i think that config expects script_path and socket_path to be present,

Error: TOML parse error at line 26, column 1
   |
26 | [flux]
   | ^^^^^^
missing field `script_path`

@@ -0,0 +1,137 @@
{{- if .Values.billing.enabled -}}
apiVersion: apps/v1
kind: StatefulSet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkroll-deepgram I think as billing is statefulset , so it might be better to name this file as billing.statefulset.yaml so people dont get consfused by the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants