diff --git a/gfmstudio/amo/a-model-01-name-chart/README.md b/gfmstudio/amo/a-model-01-name-chart/README.md new file mode 100644 index 0000000..ee9a1e9 --- /dev/null +++ b/gfmstudio/amo/a-model-01-name-chart/README.md @@ -0,0 +1,259 @@ +# Automated Model Deployment Guide + +This guide explains how to use the vllm-inference-server template for automated model deployments. + +## Overview + +The deployment system uses a template-based approach where: +- All services are prefixed with `gfm-amo-{MODEL_NAME}` +- KServe can be toggled on/off during deployment +- Resource requirements are configurable per deployment +- Model names are dynamically injected into the deployment + +## Files + +1. **vllm-inference-server-template.yaml** - The main template with placeholders +2. **deploy_model.py** - Python script for automated deployments +3. **vllm-inference-server-all-in-one.yaml** - Reference with Helm variables +4. **vllm-inference-server-standalone.yaml** - Example with filled values + +## Template Placeholders + +The template uses the following placeholders that must be replaced: + +| Placeholder | Description | Example | +|------------|-------------|---------| +| `${MODEL_NAME}` | Model name (prefixed with gfm-amo-) | `my-model` | +| `${NAMESPACE}` | Kubernetes namespace | `geospatial-studio` | +| `${ENABLE_KSERVE}` | Enable KServe (true/false) | `false` | +| `${IMAGE_REPOSITORY}` | Container image repository | `us.icr.io/gfmaas/vllm-small` | +| `${IMAGE_TAG}` | Container image tag | `v0.0.6` | +| `${IMAGE_PULL_SECRET}` | Image pull secret name | `my-registry-secret` | +| `${SERVICE_ACCOUNT}` | Service account name | `default` | +| `${MODELS_PVC}` | Models storage PVC | `vllm-models-pvc` | +| `${INFERENCE_SHARED_PVC}` | Shared inference PVC | `inference-shared-pvc` | +| `${GPU_COUNT}` | Number of GPUs | `1` | +| `${CPU_LIMIT}` | CPU limit | `2000m` | +| `${MEMORY_LIMIT}` | Memory limit | `8Gi` | +| `${CPU_REQUEST}` | CPU request | `1000m` | +| `${MEMORY_REQUEST}` | Memory request | `4Gi` | + +## Deployment Modes + +### Standard Deployment (KServe Disabled) + +Creates a standard Kubernetes Deployment with: +- Fixed replica count (default: 1) +- Always-on pods +- Direct service access + +**Use when:** +- You need consistent availability +- Scale-to-zero is not required +- You want simpler networking + +### KServe InferenceService (KServe Enabled) + +Creates a KServe InferenceService with: +- Scale-to-zero capability (minReplicas: 0) +- Automatic scaling based on traffic +- Advanced serving features + +**Use when:** +- You want to save resources with scale-to-zero +- You need automatic scaling +- KServe is installed in your cluster + +## Usage Examples + +### Using Python Script + +#### 1. Deploy with Standard Deployment (Dry Run) + +```bash +python deploy_model.py my-flood-model \ + --namespace geospatial-studio \ + --dry-run +``` + +#### 2. Deploy with KServe InferenceService + +```bash +python deploy_model.py my-flood-model \ + --namespace geospatial-studio \ + --enable-kserve +``` + +#### 3. Deploy with Custom Resources + +```bash +python deploy_model.py my-large-model \ + --namespace geospatial-studio \ + --gpu-count 2 \ + --memory-limit 16Gi \ + --cpu-limit 4000m \ + --memory-request 8Gi \ + --cpu-request 2000m +``` + +#### 4. Generate YAML File Without Deploying + +```bash +python deploy_model.py my-model \ + --namespace geospatial-studio \ + --enable-kserve \ + --dry-run \ + --output my-model-deployment.yaml +``` + +### Using Shell Script (sed/envsubst) + +```bash +#!/bin/bash + +# Set variables +export MODEL_NAME="my-flood-model" +export NAMESPACE="geospatial-studio" +export ENABLE_KSERVE="false" +export IMAGE_REPOSITORY="us.icr.io/gfmaas/vllm-small" +export IMAGE_TAG="v0.0.6" +export IMAGE_PULL_SECRET="my-registry-secret" +export SERVICE_ACCOUNT="default" +export MODELS_PVC="vllm-models-pvc" +export INFERENCE_SHARED_PVC="inference-shared-pvc" +export GPU_COUNT="1" +export CPU_LIMIT="2000m" +export MEMORY_LIMIT="8Gi" +export CPU_REQUEST="1000m" +export MEMORY_REQUEST="4Gi" + +# Generate YAML +envsubst < vllm-inference-server-template.yaml > deployment.yaml + +# Filter based on KServe mode +if [ "$ENABLE_KSERVE" = "true" ]; then + # Remove Deployment section, keep InferenceService + sed -i '/# Deployment (used when ENABLE_KSERVE=false)/,/^---$/d' deployment.yaml +else + # Remove InferenceService section, keep Deployment + sed -i '/# InferenceService (used when ENABLE_KSERVE=true)/,/^# Made with Bob$/d' deployment.yaml +fi + +# Apply +kubectl apply -f deployment.yaml +``` + +### Programmatic Integration (Python) + +```python +from deploy_model import deploy_model + +# Deploy a model programmatically +success = deploy_model( + model_name="my-burn-scar-model", + namespace="geospatial-studio", + enable_kserve=True, + image_repository="us.icr.io/gfmaas/vllm-small", + image_tag="v0.0.6", + gpu_count="1", + memory_limit="8Gi", + dry_run=False +) + +if success: + print("Model deployed successfully!") +else: + print("Deployment failed!") +``` + +## Service Naming Convention + +All deployed services follow this naming pattern: +- Service name: `gfm-amo-{MODEL_NAME}` +- Example: `gfm-amo-my-flood-model` + +This ensures: +- Consistent naming across deployments +- Easy identification of automated deployments +- No naming conflicts with manual deployments + +## Verification + +After deployment, verify the resources: + +```bash +# Check deployment/inferenceservice +kubectl get deployment -n geospatial-studio gfm-amo-{MODEL_NAME} +# OR +kubectl get inferenceservice -n geospatial-studio gfm-amo-{MODEL_NAME} + +# Check service +kubectl get svc -n geospatial-studio gfm-amo-{MODEL_NAME} + +# Check pods +kubectl get pods -n geospatial-studio -l app.kubernetes.io/name=gfm-amo-{MODEL_NAME} + +# View logs +kubectl logs -n geospatial-studio -l app.kubernetes.io/name=gfm-amo-{MODEL_NAME} +``` + +## Cleanup + +To remove a deployed model: + +```bash +# Delete all resources for a model +kubectl delete all -n geospatial-studio -l app.kubernetes.io/name=gfm-amo-{MODEL_NAME} + +# Or delete specific resources +kubectl delete deployment gfm-amo-{MODEL_NAME} -n geospatial-studio +kubectl delete svc gfm-amo-{MODEL_NAME} -n geospatial-studio +``` + +## Integration with Deployment Services + +For automated deployment services, the recommended approach is: + +1. **Store the template** in your deployment service +2. **Accept user inputs** for model name and KServe toggle +3. **Replace placeholders** using your preferred method (Python, Go, etc.) +4. **Filter resources** based on KServe mode +5. **Apply to cluster** using kubectl or Kubernetes client library + +Example workflow: +``` +User Request → Model Name + KServe Toggle + ↓ +Load Template + ↓ +Replace Placeholders + ↓ +Filter by KServe Mode + ↓ +Apply to Kubernetes + ↓ +Return Service URL: gfm-amo-{MODEL_NAME}.{NAMESPACE}.svc.cluster.local +``` + +## Troubleshooting + +### Issue: Pods not starting + +Check: +1. GPU availability: `kubectl describe node | grep nvidia.com/gpu` +2. Image pull secrets: `kubectl get secret {IMAGE_PULL_SECRET} -n {NAMESPACE}` +3. PVC status: `kubectl get pvc -n {NAMESPACE}` + +### Issue: KServe InferenceService not working + +Verify: +1. KServe is installed: `kubectl get crd inferenceservices.serving.kserve.io` +2. Knative Serving is running: `kubectl get pods -n knative-serving` +3. Check InferenceService status: `kubectl describe inferenceservice gfm-amo-{MODEL_NAME} -n {NAMESPACE}` + +### Issue: Service not accessible + +Check: +1. Service exists: `kubectl get svc gfm-amo-{MODEL_NAME} -n {NAMESPACE}` +2. Endpoints are ready: `kubectl get endpoints gfm-amo-{MODEL_NAME} -n {NAMESPACE}` +3. Pod is running: `kubectl get pods -n {NAMESPACE} -l app.kubernetes.io/name=gfm-amo-{MODEL_NAME}` diff --git a/gfmstudio/amo/a-model-01-name-chart/deploy_model.py b/gfmstudio/amo/a-model-01-name-chart/deploy_model.py new file mode 100644 index 0000000..ec021df --- /dev/null +++ b/gfmstudio/amo/a-model-01-name-chart/deploy_model.py @@ -0,0 +1,266 @@ +#!/usr/bin/env python3 +""" +© Copyright IBM Corporation 2026 +SPDX-License-Identifier: Apache-2.0 + +Automated Model Deployment Script +This script demonstrates how to deploy models using the vllm-inference-server template. +""" + +import argparse +import subprocess +import sys +from pathlib import Path +from typing import Dict, Optional + + +def load_template(template_path: str) -> str: + """Load the YAML template file.""" + with open(template_path, "r") as f: + return f.read() + + +def replace_placeholders(template: str, config: Dict[str, str]) -> str: + """Replace all placeholders in the template with actual values.""" + result = template + for key, value in config.items(): + placeholder = f"${{{key}}}" + result = result.replace(placeholder, str(value)) + return result + + +def filter_by_kserve_mode(yaml_content: str, enable_kserve: bool) -> str: + """ + Filter the YAML content based on KServe mode. + Remove Deployment if KServe is enabled, or remove InferenceService if disabled. + """ + lines = yaml_content.split("\n") + filtered_lines = [] + skip_section = False + current_section = None + + for line in lines: + # Detect section starts + if line.startswith("# Deployment (used when ENABLE_KSERVE=false)"): + current_section = "deployment" + skip_section = enable_kserve # Skip if KServe is enabled + elif line.startswith("# InferenceService (used when ENABLE_KSERVE=true)"): + current_section = "inferenceservice" + skip_section = not enable_kserve # Skip if KServe is disabled + elif line.startswith("---") and current_section: + # End of section + current_section = None + skip_section = False + + # Add line if not skipping + if not skip_section: + filtered_lines.append(line) + + return "\n".join(filtered_lines) + + +def deploy_model( + model_name: str, + namespace: str, + enable_kserve: bool = False, + image_repository: str = "us.icr.io/gfmaas/vllm-small", + image_tag: str = "v0.0.6", + image_pull_secret: str = "my-registry-secret", + service_account: str = "default", + models_pvc: str = "vllm-models-pvc", + inference_shared_pvc: str = "inference-shared-pvc", + gpu_count: str = "1", + cpu_limit: str = "2000m", + memory_limit: str = "8Gi", + cpu_request: str = "1000m", + memory_request: str = "4Gi", + dry_run: bool = False, + output_file: Optional[str] = None, +) -> bool: + """ + Deploy a model using the template. + + Args: + model_name: Name of the model (will be prefixed with gfm-amo-) + namespace: Kubernetes namespace + enable_kserve: Whether to use KServe InferenceService + image_repository: Container image repository + image_tag: Container image tag + image_pull_secret: Name of image pull secret + service_account: Service account name + models_pvc: PVC name for models storage + inference_shared_pvc: PVC name for shared inference data + gpu_count: Number of GPUs to request + cpu_limit: CPU limit + memory_limit: Memory limit + cpu_request: CPU request + memory_request: Memory request + dry_run: If True, only generate YAML without applying + output_file: If provided, save generated YAML to this file + + Returns: + True if successful, False otherwise + """ + # Configuration dictionary + config = { + "MODEL_NAME": model_name, + "NAMESPACE": namespace, + "ENABLE_KSERVE": str(enable_kserve).lower(), + "IMAGE_REPOSITORY": image_repository, + "IMAGE_TAG": image_tag, + "IMAGE_PULL_SECRET": image_pull_secret, + "SERVICE_ACCOUNT": service_account, + "MODELS_PVC": models_pvc, + "INFERENCE_SHARED_PVC": inference_shared_pvc, + "GPU_COUNT": gpu_count, + "CPU_LIMIT": cpu_limit, + "MEMORY_LIMIT": memory_limit, + "CPU_REQUEST": cpu_request, + "MEMORY_REQUEST": memory_request, + } + + # Load template + template_path = Path(__file__).parent / "vllm-inference-server-template.yaml" + if not template_path.exists(): + print(f"Error: Template file not found at {template_path}", file=sys.stderr) + return False + + template = load_template(str(template_path)) + + # Replace placeholders + yaml_content = replace_placeholders(template, config) + + # Filter based on KServe mode + yaml_content = filter_by_kserve_mode(yaml_content, enable_kserve) + + # Save to file if requested + if output_file: + with open(output_file, "w") as f: + f.write(yaml_content) + print(f"Generated YAML saved to: {output_file}") + + # Print or apply + if dry_run: + print("=" * 80) + print("DRY RUN - Generated YAML:") + print("=" * 80) + print(yaml_content) + print("=" * 80) + deployment_type = ( + "KServe InferenceService" if enable_kserve else "Standard Deployment" + ) + print(f"\nDeployment type: {deployment_type}") + print(f"Service name: gfm-amo-{model_name}") + print(f"Namespace: {namespace}") + return True + else: + # Apply to Kubernetes + try: + result = subprocess.run( + ["kubectl", "apply", "-f", "-"], + input=yaml_content.encode(), + capture_output=True, + check=True, + ) + print(result.stdout.decode()) + deployment_type = ( + "KServe InferenceService" if enable_kserve else "Standard Deployment" + ) + print( + f"\n✓ Successfully deployed {deployment_type} for model: gfm-amo-{model_name}" + ) + return True + except subprocess.CalledProcessError as e: + print(f"Error deploying model: {e.stderr.decode()}", file=sys.stderr) + return False + + +def main(): + parser = argparse.ArgumentParser( + description="Deploy GFM inference server for a model", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Deploy with standard Deployment + python deploy_model.py my-model --namespace geospatial-studio --dry-run + + # Deploy with KServe InferenceService + python deploy_model.py my-model --namespace geospatial-studio --enable-kserve --dry-run + + # Deploy with custom resources + python deploy_model.py my-model --namespace geospatial-studio \\ + --gpu-count 2 --memory-limit 16Gi --cpu-limit 4000m + + # Generate YAML file without deploying + python deploy_model.py my-model --namespace geospatial-studio \\ + --dry-run --output my-model-deployment.yaml + """, + ) + + parser.add_argument("model_name", help="Name of the model to deploy") + parser.add_argument("--namespace", required=True, help="Kubernetes namespace") + parser.add_argument( + "--enable-kserve", + action="store_true", + help="Use KServe InferenceService instead of standard Deployment", + ) + parser.add_argument( + "--image-repository", + default="us.icr.io/gfmaas/vllm-small", + help="Container image repository", + ) + parser.add_argument("--image-tag", default="v0.0.6", help="Container image tag") + parser.add_argument( + "--image-pull-secret", + default="my-registry-secret", + help="Name of image pull secret", + ) + parser.add_argument( + "--service-account", default="default", help="Service account name" + ) + parser.add_argument( + "--models-pvc", default="vllm-models-pvc", help="PVC name for models storage" + ) + parser.add_argument( + "--inference-shared-pvc", + default="inference-shared-pvc", + help="PVC name for shared inference data", + ) + parser.add_argument("--gpu-count", default="1", help="Number of GPUs to request") + parser.add_argument("--cpu-limit", default="2000m", help="CPU limit") + parser.add_argument("--memory-limit", default="8Gi", help="Memory limit") + parser.add_argument("--cpu-request", default="1000m", help="CPU request") + parser.add_argument("--memory-request", default="4Gi", help="Memory request") + parser.add_argument( + "--dry-run", + action="store_true", + help="Generate YAML without applying to cluster", + ) + parser.add_argument("--output", help="Save generated YAML to file") + + args = parser.parse_args() + + success = deploy_model( + model_name=args.model_name, + namespace=args.namespace, + enable_kserve=args.enable_kserve, + image_repository=args.image_repository, + image_tag=args.image_tag, + image_pull_secret=args.image_pull_secret, + service_account=args.service_account, + models_pvc=args.models_pvc, + inference_shared_pvc=args.inference_shared_pvc, + gpu_count=args.gpu_count, + cpu_limit=args.cpu_limit, + memory_limit=args.memory_limit, + cpu_request=args.cpu_request, + memory_request=args.memory_request, + dry_run=args.dry_run, + output_file=args.output, + ) + + sys.exit(0 if success else 1) + + +if __name__ == "__main__": + main() diff --git a/gfmstudio/amo/a-model-01-name-chart/vllm-inference-server-template.yaml b/gfmstudio/amo/a-model-01-name-chart/vllm-inference-server-template.yaml new file mode 100644 index 0000000..59ae2ba --- /dev/null +++ b/gfmstudio/amo/a-model-01-name-chart/vllm-inference-server-template.yaml @@ -0,0 +1,208 @@ +# © Copyright IBM Corporation 2026 +# SPDX-License-Identifier: Apache-2.0 + +# GFM Inference Server - Automated Deployment Template +# This template is designed for automated model deployment services +# +# PLACEHOLDERS TO REPLACE: +# - ${MODEL_NAME} - Name of the model being deployed (e.g., "my-model") +# - ${NAMESPACE} - Target Kubernetes namespace +# - ${ENABLE_KSERVE} - "true" or "false" to toggle KServe InferenceService +# - ${IMAGE_REPOSITORY} - Container image repository +# - ${IMAGE_TAG} - Container image tag +# - ${IMAGE_PULL_SECRET} - Name of the image pull secret +# - ${SERVICE_ACCOUNT} - Service account name +# - ${MODELS_PVC} - PVC name for models storage +# - ${INFERENCE_SHARED_PVC} - PVC name for shared inference data +# - ${GPU_COUNT} - Number of GPUs to request (e.g., "1") +# - ${CPU_LIMIT} - CPU limit (e.g., "2000m") +# - ${MEMORY_LIMIT} - Memory limit (e.g., "8Gi") +# - ${CPU_REQUEST} - CPU request (e.g., "1000m") +# - ${MEMORY_REQUEST} - Memory request (e.g., "4Gi") +# +# Service names will be prefixed with "gfm-amo-" automatically + +--- +# Service +apiVersion: v1 +kind: Service +metadata: + name: gfm-amo-${MODEL_NAME} + namespace: ${NAMESPACE} + labels: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + app.kubernetes.io/component: inference-server + app.kubernetes.io/managed-by: automated-deployment +spec: + type: ClusterIP + ports: + - port: 80 + targetPort: http + protocol: TCP + name: http + selector: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + +--- +# Deployment (used when ENABLE_KSERVE=false) +# CONDITIONAL: Only deploy if ${ENABLE_KSERVE} == "false" +apiVersion: apps/v1 +kind: Deployment +metadata: + name: gfm-amo-${MODEL_NAME} + namespace: ${NAMESPACE} + labels: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + app.kubernetes.io/component: inference-server + app.kubernetes.io/managed-by: automated-deployment + annotations: + deployment.type: "standard" +spec: + replicas: 1 + strategy: + type: Recreate + selector: + matchLabels: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + template: + metadata: + labels: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + app.kubernetes.io/component: inference-server + app.kubernetes.io/managed-by: automated-deployment + spec: + imagePullSecrets: + - name: ${IMAGE_PULL_SECRET} + serviceAccountName: ${SERVICE_ACCOUNT} + affinity: + nodeAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + preference: + matchExpressions: + - key: nvidia.com/gpu + operator: Exists + initContainers: + - name: setup-directory + image: busybox + command: ['sh', '-c', 'mkdir -p /data/outputs && chmod 777 /data/outputs'] + volumeMounts: + - name: inference-shared-pvc + mountPath: /data + containers: + - name: inference-server + image: "${IMAGE_REPOSITORY}:${IMAGE_TAG}" + imagePullPolicy: Always + ports: + - name: http + containerPort: 8000 + protocol: TCP + env: + - name: MODELS_PATH + value: "/models" + - name: HF_HOME + value: "/models/huggingface_cache" + - name: TERRATORCH_SEGMENTATION_IO_PROCESSOR_CONFIG + value: "{\"output_path\": \"/data/outputs/\"}" + - name: MODEL_NAME + value: "${MODEL_NAME}" + volumeMounts: + - name: models-storage + mountPath: /models + - name: inference-shared-pvc + mountPath: /data + resources: + limits: + cpu: "${CPU_LIMIT}" + memory: "${MEMORY_LIMIT}" + nvidia.com/gpu: "${GPU_COUNT}" + requests: + cpu: "${CPU_REQUEST}" + memory: "${MEMORY_REQUEST}" + nvidia.com/gpu: "${GPU_COUNT}" + volumes: + - name: models-storage + persistentVolumeClaim: + claimName: ${MODELS_PVC} + - name: inference-shared-pvc + persistentVolumeClaim: + claimName: ${INFERENCE_SHARED_PVC} + +--- +# InferenceService (used when ENABLE_KSERVE=true) +# CONDITIONAL: Only deploy if ${ENABLE_KSERVE} == "true" +# Provides KServe integration with scale-to-zero capabilities +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: gfm-amo-${MODEL_NAME} + namespace: ${NAMESPACE} + labels: + app.kubernetes.io/name: gfm-amo-${MODEL_NAME} + app.kubernetes.io/component: inference-server + app.kubernetes.io/managed-by: automated-deployment + annotations: + deployment.type: "kserve" +spec: + predictor: + minReplicas: 0 + maxReplicas: 3 + scaleTarget: 100 + scaleMetric: concurrency + containerConcurrency: 0 + timeout: 600 + containers: + - name: kserve-container + image: "${IMAGE_REPOSITORY}:${IMAGE_TAG}" + imagePullPolicy: Always + ports: + - name: http + containerPort: 8000 + protocol: TCP + env: + - name: MODELS_PATH + value: "/models" + - name: HF_HOME + value: "/models/huggingface_cache" + - name: TERRATORCH_SEGMENTATION_IO_PROCESSOR_CONFIG + value: "{\"output_path\": \"/data/outputs/\"}" + - name: MODEL_NAME + value: "${MODEL_NAME}" + volumeMounts: + - name: models-storage + mountPath: /models + - name: inference-shared-pvc + mountPath: /data + resources: + limits: + cpu: "${CPU_LIMIT}" + memory: "${MEMORY_LIMIT}" + nvidia.com/gpu: "${GPU_COUNT}" + requests: + cpu: "${CPU_REQUEST}" + memory: "${MEMORY_REQUEST}" + nvidia.com/gpu: "${GPU_COUNT}" + initContainers: + - name: setup-directory + image: busybox + command: ['sh', '-c', 'mkdir -p /data/outputs && chmod 777 /data/outputs'] + volumeMounts: + - name: inference-shared-pvc + mountPath: /data + volumes: + - name: models-storage + persistentVolumeClaim: + claimName: ${MODELS_PVC} + - name: inference-shared-pvc + persistentVolumeClaim: + claimName: ${INFERENCE_SHARED_PVC} + affinity: + nodeAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + preference: + matchExpressions: + - key: nvidia.com/gpu + operator: Exists + imagePullSecrets: + - name: ${IMAGE_PULL_SECRET}