huggingface · naveenkumar982 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml
@@ -114,6 +114,9 @@ jobs:
           - name: openspiel-env
             dockerfile: envs/openspiel_env/server/Dockerfile
             context: envs/openspiel_env
+          - name: cloud-sre-env
+            dockerfile: envs/cloud_sre_env/server/Dockerfile
+            context: .
 
     steps:
       - name: Checkout code

diff --git a/envs/cloud_sre_env/README.md b/envs/cloud_sre_env/README.md
@@ -0,0 +1,138 @@
+# Cloud SRE & FinOps Environment
+
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment for training and evaluating AI agents on **Cloud Site Reliability Engineering (SRE)** and **Financial Operations (FinOps)** tasks.
+
+The agent manages a simulated cloud infrastructure: diagnosing outages, terminating idle resources, scaling services, and optimizing costs — all without causing collateral damage to production workloads.
+
+## Features
+
+| Feature | Description |
+|---|---|
+| **3 Difficulty Tiers** | Easy → Medium → Hard tasks covering cost optimization, scaling, and incident response |
+| **Deterministic Grading** | Fine-grained scoring breakdowns for agent debugging |
+| **Seeded Procedural Gen** | Reproducible yet varied infrastructure layouts for RL training |
+| **Chaos Injection** | Random cost spikes, CPU anomalies, and spurious alerts |
+| **Zero Dependencies** | No external API keys or cloud accounts needed |
+
+## Tasks
+
+### 1. Phantom Volume Cleanup (Easy)
+- **Goal:** Find and terminate unattached EBS volumes wasting money
+- **Trap:** Do NOT touch running EC2 instances or in-use volumes
+- **Scoring:** +1/N per orphan terminated, −0.5 per active resource destroyed
+
+### 2. Latency Spike Remediation (Medium)
+- **Goal:** Scale an under-provisioned RDS to fix high API latency
+- **Trap:** Stay within the budget limit
+- **Scoring:** 40% RDS scaled + 30% latency resolved + 30% under budget
+
+### 3. Noisy Neighbor Incident (Hard)
+- **Goal:** Investigate a rogue test instance, terminate it, reboot crashed production
+- **Trap:** Don't terminate production infrastructure
+- **Scoring:** 20% inspect + 30% terminate rogue + 30% reboot backend + 20% alerts resolved
+
+## Quick Start
+
+### From Docker
+```bash
+# Build base image
+docker build -t openenv-base:latest -f src/openenv/core/containers/images/Dockerfile .
+
+# Build Cloud SRE environment
+docker build -t cloud-sre-env:latest -f envs/cloud_sre_env/server/Dockerfile .
+```
+
+### Using the Client
+```python
+from envs.cloud_sre_env import SREAction, CloudSREEnv
+
+# Connect to the Docker container
+client = CloudSREEnv.from_docker_image("cloud-sre-env:latest")
+
+# Reset to a task
+result = client.reset()
+
+# Take actions
+result = client.step(SREAction(command="inspect", resource_id="ec2-web-001"))
+result = client.step(SREAction(command="terminate", resource_id="ebs-orphan-001"))
+
+# Get state
+state = client.state()
+print(f"Step: {state.current_step}, Reward: {state.cumulative_reward}")
+
+# Cleanup
+client.close()
+```
+
+### Direct Python Usage (no Docker)
+```python
+from cloud_sre_env.server.cloud_sre_environment import CloudSREEnvironment
+from cloud_sre_env.models import SREAction
+
+env = CloudSREEnvironment()
+
+# Reset with seeded procedural generation
+obs = env.reset(task_id="phantom_volume_cleanup", seed=42)
+print(f"Resources: {len(obs.resources)}, Alerts: {len(obs.alerts)}")
+
+# Agent loop
+for step in range(15):
+    action = SREAction(command="terminate", resource_id="ebs-orphan-001")
+    obs = env.step(action)
+    if env.state.done:
+        break
+
+# Grade the agent
+score, breakdown = env.grade()
+print(f"Score: {score}, Breakdown: {breakdown}")
+```
+
+## Action Space
+
+| Command | Description | Parameters |
+|---------|-------------|------------|
+| `terminate` | Remove a resource permanently | `resource_id` |
+| `scale` | Change instance size | `resource_id`, `params.target_size` |
+| `reboot` | Restart a stopped/running instance | `resource_id` |
+| `inspect` | View detailed resource info (no side-effects) | `resource_id` |
+| `wait` | Do nothing for this step | — |
+
+## Observation Space
+
+```json
+{
+  "resources": [{"id": "ec2-web-001", "type": "ec2_instance", "status": "running", ...}],
+  "alerts": [{"alert_id": "alert-cost-001", "severity": "warning", "message": "..."}],
+  "total_hourly_cost": 4.52,
+  "system_uptime": 78.0,
+  "budget_limit": 12.00,
+  "task_description": "Your cloud account has unattached EBS volumes..."
+}
+```
+
+## Project Structure
+
+```
+cloud_sre_env/
+├── __init__.py                          # Exports SREAction, SREObservation, SREState, CloudSREEnv
+├── models.py                            # Typed dataclass models (Action, Observation, State)
+├── client.py                            # EnvClient subclass for HTTP/Docker communication
+├── openenv.yaml                         # Environment manifest
+├── README.md                            # This file
+└── server/
+    ├── __init__.py
+    ├── cloud_sre_environment.py         # Core environment logic + 3 tasks + grading
+    ├── app.py                           # FastAPI application
+    ├── Dockerfile                       # Container image
+    └── requirements.txt                 # Server dependencies
+```
+
+## Testing
+
+```bash
+pytest tests/test_cloud_sre_environment.py -v
+```
+
+## License
+
+BSD-3-Clause — same as OpenEnv.
diff --git a/envs/cloud_sre_env/__init__.py b/envs/cloud_sre_env/__init__.py
@@ -0,0 +1,12 @@
+"""
+Cloud SRE & FinOps Environment for OpenEnv.
+
+An OpenEnv-compliant environment simulating Cloud SRE operations:
+diagnosing outages, terminating idle resources, scaling services,
+and optimizing costs without causing collateral damage.
+"""
+
+from .models import SREAction, SREObservation, SREState
+from .client import CloudSREEnv
+
+__all__ = ["SREAction", "SREObservation", "SREState", "CloudSREEnv"]
diff --git a/envs/cloud_sre_env/client.py b/envs/cloud_sre_env/client.py
@@ -0,0 +1,65 @@
+"""
+Cloud SRE OpenEnv Client.
+
+Implements EnvClient for communicating with the Cloud SRE environment
+via WebSocket/HTTP when deployed as a Docker container or HF Space.
+"""
+
+from openenv.core import EnvClient, StepResult
+from .models import SREAction, SREObservation, SREState
+
+
+class CloudSREEnv(EnvClient[SREAction, SREObservation, SREState]):
+    """
+    Client for the Cloud SRE & FinOps environment.
+
+    Usage (async):
+        async with CloudSREEnv(base_url="https://your-space.hf.space") as client:
+            result = await client.reset()
+            result = await client.step(SREAction(command="inspect", resource_id="ec2-web-001"))
+
+    Usage (sync):
+        with CloudSREEnv(base_url="...").sync() as client:
+            result = client.reset()
+            result = client.step(SREAction(command="terminate", resource_id="ebs-orphan-001"))
+    """
+
+    def _step_payload(self, action: SREAction) -> dict:
+        """Serialize an SREAction into the JSON payload for the server."""
+        payload = {"command": action.command}
+        if action.resource_id:
+            payload["resource_id"] = action.resource_id
+        if action.params:
+            payload["params"] = action.params
+        return payload
+
+    def _parse_result(self, payload: dict) -> StepResult[SREObservation]:
+        """Deserialize the server response into a typed StepResult."""
+        obs_data = payload.get("observation", {})
+        obs = SREObservation(
+            resources=obs_data.get("resources", []),
+            alerts=obs_data.get("alerts", []),
+            total_hourly_cost=obs_data.get("total_hourly_cost", 0.0),
+            system_uptime=obs_data.get("system_uptime", 100.0),
+            step_number=obs_data.get("step_number", 0),
+            max_steps=obs_data.get("max_steps", 15),
+            budget_limit=obs_data.get("budget_limit"),
+            task_description=obs_data.get("task_description", ""),
+        )
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward", 0.0),
+            done=payload.get("done", False),
+        )
+
+    def _parse_state(self, payload: dict) -> SREState:
+        """Deserialize the server state response into a typed SREState."""
+        return SREState(
+            episode_id=payload.get("episode_id", ""),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            current_step=payload.get("current_step", 0),
+            done=payload.get("done", False),
+            cumulative_reward=payload.get("cumulative_reward", 0.0),
+            action_count=payload.get("action_count", 0),
+        )
diff --git a/envs/cloud_sre_env/models.py b/envs/cloud_sre_env/models.py
@@ -0,0 +1,124 @@
+"""
+Typed models for the Cloud SRE OpenEnv environment.
+
+Follows OpenEnv conventions: Action, Observation, State as dataclasses
+inheriting from openenv.core base classes.
+"""
+
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+from enum import Enum
+
+from openenv.core.env_server import Action, Observation, State
+
+
+# ─── Enums ────────────────────────────────────────────────────────────────────
+
+
+class ResourceType(str, Enum):
+    """Types of cloud resources available in the simulation."""
+    EC2 = "ec2_instance"
+    RDS = "rds_database"
+    EBS = "ebs_volume"
+    ALB = "alb_load_balancer"
+
+
+class ResourceStatus(str, Enum):
+    """Possible statuses for a cloud resource."""
+    RUNNING = "running"
+    STOPPED = "stopped"
+    AVAILABLE = "available"       # EBS: unattached
+    IN_USE = "in-use"            # EBS: attached
+    REBOOTING = "rebooting"
+    TERMINATED = "terminated"
+
+
+class AlertSeverity(str, Enum):
+    """Severity levels for monitoring alerts."""
+    INFO = "info"
+    WARNING = "warning"
+    CRITICAL = "critical"
+
+
+class ActionCommand(str, Enum):
+    """Discrete commands the agent can issue."""
+    TERMINATE = "terminate"
+    SCALE = "scale"
+    REBOOT = "reboot"
+    INSPECT = "inspect"
+    WAIT = "wait"
+
+
+# ─── Data Structures ─────────────────────────────────────────────────────────
+
+
+@dataclass
+class ResourceInfo:
+    """Represents a single cloud resource (EC2, RDS, EBS, ALB)."""
+    id: str
+    name: str = ""
+    type: str = ""            # ResourceType value
+    status: str = ""          # ResourceStatus value
+    instance_size: str = ""   # e.g., "t3.micro", "db.t3.medium"
+    cpu_utilization: float = 0.0
+    memory_utilization: float = 0.0
+    cost_per_hour: float = 0.0
+    attached_to: Optional[str] = None
+    tags: Dict[str, str] = field(default_factory=dict)
+
+
+@dataclass
+class AlertInfo:
+    """Represents an active monitoring alert."""
+    alert_id: str
+    severity: str = "info"
+    message: str = ""
+    resource_id: Optional[str] = None
+    metric_name: Optional[str] = None
+    metric_value: Optional[float] = None
+
+
+# ─── OpenEnv Action / Observation / State ────────────────────────────────────
+
+
+from pydantic import Field
+
+class SREAction(Action):
+    """
+    An action the agent submits to step().
+
+    Attributes:
+        command: One of 'terminate', 'scale', 'reboot', 'inspect', 'wait'.
+        resource_id: The target resource ID (optional for 'wait').
+        params: Additional parameters, e.g. {"target_size": "db.t3.medium"}.
+    """
+    command: str = "wait"
+    resource_id: Optional[str] = None
+    params: Dict[str, str] = Field(default_factory=dict)
+
+
+class SREObservation(Observation):
+    """
+    The full observation returned by state() and step().
+    Contains everything the agent can see about the infrastructure.
+    """
+    resources: List[dict] = Field(default_factory=list)
+    alerts: List[dict] = Field(default_factory=list)
+    total_hourly_cost: float = 0.0
+    system_uptime: float = 100.0
+    step_number: int = 0
+    max_steps: int = 15
+    budget_limit: Optional[float] = None
+    task_description: str = ""
+
+
+class SREState(State):
+    """
+    Episode state tracking for the Cloud SRE environment.
+    Extends the base State with SRE-specific metadata.
+    """
+    task_id: str = ""
+    current_step: int = 0
+    done: bool = False
+    cumulative_reward: float = 0.0
+    action_count: int = 0