Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ jobs:
- name: openspiel-env
dockerfile: envs/openspiel_env/server/Dockerfile
context: envs/openspiel_env
- name: cloud-sre-env
dockerfile: envs/cloud_sre_env/server/Dockerfile
context: .

Comment thread
naveenkumar982 marked this conversation as resolved.
steps:
- name: Checkout code
Expand Down
138 changes: 138 additions & 0 deletions envs/cloud_sre_env/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Cloud SRE & FinOps Environment

An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment for training and evaluating AI agents on **Cloud Site Reliability Engineering (SRE)** and **Financial Operations (FinOps)** tasks.

The agent manages a simulated cloud infrastructure: diagnosing outages, terminating idle resources, scaling services, and optimizing costs — all without causing collateral damage to production workloads.

## Features

| Feature | Description |
|---|---|
| **3 Difficulty Tiers** | Easy → Medium → Hard tasks covering cost optimization, scaling, and incident response |
| **Deterministic Grading** | Fine-grained scoring breakdowns for agent debugging |
| **Seeded Procedural Gen** | Reproducible yet varied infrastructure layouts for RL training |
| **Chaos Injection** | Random cost spikes, CPU anomalies, and spurious alerts |
| **Zero Dependencies** | No external API keys or cloud accounts needed |

## Tasks

### 1. Phantom Volume Cleanup (Easy)
- **Goal:** Find and terminate unattached EBS volumes wasting money
- **Trap:** Do NOT touch running EC2 instances or in-use volumes
- **Scoring:** +1/N per orphan terminated, −0.5 per active resource destroyed

### 2. Latency Spike Remediation (Medium)
- **Goal:** Scale an under-provisioned RDS to fix high API latency
- **Trap:** Stay within the budget limit
- **Scoring:** 40% RDS scaled + 30% latency resolved + 30% under budget

### 3. Noisy Neighbor Incident (Hard)
- **Goal:** Investigate a rogue test instance, terminate it, reboot crashed production
- **Trap:** Don't terminate production infrastructure
- **Scoring:** 20% inspect + 30% terminate rogue + 30% reboot backend + 20% alerts resolved

## Quick Start

### From Docker
```bash
# Build base image
docker build -t openenv-base:latest -f src/openenv/core/containers/images/Dockerfile .

# Build Cloud SRE environment
docker build -t cloud-sre-env:latest -f envs/cloud_sre_env/server/Dockerfile .
```

### Using the Client
```python
from envs.cloud_sre_env import SREAction, CloudSREEnv

# Connect to the Docker container
client = CloudSREEnv.from_docker_image("cloud-sre-env:latest")

# Reset to a task
result = client.reset()

# Take actions
result = client.step(SREAction(command="inspect", resource_id="ec2-web-001"))
result = client.step(SREAction(command="terminate", resource_id="ebs-orphan-001"))

# Get state
state = client.state()
print(f"Step: {state.current_step}, Reward: {state.cumulative_reward}")

# Cleanup
client.close()
```

### Direct Python Usage (no Docker)
```python
from cloud_sre_env.server.cloud_sre_environment import CloudSREEnvironment
from cloud_sre_env.models import SREAction

env = CloudSREEnvironment()

# Reset with seeded procedural generation
obs = env.reset(task_id="phantom_volume_cleanup", seed=42)
print(f"Resources: {len(obs.resources)}, Alerts: {len(obs.alerts)}")

# Agent loop
for step in range(15):
action = SREAction(command="terminate", resource_id="ebs-orphan-001")
obs = env.step(action)
if env.state.done:
break

# Grade the agent
score, breakdown = env.grade()
print(f"Score: {score}, Breakdown: {breakdown}")
```

## Action Space

| Command | Description | Parameters |
|---------|-------------|------------|
| `terminate` | Remove a resource permanently | `resource_id` |
| `scale` | Change instance size | `resource_id`, `params.target_size` |
| `reboot` | Restart a stopped/running instance | `resource_id` |
| `inspect` | View detailed resource info (no side-effects) | `resource_id` |
| `wait` | Do nothing for this step | — |

## Observation Space

```json
{
"resources": [{"id": "ec2-web-001", "type": "ec2_instance", "status": "running", ...}],
"alerts": [{"alert_id": "alert-cost-001", "severity": "warning", "message": "..."}],
"total_hourly_cost": 4.52,
"system_uptime": 78.0,
"budget_limit": 12.00,
"task_description": "Your cloud account has unattached EBS volumes..."
}
```

## Project Structure

```
cloud_sre_env/
├── __init__.py # Exports SREAction, SREObservation, SREState, CloudSREEnv
├── models.py # Typed dataclass models (Action, Observation, State)
├── client.py # EnvClient subclass for HTTP/Docker communication
├── openenv.yaml # Environment manifest
├── README.md # This file
└── server/
├── __init__.py
├── cloud_sre_environment.py # Core environment logic + 3 tasks + grading
├── app.py # FastAPI application
├── Dockerfile # Container image
└── requirements.txt # Server dependencies
```

## Testing

```bash
pytest tests/test_cloud_sre_environment.py -v
```

## License

BSD-3-Clause — same as OpenEnv.
12 changes: 12 additions & 0 deletions envs/cloud_sre_env/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""
Cloud SRE & FinOps Environment for OpenEnv.

An OpenEnv-compliant environment simulating Cloud SRE operations:
diagnosing outages, terminating idle resources, scaling services,
and optimizing costs without causing collateral damage.
"""

from .models import SREAction, SREObservation, SREState
from .client import CloudSREEnv

__all__ = ["SREAction", "SREObservation", "SREState", "CloudSREEnv"]
65 changes: 65 additions & 0 deletions envs/cloud_sre_env/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""
Cloud SRE OpenEnv Client.

Implements EnvClient for communicating with the Cloud SRE environment
via WebSocket/HTTP when deployed as a Docker container or HF Space.
"""

from openenv.core import EnvClient, StepResult
from .models import SREAction, SREObservation, SREState


class CloudSREEnv(EnvClient[SREAction, SREObservation, SREState]):
"""
Client for the Cloud SRE & FinOps environment.

Usage (async):
async with CloudSREEnv(base_url="https://your-space.hf.space") as client:
result = await client.reset()
result = await client.step(SREAction(command="inspect", resource_id="ec2-web-001"))

Usage (sync):
with CloudSREEnv(base_url="...").sync() as client:
result = client.reset()
result = client.step(SREAction(command="terminate", resource_id="ebs-orphan-001"))
"""

def _step_payload(self, action: SREAction) -> dict:
"""Serialize an SREAction into the JSON payload for the server."""
payload = {"command": action.command}
if action.resource_id:
payload["resource_id"] = action.resource_id
if action.params:
payload["params"] = action.params
return payload

def _parse_result(self, payload: dict) -> StepResult[SREObservation]:
"""Deserialize the server response into a typed StepResult."""
obs_data = payload.get("observation", {})
obs = SREObservation(
resources=obs_data.get("resources", []),
alerts=obs_data.get("alerts", []),
total_hourly_cost=obs_data.get("total_hourly_cost", 0.0),
system_uptime=obs_data.get("system_uptime", 100.0),
step_number=obs_data.get("step_number", 0),
max_steps=obs_data.get("max_steps", 15),
budget_limit=obs_data.get("budget_limit"),
task_description=obs_data.get("task_description", ""),
)
return StepResult(
observation=obs,
reward=payload.get("reward", 0.0),
done=payload.get("done", False),
)

def _parse_state(self, payload: dict) -> SREState:
"""Deserialize the server state response into a typed SREState."""
return SREState(
episode_id=payload.get("episode_id", ""),
step_count=payload.get("step_count", 0),
task_id=payload.get("task_id", ""),
current_step=payload.get("current_step", 0),
done=payload.get("done", False),
cumulative_reward=payload.get("cumulative_reward", 0.0),
action_count=payload.get("action_count", 0),
)
124 changes: 124 additions & 0 deletions envs/cloud_sre_env/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""
Typed models for the Cloud SRE OpenEnv environment.

Follows OpenEnv conventions: Action, Observation, State as dataclasses
inheriting from openenv.core base classes.
"""

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum

from openenv.core.env_server import Action, Observation, State


# ─── Enums ────────────────────────────────────────────────────────────────────


class ResourceType(str, Enum):
"""Types of cloud resources available in the simulation."""
EC2 = "ec2_instance"
RDS = "rds_database"
EBS = "ebs_volume"
ALB = "alb_load_balancer"


class ResourceStatus(str, Enum):
"""Possible statuses for a cloud resource."""
RUNNING = "running"
STOPPED = "stopped"
AVAILABLE = "available" # EBS: unattached
IN_USE = "in-use" # EBS: attached
REBOOTING = "rebooting"
TERMINATED = "terminated"


class AlertSeverity(str, Enum):
"""Severity levels for monitoring alerts."""
INFO = "info"
WARNING = "warning"
CRITICAL = "critical"


class ActionCommand(str, Enum):
"""Discrete commands the agent can issue."""
TERMINATE = "terminate"
SCALE = "scale"
REBOOT = "reboot"
INSPECT = "inspect"
WAIT = "wait"


# ─── Data Structures ─────────────────────────────────────────────────────────


@dataclass
class ResourceInfo:
"""Represents a single cloud resource (EC2, RDS, EBS, ALB)."""
id: str
name: str = ""
type: str = "" # ResourceType value
status: str = "" # ResourceStatus value
instance_size: str = "" # e.g., "t3.micro", "db.t3.medium"
cpu_utilization: float = 0.0
memory_utilization: float = 0.0
cost_per_hour: float = 0.0
attached_to: Optional[str] = None
tags: Dict[str, str] = field(default_factory=dict)


@dataclass
class AlertInfo:
"""Represents an active monitoring alert."""
alert_id: str
severity: str = "info"
message: str = ""
resource_id: Optional[str] = None
metric_name: Optional[str] = None
metric_value: Optional[float] = None


# ─── OpenEnv Action / Observation / State ────────────────────────────────────


from pydantic import Field

class SREAction(Action):
"""
An action the agent submits to step().

Attributes:
command: One of 'terminate', 'scale', 'reboot', 'inspect', 'wait'.
resource_id: The target resource ID (optional for 'wait').
params: Additional parameters, e.g. {"target_size": "db.t3.medium"}.
"""
command: str = "wait"
resource_id: Optional[str] = None
params: Dict[str, str] = Field(default_factory=dict)


class SREObservation(Observation):
"""
The full observation returned by state() and step().
Contains everything the agent can see about the infrastructure.
"""
resources: List[dict] = Field(default_factory=list)
alerts: List[dict] = Field(default_factory=list)
total_hourly_cost: float = 0.0
system_uptime: float = 100.0
step_number: int = 0
max_steps: int = 15
budget_limit: Optional[float] = None
task_description: str = ""


class SREState(State):
"""
Episode state tracking for the Cloud SRE environment.
Extends the base State with SRE-specific metadata.
"""
task_id: str = ""
current_step: int = 0
done: bool = False
cumulative_reward: float = 0.0
action_count: int = 0
Loading