Skip to content

Latest commit

 

History

History
565 lines (457 loc) · 26.3 KB

File metadata and controls

565 lines (457 loc) · 26.3 KB

AgentOS - Comprehensive Test Strategy

AI Agent Platform | Production Validation Report


1. System Architecture Summary

1.1 Core Architecture

AgentOS is a multi-agent AI operating system built as a monorepo with:

Layer Technology Port
Web Client Next.js 14 3000
Mobile Client Expo React Native 8081
AI Backend Express.js + TypeScript 4000
LLM Engine Ollama (local) 11434

1.2 Main Components

Backend Services (services/ai-backend/)

├── src/
│   ├── index.ts                    # Express server entry
│   ├── api/routes/
│   │   ├── chat.ts                 # POST /chat, /chat/stream
│   │   ├── agents.ts               # POST /agents/execute, GET /agents/status
│   │   ├── health.ts               # GET /health, /health/logs
│   │   ├── models.ts               # GET /models
│   │   └── tasks.ts                # GET /tasks
│   ├── agents/
│   │   ├── orchestrator.ts         # Agent coordination & task routing
│   │   ├── baseAgent.ts            # Agent interface
│   │   ├── planner/index.ts         # Task decomposition
│   │   ├── coding/index.ts          # Code generation
│   │   ├── research/index.ts        # Information gathering
│   │   ├── execution/index.ts       # Command execution
│   │   └── taskQueue.ts             # Task lifecycle management
│   ├── router/
│   │   └── selectModel.ts           # Model routing logic
│   ├── memory/
│   │   ├── index.ts                 # Memory storage interface
│   │   └── conversationStore.ts     # Chat history
│   ├── tools/
│   │   ├── index.ts                 # Tool registry
│   │   └── webSearch.ts             # Web search tool
│   ├── lib/
│   │   └── ollama.ts                # Ollama client wrapper
│   └── utils/
│       └── logger.ts                # Logging utility

Web Dashboard (apps/web/)

  • Dashboard page with system stats
  • AI Chat interface with streaming
  • Agent fleet status monitoring
  • Task queue visualization
  • Model availability display
  • Real-time logs viewer

Mobile App (apps/mobile/)

  • Expo React Native with tab navigation
  • API integration for chat and monitoring

2. Full Flow Analysis

2.1 Data Flow Diagram

┌──────────────┐     ┌──────────────┐
│   Web App    │     │ Mobile App   │
│  (Next.js)   │     │   (Expo)    │
└──────┬───────┘     └──────┬───────┘
       │                    │
       └────────┬───────────┘
                │
                ▼
        ┌──────────────────┐
        │   REST API       │
        │  (Express :4000)  │
        └────────┬─────────┘
                 │
       ┌─────────┼─────────┐
       ▼         ▼         ▼
   ┌───────┐ ┌───────┐ ┌───────┐
   │ /chat │ │/agents│ │/health│
   └───┬───┘ └───┬───┘ └───┬───┘
       │         │         │
       ▼         ▼         ▼
┌──────────────────────────────────────┐
│         AI Gateway Layer             │
│  ┌────────────┐  ┌────────────────┐ │
│  │   Model    │  │  Orchestrator   │ │
│  │   Router   │  │                 │ │
│  └─────┬──────┘  └────────┬────────┘ │
└────────┼─────────────────┼───────────┘
         │                 │
         ▼                 ▼
   ┌─────────────────────────────┐
   │      Agent Execution        │
   │  ┌─────┐ ┌─────┐ ┌─────┐   │
   │  │Plan │ │Code │ │ Rsrch│   │
   │  │ ner │ │ ner │ │  ger│   │
   │  └──┬──┘ └──┬──┘ └──┬──┘   │
   └─────┼───────┼───────┼───────┘
         │       │       │
         ▼       ▼       ▼
   ┌─────────────────────────────────┐
   │         Ollama (Local LLM)      │
   │  ┌────────┐ ┌────────┐ ┌─────┐  │
   │  │ Llama3 │ │Mistral │ │Deep │  │
   │  │        │ │        │ │Seek │  │
   │  └────────┘ └────────┘ └─────┘  │
   └─────────────────────────────────┘
                │
                ▼
   ┌─────────────────────────────────┐
   │       Memory System             │
   │  ┌────────────┐ ┌──────────────┐ │
   │  │Conversation│ │ Vector Store │ │
   │  │   Store    │ │  (Chroma)    │ │
   │  └────────────┘ └──────────────┘ │
   └─────────────────────────────────┘

2.2 User Flows

Flow 1: Chat Interaction

  1. User enters message in web/mobile chat
  2. POST /api/chat with message payload
  3. Model router classifies task type
  4. Ollama generates response
  5. Response stored in conversation history
  6. UI updates with assistant message

Flow 2: Agent Task Execution

  1. User submits task via POST /api/agents/execute
  2. If specific agent requested → execute directly
  3. Otherwise → planner decomposes task
  4. Subtasks distributed to worker agents
  5. Results aggregated and returned

Flow 3: Dashboard Monitoring

  1. Web app polls GET /api/health every 5s
  2. Status displayed: Ollama connection, agent statuses
  3. Task stats, conversation count, uptime shown

2.3 Agent Interactions

Scenario Path
Simple task User → Orchestrator → Single Agent → Ollama → Response
Complex task User → Planner → Subtasks → Worker Agents → Aggregated Result
Model fallback Primary model fails → Fallback model → Retry logic

2.4 Tool Execution Flows

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Agent     │────▶│ Tool        │────▶│   Tool      │
│  Request    │     │   Registry  │     │  Execution  │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                                │
                     ┌─────────────┐            │
                     │   Result    │◀───────────┘
                     │  Aggregation│
                     └─────────────┘

2.5 Model Routing Flow

User Prompt → Keyword Analysis → Task Classification → Model Selection → Ollama Call
                                              │
                              ┌───────────────┼───────────────┐
                              ▼               ▼               ▼
                         deepseek-coder    llama3          mistral
                         (coding)         (reasoning)    (conversation)

3. Testing Strategy

3.1 Test Pyramid

                    ┌─────────────┐
                    │     E2E     │  ← Few, slow, comprehensive
                    │   Tests     │
              ┌─────┴─────────────┴─────┐
              │    Integration Tests    │  ← Moderate, critical paths
              └─────┬─────────────┬───────┘
                    │             │
            ┌───────┴──────┐ ┌───┴────────┐
            │  Unit Tests  │ │  API Tests │  ← Many, fast, isolated
            └──────────────┘ └────────────┘

3.2 Recommended Testing Tools

Category Tool Purpose
Unit Tests Vitest Fast, modern test runner
API Tests Supertest HTTP assertions
E2E Tests Playwright Cross-browser automation
Load Tests k6 or Autocannon Performance testing
Mobile Tests Detox React Native E2E
Mocking MSW API mocking
Coverage c8 Code coverage

4. Detailed Test Cases

4.1 Unit Tests

Model Router Tests

Test ID Scenario Steps Expected Result Edge Cases
MR-001 Coding task routing Input: "write a function to sort array" deepseek-coder selected Keywords like "explain" combined with code
MR-002 Reasoning task routing Input: "explain how binary search works" llama3 selected Ambiguous prompts
MR-003 Conversation routing Input: "hello, how are you?" mistral selected Non-English input
MR-004 Planning task routing Input: "create a project roadmap" llama3 selected Multi-word planning terms
MR-005 Analysis task routing Input: "analyze the performance metrics" llama3 selected Data-related keywords
MR-006 Preferred model override Input: "code function" + preferred=mistral mistral selected Invalid preferred model
MR-007 Fallback model selection deepseek-coder unavailable llama3 selected N/A - tested in integration

Task Queue Tests

Test ID Scenario Steps Expected Result
TQ-001 Add task Add new QueuedTask Task in getAllTasks()
TQ-002 Update status Update task to 'completed' Status changed, completedAt set
TQ-003 Get by status Filter running tasks Only running tasks returned
TQ-004 Task statistics Multiple tasks in various states Accurate counts

Memory System Tests

Test ID Scenario Steps Expected Result
MS-001 Store memory Store entry with type 'conversation' ID returned, entry retrievable
MS-002 Retrieve by query Store entries, search with query Relevant entries returned
MS-003 Conversation history Store multiple messages Most recent first, limit respected

Conversation Store Tests

Test ID Scenario Steps Expected Result
CS-001 Create conversation New UUID Conversation created with empty messages
CS-002 Get existing Use existing ID Existing conversation returned
CS-003 List all Multiple conversations Sorted by updatedAt DESC

4.2 Integration Tests

API Endpoint Tests

Test ID Endpoint Scenario Steps Expected Result
API-001 POST /api/chat Valid message Send message with valid payload 200, success=true, message returned
API-002 POST /api/chat Empty message Send "" 400, validation error
API-003 POST /api/chat Invalid UUID conversationId="invalid" 400, validation error
API-004 POST /api/chat/stream Streaming enabled Send stream=true text/event-stream response
API-005 POST /api/agents/execute Execute coding task Send prompt="write hello world" 200, task executed
API-006 POST /api/agents/execute Invalid agent agentId="invalid" 400, validation error
API-007 GET /api/agents/status Get all statuses Call endpoint 200, all agent statuses
API-008 GET /api/health Healthy system Ollama running status=healthy
API-009 GET /api/health Ollama down Stop Ollama status=degraded
API-010 GET /api/tasks Get all tasks Call endpoint 200, task list with stats
API-011 GET /api/models List models Call endpoint 200, available models

Agent Orchestrator Tests

Test ID Scenario Steps Expected Result
AO-001 Single agent execution Execute with agentId="coding" Single task in queue, completed
AO-002 Planner decomposition Complex task without agentId Planner creates subtasks
AO-003 Agent status retrieval Get agent statuses All 4 agents returned
AO-004 Invalid agent ID Execute with invalid ID 404 or fallback to planner

Ollama Integration Tests

Test ID Scenario Steps Expected Result
OLL-001 Chat completion Send messages to llama3 Response content returned
OLL-002 Streaming Stream chat response Multiple chunks received
OLL-003 Model list List available models Array of model names
OLL-004 Connection check Check Ollama connectivity True/False
OLL-005 Embedding generation Generate embedding Float array returned

4.3 End-to-End Tests

E2E-001: Complete Chat Flow

Step Action Expected
1 Open web app Dashboard loads
2 Navigate to Chat Chat page visible
3 Type message "Hello" Message appears in UI
4 Press send Loading indicator shows
5 Receive response Assistant message appears
6 Check conversation history GET /api/chat/conversations shows new entry

E2E-002: Agent Task Execution

Step Action Expected
1 POST /api/agents/execute with coding task Task queued
2 GET /api/tasks Task in running state
3 Wait for completion Task completed
4 GET /api/agents/status Agent status updated

E2E-003: Model Routing Verification

Step Action Expected
1 POST /api/chat with coding prompt deepseek-coder used
2 Check response metadata modelUsed="deepseek-coder"
3 POST /api/chat with conversation mistral used

E2E-004: Dashboard Monitoring

Step Action Expected
1 Navigate to Dashboard Stats visible
2 Trigger agent task Running tasks count updates
3 Wait for completion Completed count increments
4 Check agent statuses All agents status displayed

4.4 Mobile App Tests

Test ID Scenario Steps Expected Result
MOB-001 App launch Start Expo app Home screen renders
MOB-002 API connectivity Fetch /api/health Connection status shown
MOB-003 Tab navigation Switch between tabs Content switches
MOB-004 Chat interface Send message Response received
MOB-005 Offline handling Disconnect backend Error message shown

5. Edge Case Analysis

5.1 Critical Edge Cases

ID Edge Case Risk Level Mitigation
EC-001 Ollama not running 🔴 Critical Health check, graceful degradation
EC-002 Model not installed 🔴 Critical Validate model availability before execution
EC-003 LLM timeout 🔴 Critical Request timeout (30s), retry logic
EC-004 Agent execution loop 🔴 Critical Max iteration limit, task complexity check
EC-005 Memory overflow 🔴 High Conversation history limit (100 messages)
EC-006 Concurrent requests 🟡 Medium Rate limiting, queue management
EC-007 Invalid task decomposition 🟡 Medium Validate subtask structure
EC-008 Network interruption 🟡 Medium Retry with exponential backoff
EC-009 Large payload 🟡 Medium Request size limit (10mb)
EC-010 Model hallucination 🟡 Medium Validate output format

5.2 Race Condition Scenarios

Scenario Test Approach
Multiple agents accessing queue Concurrent task submission
Simultaneous model requests Load test with multiple requests
Memory race conditions Rapid store/retrieve operations

6. Automation Strategy

6.1 Test Execution Matrix

Test Type Frequency Environment CI/CD
Unit Tests Every PR Local + CI ✅ GitHub Actions
Integration Every PR Staging ✅ GitHub Actions
E2E (Web) Every Release Production-like ✅ GitHub Actions
E2E (Mobile) Every Release Device Farm ⚠️ Manual/Detox
Performance Weekly Dedicated ⚠️ Manual
Security Monthly CI ⚠️ Manual

6.2 CI/CD Pipeline

# .github/workflows/test.yml
name: Test Suite

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:unit
      - run: npm run test:coverage

  integration-tests:
    runs-on: ubuntu-latest
    services:
      ollama:
        image: ollama/ollama
        ports:
          - 11434:11434
    steps:
      - run: npm run test:integration

  e2e-tests:
    runs-on: ubuntu-latest
    steps:
      - run: npm run test:e2e

6.3 Test Organization

tests/
├── unit/
│   ├── router/
│   │   └── selectModel.test.ts
│   ├── agents/
│   │   ├── orchestrator.test.ts
│   │   └── taskQueue.test.ts
│   ├── memory/
│   │   ├── memorySystem.test.ts
│   │   └── conversationStore.test.ts
│   └── api/
│       └── validation.test.ts
├── integration/
│   ├── api/
│   │   ├── chat.test.ts
│   │   ├── agents.test.ts
│   │   └── health.test.ts
│   └── ollama/
│       └── ollama.test.ts
├── e2e/
│   ├── web/
│   │   ├── chat.test.ts
│   │   ├── dashboard.test.ts
│   │   └── agents.test.ts
│   └── mobile/
│       └── app.test.ts
└── fixtures/
    ├── conversations.json
    └── tasks.json

7. Risk Assessment

7.1 Component Risk Matrix

Component Failure Impact Probability Priority Test Focus
AI Gateway System down Medium P0 Health, connectivity
Model Router Wrong model Medium P0 Routing logic
Orchestrator Task failures Medium P0 Execution flow
Ollama Client No LLM responses High P0 Connection, fallback
Task Queue Lost tasks Low P1 Persistence
Memory System Data loss Medium P1 Storage, retrieval
Web Dashboard UI broken Low P2 E2E tests
Mobile App App crash Low P2 Basic functionality

7.2 Critical Success Criteria

  1. ✅ All API endpoints return correct status codes
  2. ✅ Model routing accuracy > 95%
  3. ✅ Agent task completion rate > 99%
  4. ✅ Health check reflects actual system state
  5. ✅ No data loss in conversation history
  6. ✅ Response time < 5s for simple queries

8. Architecture Improvements & Recommendations

8.1 Missing Requirements

ID Requirement Impact Recommendation
REQ-001 Authentication Security Add JWT/API key auth
REQ-002 Rate Limiting Stability Implement rate limiter
REQ-003 Input Sanitization Security Add XSS protection
REQ-004 Persistent Storage Data Implement SQLite/Chroma
REQ-005 WebSocket Support Real-time Add WebSocket for live updates
REQ-006 Error Recovery Reliability Implement retry logic
REQ-007 Metrics/Tracing Observability Add OpenTelemetry
REQ-008 Caching Performance Add Redis cache layer

8.2 Testing Gaps

Gap Current State Recommended
Test Coverage None Target 80%
Mobile Tests Manual Add Detox
Performance Tests None Add k6
Security Tests None Add OWASP tests
Visual Regression None Add Chromatic

8.3 Suggested Package Additions

{
  "devDependencies": {
    "vitest": "^2.0.0",
    "@vitest/coverage-v8": "^2.0.0",
    "supertest": "^7.0.0",
    "@playwright/test": "^1.45.0",
    "msw": "^2.3.0",
    "detox": "^20.0.0",
    "k6": "latest"
  }
}

9. Test Case Summary

Category Count Priority
Unit Tests 30+ High
Integration Tests 25+ High
E2E Tests (Web) 15+ Medium
E2E Tests (Mobile) 5+ Medium
Edge Case Tests 10+ High
Total 85+

10. Next Steps

  1. Set up testing infrastructure - Add Vitest, Supertest, Playwright
  2. Write unit tests - Start with model router, task queue
  3. Add integration tests - Test all API endpoints
  4. Implement E2E tests - Cover critical user flows
  5. Configure CI/CD - GitHub Actions pipeline
  6. Add test coverage reporting - Target 80%+
  7. Mobile testing - Set up Detox for React Native

Report generated by Senior Staff Engineer - AI Systems Architect AgentOS v0.1.0 | March 2026