AgentOS - Comprehensive Test Strategy

AI Agent Platform | Production Validation Report

1. System Architecture Summary

1.1 Core Architecture

AgentOS is a multi-agent AI operating system built as a monorepo with:

Layer	Technology	Port
Web Client	Next.js 14	3000
Mobile Client	Expo React Native	8081
AI Backend	Express.js + TypeScript	4000
LLM Engine	Ollama (local)	11434

1.2 Main Components

Backend Services (`services/ai-backend/`)

├── src/
│   ├── index.ts                    # Express server entry
│   ├── api/routes/
│   │   ├── chat.ts                 # POST /chat, /chat/stream
│   │   ├── agents.ts               # POST /agents/execute, GET /agents/status
│   │   ├── health.ts               # GET /health, /health/logs
│   │   ├── models.ts               # GET /models
│   │   └── tasks.ts                # GET /tasks
│   ├── agents/
│   │   ├── orchestrator.ts         # Agent coordination & task routing
│   │   ├── baseAgent.ts            # Agent interface
│   │   ├── planner/index.ts         # Task decomposition
│   │   ├── coding/index.ts          # Code generation
│   │   ├── research/index.ts        # Information gathering
│   │   ├── execution/index.ts       # Command execution
│   │   └── taskQueue.ts             # Task lifecycle management
│   ├── router/
│   │   └── selectModel.ts           # Model routing logic
│   ├── memory/
│   │   ├── index.ts                 # Memory storage interface
│   │   └── conversationStore.ts     # Chat history
│   ├── tools/
│   │   ├── index.ts                 # Tool registry
│   │   └── webSearch.ts             # Web search tool
│   ├── lib/
│   │   └── ollama.ts                # Ollama client wrapper
│   └── utils/
│       └── logger.ts                # Logging utility

Web Dashboard (`apps/web/`)

Dashboard page with system stats
AI Chat interface with streaming
Agent fleet status monitoring
Task queue visualization
Model availability display
Real-time logs viewer

Mobile App (`apps/mobile/`)

Expo React Native with tab navigation
API integration for chat and monitoring

2. Full Flow Analysis

2.1 Data Flow Diagram

┌──────────────┐     ┌──────────────┐
│   Web App    │     │ Mobile App   │
│  (Next.js)   │     │   (Expo)    │
└──────┬───────┘     └──────┬───────┘
       │                    │
       └────────┬───────────┘
                │
                ▼
        ┌──────────────────┐
        │   REST API       │
        │  (Express :4000)  │
        └────────┬─────────┘
                 │
       ┌─────────┼─────────┐
       ▼         ▼         ▼
   ┌───────┐ ┌───────┐ ┌───────┐
   │ /chat │ │/agents│ │/health│
   └───┬───┘ └───┬───┘ └───┬───┘
       │         │         │
       ▼         ▼         ▼
┌──────────────────────────────────────┐
│         AI Gateway Layer             │
│  ┌────────────┐  ┌────────────────┐ │
│  │   Model    │  │  Orchestrator   │ │
│  │   Router   │  │                 │ │
│  └─────┬──────┘  └────────┬────────┘ │
└────────┼─────────────────┼───────────┘
         │                 │
         ▼                 ▼
   ┌─────────────────────────────┐
   │      Agent Execution        │
   │  ┌─────┐ ┌─────┐ ┌─────┐   │
   │  │Plan │ │Code │ │ Rsrch│   │
   │  │ ner │ │ ner │ │  ger│   │
   │  └──┬──┘ └──┬──┘ └──┬──┘   │
   └─────┼───────┼───────┼───────┘
         │       │       │
         ▼       ▼       ▼
   ┌─────────────────────────────────┐
   │         Ollama (Local LLM)      │
   │  ┌────────┐ ┌────────┐ ┌─────┐  │
   │  │ Llama3 │ │Mistral │ │Deep │  │
   │  │        │ │        │ │Seek │  │
   │  └────────┘ └────────┘ └─────┘  │
   └─────────────────────────────────┘
                │
                ▼
   ┌─────────────────────────────────┐
   │       Memory System             │
   │  ┌────────────┐ ┌──────────────┐ │
   │  │Conversation│ │ Vector Store │ │
   │  │   Store    │ │  (Chroma)    │ │
   │  └────────────┘ └──────────────┘ │
   └─────────────────────────────────┘

2.2 User Flows

Flow 1: Chat Interaction

User enters message in web/mobile chat
POST /api/chat with message payload
Model router classifies task type
Ollama generates response
Response stored in conversation history
UI updates with assistant message

Flow 2: Agent Task Execution

User submits task via POST /api/agents/execute
If specific agent requested → execute directly
Otherwise → planner decomposes task
Subtasks distributed to worker agents
Results aggregated and returned

Flow 3: Dashboard Monitoring

Web app polls GET /api/health every 5s
Status displayed: Ollama connection, agent statuses
Task stats, conversation count, uptime shown

2.3 Agent Interactions

Scenario	Path
Simple task	User → Orchestrator → Single Agent → Ollama → Response
Complex task	User → Planner → Subtasks → Worker Agents → Aggregated Result
Model fallback	Primary model fails → Fallback model → Retry logic

2.4 Tool Execution Flows

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Agent     │────▶│ Tool        │────▶│   Tool      │
│  Request    │     │   Registry  │     │  Execution  │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                                │
                     ┌─────────────┐            │
                     │   Result    │◀───────────┘
                     │  Aggregation│
                     └─────────────┘

2.5 Model Routing Flow

User Prompt → Keyword Analysis → Task Classification → Model Selection → Ollama Call
                                              │
                              ┌───────────────┼───────────────┐
                              ▼               ▼               ▼
                         deepseek-coder    llama3          mistral
                         (coding)         (reasoning)    (conversation)

3. Testing Strategy

3.1 Test Pyramid

                    ┌─────────────┐
                    │     E2E     │  ← Few, slow, comprehensive
                    │   Tests     │
              ┌─────┴─────────────┴─────┐
              │    Integration Tests    │  ← Moderate, critical paths
              └─────┬─────────────┬───────┘
                    │             │
            ┌───────┴──────┐ ┌───┴────────┐
            │  Unit Tests  │ │  API Tests │  ← Many, fast, isolated
            └──────────────┘ └────────────┘

3.2 Recommended Testing Tools

Category	Tool	Purpose
Unit Tests	Vitest	Fast, modern test runner
API Tests	Supertest	HTTP assertions
E2E Tests	Playwright	Cross-browser automation
Load Tests	k6 or Autocannon	Performance testing
Mobile Tests	Detox	React Native E2E
Mocking	MSW	API mocking
Coverage	c8	Code coverage

4. Detailed Test Cases

4.1 Unit Tests

Model Router Tests

Test ID	Scenario	Steps	Expected Result	Edge Cases
MR-001	Coding task routing	Input: "write a function to sort array"	deepseek-coder selected	Keywords like "explain" combined with code
MR-002	Reasoning task routing	Input: "explain how binary search works"	llama3 selected	Ambiguous prompts
MR-003	Conversation routing	Input: "hello, how are you?"	mistral selected	Non-English input
MR-004	Planning task routing	Input: "create a project roadmap"	llama3 selected	Multi-word planning terms
MR-005	Analysis task routing	Input: "analyze the performance metrics"	llama3 selected	Data-related keywords
MR-006	Preferred model override	Input: "code function" + preferred=mistral	mistral selected	Invalid preferred model
MR-007	Fallback model selection	deepseek-coder unavailable	llama3 selected	N/A - tested in integration

Task Queue Tests

Test ID	Scenario	Steps	Expected Result
TQ-001	Add task	Add new QueuedTask	Task in getAllTasks()
TQ-002	Update status	Update task to 'completed'	Status changed, completedAt set
TQ-003	Get by status	Filter running tasks	Only running tasks returned
TQ-004	Task statistics	Multiple tasks in various states	Accurate counts

Memory System Tests

Test ID	Scenario	Steps	Expected Result
MS-001	Store memory	Store entry with type 'conversation'	ID returned, entry retrievable
MS-002	Retrieve by query	Store entries, search with query	Relevant entries returned
MS-003	Conversation history	Store multiple messages	Most recent first, limit respected

Conversation Store Tests

Test ID	Scenario	Steps	Expected Result
CS-001	Create conversation	New UUID	Conversation created with empty messages
CS-002	Get existing	Use existing ID	Existing conversation returned
CS-003	List all	Multiple conversations	Sorted by updatedAt DESC

4.2 Integration Tests

API Endpoint Tests

Test ID	Endpoint	Scenario	Steps	Expected Result
API-001	POST /api/chat	Valid message	Send message with valid payload	200, success=true, message returned
API-002	POST /api/chat	Empty message	Send ""	400, validation error
API-003	POST /api/chat	Invalid UUID	conversationId="invalid"	400, validation error
API-004	POST /api/chat/stream	Streaming enabled	Send stream=true	text/event-stream response
API-005	POST /api/agents/execute	Execute coding task	Send prompt="write hello world"	200, task executed
API-006	POST /api/agents/execute	Invalid agent	agentId="invalid"	400, validation error
API-007	GET /api/agents/status	Get all statuses	Call endpoint	200, all agent statuses
API-008	GET /api/health	Healthy system	Ollama running	status=healthy
API-009	GET /api/health	Ollama down	Stop Ollama	status=degraded
API-010	GET /api/tasks	Get all tasks	Call endpoint	200, task list with stats
API-011	GET /api/models	List models	Call endpoint	200, available models

Agent Orchestrator Tests

Test ID	Scenario	Steps	Expected Result
AO-001	Single agent execution	Execute with agentId="coding"	Single task in queue, completed
AO-002	Planner decomposition	Complex task without agentId	Planner creates subtasks
AO-003	Agent status retrieval	Get agent statuses	All 4 agents returned
AO-004	Invalid agent ID	Execute with invalid ID	404 or fallback to planner

Ollama Integration Tests

Test ID	Scenario	Steps	Expected Result
OLL-001	Chat completion	Send messages to llama3	Response content returned
OLL-002	Streaming	Stream chat response	Multiple chunks received
OLL-003	Model list	List available models	Array of model names
OLL-004	Connection check	Check Ollama connectivity	True/False
OLL-005	Embedding generation	Generate embedding	Float array returned

4.3 End-to-End Tests

E2E-001: Complete Chat Flow

Step	Action	Expected
1	Open web app	Dashboard loads
2	Navigate to Chat	Chat page visible
3	Type message "Hello"	Message appears in UI
4	Press send	Loading indicator shows
5	Receive response	Assistant message appears
6	Check conversation history	GET /api/chat/conversations shows new entry

E2E-002: Agent Task Execution

Step	Action	Expected
1	POST /api/agents/execute with coding task	Task queued
2	GET /api/tasks	Task in running state
3	Wait for completion	Task completed
4	GET /api/agents/status	Agent status updated

E2E-003: Model Routing Verification

Step	Action	Expected
1	POST /api/chat with coding prompt	deepseek-coder used
2	Check response metadata	modelUsed="deepseek-coder"
3	POST /api/chat with conversation	mistral used

E2E-004: Dashboard Monitoring

Step	Action	Expected
1	Navigate to Dashboard	Stats visible
2	Trigger agent task	Running tasks count updates
3	Wait for completion	Completed count increments
4	Check agent statuses	All agents status displayed

4.4 Mobile App Tests

Test ID	Scenario	Steps	Expected Result
MOB-001	App launch	Start Expo app	Home screen renders
MOB-002	API connectivity	Fetch /api/health	Connection status shown
MOB-003	Tab navigation	Switch between tabs	Content switches
MOB-004	Chat interface	Send message	Response received
MOB-005	Offline handling	Disconnect backend	Error message shown

5. Edge Case Analysis

5.1 Critical Edge Cases

ID	Edge Case	Risk Level	Mitigation
EC-001	Ollama not running	🔴 Critical	Health check, graceful degradation
EC-002	Model not installed	🔴 Critical	Validate model availability before execution
EC-003	LLM timeout	🔴 Critical	Request timeout (30s), retry logic
EC-004	Agent execution loop	🔴 Critical	Max iteration limit, task complexity check
EC-005	Memory overflow	🔴 High	Conversation history limit (100 messages)
EC-006	Concurrent requests	🟡 Medium	Rate limiting, queue management
EC-007	Invalid task decomposition	🟡 Medium	Validate subtask structure
EC-008	Network interruption	🟡 Medium	Retry with exponential backoff
EC-009	Large payload	🟡 Medium	Request size limit (10mb)
EC-010	Model hallucination	🟡 Medium	Validate output format

5.2 Race Condition Scenarios

Scenario	Test Approach
Multiple agents accessing queue	Concurrent task submission
Simultaneous model requests	Load test with multiple requests
Memory race conditions	Rapid store/retrieve operations

6. Automation Strategy

6.1 Test Execution Matrix

Test Type	Frequency	Environment	CI/CD
Unit Tests	Every PR	Local + CI	✅ GitHub Actions
Integration	Every PR	Staging	✅ GitHub Actions
E2E (Web)	Every Release	Production-like	✅ GitHub Actions
E2E (Mobile)	Every Release	Device Farm	⚠️ Manual/Detox
Performance	Weekly	Dedicated	⚠️ Manual
Security	Monthly	CI	⚠️ Manual

6.2 CI/CD Pipeline

# .github/workflows/test.yml
name: Test Suite

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:unit
      - run: npm run test:coverage

  integration-tests:
    runs-on: ubuntu-latest
    services:
      ollama:
        image: ollama/ollama
        ports:
          - 11434:11434
    steps:
      - run: npm run test:integration

  e2e-tests:
    runs-on: ubuntu-latest
    steps:
      - run: npm run test:e2e

6.3 Test Organization

tests/
├── unit/
│   ├── router/
│   │   └── selectModel.test.ts
│   ├── agents/
│   │   ├── orchestrator.test.ts
│   │   └── taskQueue.test.ts
│   ├── memory/
│   │   ├── memorySystem.test.ts
│   │   └── conversationStore.test.ts
│   └── api/
│       └── validation.test.ts
├── integration/
│   ├── api/
│   │   ├── chat.test.ts
│   │   ├── agents.test.ts
│   │   └── health.test.ts
│   └── ollama/
│       └── ollama.test.ts
├── e2e/
│   ├── web/
│   │   ├── chat.test.ts
│   │   ├── dashboard.test.ts
│   │   └── agents.test.ts
│   └── mobile/
│       └── app.test.ts
└── fixtures/
    ├── conversations.json
    └── tasks.json

7. Risk Assessment

7.1 Component Risk Matrix

Component	Failure Impact	Probability	Priority	Test Focus
AI Gateway	System down	Medium	P0	Health, connectivity
Model Router	Wrong model	Medium	P0	Routing logic
Orchestrator	Task failures	Medium	P0	Execution flow
Ollama Client	No LLM responses	High	P0	Connection, fallback
Task Queue	Lost tasks	Low	P1	Persistence
Memory System	Data loss	Medium	P1	Storage, retrieval
Web Dashboard	UI broken	Low	P2	E2E tests
Mobile App	App crash	Low	P2	Basic functionality

7.2 Critical Success Criteria

✅ All API endpoints return correct status codes
✅ Model routing accuracy > 95%
✅ Agent task completion rate > 99%
✅ Health check reflects actual system state
✅ No data loss in conversation history
✅ Response time < 5s for simple queries

8. Architecture Improvements & Recommendations

8.1 Missing Requirements

ID	Requirement	Impact	Recommendation
REQ-001	Authentication	Security	Add JWT/API key auth
REQ-002	Rate Limiting	Stability	Implement rate limiter
REQ-003	Input Sanitization	Security	Add XSS protection
REQ-004	Persistent Storage	Data	Implement SQLite/Chroma
REQ-005	WebSocket Support	Real-time	Add WebSocket for live updates
REQ-006	Error Recovery	Reliability	Implement retry logic
REQ-007	Metrics/Tracing	Observability	Add OpenTelemetry
REQ-008	Caching	Performance	Add Redis cache layer

8.2 Testing Gaps

Gap	Current State	Recommended
Test Coverage	None	Target 80%
Mobile Tests	Manual	Add Detox
Performance Tests	None	Add k6
Security Tests	None	Add OWASP tests
Visual Regression	None	Add Chromatic

8.3 Suggested Package Additions

{
  "devDependencies": {
    "vitest": "^2.0.0",
    "@vitest/coverage-v8": "^2.0.0",
    "supertest": "^7.0.0",
    "@playwright/test": "^1.45.0",
    "msw": "^2.3.0",
    "detox": "^20.0.0",
    "k6": "latest"
  }
}

9. Test Case Summary

Category	Count	Priority
Unit Tests	30+	High
Integration Tests	25+	High
E2E Tests (Web)	15+	Medium
E2E Tests (Mobile)	5+	Medium
Edge Case Tests	10+	High
Total	85+

10. Next Steps

Set up testing infrastructure - Add Vitest, Supertest, Playwright
Write unit tests - Start with model router, task queue
Add integration tests - Test all API endpoints
Implement E2E tests - Cover critical user flows
Configure CI/CD - GitHub Actions pipeline
Add test coverage reporting - Target 80%+
Mobile testing - Set up Detox for React Native

Report generated by Senior Staff Engineer - AI Systems Architect AgentOS v0.1.0 | March 2026

FilesExpand file tree

TEST_STRATEGY.md

Latest commit

History

TEST_STRATEGY.md

File metadata and controls

AgentOS - Comprehensive Test Strategy

AI Agent Platform | Production Validation Report

1. System Architecture Summary

1.1 Core Architecture

1.2 Main Components

Backend Services (services/ai-backend/)

Web Dashboard (apps/web/)

Mobile App (apps/mobile/)

2. Full Flow Analysis

2.1 Data Flow Diagram

2.2 User Flows

Flow 1: Chat Interaction

Flow 2: Agent Task Execution

Flow 3: Dashboard Monitoring

2.3 Agent Interactions

2.4 Tool Execution Flows

2.5 Model Routing Flow

3. Testing Strategy

3.1 Test Pyramid

3.2 Recommended Testing Tools

4. Detailed Test Cases

4.1 Unit Tests

Model Router Tests

Task Queue Tests

Memory System Tests

Conversation Store Tests

4.2 Integration Tests

API Endpoint Tests

Agent Orchestrator Tests

Ollama Integration Tests

4.3 End-to-End Tests

E2E-001: Complete Chat Flow

E2E-002: Agent Task Execution

E2E-003: Model Routing Verification

E2E-004: Dashboard Monitoring

4.4 Mobile App Tests

5. Edge Case Analysis

5.1 Critical Edge Cases

5.2 Race Condition Scenarios

6. Automation Strategy

6.1 Test Execution Matrix

6.2 CI/CD Pipeline

6.3 Test Organization

7. Risk Assessment

7.1 Component Risk Matrix

7.2 Critical Success Criteria

8. Architecture Improvements & Recommendations

8.1 Missing Requirements

8.2 Testing Gaps

8.3 Suggested Package Additions

9. Test Case Summary

10. Next Steps

Backend Services (`services/ai-backend/`)

Web Dashboard (`apps/web/`)

Mobile App (`apps/mobile/`)