AgentOS - Comprehensive Test Strategy
AI Agent Platform | Production Validation Report
1. System Architecture Summary
AgentOS is a multi-agent AI operating system built as a monorepo with:
Layer
Technology
Port
Web Client
Next.js 14
3000
Mobile Client
Expo React Native
8081
AI Backend
Express.js + TypeScript
4000
LLM Engine
Ollama (local)
11434
Backend Services (services/ai-backend/)
├── src/
│ ├── index.ts # Express server entry
│ ├── api/routes/
│ │ ├── chat.ts # POST /chat, /chat/stream
│ │ ├── agents.ts # POST /agents/execute, GET /agents/status
│ │ ├── health.ts # GET /health, /health/logs
│ │ ├── models.ts # GET /models
│ │ └── tasks.ts # GET /tasks
│ ├── agents/
│ │ ├── orchestrator.ts # Agent coordination & task routing
│ │ ├── baseAgent.ts # Agent interface
│ │ ├── planner/index.ts # Task decomposition
│ │ ├── coding/index.ts # Code generation
│ │ ├── research/index.ts # Information gathering
│ │ ├── execution/index.ts # Command execution
│ │ └── taskQueue.ts # Task lifecycle management
│ ├── router/
│ │ └── selectModel.ts # Model routing logic
│ ├── memory/
│ │ ├── index.ts # Memory storage interface
│ │ └── conversationStore.ts # Chat history
│ ├── tools/
│ │ ├── index.ts # Tool registry
│ │ └── webSearch.ts # Web search tool
│ ├── lib/
│ │ └── ollama.ts # Ollama client wrapper
│ └── utils/
│ └── logger.ts # Logging utility
Web Dashboard (apps/web/)
Dashboard page with system stats
AI Chat interface with streaming
Agent fleet status monitoring
Task queue visualization
Model availability display
Real-time logs viewer
Mobile App (apps/mobile/)
Expo React Native with tab navigation
API integration for chat and monitoring
┌──────────────┐ ┌──────────────┐
│ Web App │ │ Mobile App │
│ (Next.js) │ │ (Expo) │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬───────────┘
│
▼
┌──────────────────┐
│ REST API │
│ (Express :4000) │
└────────┬─────────┘
│
┌─────────┼─────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ /chat │ │/agents│ │/health│
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────┐
│ AI Gateway Layer │
│ ┌────────────┐ ┌────────────────┐ │
│ │ Model │ │ Orchestrator │ │
│ │ Router │ │ │ │
│ └─────┬──────┘ └────────┬────────┘ │
└────────┼─────────────────┼───────────┘
│ │
▼ ▼
┌─────────────────────────────┐
│ Agent Execution │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Plan │ │Code │ │ Rsrch│ │
│ │ ner │ │ ner │ │ ger│ │
│ └──┬──┘ └──┬──┘ └──┬──┘ │
└─────┼───────┼───────┼───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────┐
│ Ollama (Local LLM) │
│ ┌────────┐ ┌────────┐ ┌─────┐ │
│ │ Llama3 │ │Mistral │ │Deep │ │
│ │ │ │ │ │Seek │ │
│ └────────┘ └────────┘ └─────┘ │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Memory System │
│ ┌────────────┐ ┌──────────────┐ │
│ │Conversation│ │ Vector Store │ │
│ │ Store │ │ (Chroma) │ │
│ └────────────┘ └──────────────┘ │
└─────────────────────────────────┘
User enters message in web/mobile chat
POST /api/chat with message payload
Model router classifies task type
Ollama generates response
Response stored in conversation history
UI updates with assistant message
Flow 2: Agent Task Execution
User submits task via POST /api/agents/execute
If specific agent requested → execute directly
Otherwise → planner decomposes task
Subtasks distributed to worker agents
Results aggregated and returned
Flow 3: Dashboard Monitoring
Web app polls GET /api/health every 5s
Status displayed: Ollama connection, agent statuses
Task stats, conversation count, uptime shown
Scenario
Path
Simple task
User → Orchestrator → Single Agent → Ollama → Response
Complex task
User → Planner → Subtasks → Worker Agents → Aggregated Result
Model fallback
Primary model fails → Fallback model → Retry logic
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Agent │────▶│ Tool │────▶│ Tool │
│ Request │ │ Registry │ │ Execution │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌─────────────┐ │
│ Result │◀───────────┘
│ Aggregation│
└─────────────┘
User Prompt → Keyword Analysis → Task Classification → Model Selection → Ollama Call
│
┌───────────────┼───────────────┐
▼ ▼ ▼
deepseek-coder llama3 mistral
(coding) (reasoning) (conversation)
┌─────────────┐
│ E2E │ ← Few, slow, comprehensive
│ Tests │
┌─────┴─────────────┴─────┐
│ Integration Tests │ ← Moderate, critical paths
└─────┬─────────────┬───────┘
│ │
┌───────┴──────┐ ┌───┴────────┐
│ Unit Tests │ │ API Tests │ ← Many, fast, isolated
└──────────────┘ └────────────┘
3.2 Recommended Testing Tools
Category
Tool
Purpose
Unit Tests
Vitest
Fast, modern test runner
API Tests
Supertest
HTTP assertions
E2E Tests
Playwright
Cross-browser automation
Load Tests
k6 or Autocannon
Performance testing
Mobile Tests
Detox
React Native E2E
Mocking
MSW
API mocking
Coverage
c8
Code coverage
Test ID
Scenario
Steps
Expected Result
Edge Cases
MR-001
Coding task routing
Input: "write a function to sort array"
deepseek-coder selected
Keywords like "explain" combined with code
MR-002
Reasoning task routing
Input: "explain how binary search works"
llama3 selected
Ambiguous prompts
MR-003
Conversation routing
Input: "hello, how are you?"
mistral selected
Non-English input
MR-004
Planning task routing
Input: "create a project roadmap"
llama3 selected
Multi-word planning terms
MR-005
Analysis task routing
Input: "analyze the performance metrics"
llama3 selected
Data-related keywords
MR-006
Preferred model override
Input: "code function" + preferred=mistral
mistral selected
Invalid preferred model
MR-007
Fallback model selection
deepseek-coder unavailable
llama3 selected
N/A - tested in integration
Test ID
Scenario
Steps
Expected Result
TQ-001
Add task
Add new QueuedTask
Task in getAllTasks()
TQ-002
Update status
Update task to 'completed'
Status changed, completedAt set
TQ-003
Get by status
Filter running tasks
Only running tasks returned
TQ-004
Task statistics
Multiple tasks in various states
Accurate counts
Test ID
Scenario
Steps
Expected Result
MS-001
Store memory
Store entry with type 'conversation'
ID returned, entry retrievable
MS-002
Retrieve by query
Store entries, search with query
Relevant entries returned
MS-003
Conversation history
Store multiple messages
Most recent first, limit respected
Test ID
Scenario
Steps
Expected Result
CS-001
Create conversation
New UUID
Conversation created with empty messages
CS-002
Get existing
Use existing ID
Existing conversation returned
CS-003
List all
Multiple conversations
Sorted by updatedAt DESC
Test ID
Endpoint
Scenario
Steps
Expected Result
API-001
POST /api/chat
Valid message
Send message with valid payload
200, success=true, message returned
API-002
POST /api/chat
Empty message
Send ""
400, validation error
API-003
POST /api/chat
Invalid UUID
conversationId="invalid"
400, validation error
API-004
POST /api/chat/stream
Streaming enabled
Send stream=true
text/event-stream response
API-005
POST /api/agents/execute
Execute coding task
Send prompt="write hello world"
200, task executed
API-006
POST /api/agents/execute
Invalid agent
agentId="invalid"
400, validation error
API-007
GET /api/agents/status
Get all statuses
Call endpoint
200, all agent statuses
API-008
GET /api/health
Healthy system
Ollama running
status=healthy
API-009
GET /api/health
Ollama down
Stop Ollama
status=degraded
API-010
GET /api/tasks
Get all tasks
Call endpoint
200, task list with stats
API-011
GET /api/models
List models
Call endpoint
200, available models
Test ID
Scenario
Steps
Expected Result
AO-001
Single agent execution
Execute with agentId="coding"
Single task in queue, completed
AO-002
Planner decomposition
Complex task without agentId
Planner creates subtasks
AO-003
Agent status retrieval
Get agent statuses
All 4 agents returned
AO-004
Invalid agent ID
Execute with invalid ID
404 or fallback to planner
Test ID
Scenario
Steps
Expected Result
OLL-001
Chat completion
Send messages to llama3
Response content returned
OLL-002
Streaming
Stream chat response
Multiple chunks received
OLL-003
Model list
List available models
Array of model names
OLL-004
Connection check
Check Ollama connectivity
True/False
OLL-005
Embedding generation
Generate embedding
Float array returned
E2E-001: Complete Chat Flow
Step
Action
Expected
1
Open web app
Dashboard loads
2
Navigate to Chat
Chat page visible
3
Type message "Hello"
Message appears in UI
4
Press send
Loading indicator shows
5
Receive response
Assistant message appears
6
Check conversation history
GET /api/chat/conversations shows new entry
E2E-002: Agent Task Execution
Step
Action
Expected
1
POST /api/agents/execute with coding task
Task queued
2
GET /api/tasks
Task in running state
3
Wait for completion
Task completed
4
GET /api/agents/status
Agent status updated
E2E-003: Model Routing Verification
Step
Action
Expected
1
POST /api/chat with coding prompt
deepseek-coder used
2
Check response metadata
modelUsed="deepseek-coder"
3
POST /api/chat with conversation
mistral used
E2E-004: Dashboard Monitoring
Step
Action
Expected
1
Navigate to Dashboard
Stats visible
2
Trigger agent task
Running tasks count updates
3
Wait for completion
Completed count increments
4
Check agent statuses
All agents status displayed
Test ID
Scenario
Steps
Expected Result
MOB-001
App launch
Start Expo app
Home screen renders
MOB-002
API connectivity
Fetch /api/health
Connection status shown
MOB-003
Tab navigation
Switch between tabs
Content switches
MOB-004
Chat interface
Send message
Response received
MOB-005
Offline handling
Disconnect backend
Error message shown
ID
Edge Case
Risk Level
Mitigation
EC-001
Ollama not running
🔴 Critical
Health check, graceful degradation
EC-002
Model not installed
🔴 Critical
Validate model availability before execution
EC-003
LLM timeout
🔴 Critical
Request timeout (30s), retry logic
EC-004
Agent execution loop
🔴 Critical
Max iteration limit, task complexity check
EC-005
Memory overflow
🔴 High
Conversation history limit (100 messages)
EC-006
Concurrent requests
🟡 Medium
Rate limiting, queue management
EC-007
Invalid task decomposition
🟡 Medium
Validate subtask structure
EC-008
Network interruption
🟡 Medium
Retry with exponential backoff
EC-009
Large payload
🟡 Medium
Request size limit (10mb)
EC-010
Model hallucination
🟡 Medium
Validate output format
5.2 Race Condition Scenarios
Scenario
Test Approach
Multiple agents accessing queue
Concurrent task submission
Simultaneous model requests
Load test with multiple requests
Memory race conditions
Rapid store/retrieve operations
6.1 Test Execution Matrix
Test Type
Frequency
Environment
CI/CD
Unit Tests
Every PR
Local + CI
✅ GitHub Actions
Integration
Every PR
Staging
✅ GitHub Actions
E2E (Web)
Every Release
Production-like
✅ GitHub Actions
E2E (Mobile)
Every Release
Device Farm
⚠️ Manual/Detox
Performance
Weekly
Dedicated
⚠️ Manual
Security
Monthly
CI
⚠️ Manual
# .github/workflows/test.yml
name : Test Suite
on : [push, pull_request]
jobs :
unit-tests :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- uses : actions/setup-node@v4
with :
node-version : " 20"
- run : npm ci
- run : npm run test:unit
- run : npm run test:coverage
integration-tests :
runs-on : ubuntu-latest
services :
ollama :
image : ollama/ollama
ports :
- 11434:11434
steps :
- run : npm run test:integration
e2e-tests :
runs-on : ubuntu-latest
steps :
- run : npm run test:e2e
tests/
├── unit/
│ ├── router/
│ │ └── selectModel.test.ts
│ ├── agents/
│ │ ├── orchestrator.test.ts
│ │ └── taskQueue.test.ts
│ ├── memory/
│ │ ├── memorySystem.test.ts
│ │ └── conversationStore.test.ts
│ └── api/
│ └── validation.test.ts
├── integration/
│ ├── api/
│ │ ├── chat.test.ts
│ │ ├── agents.test.ts
│ │ └── health.test.ts
│ └── ollama/
│ └── ollama.test.ts
├── e2e/
│ ├── web/
│ │ ├── chat.test.ts
│ │ ├── dashboard.test.ts
│ │ └── agents.test.ts
│ └── mobile/
│ └── app.test.ts
└── fixtures/
├── conversations.json
└── tasks.json
7.1 Component Risk Matrix
Component
Failure Impact
Probability
Priority
Test Focus
AI Gateway
System down
Medium
P0
Health, connectivity
Model Router
Wrong model
Medium
P0
Routing logic
Orchestrator
Task failures
Medium
P0
Execution flow
Ollama Client
No LLM responses
High
P0
Connection, fallback
Task Queue
Lost tasks
Low
P1
Persistence
Memory System
Data loss
Medium
P1
Storage, retrieval
Web Dashboard
UI broken
Low
P2
E2E tests
Mobile App
App crash
Low
P2
Basic functionality
7.2 Critical Success Criteria
✅ All API endpoints return correct status codes
✅ Model routing accuracy > 95%
✅ Agent task completion rate > 99%
✅ Health check reflects actual system state
✅ No data loss in conversation history
✅ Response time < 5s for simple queries
8. Architecture Improvements & Recommendations
ID
Requirement
Impact
Recommendation
REQ-001
Authentication
Security
Add JWT/API key auth
REQ-002
Rate Limiting
Stability
Implement rate limiter
REQ-003
Input Sanitization
Security
Add XSS protection
REQ-004
Persistent Storage
Data
Implement SQLite/Chroma
REQ-005
WebSocket Support
Real-time
Add WebSocket for live updates
REQ-006
Error Recovery
Reliability
Implement retry logic
REQ-007
Metrics/Tracing
Observability
Add OpenTelemetry
REQ-008
Caching
Performance
Add Redis cache layer
Gap
Current State
Recommended
Test Coverage
None
Target 80%
Mobile Tests
Manual
Add Detox
Performance Tests
None
Add k6
Security Tests
None
Add OWASP tests
Visual Regression
None
Add Chromatic
8.3 Suggested Package Additions
{
"devDependencies" : {
"vitest" : " ^2.0.0" ,
"@vitest/coverage-v8" : " ^2.0.0" ,
"supertest" : " ^7.0.0" ,
"@playwright/test" : " ^1.45.0" ,
"msw" : " ^2.3.0" ,
"detox" : " ^20.0.0" ,
"k6" : " latest"
}
}
Category
Count
Priority
Unit Tests
30+
High
Integration Tests
25+
High
E2E Tests (Web)
15+
Medium
E2E Tests (Mobile)
5+
Medium
Edge Case Tests
10+
High
Total
85+
Set up testing infrastructure - Add Vitest, Supertest, Playwright
Write unit tests - Start with model router, task queue
Add integration tests - Test all API endpoints
Implement E2E tests - Cover critical user flows
Configure CI/CD - GitHub Actions pipeline
Add test coverage reporting - Target 80%+
Mobile testing - Set up Detox for React Native
Report generated by Senior Staff Engineer - AI Systems Architect
AgentOS v0.1.0 | March 2026