Skip to content

Commit 9e18876

Browse files
authored
feat(eval): add universal evaluation framework (#21)
* feat(eval): add universal evaluation framework Introduce a top-level eval/ package with composable Scorer interface, Observation types, experiment runner (A/B comparison), text quality metrics, LLM-as-judge, and on-disk format. Subsystem-specific scorers live alongside their domains (agent/eval/, knowledge/eval/, rag/eval/) while the core framework stays dependency-free. * docs: add eval framework section to README Document the universal evaluation framework including core abstractions, built-in scorers across all subsystems, A/B experiment runner, stream timing collection, and on-disk format.
1 parent 29f7794 commit 9e18876

25 files changed

Lines changed: 2430 additions & 1 deletion

README.md

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
- **4 LLM providers** (Ollama, OpenAI, Anthropic, Google) behind one `Provider` interface
4040
- **Provider resilience** — retry + fallback composition out of the box
4141
- **Structured output** — constrain LLM responses to JSON schema
42+
- **Universal evaluation** — composable `Scorer` interface, A/B experiment runner, text quality metrics, LLM-as-judge, and subsystem-specific scorers for agent, RAG, and knowledge graph
4243

4344
### Why one SDK?
4445

@@ -192,6 +193,7 @@ fmt.Println(result.AssembledContext.Prompt) // context with citations
192193
- [agent — AI Agent Framework](#agent--ai-agent-framework) (providers, deltas, tools, sub-agents, markers, feedback/RLHF, compaction, tree, TUI)
193194
- [kg — Knowledge Graph SDK](#kg--knowledge-graph-sdk)
194195
- [rag — RAG Pipeline SDK](#rag--rag-pipeline-sdk)
196+
- [eval — Universal Evaluation Framework](#eval--universal-evaluation-framework)
195197
- [Examples](#examples)
196198
- [Agent Skill](#agent-skill)
197199

@@ -577,7 +579,7 @@ rag.WithHyDE(myLLM, 3) // generate 3 hypothetical docs
577579

578580
### Evaluation Metrics
579581

580-
9 metrics across retrieval, generation, and end-to-end evaluation:
582+
9 metrics across retrieval, generation, and end-to-end evaluation. These are also available as composable `Scorer` adapters for the [universal eval framework](#eval--universal-evaluation-framework) — see `rag/eval` scorer functions like `ContextPrecisionScorer()`, `FaithfulnessScorer()`, etc.
581583

582584
| Metric | Type | Description |
583585
|--------|------|-------------|
@@ -635,6 +637,125 @@ kgTools := kgtool.NewTools(graph)
635637

636638
---
637639

640+
## eval — Universal Evaluation Framework
641+
642+
Composable evaluation framework that works across all SAIGE subsystems. The core `eval/` package has zero subsystem dependencies — subsystem-specific scorers live alongside their domains.
643+
644+
### Core Abstractions
645+
646+
| Type | Purpose |
647+
|------|---------|
648+
| `Observation` | Universal eval case — Input, Output, GroundTruth as `json.RawMessage`, typed Annotations map |
649+
| `Scorer` | Interface computing a named metric from an Observation |
650+
| `Subject` | Function that populates an Observation's Output and Annotations |
651+
| `Score` | Named metric value with optional reason |
652+
653+
### Built-in Scorers
654+
655+
**Text Quality** (pure functions, no LLM):
656+
657+
| Scorer | Description |
658+
|--------|-------------|
659+
| `SequenceSimilarityScorer` | Character-level LCS ratio between output and ground truth |
660+
| `TokenF1Scorer` | Word-token precision/recall/F1 |
661+
| `RougeLScorer` | ROUGE-L F1 at the token level |
662+
663+
**LLM-as-Judge**:
664+
665+
| Scorer | Description |
666+
|--------|-------------|
667+
| `NewJudgeScorer` | Pointwise scoring with customizable rubric |
668+
| `NewPairwiseJudgeScorer` | A/B comparison between two outputs |
669+
670+
**Agent** (`agent/eval`):
671+
672+
| Scorer | Description |
673+
|--------|-------------|
674+
| `TTFTScorer` | Time to first token (ms) |
675+
| `TTLTScorer` | Time to last token (ms) |
676+
| `MedianITLScorer` | Median inter-token latency (ms) |
677+
| `ToolCallCountScorer` | Number of tool calls |
678+
| `ToolSuccessRateScorer` | Fraction of successful tool calls |
679+
| `TurnCountScorer` | Agent loop iterations |
680+
681+
**Knowledge Graph** (`knowledge/eval`):
682+
683+
| Scorer | Description |
684+
|--------|-------------|
685+
| `EntityRecallScorer` | Fraction of expected entities extracted |
686+
| `EntityPrecisionScorer` | Fraction of extracted entities matching expected |
687+
| `RelationRecallScorer` | Relation extraction recall |
688+
| `RelationPrecisionScorer` | Relation extraction precision |
689+
| `FactSearchRecallScorer` | Fraction of relevant facts found by search |
690+
691+
**RAG** (`rag/eval`):
692+
693+
The existing 9 RAG metrics are also available as composable `Scorer` adapters: `ContextPrecisionScorer`, `ContextRecallScorer`, `NDCGScorer`, `MRRScorer`, `HitRateScorer`, `FaithfulnessScorer`, `AnswerRelevancyScorer`, `AnswerCorrectnessScorer`.
694+
695+
### Evaluate a Single System
696+
697+
```go
698+
import "github.com/urmzd/saige/eval"
699+
700+
observations := []eval.Observation{
701+
{ID: "q1", Input: json.RawMessage(`"What is Go?"`), GroundTruth: json.RawMessage(`"A programming language."`)},
702+
}
703+
704+
// Define a subject that calls the system under test.
705+
subject := eval.Subject(func(ctx context.Context, obs *eval.Observation) error {
706+
// Call your system, populate obs.Output, obs.Annotations, obs.Timing
707+
obs.Output = json.RawMessage(`"Go is a statically typed language."`)
708+
return nil
709+
})
710+
711+
eval.Populate(ctx, observations, subject)
712+
713+
result, _ := eval.Run(ctx, "my-eval", observations, []eval.Scorer{
714+
eval.TokenF1Scorer(),
715+
eval.RougeLScorer(),
716+
eval.NewJudgeScorer(llm, eval.WithJudgeRubric("Score for accuracy.")),
717+
})
718+
```
719+
720+
### A/B Experiment
721+
722+
Compare two approaches on the same inputs:
723+
724+
```go
725+
result, _ := eval.RunExperiment(ctx, inputs, baseSubject, expSubject,
726+
[]eval.Scorer{rageval.NDCGScorer(10), rageval.MRRScorer()},
727+
eval.WithOutputDir("experiments/bm25-vs-hyde"),
728+
eval.WithExperimentName("bm25-vs-hyde"),
729+
)
730+
// result.Deltas["ndcg"] shows the improvement
731+
```
732+
733+
### Stream Timing (Agent)
734+
735+
Instrument a delta channel to collect TTFT, TTLT, and median ITL:
736+
737+
```go
738+
import agenteval "github.com/urmzd/saige/agent/eval"
739+
740+
stream := myAgent.Invoke(ctx, messages)
741+
timing, text, deltas := agenteval.CollectStreamTiming(stream.Deltas())
742+
// timing.TTFTMs, timing.TTLTMs, timing.MedianITL
743+
```
744+
745+
### On-Disk Format
746+
747+
Experiment results persist as structured JSON for reproducibility:
748+
749+
```
750+
experiments/bm25-vs-hyde/
751+
result.json
752+
inputs/000.json
753+
outputs/base/000.json
754+
outputs/exp/000.json
755+
```
756+
757+
---
758+
638759
## Examples
639760

640761
| Example | Path | Description |

agent/eval/scorers.go

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
package eval
2+
3+
import (
4+
"context"
5+
"encoding/json"
6+
7+
topeval "github.com/urmzd/saige/eval"
8+
)
9+
10+
// Annotation keys used by agent subjects.
11+
const (
12+
AnnotationStreamTiming = "agent.stream_timing" // StreamTiming
13+
AnnotationToolCalls = "agent.tool_calls" // []ToolCallRecord
14+
AnnotationTurnCount = "agent.turn_count" // int
15+
)
16+
17+
// ToolCallRecord captures a tool invocation for evaluation.
18+
type ToolCallRecord struct {
19+
Name string `json:"name"`
20+
Arguments map[string]any `json:"arguments"`
21+
Result string `json:"result"`
22+
Error string `json:"error,omitempty"`
23+
DurationMs int64 `json:"duration_ms"`
24+
}
25+
26+
// TTFTScorer reports time-to-first-token in milliseconds.
27+
func TTFTScorer() topeval.Scorer {
28+
return topeval.NewScorerFunc("ttft_ms", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
29+
st, err := extractStreamTiming(obs)
30+
if err != nil || st == nil {
31+
return topeval.Score{}, err
32+
}
33+
return topeval.Score{Name: "ttft_ms", Value: float64(st.TTFTMs)}, nil
34+
})
35+
}
36+
37+
// TTLTScorer reports time-to-last-token in milliseconds.
38+
func TTLTScorer() topeval.Scorer {
39+
return topeval.NewScorerFunc("ttlt_ms", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
40+
st, err := extractStreamTiming(obs)
41+
if err != nil || st == nil {
42+
return topeval.Score{}, err
43+
}
44+
return topeval.Score{Name: "ttlt_ms", Value: float64(st.TTLTMs)}, nil
45+
})
46+
}
47+
48+
// MedianITLScorer reports median inter-token latency in milliseconds.
49+
func MedianITLScorer() topeval.Scorer {
50+
return topeval.NewScorerFunc("median_itl_ms", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
51+
st, err := extractStreamTiming(obs)
52+
if err != nil || st == nil {
53+
return topeval.Score{}, err
54+
}
55+
return topeval.Score{Name: "median_itl_ms", Value: st.MedianITL}, nil
56+
})
57+
}
58+
59+
// ToolCallCountScorer reports the number of tool calls made.
60+
func ToolCallCountScorer() topeval.Scorer {
61+
return topeval.NewScorerFunc("tool_call_count", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
62+
calls, err := extractToolCalls(obs)
63+
if err != nil || calls == nil {
64+
return topeval.Score{}, err
65+
}
66+
return topeval.Score{Name: "tool_call_count", Value: float64(len(calls))}, nil
67+
})
68+
}
69+
70+
// ToolSuccessRateScorer reports the fraction of tool calls without errors.
71+
func ToolSuccessRateScorer() topeval.Scorer {
72+
return topeval.NewScorerFunc("tool_success_rate", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
73+
calls, err := extractToolCalls(obs)
74+
if err != nil || calls == nil {
75+
return topeval.Score{}, err
76+
}
77+
if len(calls) == 0 {
78+
return topeval.Score{Name: "tool_success_rate", Value: 1.0}, nil
79+
}
80+
var success int
81+
for _, c := range calls {
82+
if c.Error == "" {
83+
success++
84+
}
85+
}
86+
return topeval.Score{Name: "tool_success_rate", Value: float64(success) / float64(len(calls))}, nil
87+
})
88+
}
89+
90+
// TurnCountScorer reports the number of agent loop iterations.
91+
func TurnCountScorer() topeval.Scorer {
92+
return topeval.NewScorerFunc("turn_count", func(_ context.Context, obs topeval.Observation) (topeval.Score, error) {
93+
raw, ok := obs.Annotations[AnnotationTurnCount]
94+
if !ok {
95+
return topeval.Score{}, nil
96+
}
97+
var count int
98+
if err := json.Unmarshal(raw, &count); err != nil {
99+
return topeval.Score{}, err
100+
}
101+
return topeval.Score{Name: "turn_count", Value: float64(count)}, nil
102+
})
103+
}
104+
105+
func extractStreamTiming(obs topeval.Observation) (*StreamTiming, error) {
106+
raw, ok := obs.Annotations[AnnotationStreamTiming]
107+
if !ok {
108+
return nil, nil
109+
}
110+
var st StreamTiming
111+
if err := json.Unmarshal(raw, &st); err != nil {
112+
return nil, err
113+
}
114+
return &st, nil
115+
}
116+
117+
func extractToolCalls(obs topeval.Observation) ([]ToolCallRecord, error) {
118+
raw, ok := obs.Annotations[AnnotationToolCalls]
119+
if !ok {
120+
return nil, nil
121+
}
122+
var calls []ToolCallRecord
123+
if err := json.Unmarshal(raw, &calls); err != nil {
124+
return nil, err
125+
}
126+
return calls, nil
127+
}

agent/eval/scorers_test.go

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
package eval
2+
3+
import (
4+
"context"
5+
"encoding/json"
6+
"math"
7+
"testing"
8+
9+
topeval "github.com/urmzd/saige/eval"
10+
)
11+
12+
func assertClose(t *testing.T, name string, got, want, eps float64) {
13+
t.Helper()
14+
if math.Abs(got-want) > eps {
15+
t.Errorf("%s: got %f, want %f (±%f)", name, got, want, eps)
16+
}
17+
}
18+
19+
func TestTTFTScorer(t *testing.T) {
20+
st := StreamTiming{TTFTMs: 42}
21+
stJSON, _ := json.Marshal(st)
22+
23+
obs := topeval.Observation{
24+
ID: "t1",
25+
Annotations: map[string]json.RawMessage{AnnotationStreamTiming: stJSON},
26+
}
27+
28+
score, err := TTFTScorer().Score(context.Background(), obs)
29+
if err != nil {
30+
t.Fatal(err)
31+
}
32+
assertClose(t, "ttft", score.Value, 42.0, 0.001)
33+
}
34+
35+
func TestTTFTScorerMissingAnnotation(t *testing.T) {
36+
obs := topeval.Observation{ID: "t2"}
37+
score, err := TTFTScorer().Score(context.Background(), obs)
38+
if err != nil {
39+
t.Fatal(err)
40+
}
41+
if score.Name != "" {
42+
t.Errorf("expected empty score for missing annotation, got %q", score.Name)
43+
}
44+
}
45+
46+
func TestToolSuccessRateScorer(t *testing.T) {
47+
calls := []ToolCallRecord{
48+
{Name: "search", Result: "ok"},
49+
{Name: "fetch", Error: "timeout"},
50+
{Name: "parse", Result: "done"},
51+
}
52+
callsJSON, _ := json.Marshal(calls)
53+
54+
obs := topeval.Observation{
55+
ID: "t3",
56+
Annotations: map[string]json.RawMessage{AnnotationToolCalls: callsJSON},
57+
}
58+
59+
score, err := ToolSuccessRateScorer().Score(context.Background(), obs)
60+
if err != nil {
61+
t.Fatal(err)
62+
}
63+
// 2 out of 3 succeeded.
64+
assertClose(t, "success_rate", score.Value, 2.0/3.0, 0.001)
65+
}
66+
67+
func TestToolCallCountScorer(t *testing.T) {
68+
calls := []ToolCallRecord{{Name: "a"}, {Name: "b"}}
69+
callsJSON, _ := json.Marshal(calls)
70+
71+
obs := topeval.Observation{
72+
ID: "t4",
73+
Annotations: map[string]json.RawMessage{AnnotationToolCalls: callsJSON},
74+
}
75+
76+
score, err := ToolCallCountScorer().Score(context.Background(), obs)
77+
if err != nil {
78+
t.Fatal(err)
79+
}
80+
assertClose(t, "count", score.Value, 2.0, 0.001)
81+
}
82+
83+
func TestTurnCountScorer(t *testing.T) {
84+
countJSON, _ := json.Marshal(5)
85+
obs := topeval.Observation{
86+
ID: "t5",
87+
Annotations: map[string]json.RawMessage{AnnotationTurnCount: countJSON},
88+
}
89+
90+
score, err := TurnCountScorer().Score(context.Background(), obs)
91+
if err != nil {
92+
t.Fatal(err)
93+
}
94+
assertClose(t, "turns", score.Value, 5.0, 0.001)
95+
}

0 commit comments

Comments
 (0)