[FEAT] Prompt Evaluation Framework with Acceptance Metrics

Now that initial prompt unit tests have been merged (#13), 
there is an opportunity to improve the evaluation framework by:
• Introducing acceptance metrics (e.g. ≥70% correct extraction across N receipts)
• Classifying failure types (OCR error vs hallucination vs missing field)
• Decoupling test execution from app packaging more cleanly
• Making prompt evaluation reproducible & comparable across model versions
This would move prompt testing from “smoke tests” to a measurable evaluation framework.
Happy to take this on if aligned with the roadmap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Prompt Evaluation Framework with Acceptance Metrics #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Prompt Evaluation Framework with Acceptance Metrics #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions