Quality Metrics
FAI Evaluation Dashboard
Automated quality scoring for every solution play. These metrics run in CI and must pass before any play ships.
Groundedness
≥ 0.95
% of claims backed by source documents. Measured via citation verification.
Coherence
≥ 0.90
Logical flow and consistency of multi-turn responses.
Relevance
≥ 0.90
How well the response addresses the user's actual question.
Fluency
≥ 0.95
Grammatical correctness and natural language quality.
Safety
0 violations
Content safety score — harmful, hateful, sexual, violent content blocked.
Cost / Query
< $0.01
Average token cost per query including retrieval + generation.
Evaluation Pipeline
- Test Set — 50+ question/answer pairs per play, covering edge cases
- Run —
python evaluation/eval.pyscores each metric - Gate — CI blocks deployment if any metric falls below threshold
- Report — Results saved to
evaluation/results.json