Open · Versioned · Citable
Eval Methodology
How FrootAI measures agent quality across the FAI Protocol ecosystem — 101 solution plays, 847 primitives, and every Cloud beta run. It is opinionated by design.
Living document · reviewed quarterly · tell us where we're wrong
1. Why this page exists
The AI ecosystem is full of eval claims but short on eval transparency. Platforms advertise “enterprise-grade evaluation” without publishing what they measure, how their judges work, or what they don'tclaim. We think that's backwards.
Every metric, every judge prompt template, and every dataset schema is documented here — and versioned in a public repo. We publish this because open methodology earns more trust than proprietary judges, and because our design partners told us the first thing their CTO asks is “how do you know this agent won't regress?” This page is the answer.
2. What we measure
Five dimensions. Each has deterministic and non-deterministic metrics.
2.1 Correctness
Does the agent produce the right answer?
| Metric | Type | How |
|---|---|---|
| Exact match | Deterministic | Normalised string comparison against ground-truth |
| Semantic match | Non-deterministic | LLM-as-judge (groundedness template), score 0.0–1.0 |
| Schema conformance | Deterministic | Output validated against a JSON Schema / regex per play |
| Factual grounding | Non-deterministic | LLM-as-judge: every claim supported by context; mean ± std over 3 samples |
2.2 Determinism
Does the agent produce consistent results across runs?
| Metric | Type | How |
|---|---|---|
| Seeded consistency | Deterministic | Same input + temp 0 + same model → output hash across 5 runs (pass = 5/5) |
| Semantic stability | Non-deterministic | LLM-as-judge pairwise similarity across 5 runs; threshold ≥ 0.85 |
2.3 Latency
Is the agent fast enough for production?
| Metric | Type | How |
|---|---|---|
| p50 / p95 / p99 end-to-end | Measured | Wall-clock request → final token (engine.total_ms) |
| Engine overhead | Budget | engine.overhead_ms = total − model_call_ms; p95 ≤ 250 ms |
| Time to first token | Measured | Request → first token SSE event |
2.4 Cost
What does this agent cost to run?
| Metric | Type | How |
|---|---|---|
| Estimated cost | Measured | Per-node token estimate × pinned pricing (refreshed daily) |
| Actual cost | Measured | Actual prompt + completion tokens × provider price at run time |
| Estimate accuracy | Target | 1 − |est − actual| / actual; ≥ 90% on trailing 100 runs |
| Cost per 1k queries | Aggregated | Across all nodes — the number a CFO needs |
2.5 Safety
Does the agent avoid harmful outputs?
| Metric | Type | How |
|---|---|---|
| Harm avoidance | Det + non-det | Regex patterns + LLM-as-judge for nuanced content |
| PII leak detection | Deterministic | Regex: SSN, passport, email, phone not present in input |
| Jailbreak resistance | Non-deterministic | 20 adversarial prompts; pass = 0 successful jailbreaks |
| Refusal rate | Deterministic | % of inputs the agent correctly refuses (out-of-scope / harmful) |
3. How we measure
Datasets
≥ 20 cases per play (≥ 2,020 total). MIT-licensed, attributed to a named maintainer, schema-validated.
Judges
Open prompt templates, versioned (groundedness-v1…), pinned per release. No proprietary judge APIs we can't inspect.
Sampling
Full dataset per-run in Studio; sampled on large scheduled runs; full dataset always for regression detection.
Reproducibility
Temperature 0 on deterministic metrics + judges; 3-sample mean ± std for non-deterministic; judge version recorded in every result.
Regression detection
Deterministic drop > 10% or non-deterministic drop > 1σ below baseline → regressed:true. Surfaced in Studio, Cloud alerts, and the CI action.
4. What we DON'T claim
This section is mandatory. We include it in every methodology document we publish.
1. We don't certify agents.
An eval score is a measurement, not a guarantee. 0.95 groundedness means it scored 0.95 against that dataset + judge — not that it's 95% correct in production.
2. We don't run cross-vendor leaderboards.
We measure FrootAI plays against FrootAI datasets. We don't publish comparative rankings vs LangSmith, Vellum, Humanloop, or anyone else. That comparison is the customer's job.
3. We don't offer real-time safety interception.
Safety eval runs post-hoc. We detect harmful patterns in outputs; we don't prevent them mid-generation. Real-time safety needs a guardrails layer (FAI Protocol hooks), not eval.
4. We don't detect all PII.
Regex catches common patterns (SSN, passport, API keys, email, phone). It misses context-dependent PII. For regulated environments, layer a dedicated PII service.
5. We don't replace human review.
Eval automates the repeatable parts of QA. It doesn't replace domain-expert review for edge cases, cultural sensitivity, or business logic. The score tells you where to look; the human tells you what to do.
6. We don't optimise prompts.
Eval measures quality; prompt optimisation is a separate discipline. Eval tells you if your prompt is good, not how to make it better.
7. “Eval is necessary, not sufficient.”
A passing eval suite is a minimum bar, not a ship decision.
5. How to disagree with us
We want to be wrong in public rather than wrong in private. Open a GitHub issue describing what you'd change and why. Every 3 months we review all open methodology issues; changes ship with a 90-day deprecation window, and any change to a judge prompt or scoring formula ships as a new version (e.g. groundedness-v2). If your issue leads to a change, you're credited in the changelog.
6. Citation
@misc{frootai-eval-methodology-2026,
title = {FrootAI Eval Methodology: What We Measure,
How We Measure It, and What We Don't Claim},
author = {Bali, Pavleen},
year = {2026},
url = {https://frootai.dev/methodology/eval},
note = {Living document. github.com/frootai/methodology}
}Build evals into your pipeline → docs · eval GitHub Action