Open · Versioned · Citable

Eval Methodology

How FrootAI measures agent quality across the FAI Protocol ecosystem — 101 solution plays, 847 primitives, and every Cloud beta run. It is opinionated by design.

Living document · reviewed quarterly · tell us where we're wrong

1. Why this page exists

The AI ecosystem is full of eval claims but short on eval transparency. Platforms advertise “enterprise-grade evaluation” without publishing what they measure, how their judges work, or what they don'tclaim. We think that's backwards.

Every metric, every judge prompt template, and every dataset schema is documented here — and versioned in a public repo. We publish this because open methodology earns more trust than proprietary judges, and because our design partners told us the first thing their CTO asks is “how do you know this agent won't regress?” This page is the answer.

2. What we measure

Five dimensions. Each has deterministic and non-deterministic metrics.

2.1 Correctness

Does the agent produce the right answer?

Metric	Type	How
Exact match	Deterministic	Normalised string comparison against ground-truth
Semantic match	Non-deterministic	LLM-as-judge (groundedness template), score 0.0–1.0
Schema conformance	Deterministic	Output validated against a JSON Schema / regex per play
Factual grounding	Non-deterministic	LLM-as-judge: every claim supported by context; mean ± std over 3 samples

2.2 Determinism

Does the agent produce consistent results across runs?

Metric	Type	How
Seeded consistency	Deterministic	Same input + temp 0 + same model → output hash across 5 runs (pass = 5/5)
Semantic stability	Non-deterministic	LLM-as-judge pairwise similarity across 5 runs; threshold ≥ 0.85

2.3 Latency

Is the agent fast enough for production?

Metric	Type	How
p50 / p95 / p99 end-to-end	Measured	Wall-clock request → final token (engine.total_ms)
Engine overhead	Budget	engine.overhead_ms = total − model_call_ms; p95 ≤ 250 ms
Time to first token	Measured	Request → first token SSE event

2.4 Cost

What does this agent cost to run?

Metric	Type	How
Estimated cost	Measured	Per-node token estimate × pinned pricing (refreshed daily)
Actual cost	Measured	Actual prompt + completion tokens × provider price at run time
Estimate accuracy	Target	1 − \|est − actual\| / actual; ≥ 90% on trailing 100 runs
Cost per 1k queries	Aggregated	Across all nodes — the number a CFO needs

2.5 Safety

Does the agent avoid harmful outputs?

Metric	Type	How
Harm avoidance	Det + non-det	Regex patterns + LLM-as-judge for nuanced content
PII leak detection	Deterministic	Regex: SSN, passport, email, phone not present in input
Jailbreak resistance	Non-deterministic	20 adversarial prompts; pass = 0 successful jailbreaks
Refusal rate	Deterministic	% of inputs the agent correctly refuses (out-of-scope / harmful)

3. How we measure

Datasets

≥ 20 cases per play (≥ 2,020 total). MIT-licensed, attributed to a named maintainer, schema-validated.

Judges

Open prompt templates, versioned (groundedness-v1…), pinned per release. No proprietary judge APIs we can't inspect.

Sampling

Full dataset per-run in Studio; sampled on large scheduled runs; full dataset always for regression detection.

Reproducibility

Temperature 0 on deterministic metrics + judges; 3-sample mean ± std for non-deterministic; judge version recorded in every result.

Regression detection

Deterministic drop > 10% or non-deterministic drop > 1σ below baseline → regressed:true. Surfaced in Studio, Cloud alerts, and the CI action.

4. What we DON'T claim

This section is mandatory. We include it in every methodology document we publish.

1. We don't certify agents.
An eval score is a measurement, not a guarantee. 0.95 groundedness means it scored 0.95 against that dataset + judge — not that it's 95% correct in production.
2. We don't run cross-vendor leaderboards.
We measure FrootAI plays against FrootAI datasets. We don't publish comparative rankings vs LangSmith, Vellum, Humanloop, or anyone else. That comparison is the customer's job.
3. We don't offer real-time safety interception.
Safety eval runs post-hoc. We detect harmful patterns in outputs; we don't prevent them mid-generation. Real-time safety needs a guardrails layer (FAI Protocol hooks), not eval.
4. We don't detect all PII.
Regex catches common patterns (SSN, passport, API keys, email, phone). It misses context-dependent PII. For regulated environments, layer a dedicated PII service.
5. We don't replace human review.
Eval automates the repeatable parts of QA. It doesn't replace domain-expert review for edge cases, cultural sensitivity, or business logic. The score tells you where to look; the human tells you what to do.
6. We don't optimise prompts.
Eval measures quality; prompt optimisation is a separate discipline. Eval tells you if your prompt is good, not how to make it better.
7. “Eval is necessary, not sufficient.”
A passing eval suite is a minimum bar, not a ship decision.

5. How to disagree with us

We want to be wrong in public rather than wrong in private. Open a GitHub issue describing what you'd change and why. Every 3 months we review all open methodology issues; changes ship with a 90-day deprecation window, and any change to a judge prompt or scoring formula ships as a new version (e.g. groundedness-v2). If your issue leads to a change, you're credited in the changelog.

6. Citation

@misc{frootai-eval-methodology-2026,
  title  = {FrootAI Eval Methodology: What We Measure,
            How We Measure It, and What We Don't Claim},
  author = {Bali, Pavleen},
  year   = {2026},
  url    = {https://frootai.dev/methodology/eval},
  note   = {Living document. github.com/frootai/methodology}
}

Build evals into your pipeline → docs · eval GitHub Action