FrootAI — AmpliFAI your AI Ecosystem Get Started

Open · Versioned · Citable

Eval Methodology

How FrootAI measures agent quality across the FAI Protocol ecosystem — 101 solution plays, 847 primitives, and every Cloud beta run. It is opinionated by design.

Living document · reviewed quarterly · tell us where we're wrong

1. Why this page exists

The AI ecosystem is full of eval claims but short on eval transparency. Platforms advertise “enterprise-grade evaluation” without publishing what they measure, how their judges work, or what they don'tclaim. We think that's backwards.

Every metric, every judge prompt template, and every dataset schema is documented here — and versioned in a public repo. We publish this because open methodology earns more trust than proprietary judges, and because our design partners told us the first thing their CTO asks is “how do you know this agent won't regress?” This page is the answer.

2. What we measure

Five dimensions. Each has deterministic and non-deterministic metrics.

2.1 Correctness

Does the agent produce the right answer?

MetricTypeHow
Exact matchDeterministicNormalised string comparison against ground-truth
Semantic matchNon-deterministicLLM-as-judge (groundedness template), score 0.0–1.0
Schema conformanceDeterministicOutput validated against a JSON Schema / regex per play
Factual groundingNon-deterministicLLM-as-judge: every claim supported by context; mean ± std over 3 samples

2.2 Determinism

Does the agent produce consistent results across runs?

MetricTypeHow
Seeded consistencyDeterministicSame input + temp 0 + same model → output hash across 5 runs (pass = 5/5)
Semantic stabilityNon-deterministicLLM-as-judge pairwise similarity across 5 runs; threshold ≥ 0.85

2.3 Latency

Is the agent fast enough for production?

MetricTypeHow
p50 / p95 / p99 end-to-endMeasuredWall-clock request → final token (engine.total_ms)
Engine overheadBudgetengine.overhead_ms = total − model_call_ms; p95 ≤ 250 ms
Time to first tokenMeasuredRequest → first token SSE event

2.4 Cost

What does this agent cost to run?

MetricTypeHow
Estimated costMeasuredPer-node token estimate × pinned pricing (refreshed daily)
Actual costMeasuredActual prompt + completion tokens × provider price at run time
Estimate accuracyTarget1 − |est − actual| / actual; ≥ 90% on trailing 100 runs
Cost per 1k queriesAggregatedAcross all nodes — the number a CFO needs

2.5 Safety

Does the agent avoid harmful outputs?

MetricTypeHow
Harm avoidanceDet + non-detRegex patterns + LLM-as-judge for nuanced content
PII leak detectionDeterministicRegex: SSN, passport, email, phone not present in input
Jailbreak resistanceNon-deterministic20 adversarial prompts; pass = 0 successful jailbreaks
Refusal rateDeterministic% of inputs the agent correctly refuses (out-of-scope / harmful)

3. How we measure

Datasets

≥ 20 cases per play (≥ 2,020 total). MIT-licensed, attributed to a named maintainer, schema-validated.

Judges

Open prompt templates, versioned (groundedness-v1…), pinned per release. No proprietary judge APIs we can't inspect.

Sampling

Full dataset per-run in Studio; sampled on large scheduled runs; full dataset always for regression detection.

Reproducibility

Temperature 0 on deterministic metrics + judges; 3-sample mean ± std for non-deterministic; judge version recorded in every result.

Regression detection

Deterministic drop > 10% or non-deterministic drop > 1σ below baseline → regressed:true. Surfaced in Studio, Cloud alerts, and the CI action.

4. What we DON'T claim

This section is mandatory. We include it in every methodology document we publish.

  1. 1. We don't certify agents.

    An eval score is a measurement, not a guarantee. 0.95 groundedness means it scored 0.95 against that dataset + judge — not that it's 95% correct in production.

  2. 2. We don't run cross-vendor leaderboards.

    We measure FrootAI plays against FrootAI datasets. We don't publish comparative rankings vs LangSmith, Vellum, Humanloop, or anyone else. That comparison is the customer's job.

  3. 3. We don't offer real-time safety interception.

    Safety eval runs post-hoc. We detect harmful patterns in outputs; we don't prevent them mid-generation. Real-time safety needs a guardrails layer (FAI Protocol hooks), not eval.

  4. 4. We don't detect all PII.

    Regex catches common patterns (SSN, passport, API keys, email, phone). It misses context-dependent PII. For regulated environments, layer a dedicated PII service.

  5. 5. We don't replace human review.

    Eval automates the repeatable parts of QA. It doesn't replace domain-expert review for edge cases, cultural sensitivity, or business logic. The score tells you where to look; the human tells you what to do.

  6. 6. We don't optimise prompts.

    Eval measures quality; prompt optimisation is a separate discipline. Eval tells you if your prompt is good, not how to make it better.

  7. 7. “Eval is necessary, not sufficient.”

    A passing eval suite is a minimum bar, not a ship decision.

5. How to disagree with us

We want to be wrong in public rather than wrong in private. Open a GitHub issue describing what you'd change and why. Every 3 months we review all open methodology issues; changes ship with a 90-day deprecation window, and any change to a judge prompt or scoring formula ships as a new version (e.g. groundedness-v2). If your issue leads to a change, you're credited in the changelog.

6. Citation

@misc{frootai-eval-methodology-2026,
  title  = {FrootAI Eval Methodology: What We Measure,
            How We Measure It, and What We Don't Claim},
  author = {Bali, Pavleen},
  year   = {2026},
  url    = {https://frootai.dev/methodology/eval},
  note   = {Living document. github.com/frootai/methodology}
}

Build evals into your pipeline → docs · eval GitHub Action