Play 98

Agent Evaluation Platform

High✅ Ready

Automated evaluation suite for any AI agent with benchmarks, A/B testing, and leaderboards.

Automated evaluation suite for any AI agent — standardized benchmarks, regression testing, A/B experimentation, human preference scoring, and leaderboard ranking. Evaluates agents across quality, safety, speed, cost, and user satisfaction dimensions.

Architecture Pattern

Agent eval pipeline: benchmark selection - test execution - metric collection - A/B analysis - human scoring - leaderboard ranking

Azure Services

Azure OpenAIAzure Container AppsAzure Cosmos DBAzure Machine LearningAzure Functions

DevKit (.github Agentic OS)

agent.md — root orchestrator with builder→reviewer→tuner handoffs
3 agents — Eval Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skills — deploy (260 lines), evaluate (103 lines), tune (243 lines)
4 prompts — /deploy, /test, /review, /evaluate with agent routing
.vscode/mcp.json — FrootAI MCP with OpenAI key input + envFile

TuneKit (AI Config)

config/openai.json - evaluation prompts and scoring criteria
config/evaluation.json - benchmark suites, thresholds, traffic splits
config/guardrails.json - regression thresholds, minimum sample sizes
evaluation/eval.py - Benchmark coverage >95%, Regression detection >99%

Tuning Parameters

Benchmark suite selectionRegression thresholdA/B traffic splitHuman eval sample sizeLeaderboard scoring weights

Estimated Cost

Dev/Test

$80-200/mo

Production

$2K-8K/mo

User Guide Open in VS Code View on GitHub Setup Guide Configurator Ask Agent FAI Back to FrootAI