Play 98
Agent Evaluation Platform
High✅ Ready
Automated evaluation suite for any AI agent with benchmarks, A/B testing, and leaderboards.
Automated evaluation suite for any AI agent — standardized benchmarks, regression testing, A/B experimentation, human preference scoring, and leaderboard ranking. Evaluates agents across quality, safety, speed, cost, and user satisfaction dimensions.
Architecture Pattern
Agent eval pipeline: benchmark selection - test execution - metric collection - A/B analysis - human scoring - leaderboard ranking
Azure Services
Azure OpenAIAzure Container AppsAzure Cosmos DBAzure Machine LearningAzure Functions
DevKit (.github Agentic OS)
- agent.md — root orchestrator with builder→reviewer→tuner handoffs
- 3 agents — Eval Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
- 3 skills — deploy (260 lines), evaluate (103 lines), tune (243 lines)
- 4 prompts — /deploy, /test, /review, /evaluate with agent routing
- .vscode/mcp.json — FrootAI MCP with OpenAI key input + envFile
TuneKit (AI Config)
- config/openai.json - evaluation prompts and scoring criteria
- config/evaluation.json - benchmark suites, thresholds, traffic splits
- config/guardrails.json - regression thresholds, minimum sample sizes
- evaluation/eval.py - Benchmark coverage >95%, Regression detection >99%
Tuning Parameters
Benchmark suite selectionRegression thresholdA/B traffic splitHuman eval sample sizeLeaderboard scoring weights
Estimated Cost
Dev/Test
$80-200/mo
Production
$2K-8K/mo