Play 36
Multimodal Agent
Medium✅ Ready
Vision + text + code — analyze images, screenshots, diagrams alongside natural language.
Vision + text + code agent that analyzes images, screenshots, diagrams, and documents alongside natural language input. GPT-4o Vision processes visual content, Azure AI Vision handles specialized image analysis, Blob Storage manages media assets, and Container Apps hosts the agent runtime. Supports use cases from UI testing to architectural diagram analysis to document verification with cross-modal reasoning.
Architecture Pattern
Multimodal agent: vision + text + code understanding, cross-modal reasoning
Azure Services
Azure OpenAI (GPT-4o Vision)Azure AI VisionBlob StorageContainer Apps
DevKit (.github Agentic OS)
- agent.md — root orchestrator with builder→reviewer→tuner handoffs
- 3 agents — Multimodal Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
- 3 skills — deploy (103 lines), evaluate (104 lines), tune (107 lines)
- 4 prompts — /deploy, /test, /review, /evaluate with agent routing
- .vscode/mcp.json — FrootAI MCP with Vision + OpenAI key inputs + envFile
TuneKit (AI Config)
- config/openai.json — gpt-4o vision model config, image tokens
- config/vision.json — image processing params, resolution, formats
- config/guardrails.json — content safety for images, PII in screenshots
- evaluation/eval.py — Cross-modal accuracy >85%, Image understanding >80%
Tuning Parameters
Vision promptsImage resolution (low/medium/high)Multi-modal routing strategyContent safety thresholds for imagesMax image tokens per request
Estimated Cost
Dev/Test
$100–250/mo
Production
$1.5K–5K/mo