Play 15
Multi-Modal DocProc
Medium🔧 Skeleton
Process documents with text + images using GPT-4o multi-modal vision.
GPT-4o's vision capability processes documents that contain images, charts, tables, and text together. Document Intelligence handles OCR, then GPT-4o interprets visual elements like graphs, stamps, signatures. Outputs structured JSON. Handles multi-page documents with page-level processing.
Architecture Pattern
Multi-modal extraction, images+text+tables→structured JSON
Azure Services
Azure OpenAI (gpt-4o vision)Document IntelligenceBlob StorageCosmos DBAzure Functions
DevKit (.github Agentic OS)
- agent.md — root orchestrator with builder→reviewer→tuner handoffs
- 3 agents — DocProc Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
- 3 skills — deploy (124 lines), evaluate (100 lines), tune (112 lines)
- 4 prompts — /deploy, /test, /review, /evaluate with agent routing
- .vscode/mcp.json — FrootAI MCP with OpenAI + Doc Intel inputs + envFile
TuneKit (AI Config)
- config/openai.json — gpt-4o, vision prompts
- config/extraction.json — field schemas, image handling rules
- config/guardrails.json — PII in images
- evaluation/ — extraction accuracy per doc type
Tuning Parameters
Image promptsExtraction schemasConfidence thresholdsPage processing order
Estimated Cost
Dev/Test
$120–280/mo
Production
$1.5K–4K/mo