Play 27
AI Data Pipeline
High✅ Ready
ETL with LLM augmentation — classify, enrich, and redact at scale.
ETL pipeline enhanced with LLM intelligence. Data flows through Azure Data Factory, and at each stage GPT-4o-mini (chosen for cost efficiency on high-volume processing) classifies records, extracts entities, scores quality, and redacts PII. Schema detection auto-maps incoming formats. Event Hubs handles real-time ingestion. Cosmos DB stores enriched output. Batch processing handles millions of records with automatic retry and dead-letter queues.
Architecture Pattern
LLM-augmented ETL: classify, extract, enrich, redact, lakehouse integration
Azure Services
Azure OpenAI (gpt-4o-mini)Data FactoryBlob StorageCosmos DBEvent Hubs
DevKit (.github Agentic OS)
- agent.md — root orchestrator with builder→reviewer→tuner handoffs
- 3 agents — Data Pipeline Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
- 3 skills — deploy (104 lines), evaluate (105 lines), tune (103 lines)
- 4 prompts — /deploy, /test, /review, /evaluate with agent routing
- .vscode/mcp.json — FrootAI MCP with Storage + OpenAI inputs + envFile
TuneKit (AI Config)
- config/openai.json — gpt-4o-mini for cost efficiency, batch mode
- config/pipeline.json — stage definitions, batch size, retry rules
- config/guardrails.json — PII redaction rules, quality thresholds
- evaluation/eval.py — Classification accuracy >90%, PII recall >95%
Tuning Parameters
Classification prompts per data typePII detection rules (GDPR/HIPAA)Quality score thresholdsBatch size (100→10K)Dead-letter retry policySchema mapping rules
Estimated Cost
Dev/Test
$50–150/mo
Production
$800–3K/mo