FrootAI — AmpliFAI your AI Ecosystem Get Started

All Solution Plays

Play 12

Model Serving AKS

High🔧 Skeleton

Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.

Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on request queue depth, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency. ACR stores model containers.

Architecture Pattern

GPU cluster, custom model hosting, LLM inference, auto-scaling

Azure Services

AKS (GPU nodes)NVIDIA GPUContainer Registry (ACR)vLLM

DevKit (.github Agentic OS)

  • agent.md — root orchestrator with builder→reviewer→tuner handoffs
  • 3 agents — AKS Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
  • 3 skills — deploy (142 lines), evaluate (101 lines), tune (112 lines)
  • 4 prompts — /deploy, /test, /review, /evaluate with agent routing
  • .vscode/mcp.json — FrootAI MCP with AKS cluster + ACR inputs + envFile

TuneKit (AI Config)

  • config/aks.json — node pools, GPU config, scaling rules
  • config/vllm.json — quantization, batching, max concurrent
  • infra/main.bicep — AKS cluster definition

Tuning Parameters

GPU node countQuantization level (GPTQ/AWQ)Batching paramsScaling rulesModel weights path

Estimated Cost

Dev/Test

$300–600/mo

Production

$3K–20K+/mo