Play 12

Model Serving AKS

High🔧 Skeleton

Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.

Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on request queue depth, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency. ACR stores model containers.

Architecture Pattern

GPU cluster, custom model hosting, LLM inference, auto-scaling

Azure Services

AKS (GPU nodes)NVIDIA GPUContainer Registry (ACR)vLLM

DevKit (.github Agentic OS)

agent.md — root orchestrator with builder→reviewer→tuner handoffs
3 agents — AKS Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skills — deploy (142 lines), evaluate (101 lines), tune (112 lines)
4 prompts — /deploy, /test, /review, /evaluate with agent routing
.vscode/mcp.json — FrootAI MCP with AKS cluster + ACR inputs + envFile

TuneKit (AI Config)

config/aks.json — node pools, GPU config, scaling rules
config/vllm.json — quantization, batching, max concurrent
infra/main.bicep — AKS cluster definition

Tuning Parameters

GPU node countQuantization level (GPTQ/AWQ)Batching paramsScaling rulesModel weights path

Estimated Cost

Dev/Test

$300–600/mo

Production

$3K–20K+/mo

User Guide Open in VS Code View on GitHub Setup Guide Configurator Ask Agent FAI Back to FrootAI