Play 12
Model Serving AKS
High🔧 Skeleton
Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.
Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on request queue depth, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency. ACR stores model containers.
Architecture Pattern
GPU cluster, custom model hosting, LLM inference, auto-scaling
Azure Services
AKS (GPU nodes)NVIDIA GPUContainer Registry (ACR)vLLM
DevKit (.github Agentic OS)
- agent.md — root orchestrator with builder→reviewer→tuner handoffs
- 3 agents — AKS Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
- 3 skills — deploy (142 lines), evaluate (101 lines), tune (112 lines)
- 4 prompts — /deploy, /test, /review, /evaluate with agent routing
- .vscode/mcp.json — FrootAI MCP with AKS cluster + ACR inputs + envFile
TuneKit (AI Config)
- config/aks.json — node pools, GPU config, scaling rules
- config/vllm.json — quantization, batching, max concurrent
- infra/main.bicep — AKS cluster definition
Tuning Parameters
GPU node countQuantization level (GPTQ/AWQ)Batching paramsScaling rulesModel weights path
Estimated Cost
Dev/Test
$300–600/mo
Production
$3K–20K+/mo