FAI Development Workflows

Azure AI-specific development patterns — RAG debugging, agent lifecycle, evaluation-driven development, cost tracking.

L14·14 min read·Medium

RAG Development Cycle

Building a production RAG pipeline follows a 5-phase cycle. Each phase has specific validation criteria before moving to the next:

1. DesignChoose chunking strategy, embedding model, vector store. Define search type (hybrid, pure vector, semantic). Set target metrics: groundedness ≥ 4.0, relevance ≥ 3.5.

2. IndexIngest documents, generate embeddings, populate the vector store. Verify document count, chunk distribution, and embedding dimensions match expectations.

3. RetrieveTest search queries against the index. Measure recall@5, precision@5, and MRR. Tune k-value, filters, and reranking threshold.

4. GenerateWire retrieval to the LLM. Craft system prompt with grounding instructions. Test with diverse queries — edge cases, adversarial inputs, multi-turn conversations.

5. EvaluateRun the full pipeline through FAI Engine evaluation. Check groundedness, relevance, coherence, fluency, safety. If below threshold, iterate on retrieval or prompt.

RAG Debugging Workflow

When RAG quality is low, debug each layer independently. Start from retrieval and work forward:

Debug: Check Retrieval Quality

# 1. Test retrieval directly — bypass the LLM
curl -X POST "$SEARCH_ENDPOINT/indexes/$INDEX/docs/search" \
  -H "api-key: $SEARCH_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "search": "How do I configure RBAC?",
    "queryType": "semantic",
    "semanticConfiguration": "default",
    "top": 5,
    "select": "title,content,chunk_id"
  }'

# 2. Check: Are the top-5 results relevant?
# If not → chunking strategy or embedding model is wrong
# If yes → problem is in the prompt or generation layer

Debug: Check Embedding Quality

# Compare embedding similarity between query and expected doc
import numpy as np
from openai import AzureOpenAI

client = AzureOpenAI(...)
q_emb = client.embeddings.create(
    input="How do I configure RBAC?",
    model="text-embedding-3-large"
).data[0].embedding

doc_emb = client.embeddings.create(
    input="RBAC configuration requires assigning roles...",
    model="text-embedding-3-large"
).data[0].embedding

similarity = np.dot(q_emb, doc_emb)
print(f"Cosine similarity: {similarity:.4f}")
# Expected: > 0.80 for relevant pairs

Debug: Check Generation Grounding

# 3. Test generation with known-good context
node engine/index.js \
  solution-plays/01-enterprise-rag/fai-manifest.json \
  --eval --query "How do I configure RBAC?" \
  --context "RBAC configuration requires assigning roles..."

# Check groundedness score
# < 3.5 → system prompt needs stronger grounding instructions
# ≥ 4.0 → generation is properly grounded

Agent Development Cycle

FAI agents follow the Build → Review → Tune chain. Each phase has a dedicated agent pattern:

Build Phase

Create the agent file with frontmatter, implement using config/ values, wire into the target play. The builder agent generates initial structure.

Create Agent

# Scaffold a new agent
node scripts/scaffold-primitive.js agent

# Creates: agents/fai-<name>.agent.md
# With frontmatter: description, model, tools, waf, plays

Review Phase

Self-review against security, quality, and WAF compliance. Check tool access follows least privilege, description is accurate, and pillar alignment is correct.

Validate Agent

# Validate the new agent
node scripts/validate-primitives.js --verbose

# Check: description ≥ 10 chars?
# Check: tool names are valid?
# Check: WAF pillars are from the 6-pillar set?
# Check: filename is lowercase-hyphen?

Tune Phase

Verify config values are production-appropriate. Test the agent in Copilot Chat with real scenarios. Measure response quality and iterate.

Test Agent in Context

# Load the play that uses this agent
node engine/index.js \
  solution-plays/01-enterprise-rag/fai-manifest.json \
  --status

# Open Copilot Chat and invoke:
# @fai-rag-architect "Review my search index config"
# Verify: agent stays in scope, uses allowed tools only

Evaluation-Driven Development

Like test-driven development but for AI quality. Write evaluation criteria before building, then iterate until all metrics pass:

config/guardrails.json

{
  "evaluation": {
    "metrics": {
      "groundedness": { "threshold": 4.0, "weight": 0.3 },
      "relevance":    { "threshold": 3.5, "weight": 0.25 },
      "coherence":    { "threshold": 4.0, "weight": 0.2 },
      "fluency":      { "threshold": 4.0, "weight": 0.15 },
      "safety":       { "threshold": 1.0, "weight": 0.1 }
    },
    "min_weighted_score": 3.8,
    "test_queries": 50,
    "fail_fast": true
  }
}

The cycle: define thresholds → build pipeline → run eval → check scores → tune config → re-eval. Never ship a play where any metric is below threshold. The FAI Engine blocks deployment if fail_fast: true and any metric fails.

Cost Tracking Workflow

AI costs can spiral without monitoring. Track token usage, model routing efficiency, and caching hit rates:

Cost Monitoring Commands

# Estimate monthly cost for a play
npx frootai cost 01 --scale prod

# Sample output:
# Azure OpenAI (GPT-4o):    $340/mo  (850K tokens/day)
# Azure AI Search (S1):      $250/mo  (1 index, 50GB)
# Azure Container Apps:       $45/mo  (2 replicas, 0.5 vCPU)
# Total estimated:           $635/mo

# Track token usage over time
node engine/index.js \
  solution-plays/01-enterprise-rag/fai-manifest.json \
  --cost --period 7d

Model Routing

Route simple queries to GPT-4o-mini ($0.15/1M tokens), escalate complex ones to GPT-4o ($2.50/1M tokens). A 70/30 split can save 60% on model costs.

Semantic Caching

Cache responses keyed by embedding similarity. A 40% cache hit rate eliminates 40% of LLM calls. Use Azure Cache for Redis with vector similarity.

Token Budgets

Set max_tokens per request and daily token caps per play. Alert at 80% of budget. The FAI Engine enforces these limits at runtime.

Prompt Iteration Workflow

Systematic prompt improvement using versioned configs and A/B evaluation:

Prompt Version Control

# config/openai.json — version your system prompts
{
  "model": "gpt-4o",
  "temperature": 0.3,
  "max_tokens": 2048,
  "system_prompt_version": "v3",
  "system_prompt": "You are an Azure architecture expert. Answer ONLY from the provided context. If the context does not contain the answer, say 'I don't have information about that.' Never speculate beyond the provided documents."
}

# Iterate: change prompt → run eval → compare scores
# v1: Basic prompt           → groundedness: 3.2
# v2: + "answer ONLY from"   → groundedness: 3.8
# v3: + "never speculate"    → groundedness: 4.3 ✓

Debugging LLM Calls

When responses are unexpected, inspect the actual API calls:

Enable Debug Logging

# Set environment variable for verbose logging
$env:FAI_DEBUG = "true"

# Run the engine — all API calls are logged
node engine/index.js \
  solution-plays/01-enterprise-rag/fai-manifest.json \
  --eval --verbose

# Log output includes:
# → Request:  model, temperature, max_tokens, messages[]
# → Response: finish_reason, usage.total_tokens, content
# → Timing:   latency_ms, tokens_per_second

Common Issues

finish_reason: length — Response was truncated. Increase max_tokens.

finish_reason: content_filter — Content safety blocked the response. Check input for policy violations.

429 Too Many Requests — Rate limited. Implement retry with exponential backoff or increase TPM quota.

High latency (>5s) — Large context window. Reduce retrieved chunks or use streaming.

Security Audit Workflow

Run before every deployment. Covers secrets, identity, content safety, and prompt injection resistance:

Security Checklist Commands

# 1. Scan for secrets in the codebase
node hooks/fai-secrets-scanner/scan.js

# 2. Verify Managed Identity configuration
az identity show --name "$MI_NAME" --resource-group "$RG"

# 3. Check content safety filters are active
az cognitiveservices account show \
  --name "$AOAI_NAME" --resource-group "$RG" \
  --query "properties.contentFilter"

# 4. Test prompt injection resistance
node engine/index.js \
  solution-plays/01-enterprise-rag/fai-manifest.json \
  --eval --test-set "evaluation/adversarial-prompts.jsonl"

L13: Quick Start

L15: End-to-End Workshop

L11: Agentic Workflows L15: End-to-End Workshop

Learning Hub Browse Primitives Back to FrootAI