Skip to main content

Module 11: Quick Reference Cards β€” AI Essentials at a Glance

Type: Reference | Level: All Levels Usage: Bookmark this page β€” your daily AI quick-reference Last Updated: March 2026


How to Use This Page​

This module is designed as a fast-lookup reference. Every card is self-contained. Use your browser's Ctrl+F / Cmd+F to jump to any term instantly. No narrative β€” just facts, tables, and decision shortcuts.


Card 1: LLM Generation Parameters Cheat Sheet​

Every parameter you can tune when calling a large language model.

ParameterRangeTypical DefaultWhat It ControlsRule of Thumb
Temperature0.0 – 2.01.0Randomness of token sampling. Lower = deterministic, higher = creative.0.0–0.3 for factual / extraction. 0.7–1.0 for creative writing. Never exceed 1.5 in production.
Top-P (nucleus sampling)0.0 – 1.01.0Cumulative probability cutoff. Only tokens within the top-P mass are considered.Use 0.9–0.95 for balanced output. Set to 1.0 and control via temperature, or vice-versa β€” avoid tuning both simultaneously.
Top-K1 – vocabulary sizeModel-dependentLimits sampling to the K most probable next tokens.40–100 is a safe range. Not exposed in Azure OpenAI β€” available in open-source / Hugging Face models.
Frequency Penalty-2.0 – 2.00.0Penalizes tokens proportionally to how often they already appeared. Reduces repetition.0.3–0.8 to reduce repetitive phrasing. Values above 1.0 can distort output.
Presence Penalty-2.0 – 2.00.0Flat penalty applied once a token has appeared at all. Encourages topic diversity.0.3–0.6 for varied topic coverage. Combine lightly with frequency penalty β€” don't max both.
Max Tokens (max_completion_tokens)1 – model context limitModel-dependentHard ceiling on response length in tokens.Set explicitly to avoid runaway costs. Estimate: 1 paragraph ~ 80–120 tokens, 1 page ~ 600–800 tokens.
Stop SequencesUp to 4 stringsNoneGeneration halts when any stop sequence is emitted.Use ["\n\n"] for single-paragraph answers. Use ["```"] to stop after a code block.
SeedAny integerNoneWhen set, the service attempts deterministic output (best-effort).Use for reproducible evaluations and regression testing. Same seed + same prompt + same parameters = same output (mostly).
Response Formattext, json_object, json_schematextForces structured output format.Use json_schema for reliable structured extraction. Always include "respond in JSON" in the prompt when using json_object.
N1 – 1281Number of completions to generate per request.Use n > 1 only for ranking/voting strategies. Multiplies token cost linearly.
Logprobstrue/false, top_logprobs 0–20falseReturns log-probabilities for each output token.Use for confidence scoring, calibration, and classification thresholds.
Logit BiasToken ID β†’ bias (-100 to 100){}Directly adjusts probability of specific tokens. -100 = ban token.Ban unwanted tokens (e.g., profanity token IDs). Use sparingly β€” hard to maintain.

Parameter Interaction Quick Rules​

ScenarioTemperatureTop-PFreq. PenaltyPresence Penalty
Deterministic extraction0.01.00.00.0
Conversational chatbot0.70.950.30.3
Creative writing1.00.950.50.6
Code generation0.20.950.00.0
Brainstorming / ideation1.21.00.80.8
Summarization0.30.950.00.0
Translation0.30.950.00.0
Customer support bot0.50.90.40.2

Common API Call Patterns​

Python (Azure OpenAI SDK) β€” Minimal call:

from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://<resource>.openai.azure.com/",
api_key="<key>", # or use DefaultAzureCredential
api_version="2025-03-01-preview"
)
response = client.chat.completions.create(
model="<deployment-name>",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
max_tokens=800
)
print(response.choices[0].message.content)

Key API versions (Azure OpenAI):

API VersionStatusNotes
2025-03-01-previewLatest previewNewest features
2024-12-01-previewPreviewStructured outputs, reasoning
2024-10-21GA (stable)Production recommended
2024-06-01GABroadly supported

Card 2: Token Quick Reference​

What Is a Token?​

FactValue
Average token length (English)~4 characters
Tokens per word (English avg.)~1.33 tokens per word (~0.75 words per token)
Tokens per word (code)~2–3 tokens per word (symbols split aggressively)
Tokens per word (non-Latin scripts)~2–4 tokens per character for CJK languages

Token Estimation Formulas​

English text:   tokens β‰ˆ word_count Γ— 1.33
Code: tokens β‰ˆ character_count Γ· 3
Mixed content: tokens β‰ˆ character_count Γ· 4

Common Text Lengths in Tokens​

Content TypeApproximate Tokens
A short email (3–4 sentences)~100–200
One A4 page of text~600–800
A long blog post (2,000 words)~2,700
A technical whitepaper (10 pages)~7,000–9,000
A full novel (80,000 words)~107,000
1 hour of transcribed speech~8,000–10,000
A typical Slack conversation (50 messages)~2,000–3,000
JSON payload (1 KB)~300–400
A complete React component file~500–1,500

Context Windows by Model (March 2026)​

ModelProviderContext WindowMax Output Tokens
GPT-4.1Azure OpenAI1,047,576 (1M)32,768
GPT-4.1 miniAzure OpenAI1,047,576 (1M)32,768
GPT-4.1 nanoAzure OpenAI1,047,576 (1M)32,768
GPT-4oAzure OpenAI128,00016,384
GPT-4o miniAzure OpenAI128,00016,384
o3Azure OpenAI200,000100,000
o4-miniAzure OpenAI200,000100,000
o3-miniAzure OpenAI200,000100,000
o1Azure OpenAI200,000100,000
Claude Opus 4Anthropic200,00032,000
Claude Sonnet 4Anthropic200,00064,000
Gemini 2.5 ProGoogle1,048,576 (1M)65,536
Gemini 2.5 FlashGoogle1,048,576 (1M)65,536
Llama 4 MaverickMeta (via Azure)1,048,576 (1M)32,768
DeepSeek-R1DeepSeek (via Azure)128,00016,384
Mistral LargeMistral (via Azure)128,0008,192
Phi-4Microsoft16,3844,096
Phi-4-miniMicrosoft128,0004,096

Azure OpenAI Pricing (Pay-As-You-Go, per 1M Tokens)​

Prices reflect Global Standard deployment where available. Check the Azure OpenAI pricing page for latest values.

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4.1$2.00$8.00
GPT-4.1 mini$0.40$1.60
GPT-4.1 nano$0.10$0.40
GPT-4o$2.50$10.00
GPT-4o mini$0.15$0.60
o3$10.00$40.00
o4-mini$1.10$4.40
o3-mini$1.10$4.40
text-embedding-3-large$0.13β€”
text-embedding-3-small$0.02β€”
DALL-E 3 (Standard)$0.040 / imageβ€”
DALL-E 3 (HD)$0.080 / imageβ€”
Whisper$0.36 / audio hourβ€”

Global Batch Pricing (50% Discount)​

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4.1$1.00$4.00
GPT-4.1 mini$0.20$0.80
GPT-4o$1.25$5.00
GPT-4o mini$0.075$0.30

Cost rule of thumb: For a typical chatbot conversation (~1,500 input + 500 output tokens), GPT-4.1 nano costs ~$0.0004 per turn. GPT-4o costs ~$0.009 per turn. That is a ~20x difference.


Card 3: Model Selection Decision Tree​

Use this table to pick the right model for your workload. Start from the need.

NeedRecommended ModelWhyFallback
Simple classification / routingGPT-4.1 nanoCheapest, fastest, sufficient for binary/multi-classGPT-4o mini
Structured data extractionGPT-4.1 miniGreat JSON mode, cost efficientGPT-4.1
General-purpose chatbotGPT-4oStrong general ability, broad knowledgeGPT-4.1
Complex multi-step reasoningo3Deep chain-of-thought, highest reasoning accuracyo4-mini
Reasoning on a budgeto4-mini80% of o3 capability at ~10% costo3-mini
Code generation & reviewGPT-4.1Optimized for code, instruction followingo4-mini
Long document analysis (>100K)GPT-4.11M context window, strong recallGemini 2.5 Pro
Vision / image understandingGPT-4oNative multimodal, strong visionGPT-4.1 (vision)
Embeddingstext-embedding-3-largeBest quality Azure embeddingtext-embedding-3-small
On-device / edgePhi-4-miniSmall footprint, strong for sizePhi-4
Open-source self-hostedLlama 4 MaverickStrong open model, permissive licenseDeepSeek-R1
Batch processing (non-real-time)GPT-4o (Global Batch)50% price discount for asyncGPT-4.1 mini (Batch)
Audio transcriptionWhisperPurpose-built speech-to-textAzure AI Speech
Text-to-speechAzure AI Speech / GPT-4o AudioHigh quality neural voicesβ€”
Image generationDALL-E 3 / GPT Image GenNative Azure integrationβ€”

Decision Flowchart (Text)​

START
|
β”œβ”€ Need reasoning/math/logic?
| β”œβ”€ Budget sensitive? β†’ o4-mini
| └─ Maximum accuracy? β†’ o3
|
β”œβ”€ Need code generation?
| └─ β†’ GPT-4.1
|
β”œβ”€ Need vision/images?
| └─ β†’ GPT-4o
|
β”œβ”€ Simple task (classify, extract, route)?
| β”œβ”€ High volume? β†’ GPT-4.1 nano
| └─ Moderate quality needed? β†’ GPT-4.1 mini
|
β”œβ”€ Long context (>128K)?
| └─ β†’ GPT-4.1 (1M context)
|
└─ General conversational?
└─ β†’ GPT-4o or GPT-4.1

Model Capabilities Matrix​

CapabilityGPT-4.1GPT-4.1 miniGPT-4.1 nanoGPT-4oGPT-4o minio3o4-mini
Text generationExcellentVery GoodGoodExcellentGoodExcellentVery Good
Code generationExcellentGoodFairVery GoodGoodExcellentVery Good
Reasoning / mathVery GoodGoodFairGoodFairExcellentVery Good
Vision / imagesYesYesNoYesYesNoNo
Structured outputExcellentExcellentVery GoodExcellentVery GoodGoodGood
Instruction followingExcellentVery GoodGoodVery GoodGoodVery GoodGood
Long context (>100K)Excellent (1M)Excellent (1M)Good (1M)Good (128K)Good (128K)Good (200K)Good (200K)
MultilingualVery GoodGoodFairVery GoodGoodGoodGood
Speed (tokens/sec)FastVery FastFastestFastVery FastSlower (thinks)Moderate
Function callingExcellentVery GoodGoodExcellentGoodGoodGood

Card 4: RAG Architecture Cheat Sheet​

Chunking Strategies Comparison​

StrategyChunk SizeOverlapBest ForDrawbacks
Fixed-size512–1024 tokens10–20%Simple docs, uniform structureBreaks mid-sentence
Sentence-based3–5 sentences1 sentenceArticles, natural proseInconsistent chunk sizes
Paragraph-based1 paragraphNone or 1 sentenceWell-structured docsLarge variance in size
Recursive character512–1024 tokens10–20%General-purpose (LangChain default)May split semantic units
Semantic chunkingVariableEmbedding-based boundariesResearch papers, mixed contentSlower, requires embeddings
Markdown/HTML-awareBy headingNoneTechnical docs, wikisRequires structured source
Sliding window256–512 tokens50%Dense retrieval, high recall2x storage, more chunks
Document-levelEntire docN/AShort docs (< 1 page)Poor for long documents

Rule of thumb: Start with 512 tokens, 10% overlap, recursive character splitting. Optimize from there.

Document Pre-Processing Pipeline​

Source Documents
β”‚
β”œβ”€ PDF β†’ Extract text (PyMuPDF, Azure Document Intelligence)
β”œβ”€ Word/PPTX β†’ Extract text (python-docx, python-pptx)
β”œβ”€ HTML β†’ Strip tags, keep structure (BeautifulSoup)
β”œβ”€ Markdown β†’ Parse headings as section boundaries
└─ Scanned images β†’ OCR (Azure Document Intelligence)
β”‚
β–Ό
Clean & Normalize
β”‚ Remove headers/footers, fix encoding, normalize whitespace
β–Ό
Chunk
β”‚ Apply chunking strategy (see table above)
β–Ό
Enrich (optional)
β”‚ Add metadata: title, source, page, section, date
β”‚ Generate summaries or hypothetical questions per chunk
β–Ό
Embed
β”‚ Generate vector embeddings for each chunk
β–Ό
Index
β”‚ Upload to Azure AI Search (or other vector store)
β”‚ Configure vector fields, filterable metadata, semantic config
β–Ό
Ready for Retrieval

Embedding Models Comparison​

ModelDimensionsMax TokensRelative QualityCost (per 1M tokens)Notes
text-embedding-3-large3,072 (configurable)8,191Highest$0.13Supports dimension reduction via dimensions param
text-embedding-3-small1,536 (configurable)8,191High$0.02Best price/quality ratio
text-embedding-ada-0021,5368,191Good$0.10Legacy β€” migrate to v3
Cohere Embed v31,024512HighVariesMulti-language strength
E5-large-v21,024512GoodSelf-hostedOpen-source, no API cost
BGE-large-en-v1.51,024512GoodSelf-hostedOpen-source, MTEB top-tier

Retrieval Strategy Comparison​

StrategyHow It WorksPrecisionRecallLatencyWhen to Use
Vector searchEmbed query, find nearest neighborsMedium-HighHighLowDefault starting point
Full-text / keyword (BM25)Term frequency matchingHighMediumVery LowExact term matching, codes, IDs
Hybrid (vector + keyword)Combines both, fused ranking (RRF)HighHighLow-MediumRecommended default for production
Semantic ranker (L2 rerank)Cross-encoder reranks top-N resultsVery HighDepends on Stage 1MediumWhen precision matters most
Multi-queryLLM rewrites query N ways, merges resultsHighVery HighHigherAmbiguous or complex queries
HyDELLM generates hypothetical doc, then searchesHighHighHigherWhen queries differ from document style

Azure AI Search Tiers​

TierPrice (approx/month)StorageIndexesReplicasPartitionsSemantic RankerVector Search
Free$050 MB311NoYes (limited)
Basic~$752 GB1531YesYes
Standard S1~$25025 GB per partition501212YesYes
Standard S2~$1,000100 GB per partition2001212YesYes
Standard S3~$2,000200 GB per partition2001212YesYes
Storage Optimized L1~$2,5001 TB per partition101212YesYes
Storage Optimized L2~$5,0002 TB per partition101212YesYes

RAG Evaluation Metrics​

MetricWhat It MeasuresTargetHow to Calculate
GroundednessAre answers supported by retrieved context?> 4.0 / 5.0LLM-as-judge or NLI model
RelevanceIs the answer relevant to the question?> 4.0 / 5.0LLM-as-judge
CoherenceIs the answer well-structured and logical?> 4.0 / 5.0LLM-as-judge
FluencyIs the language natural and grammatical?> 4.0 / 5.0LLM-as-judge
Retrieval PrecisionAre retrieved chunks relevant?> 0.7Manual label or LLM-judge top-K
Retrieval RecallAre all relevant chunks retrieved?> 0.8Requires ground truth annotations
NDCG@KQuality of ranking in top K results> 0.7Standard IR formula
Answer SimilarityCloseness to ground truth answer> 0.8Cosine similarity of embeddings
FaithfulnessNo hallucinated facts beyond context> 0.9Claim-level verification

Card 5: Azure AI Foundry Deployment Types​

PropertyStandardGlobal StandardData Zone StandardProvisioned (PTU)Global Batch
Pricing modelPay-per-tokenPay-per-tokenPay-per-tokenReserved throughput (PTU/hr)Pay-per-token (50% discount)
LatencyLowLow (optimized routing)LowLowest (guaranteed)High (async, up to 24h)
Data residencySingle regionTraffic routed globallyWithin data zone (US/EU)Single regionTraffic routed globally
Data processingIn-regionMay process in any regionUS or EU zoneIn-regionMay process in any region
Rate limitsPer-deployment TPMHigher TPM quotasPer-deployment TPMDetermined by PTU countVery high (batch queue)
SLA99.9%99.9%99.9%99.9%Best-effort (24h target)
Min commitmentNoneNoneNone1-month or 1-hour reservationNone
Best forDev/test, moderate prodHigh-scale prod, cost optimizationEU/US data residency requirementsPredictable high-throughput prodBulk scoring, evaluations, embeddings
Supported modelsMost modelsGPT-4o, GPT-4.1, o-seriesGPT-4o, GPT-4.1, o-seriesGPT-4o, GPT-4.1, o-seriesGPT-4o, GPT-4.1

PTU Sizing Quick Reference​

ModelApprox. Tokens per Minute per PTUTypical PTU for 100 chat users
GPT-4o~2,500 TPM50–80 PTU
GPT-4.1~2,500 TPM50–80 PTU
GPT-4o mini~7,500 TPM15–25 PTU

Break-even rule of thumb: If your monthly PAYG bill exceeds ~$5,000 for a single deployment, evaluate PTU pricing. PTU becomes cost-effective at sustained utilization above 60-70%.

Default Quota Limits (Tokens Per Minute)​

Default quotas per subscription per region. Can be increased via support request.

ModelDefault TPM (Standard)Default TPM (Global Standard)Max RPM
GPT-4.1450K2M2,700
GPT-4.1 mini2M10M12,000
GPT-4.1 nano2M10M12,000
GPT-4o450K2M2,700
GPT-4o mini2M10M12,000
o3100K500K600
o4-mini450K2M2,700

Quota tip: Use Global Standard deployments for higher default TPM limits. Request quota increases via Azure Portal > Azure OpenAI > Quotas.


Card 6: Prompt Engineering Patterns​

PatternWhen to UseTemplateExpected Improvement
Zero-shotSimple, well-defined tasksClassify this text as positive or negative: {text}Baseline
Few-shotWhen examples clarify the expected format/logicHere are examples:\nInput: X β†’ Output: Y\nInput: A β†’ Output: B\nNow: Input: {text} β†’ Output:+10–25% accuracy
Chain-of-Thought (CoT)Multi-step reasoning, math, logicSolve step by step:\n{problem}\nLet's think through this:+15–40% on reasoning tasks
Zero-shot CoTQuick reasoning boost, no examples needed{question}\nLet's think step by step.+10–20% on reasoning tasks
ReActTasks requiring external tools/actionsThink: {reasoning}\nAction: {tool_call}\nObservation: {result}\nThink: ...Enables tool use reliably
Role / System PromptSetting persona, behavior constraintsYou are a {role}. You always {constraint}. You never {restriction}.Consistent tone and behavior
Self-consistencyHigh-stakes reasoning (run N times, majority vote)Run CoT N times β†’ pick most common answer+5–15% on reasoning
Tree-of-ThoughtComplex problem solving with branching pathsGenerate multiple approaches β†’ evaluate each β†’ select best+20–30% on complex planning
Structured OutputWhen you need predictable JSON/XMLRespond in JSON matching this schema: {schema} + response_format: json_schemaNear 100% format compliance
DecompositionBreak a hard task into subtasksFirst: {subtask1}\nThen: {subtask2}\nFinally: {subtask3}Reduces errors on complex tasks
Meta-promptingWhen you want the LLM to write its own promptWrite the optimal prompt for: {task_description}Variable β€” good for prompt iteration
Retrieval-augmentedWhen current/private knowledge is neededContext:\n{retrieved_docs}\n\nUsing ONLY the context above, answer: {question}Reduces hallucination dramatically

Prompt Structure Best Practice​

[SYSTEM]
You are {role}. {behavioral constraints}. {output format}.

[USER]
## Context
{background information or retrieved documents}

## Task
{clear, specific instruction}

## Constraints
- {constraint 1}
- {constraint 2}

## Output Format
{expected structure}

## Examples (if few-shot)
Input: ... β†’ Output: ...

Common Prompt Anti-Patterns​

Anti-PatternProblemFix
Vague instructions"Do something with this data" β†’ unpredictable outputBe specific: "Extract all dates and amounts from this invoice"
Conflicting constraints"Be brief but include all details" β†’ model oscillatesPrioritize: "Summarize in 3 bullet points. Include dollar amounts."
No output formatResponse structure varies per callSpecify format: "Respond as JSON with keys: name, date, amount"
Prompt injection vulnerabilityUser input not delimited β†’ hijack riskWrap user input in clear delimiters: """User message: {input}"""
Token waste in system prompt2,000-token system prompt on a classification taskKeep system prompts proportional to task complexity
Examples that contradict rulesFew-shot examples violate stated constraintsAudit examples against constraints before deploying
Asking multi-model questions"Is this positive sentiment and extract the entities" β†’ lower accuracySplit into separate calls or use clear sub-sections

Card 7: Agent Framework Comparison​

FeatureAzure AI Agent ServiceAutoGenSemantic KernelCopilot Studio
TypeManaged cloud serviceOpen-source frameworkOpen-source SDKLow-code platform
LanguagesPython, C#, JavaScript (REST)Python, .NETPython, C#, JavaNo-code / low-code
Where it runsAzure (fully managed)Self-hosted (any infra)Self-hosted (any infra)Microsoft Cloud (managed)
Tool / function callingBuilt-in (code interpreter, file search, Azure Functions, API)Custom tool definitionsPlugin architecture (native + OpenAPI)Connectors, Power Automate flows
Multi-agentOrchestrated via threadsFirst-class multi-agent conversationsExperimental multi-agentSingle-agent (can call sub-flows)
Memory / stateManaged threads with file/vector storeConfigurable memory backendsChat history + plugin stateConversation context (managed)
Knowledge / RAGBuilt-in file search (vector store)Custom RAG integrationBuilt-in text search pluginBuilt-in knowledge sources (Dataverse, SharePoint, websites)
Enterprise securityAzure RBAC, managed identity, VNETBring your ownBring your ownMicrosoft Entra ID, DLP, environments
ObservabilityAzure Monitor, Application InsightsCustom logging, AutoGen StudioCustom loggingBuilt-in analytics dashboard
Best forProduction AI agents on AzureResearch, complex multi-agent workflowsIntegrating AI into existing appsBusiness users, citizen developers, rapid prototyping
Learning curveMediumMedium-HighMediumLow
Cost modelPay-per-use (Azure resources)Infrastructure onlyInfrastructure onlyPer-user licensing

When to Use Which​

ScenarioRecommended
Enterprise chatbot with managed infraAzure AI Agent Service
Multi-agent research or simulationAutoGen
Adding AI to existing .NET/Java/Python appSemantic Kernel
Business process automation by non-developersCopilot Studio
Quick prototype with tool callingAzure AI Agent Service
Full control over agent behavior and routingAutoGen

Card 8: AI Infrastructure Sizing​

GPU VRAM Requirements by Model Size​

Model ParametersFP16 VRAMINT8 VRAMINT4 (GPTQ/AWQ) VRAMExample Models
1–3B4–6 GB2–3 GB1–2 GBPhi-4-mini, Gemma-3 1B
7–8B14–16 GB7–8 GB4–5 GBLlama 3.1 8B, Mistral 7B
13–14B26–28 GB13–14 GB7–8 GBLlama 3.1 13B (hypothetical), CodeLlama 13B
34B68 GB34 GB18–20 GBCodeLlama 34B
70B140 GB70 GB36–40 GBLlama 3.1 70B
405B810 GB405 GB~200 GBLlama 3.1 405B
MoE (e.g., Mixtral 8x22B)~280 GB~140 GB~72 GBMixtral 8x22B

Formula: VRAM (GB) β‰ˆ Parameters (B) x Bytes per parameter. FP16 = 2 bytes. INT8 = 1 byte. INT4 = 0.5 bytes. Add ~20% overhead for KV cache and runtime.

Azure GPU VM Comparison​

VM SeriesGPUGPU CountGPU VRAM (total)vCPUsRAM (GB)Approx. Price/hrBest For
NC4as T4 v3T4116 GB428~$0.53Dev/test, small model inference
NC24ads A100 v4A100 80GB180 GB24220~$3.67Single-GPU training, 70B inference (quantized)
NC48ads A100 v4A100 80GB2160 GB48440~$7.3570B FP16 inference, medium training
NC96ads A100 v4A100 80GB4320 GB96880~$14.69Large model training, 405B quantized inference
ND96asr v4A100 80GB8640 GB96900~$27.20Distributed training, multi-GPU inference
ND96isr H100 v5H100 80GB8640 GB96900~$36.00Cutting-edge training, fastest inference
NVadsA10 v5A10124 GB6–7255–880~$0.45–$5.00Graphics + inference hybrid

PTU vs. PAYG Break-Even Reference​

Monthly PAYG cost       = (input_tokens Γ— input_price) + (output_tokens Γ— output_price)
Monthly PTU cost = PTU_count Γ— PTU_hourly_rate Γ— 730 hours
Break-even utilization β‰ˆ 60–70% sustained

Quick check:
If monthly PAYG spend > $5,000/deployment β†’ evaluate PTU
If monthly PAYG spend > $15,000/deployment β†’ PTU almost certainly cheaper

Azure AI Search Sizing Recommendations​

Workload ProfileDocumentsVectorsRecommended TierReplicasPartitions
Prototype< 10K< 1MFree or Basic11
Small production10K–100K1M–10MBasic or S121
Medium production100K–1M10M–50MS1 or S232–3
Large production1M–10M50M–500MS2 or S33–63–6
Enterprise / big data> 10M> 500ML1 or L26–126–12

High-availability rule: Always use >= 2 replicas for production SLA (99.9% for reads). Use >= 3 replicas for 99.9% read+write SLA.

Monthly Cost Estimation Formulas​

Azure OpenAI (PAYG):

Monthly cost = (avg_input_tokens_per_request Γ— requests_per_day Γ— 30 Γ— input_price_per_token)
+ (avg_output_tokens_per_request Γ— requests_per_day Γ— 30 Γ— output_price_per_token)

Example: GPT-4.1 mini, 10K requests/day, 1,500 input + 500 output tokens each:
Input: 1,500 Γ— 10,000 Γ— 30 Γ— ($0.40 / 1,000,000) = $180/month
Output: 500 Γ— 10,000 Γ— 30 Γ— ($1.60 / 1,000,000) = $240/month
Total: $420/month

Azure AI Search:

Monthly cost = service_tier_base_price Γ— partitions Γ— replicas
+ semantic_ranker_queries Γ— $0.01 per 1,000 queries (if S1+)

Example: S1 with 2 replicas, 1 partition:
$250 Γ— 1 Γ— 2 = $500/month (+ semantic ranker usage)

Embedding indexing (one-time):

Embedding cost = total_chunks Γ— avg_tokens_per_chunk Γ— price_per_token

Example: 100K chunks, 400 tokens avg, text-embedding-3-small:
100,000 Γ— 400 Γ— ($0.02 / 1,000,000) = $0.80 total

Card 9: Responsible AI Checklist​

Pre-Deployment Checklist​

#CategoryCheckStatus
1PurposeDocumented intended use case and users☐
2PurposeIdentified out-of-scope uses☐
3FairnessTested across demographic groups☐
4FairnessChecked for disparate performance or bias☐
5ReliabilityEvaluated on diverse test set (> 200 samples)☐
6ReliabilityMeasured hallucination / groundedness rate☐
7ReliabilityConducted red-team / adversarial testing☐
8SafetyAzure AI Content Safety filters configured☐
9SafetyJailbreak resistance tested☐
10PrivacyNo PII in training data / prompts without consent☐
11PrivacyData handling complies with GDPR/regional laws☐
12TransparencyUsers informed they're interacting with AI☐
13TransparencyAI-generated content is labeled☐
14TransparencySystem card / documentation written☐
15AccountabilityHuman escalation path exists☐
16AccountabilityMonitoring and logging enabled☐
17AccountabilityIncident response plan documented☐
18SecurityAPI keys in Azure Key Vault, not in code☐
19SecurityManaged Identity used for service-to-service auth☐
20SecurityNetwork isolation (VNET/Private Endpoints) configured☐

Azure AI Content Safety β€” Filter Categories​

CategoryWhat It DetectsSeverity LevelsDefault Setting
Hate & FairnessHate speech, discrimination, slursLow, Medium, HighBlock Medium + High
SexualSexually explicit contentLow, Medium, HighBlock Medium + High
ViolenceViolent content, graphic descriptionsLow, Medium, HighBlock Medium + High
Self-HarmSelf-harm instructions or glorificationLow, Medium, HighBlock Medium + High
Jailbreak (Prompt Shield)Prompt injection, jailbreak attemptsDetected / Not DetectedEnabled
Protected MaterialCopyrighted text, code licensesDetected / Not DetectedEnabled
Groundedness DetectionHallucinated or ungrounded claimsGrounded / UngroundedAvailable (opt-in)

Required Evaluations Before Production​

EvaluationMinimum StandardTool
Groundedness> 4.0 / 5.0 on test setAzure AI Evaluation SDK
Relevance> 4.0 / 5.0 on test setAzure AI Evaluation SDK
Red-team testingNo critical jailbreaks passMicrosoft PyRIT / manual
Latency P95< application SLA (e.g., 5s)Application Insights
Toxicity rate< 0.1% of responsesContent Safety API
Bias auditNo statistical disparity > 5% across groupsFairlearn / manual

Security Checklist for AI Workloads​

LayerRequirementAzure Service
IdentityManaged Identity for all AI servicesEntra ID / Managed Identity
NetworkPrivate Endpoints for AI servicesAzure Private Link
SecretsAPI keys in Key VaultAzure Key Vault
DataEncryption at rest and in transitAzure default (AES-256, TLS 1.2+)
Access controlRBAC on AI resource operationsAzure RBAC (Cognitive Services User, etc.)
LoggingDiagnostic logs to Log AnalyticsAzure Monitor
ComplianceAI services in compliant regionAzure compliance documentation
ContentContent filters enabled on all deploymentsAzure AI Content Safety

Card 10: Microsoft Copilot Ecosystem Map​

CopilotWhat It DoesAudienceLicensing / CostKey Feature
Microsoft 365 CopilotAI assist in Word, Excel, PowerPoint, Outlook, TeamsEnterprise knowledge workers$30/user/month add-onGrounded in Microsoft Graph (your emails, docs, meetings)
Copilot in WindowsOS-level assistant for PC tasks, web search, file findingAll Windows usersFree (basic) / Copilot+ PC featuresDeep OS integration, local model on Copilot+ PCs
GitHub CopilotAI code completion, chat, code review, agents in IDEDevelopers$10–$39/user/monthMulti-file editing, agent mode, workspace context
Copilot StudioBuild custom copilots / chatbots with low-code toolsCitizen devs, IT adminsIncluded in some M365 plans / per-messageGenerative AI + topics + connectors
Copilot for AzureAI assistant for Azure portal (diagnose, troubleshoot, create)Azure admins & engineersFree (in preview for most)Natural language to Azure CLI/ARM, resource diagnostics
Copilot for SecurityInvestigate threats, summarize incidents, reverse-eng malwareSecOps analystsStandalone: $4/secured compute unit/hrGrounded in Microsoft Threat Intelligence
Copilot in Power PlatformAI in Power Apps, Power Automate, Power BIMakers, analystsIncluded in Power Platform licensesNatural language to app, flow, or DAX formula
Copilot in Dynamics 365AI assist across Sales, Service, Finance, Supply ChainBusiness usersIncluded in Dynamics 365 licensesContextual to each Dynamics 365 module
Copilot in FabricAI for data engineering, data science, and analyticsData professionalsIncluded with Fabric capacityNatural language to SQL/KQL, auto-insights
Copilot for SalesSummarize CRM data, draft emails, meeting prepSalespeople$50/user/month or with M365 CopilotCRM integration (Dynamics 365 + Salesforce)
Copilot for ServiceSummarize cases, draft replies, search knowledge basesSupport agents$50/user/month or with M365 CopilotMulti-source knowledge grounding
Copilot for FinanceExcel-heavy financial workflows, variance analysisFinance teams$30/user/monthAutomated reconciliation, variance explanations

Azure AI Services Quick Map​

Beyond Copilots and OpenAI, Azure offers specialized AI services:

ServiceWhat It DoesCommon Use Cases
Azure OpenAI ServiceHost GPT, o-series, DALL-E, Whisper modelsChatbots, content generation, code assist
Azure AI SearchVector + keyword + semantic searchRAG retrieval, enterprise search, e-commerce
Azure AI Document IntelligenceExtract text, tables, key-value pairs from documentsInvoice processing, form extraction, ID scanning
Azure AI SpeechSpeech-to-text, text-to-speech, translation, speaker recognitionCall center analytics, accessibility, voice UX
Azure AI VisionImage analysis, OCR, face detection, custom modelsProduct inspection, accessibility, content moderation
Azure AI LanguageNER, sentiment analysis, summarization, PII detectionText analytics, compliance, customer feedback
Azure AI TranslatorReal-time text and document translation (100+ languages)Multilingual apps, document localization
Azure AI Content SafetyDetect harmful content in text and imagesModeration pipelines, UGC platforms
Azure Machine LearningFull ML lifecycle: train, deploy, manage modelsCustom ML models, MLOps pipelines, AutoML
Azure AI FoundryUnified AI development platformEnd-to-end AI app development, evaluation, deployment

Card 11: AI Acronym & Term Glossary​

A comprehensive A-Z reference of terms an AI and infrastructure architect encounters daily.

#TermFull Form / Definition
1AGIArtificial General Intelligence β€” hypothetical AI with human-level general reasoning across all domains.
2AI SearchAzure AI Search β€” Microsoft's managed search service with vector, keyword, and semantic ranking capabilities.
3BERTBidirectional Encoder Representations from Transformers β€” foundational encoder model for NLP tasks (classification, NER).
4BM25Best Matching 25 β€” classic probabilistic ranking algorithm for keyword/full-text search.
5CoTChain-of-Thought β€” prompting technique that asks the model to reason step-by-step before answering.
6CUDACompute Unified Device Architecture β€” NVIDIA's parallel computing platform for GPU programming.
7DAGDirected Acyclic Graph β€” used in ML pipelines and agent orchestration to define task dependencies.
8DPODirect Preference Optimization β€” alignment technique that fine-tunes LLMs using human preference pairs without a separate reward model.
9EmbeddingA dense vector representation of text (or images/audio) in a continuous vector space where semantic similarity maps to geometric proximity.
10Fine-tuningContinued training of a pre-trained model on a domain-specific dataset to improve performance on specialized tasks.
11FP16 / BF1616-bit floating point formats used in GPU training and inference to reduce memory while maintaining precision.
12Function CallingLLM capability to output structured JSON matching a tool/function schema, enabling the model to invoke external APIs.
13GGUFGPT-Generated Unified Format β€” file format for quantized models used by llama.cpp and other local inference engines.
14GPTGenerative Pre-trained Transformer β€” autoregressive language model architecture.
15GroundingConnecting LLM responses to verified data sources (RAG, search results, databases) to reduce hallucination.
16GuardrailsSafety mechanisms (content filters, input/output validation, topic restrictions) that constrain AI system behavior.
17HallucinationWhen an LLM generates plausible-sounding but factually incorrect or fabricated information.
18HNSWHierarchical Navigable Small World β€” graph-based algorithm for approximate nearest-neighbor vector search. Used in Azure AI Search.
19InferenceThe process of running a trained model to generate predictions/outputs from new inputs. Contrast with training.
20INT4 / INT84-bit / 8-bit integer quantization β€” reduces model size and VRAM usage at the cost of slight accuracy loss.
21JSON ModeAzure OpenAI feature that forces the model to return valid JSON. Use json_schema for strict schema adherence.
22KV CacheKey-Value Cache β€” stores attention key/value pairs from previous tokens to avoid recomputation during autoregressive generation. Dominates VRAM during long-context inference.
23LoRALow-Rank Adaptation β€” parameter-efficient fine-tuning method that trains small rank-decomposition matrices instead of full weights.
24LLMLarge Language Model β€” transformer-based models with billions of parameters trained on massive text corpora.
25MCPModel Context Protocol β€” open protocol for connecting LLMs to external data sources and tools via a standardized interface.
26MoEMixture of Experts β€” architecture where only a subset of model parameters activate per token, improving efficiency.
27NDCGNormalized Discounted Cumulative Gain β€” ranking quality metric used in search evaluation. Range 0–1, higher is better.
28NERNamed Entity Recognition β€” extracting structured entities (people, places, organizations) from text.
29ONNXOpen Neural Network Exchange β€” open format for representing ML models, enabling cross-framework portability.
30PEFTParameter-Efficient Fine-Tuning β€” umbrella term for methods (LoRA, QLoRA, adapters) that fine-tune a small subset of parameters.
31PPOProximal Policy Optimization β€” reinforcement learning algorithm used in RLHF to align LLMs with human preferences.
32Prompt InjectionAdversarial attack where malicious input in a prompt attempts to override the model's system instructions.
33PTUProvisioned Throughput Unit β€” Azure OpenAI's reserved capacity pricing model for guaranteed throughput.
34QLoRAQuantized LoRA β€” combines 4-bit quantization with LoRA for fine-tuning large models on consumer GPUs.
35QuantizationReducing model weight precision (e.g., FP16 β†’ INT4) to shrink model size and speed up inference.
36RAGRetrieval-Augmented Generation β€” architecture that retrieves relevant documents and includes them in the LLM context before generation.
37RBACRole-Based Access Control β€” security model where permissions are assigned to roles, used throughout Azure AI services.
38Reasoning ModelsLLMs that use internal chain-of-thought (thinking tokens) before answering. Azure examples: o3, o4-mini, o3-mini.
39RLHFReinforcement Learning from Human Feedback β€” alignment method using human preference rankings to train a reward model that guides LLM fine-tuning.
40RRFReciprocal Rank Fusion β€” algorithm for merging ranked results from multiple retrieval methods (used in hybrid search).
41Semantic KernelMicrosoft's open-source SDK for integrating AI models into applications. Supports plugins, planners, and memory.
42SFTSupervised Fine-Tuning β€” fine-tuning on labeled instruction-response pairs. First step before RLHF alignment.
43SLMSmall Language Model β€” models under ~4B parameters designed for efficiency and on-device deployment (e.g., Phi-4-mini).
44SoTAState of the Art β€” the current best-known performance on a benchmark or task.
45System PromptInstructions placed in the system message to define model behavior, persona, constraints, and output format.
46TemperatureGeneration parameter controlling output randomness. 0 = deterministic, higher = more random.
47TokenizerAlgorithm that splits text into tokens (subword units). Different models use different tokenizers (BPE, SentencePiece, etc.).
48Top-PNucleus sampling β€” limits token selection to the smallest set whose cumulative probability >= P.
49TPMTokens Per Minute β€” Azure OpenAI rate limit unit. Quota is allocated in TPM per deployment.
50TransformerNeural network architecture based on self-attention. Foundation of all modern LLMs.
51UpsamplingIncreasing resolution or representation quality. In AI data: generating synthetic examples to balance datasets.
52vLLMOpen-source high-throughput LLM serving engine. Uses PagedAttention for efficient KV cache management.
53Vector DatabaseDatabase optimized for storing and querying high-dimensional embedding vectors (Azure AI Search, Pinecone, Qdrant, etc.).
54Vector SearchFinding similar items by computing distance (cosine, dot product, L2) between embedding vectors.
55VNET IntegrationDeploying Azure AI services within a Virtual Network for network isolation and private connectivity.
56WASMWebAssembly β€” used in edge AI to run inference models in browsers or edge runtimes without native compilation.
57XAIExplainable AI β€” methods and tools for understanding and interpreting AI model decisions (SHAP, LIME, attention visualization).
58Zero-shotAsking a model to perform a task without any examples β€” relying on pre-trained knowledge alone.
59Few-shotProviding a small number of examples in the prompt to guide the model's output format and behavior.
60Agentic AIAI systems that can autonomously plan, use tools, and take multi-step actions to complete goals.
61AttentionCore mechanism in transformers that lets tokens attend to (weight) all other tokens in the sequence. Self-attention enables contextual understanding.
62BPEByte Pair Encoding β€” tokenization algorithm used by GPT models. Iteratively merges frequent character pairs into tokens.
63ChunkingSplitting documents into smaller segments for indexing and retrieval in RAG pipelines.
64Content SafetyAzure AI Content Safety β€” service for detecting harmful content (hate, violence, sexual, self-harm) in text and images.
65Cross-encoderModel that takes a query-document pair as input and outputs a relevance score. More accurate than bi-encoder but slower. Used for reranking.
66DistillationTraining a smaller (student) model to mimic a larger (teacher) model's outputs. Produces efficient models for deployment.
67Document IntelligenceAzure AI Document Intelligence β€” service for extracting text, tables, and structure from PDFs, forms, and images.
68Eval / EvaluationSystematic measurement of AI system quality using metrics (groundedness, relevance, etc.) and test datasets.
69FoundryAzure AI Foundry β€” Microsoft's unified platform for building, evaluating, and deploying AI applications.
70GANGenerative Adversarial Network β€” architecture with generator and discriminator networks. Largely superseded by diffusion models for image generation.
71GPTQPost-training quantization method that compresses LLMs to 4-bit with minimal quality loss. Popular for local deployment.
72Hybrid SearchCombining vector (semantic) and keyword (BM25) search with rank fusion (RRF) for best retrieval quality.
73ICLIn-Context Learning β€” the ability of LLMs to learn from examples provided in the prompt without weight updates.
74LatencyTime from sending a request to receiving the first (TTFT) or last (E2E) token of the response.
75MAUMonthly Active Users β€” common metric for sizing AI deployments and estimating costs.
76MultimodalModels that process multiple input types (text, images, audio, video) in a single architecture.
77NLINatural Language Inference β€” task of determining if a hypothesis is entailed by, contradicts, or is neutral to a premise. Used in groundedness evaluation.
78OrchestratorComponent that routes requests, manages conversation state, calls tools, and coordinates between models in an AI application.
79PagedAttentionMemory management technique (used in vLLM) that pages KV cache like OS virtual memory, reducing waste.
80Prompt CachingReusing computed prefixes across requests to reduce latency and cost for shared system prompts.
81Red TeamingAdversarial testing of AI systems to find safety vulnerabilities, jailbreaks, and failure modes before deployment.
82RetrieverComponent in RAG that searches a knowledge base and returns relevant documents/chunks for the LLM context.
83RPMRequests Per Minute β€” Azure OpenAI rate limit unit. Measured alongside TPM.
84SoftmaxActivation function that converts logits to a probability distribution. Final layer of token prediction in LLMs.
85StreamingReturning tokens incrementally as they're generated, reducing perceived latency. Enabled via stream=True in API calls.
86TTFTTime To First Token β€” latency metric measuring how quickly the first token of a response is returned.
87Tool UseLLM capability to decide when and how to call external tools (APIs, databases, code) during generation.
88TF-IDFTerm Frequency–Inverse Document Frequency β€” classic text representation weighting scheme. Predecessor to modern embeddings.

Card 12: Key Azure AI URLs & Resources​

Azure Portals & Services​

ResourceURL
Azure AI Foundry Portalhttps://ai.azure.com
Azure Portalhttps://portal.azure.com
Azure OpenAI Studio (legacy)https://oai.azure.com
Azure AI Content Safetyhttps://contentsafety.cognitive.azure.com

Pricing Pages​

ServiceURL
Azure OpenAI Pricinghttps://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
Azure AI Search Pricinghttps://azure.microsoft.com/pricing/details/search/
Azure AI Services Pricinghttps://azure.microsoft.com/pricing/details/cognitive-services/
Azure Virtual Machines Pricing (GPU)https://azure.microsoft.com/pricing/details/virtual-machines/linux/

Documentation​

TopicURL
Azure OpenAI Documentationhttps://learn.microsoft.com/azure/ai-services/openai/
Azure AI Foundry Documentationhttps://learn.microsoft.com/azure/ai-studio/
Azure AI Search Documentationhttps://learn.microsoft.com/azure/search/
Azure AI Content Safety Documentationhttps://learn.microsoft.com/azure/ai-services/content-safety/
Azure OpenAI Model Cataloghttps://learn.microsoft.com/azure/ai-studio/how-to/model-catalog
Responsible AI Principleshttps://www.microsoft.com/ai/responsible-ai
Responsible AI Dashboardhttps://learn.microsoft.com/azure/machine-learning/concept-responsible-ai-dashboard
Azure OpenAI Quotas & Limitshttps://learn.microsoft.com/azure/ai-services/openai/quotas-limits
Azure OpenAI What's Newhttps://learn.microsoft.com/azure/ai-services/openai/whats-new

GitHub Repositories​

RepositoryURL
Semantic Kernelhttps://github.com/microsoft/semantic-kernel
AutoGenhttps://github.com/microsoft/autogen
Azure AI Sampleshttps://github.com/Azure-Samples/azure-ai
Azure OpenAI Sampleshttps://github.com/Azure-Samples/openai
PyRIT (Red Teaming)https://github.com/Azure/PyRIT
Promptyhttps://github.com/microsoft/prompty
AI App Templateshttps://github.com/Azure-Samples/ai-app-templates
Azure AI Evaluation SDKhttps://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation

Community & Learning​

ResourceURL
Microsoft Learn AI Training Pathshttps://learn.microsoft.com/training/browse/?terms=AI
Azure AI Bloghttps://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/bg-p/Azure-AI-Services-blog
Microsoft AI (Corporate)https://www.microsoft.com/ai
Azure AI Discordhttps://aka.ms/azureaicommunity

Quick Lookup Index​

Jump to any card by topic:

CardTopicKey Questions Answered
Card 1LLM ParametersWhat do temperature, top-p, penalties do? What values should I use?
Card 2TokensHow many tokens is my text? What does each model cost?
Card 3Model SelectionWhich model for my use case?
Card 4RAG ArchitectureChunking? Embeddings? Retrieval strategy? Evaluation?
Card 5Deployment TypesStandard vs. Provisioned vs. Global vs. Batch?
Card 6Prompt PatternsZero-shot vs. few-shot vs. CoT? Prompt template?
Card 7Agent FrameworksWhich agent framework should I use?
Card 8InfrastructureGPU sizing? VM selection? PTU break-even?
Card 9Responsible AIPre-deployment checks? Content filters? Security?
Card 10Copilot EcosystemWhich Copilot does what? Licensing?
Card 11GlossaryWhat does this acronym mean?
Card 12URLs & LinksWhere is the pricing page? Where is the documentation?

Bookmark This Page

Press Ctrl+D / Cmd+D to bookmark. This page is designed to be your daily quick-reference for Azure AI engineering decisions.


Module 11 of 12 in the AI Nexus learning path. Designed as a living reference β€” updated as Azure AI services evolve.