FROOT reference

AI Glossary A–Z

Every term an architect, engineer, or consultant will encounter in GenAI — defined clearly, with context for why it matters. 110 terms, tagged by FROOT layer.

Filter by layer:

A9 terms

Ablation Study
Transformation
Removing components of a model or system one at a time to measure which pieces contribute most to performance. Used during fine-tuning and evaluation to understand what matters.
Activation Function
Foundations
A mathematical function (ReLU, GELU, SiLU) applied to neuron outputs that introduces non-linearity. Without it, a neural network would just be linear algebra. **GELU** is the most common in modern transformers.
Agent
Orchestration
An AI system that can **perceive, plan, decide, and act** autonomously. Unlike a simple chat completion, an agent has a loop: observe → think → act → observe results → repeat. See Module O2 for the full deep dive.
Agent Framework (Microsoft)
Orchestration
Microsoft's SDK for building production AI agents. Supports tool calling, multi-agent orchestration, stateful conversations, and integration with Azure AI Foundry. Successor to AutoGen for production use cases. Compare with Semantic Kernel in Module O1.
AI Landing Zone
Operations
An enterprise-ready Azure environment pre-configured for AI workloads. Includes networking (private endpoints, VNets), identity (managed identities), governance (policies, RBAC), compute (GPU quotas), and data services (AI Search, storage). Built on the Cloud Adoption Framework.
Alignment
Transformation
Training a model to follow human intent, be helpful, and avoid harmful outputs. Techniques include RLHF, DPO, and constitutional AI. The reason ChatGPT says "I'd be happy to help" instead of producing raw completions.
Attention (Self-Attention)
Foundations
The core mechanism of transformers. For each token, attention computes how much "attention" to pay to every other token in the sequence. The formula: `Attention(Q,K,V) = softmax(QK^T / √d_k) × V`. This is what lets a model understand that "it" in "The cat sat on the mat because it was tired" refers to "the cat."
AutoGen
Orchestration
Microsoft's open-source framework for multi-agent conversations. Agents are defined with roles and can collaborate in group chats. Being succeeded by Microsoft Agent Framework for production use, but remains popular for research and prototyping.
Autoregressive Generation
Foundations
The process of generating one token at a time, where each new token is conditioned on all previous tokens. This is how GPT, Claude, and Llama generate text. It's inherently sequential — which is why inference latency scales with output length.

B4 terms

Batch Size
Foundations
The number of samples processed together during training or inference. Larger batches = better GPU utilization but more memory. For inference, **continuous batching** groups multiple requests to maximize throughput.
BERT (Bidirectional Encoder Representations from Transformers)
Foundations
An encoder-only transformer (2018). Unlike GPT which reads left-to-right, BERT reads in both directions. Used for classification, entity extraction, and embeddings — not for text generation. Still widely used for search and NLU tasks.
BPE (Byte-Pair Encoding)
Foundations
The most common tokenization algorithm. Starts with individual characters and iteratively merges the most frequent pairs. "unbelievable" → `["un", "believ", "able"]`. GPT-4 uses ~100K BPE tokens. Understanding BPE helps you estimate token costs.
Beam Search
Foundations
A decoding strategy that explores multiple candidate sequences simultaneously (keeping the top-k "beams"). Produces more coherent text than greedy decoding but is slower. Often used in translation and summarization.

C13 terms

Chain-of-Thought (CoT)
Reasoning
A prompting technique where you ask the model to "think step by step" before giving a final answer. Dramatically improves accuracy on math, logic, and reasoning tasks. Cost: more output tokens. See Module R1.
Chunking
Reasoning
Splitting documents into smaller pieces for RAG retrieval. Strategies include fixed-size (512 tokens), semantic (by paragraph/section), recursive, and sentence-based. Chunk size directly impacts retrieval quality. See Module R2.
Classification
Foundations
The task of assigning a label to an input. Examples: sentiment analysis ("positive"/"negative"), intent detection ("book_flight"/"cancel_order"), content moderation ("safe"/"unsafe"). Can be done via prompting or fine-tuning.
Completion
Foundations
The output generated by a language model. In the API world, a "completion" is the model's response to a prompt. **Chat completions** use a messages array; **text completions** (legacy) use a single prompt string.
Constitutional AI
Transformation
An alignment technique (Anthropic) where the model critiques and revises its own outputs based on a set of principles ("constitution"). Reduces the need for human feedback in alignment training.
Container Apps (Azure)
Operations
A serverless container platform ideal for hosting AI agents and APIs. Supports auto-scaling (including scale-to-zero), GPU workloads, Dapr sidecars, and built-in ingress. Popular for deploying agent backends that need to scale dynamically.
Content Safety (Azure AI)
Transformation
Azure's service for detecting harmful content in text and images. Categories: hate, self-harm, sexual, violence. Severity levels 0–6. Used as a guardrail before and after model responses. See Module T2.
Context Window
Foundations
The maximum number of tokens a model can process in a single request (input + output combined). GPT-4o: 128K, Claude Opus 4: 200K, Gemini 1.5 Pro: 2M. Larger windows enable longer documents but cost more and may reduce focus on relevant content.
Continuous Batching
Operations
An inference optimization where new requests are added to a running batch without waiting for all current requests to finish. Dramatically improves GPU utilization and throughput in production serving.
Copilot
Operations
Microsoft's brand for AI assistants embedded in products. **M365 Copilot** (Office), **GitHub Copilot** (code), **Copilot for Azure** (cloud ops), **Copilot Studio** (custom copilots). Each uses different models and architectures underneath.
Copilot Studio
Operations
Microsoft's low-code platform for building custom copilots. Connects to enterprise data (SharePoint, Dataverse), supports topics, actions, and plugins. No code required for basic scenarios, extensible with code for advanced ones.
Cosine Similarity
Reasoning
A measure of similarity between two vectors (0 = unrelated, 1 = identical). Used in RAG to compare query embeddings against document embeddings. Typical relevance threshold: 0.75–0.85. See Module R2.
Cross-Attention
Foundations
Attention where queries come from one sequence (e.g., decoder) and keys/values come from another (e.g., encoder). Used in encoder-decoder models like T5 and in multi-modal models where text attends to image patches.

D5 terms

Data Parallelism
Operations
Distributing training data across multiple GPUs, each holding a copy of the model. Gradients are synchronized after each step. The simplest multi-GPU strategy. Used when the model fits in one GPU's memory.
Decoder
Foundations
The part of a transformer that generates output tokens one at a time, each conditioned on previous outputs and (optionally) encoder output. GPT, Claude, and Llama are **decoder-only** architectures.
Deterministic AI
Reasoning
Making AI outputs reproducible and predictable. Techniques: `temperature=0`, fixed `seed` parameter, structured output schemas, constrained decoding, evaluation-driven guardrails. Even with `temperature=0`, GPU floating-point arithmetic can introduce tiny variations. See Module R3.
Distillation (Knowledge Distillation)
Transformation
Training a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's probability distributions rather than raw data. Produces much smaller models that retain 90-95% of the teacher's capability.
DPO (Direct Preference Optimization)
Transformation
An alignment technique that skips the reward model step of RLHF and directly optimizes the policy from preference data. Simpler, more stable than RLHF. Used in Llama 3, Zephyr, and many fine-tuned models.

E4 terms

Embeddings
Foundations
Dense vector representations of text (or images, audio) that capture semantic meaning. "King" and "Queen" have similar embeddings. Used in RAG for similarity search, in classification, and in clustering. Common dimensions: 768, 1536, 3072.
Encoder
Foundations
The part of a transformer that processes the full input sequence in parallel (bidirectionally). BERT is encoder-only. Encoder-decoder models (T5, BART) use the encoder to understand input and the decoder to generate output.
Endpoint
Operations
A URL that serves an AI model for inference. Azure AI Foundry provides **managed endpoints** (serverless, pay-per-token) and **dedicated endpoints** (reserved compute, predictable performance). Choice impacts cost, latency, and SLA.
Evaluation
Transformation
Measuring AI system quality. **Offline evaluation**: test set metrics (accuracy, F1, BLEU, ROUGE). **Online evaluation**: A/B testing in production. **LLM-as-judge**: using one model to score another's outputs. Azure AI Foundry has built-in evaluation tools.

F6 terms

Few-Shot Learning
Reasoning
Providing a few examples in the prompt to teach the model a task. Zero-shot = no examples. One-shot = one example. Few-shot = 2-10 examples. More examples improve consistency but use more tokens (and cost more).
Fine-Tuning
Transformation
Continuing the training of a pre-trained model on your specific dataset. Changes the model's weights. Use when prompting alone isn't enough — for domain-specific language, consistent formatting, or task specialization. See Module T1.
Flash Attention
Operations
An algorithm that makes attention computation faster and more memory-efficient by tiling and recomputing instead of materializing the full attention matrix. Enables longer context windows without quadratic memory growth.
Floating Point Formats
Operations
Numeric precision used for model weights. **FP32** (32-bit) = full precision, training. **FP16/BF16** (16-bit) = mixed precision training and inference. **INT8/INT4** = quantized inference, 2-4x smaller models. Lower precision = faster + cheaper but slightly less accurate.
Foundation Model
Foundations
A large model pre-trained on broad data that can be adapted to many tasks. GPT-4, Claude Opus 4, and Llama 3.1 are foundation models. The term emphasizes that these models serve as "foundations" for specialized applications.
Function Calling
Orchestration
A model capability where it outputs structured JSON describing a function to call, rather than plain text. The application executes the function and feeds results back. Enables models to interact with databases, APIs, and tools. See Module O3.

G4 terms

GGUF (GPT-Generated Unified Format)
Operations
A file format for quantized models optimized for CPU inference via llama.cpp. Common for running models locally. Variants: Q4_K_M (4-bit, medium quality), Q5_K_M, Q8_0 (8-bit, best quality).
GPU (Graphics Processing Unit)
Operations
The hardware that makes AI possible. AI workloads use **NVIDIA A100, H100, H200, B200** GPUs. Key specs: VRAM (memory), TFLOPS (compute), memory bandwidth. A single H100 has 80GB VRAM and can serve a 70B parameter model in FP16.
Grounding
Reasoning
Anchoring model responses in factual, verifiable information. Techniques: RAG (retrieve relevant docs), system messages with facts, structured data injection, citation requirements. The primary defense against hallucination. See Module R3.
Guardrails
Reasoning
Rules and filters that constrain AI behavior. Input guardrails filter harmful prompts. Output guardrails filter inappropriate responses. Can be rule-based (regex, blocklists), ML-based (classifiers), or LLM-based (a second model checks the first).

H3 terms

Hallucination
Reasoning
When a model generates confident-sounding but factually incorrect information. Causes: training data gaps, statistical pattern matching without understanding, high temperature. Mitigation: RAG, grounding, low temperature, structured output, evaluation. See Module R3.
Hybrid Search
Reasoning
Combining keyword search (BM25) with vector search (embeddings) in RAG retrieval. Typically outperforms either alone. Azure AI Search supports hybrid natively. Weight: usually 50-70% vector, 30-50% keyword.
Hyperparameter
Foundations
A parameter set before training begins (not learned). Examples: learning rate, batch size, number of epochs, LoRA rank. Hyperparameter tuning is a key part of fine-tuning. See Module T1.

I3 terms

Inference
Foundations
Running a trained model to generate predictions or outputs. Unlike training (which updates weights), inference only reads weights. Inference workloads are typically latency-sensitive and require different infrastructure than training.
In-Context Learning (ICL)
Reasoning
The ability of LLMs to learn tasks from examples provided in the prompt, without any weight updates. Few-shot prompting is a form of ICL. Remarkable because the model "learns" at inference time, not training time.
Instruction Tuning
Transformation
Fine-tuning a model on a dataset of (instruction, response) pairs. Teaches the model to follow instructions rather than just predict next tokens. The difference between base GPT-4 and ChatGPT is largely instruction tuning + RLHF.

J1 term

JSON Mode
Reasoning
A model configuration that constrains output to valid JSON. Essential for function calling, API integration, and structured data extraction. Supported by OpenAI, Azure OpenAI, and most modern model APIs. Reduces parsing errors in production.

K2 terms

KV Cache (Key-Value Cache)
Operations
During autoregressive generation, the model caches the key and value tensors from previous tokens so they don't need to be recomputed. This is what makes generation fast but **eats VRAM**. A 128K context window with a 70B model can consume 40+ GB of KV cache.
Knowledge Cutoff
Foundations
The date after which a model has no training data. GPT-4o: Oct 2023. Claude Opus 4: early 2025. Any question about events after the cutoff requires RAG or tool access to answer correctly.

L4 terms

LangChain
Orchestration
An open-source framework for building LLM applications. Provides chains, agents, memory, and tool integrations. Python and JavaScript versions. Compare with Semantic Kernel (O1) — LangChain is more community-driven, SK is more enterprise/Microsoft-integrated.
Large Language Model (LLM)
Foundations
A neural network with billions of parameters trained on massive text corpora to predict the next token. The "large" refers to parameter count (7B to 1T+). All modern GenAI applications are built on LLMs.
LoRA (Low-Rank Adaptation)
Transformation
A parameter-efficient fine-tuning technique that freezes original model weights and trains small rank-decomposition matrices alongside them. Reduces fine-tuning compute by 10-100x. LoRA adapters are typically 10-100MB vs the full model at 10-100GB. See Module T1.
Latency
Operations
Time from request to first response token (TTFT — Time To First Token) or full response (TTLR — Time To Last Response). Production targets: TTFT < 500ms, TTLR < 3s for interactive use. Affected by model size, input length, and infrastructure.

M7 terms

MCP (Model Context Protocol)
Orchestration
An open protocol (by Anthropic, now industry-wide) for connecting AI models to external tools and data sources. MCP servers expose tools via a standardized schema. MCP clients (in agents) discover and call these tools. See Module O3.
Memory (Agent)
Orchestration
How an agent retains information across interactions. **Short-term memory**: current conversation context. **Long-term memory**: persisted facts (vector store, database). **Episodic memory**: past conversation summaries. Critical for multi-turn and multi-session agents.
Mixed Precision
Operations
Using multiple floating-point formats during training/inference. Compute-heavy operations use FP16/BF16 for speed; accumulations use FP32 for accuracy. Standard for all modern AI training. BF16 preferred over FP16 for stability.
Model Catalog (Azure AI Foundry)
Operations
Azure's marketplace of 1,700+ AI models. Includes OpenAI (GPT-4o, o1), Meta (Llama), Mistral, Cohere, and more. Models can be deployed as serverless APIs (pay-per-token) or to dedicated compute.
Model Parallelism
Operations
Distributing a single model across multiple GPUs when it doesn't fit in one GPU's memory. **Tensor parallelism**: splits layers. **Pipeline parallelism**: assigns different layers to different GPUs. Complex in practice.
Multi-Agent
Orchestration
Systems where multiple AI agents collaborate, each with specialized roles. Patterns: supervisor (one agent orchestrates), swarm (peer-to-peer), pipeline (sequential handoff). See Module O2.
Multi-Modal
Foundations
Models that process multiple input types: text, images, audio, video. GPT-4o, Claude Opus 4, and Gemini are multi-modal. Enables use cases like image understanding, document extraction, video analysis.

N2 terms

Neural Network
Foundations
A computational model inspired by biological neurons. Layers of nodes (neurons) connected by weighted edges. Training adjusts weights to minimize a loss function. Transformers are a specific type of neural network architecture.
Next-Token Prediction
Foundations
The fundamental task of decoder LLMs: given a sequence of tokens, predict the probability distribution over the vocabulary for the next token. Every text generation capability — writing, reasoning, coding — emerges from this single task.

O2 terms

ONNX (Open Neural Network Exchange)
Operations
An open format for representing ML models. Enables training in one framework (PyTorch) and deploying in another (ONNX Runtime). Azure uses ONNX Runtime for optimized inference. Supports quantization and hardware acceleration.
Orchestrator
Orchestration
The component that coordinates between an LLM and other system components (tools, memory, RAG, other agents). Semantic Kernel's planner, LangChain's agent executor, and Microsoft Agent Framework's runtime are all orchestrators.

P9 terms

Parameters (Model Parameters)
Foundations
The learned weights of a neural network. GPT-4: ~1.8T (estimated, MoE), Llama 3.1: 405B, Phi-4: 14B. More parameters generally = more capability but require more compute and memory. Size ≠ quality (architecture and training data matter enormously).
Parameters (Generation Parameters)
Foundations
Settings that control text generation behavior: | Parameter | Range | What It Controls | Impact | |-----------|-------|-------------------|--------| | **Temperature** | 0.0–2.0 | Randomness of token selection | 0.0 = deterministic, 1.0 = balanced, 2.0 = creative chaos | | **Top-k** | 1–100 | Number of tokens to consider | Lower = more focused, higher = more diverse | | **Top-p** (nucleus) | 0.0–1.0 | Cumulative probability threshold | 0.1 = only top tokens, 0.95 = most tokens | | **Frequency penalty** | -2.0–2.0 | Penalty for repeated tokens | Higher = less repetition | | **Presence penalty** | -2.0–2.0 | Penalty for tokens already used | Higher = more topic diversity | | **Max tokens** | 1–model max | Maximum output length | Controls cost and response length | | **Seed** | Integer | Random seed for reproducibility | Same seed + same input = similar output | | **Stop sequences** | Strings | Tokens that halt generation | Control where output ends |
Pipeline (AI/ML)
Operations
A sequence of automated steps: data ingestion → preprocessing → model training/inference → evaluation → deployment. Azure AI Foundry, MLflow, and GitHub Actions are common orchestrators for AI pipelines.
Planner (Semantic Kernel)
Orchestration
A component that takes a user's goal and breaks it into a sequence of plugin function calls. **Handlebars Planner**: template-based. **Stepwise Planner**: iterative. In modern SK, planners are being replaced by function-calling models.
Plugin (Semantic Kernel)
Orchestration
A collection of functions that extend model capabilities. **Native plugins**: C#/Python code. **OpenAPI plugins**: API specifications. **OpenAI plugins**: compatible format. Plugins are how Semantic Kernel connects LLMs to business logic.
Private Endpoint
Operations
An Azure networking feature that gives a service a private IP address within your VNet. Critical for AI Landing Zones — keeps model endpoints, search services, and storage off the public internet.
Prompt
Reasoning
The input given to a language model. Composed of: **system message** (instructions/persona), **user message** (the request), **assistant message** (previous responses). Prompt quality is the single biggest lever for output quality. See Module R1.
Prompt Flow
Operations
An Azure AI Foundry tool for building and evaluating LLM workflows visually. Supports DAG-based flows with LLM nodes, Python nodes, tool nodes, and evaluation metrics. Being integrated into VS Code for local development.
PTU (Provisioned Throughput Units)
Operations
Azure OpenAI's dedicated capacity model. Instead of pay-per-token, you reserve a fixed amount of throughput. Predictable performance and cost. Best for: high-volume production workloads with predictable demand. Compare with pay-as-you-go (PAYG).

Q2 terms

QLoRA (Quantized LoRA)
Transformation
LoRA applied to a quantized (4-bit) base model. Enables fine-tuning 70B+ models on a single consumer GPU (24GB VRAM). Minimal quality loss compared to full LoRA. The most accessible fine-tuning technique. See Module T1.
Quantization
Operations
Reducing the numerical precision of model weights (FP32 → INT8 → INT4). Reduces model size by 2-8x and speeds up inference. Techniques: GPTQ (post-training), AWQ (activation-aware), GGUF (CPU-friendly). Some quality loss at INT4.

R4 terms

RAG (Retrieval-Augmented Generation)
Reasoning
A pattern that grounds LLM responses in retrieved documents. Flow: query → embed query → search vector store → retrieve relevant chunks → inject into prompt → generate response. The most common pattern for enterprise AI. See Module R2. ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart LR Q["User Query"] --> E["Embed Query"] E --> S["Search Vector Store"] S --> R["Retrieve Top-K Chunks"] R --> P["Augment Prompt"] P --> M["LLM Generate"] M --> A["Grounded Answer"] style Q fill:#f59e0b22,stroke:#f59e0b style S fill:#06b6d422,stroke:#06b6d4 style M fill:#7c3aed22,stroke:#7c3aed style A fill:#10b98122,stroke:#10b981 ```
Reranking
Reasoning
A second-pass ranking step after initial retrieval. A cross-encoder model scores each (query, document) pair more accurately than cosine similarity. Dramatically improves relevance. Azure AI Search supports semantic reranking natively.
Responsible AI
Transformation
Microsoft's framework for trustworthy AI: fairness, reliability, safety, privacy, inclusiveness, transparency, accountability. Not just ethics — it's engineering practices: content filters, red teaming, evaluation, human oversight. See Module T2.
RLHF (Reinforcement Learning from Human Feedback)
Transformation
An alignment technique: (1) collect human preferences on model outputs, (2) train a reward model from preferences, (3) optimize the LLM policy against the reward model using PPO. How ChatGPT was aligned. Being partially replaced by DPO.

S9 terms

Scaling Laws
Foundations
Empirical observations (Kaplan et al., Chinchilla) about how model performance improves with more parameters, data, and compute. Key insight: performance follows predictable power laws, enabling compute-optimal model sizing.
Semantic Kernel
Orchestration
Microsoft's open-source SDK for building AI applications. Core concepts: kernel (orchestrator), plugins (tools), memory (context), planners (goal decomposition), connectors (external services). C# and Python SDKs. See Module O1.
Semantic Ranking
Reasoning
Azure AI Search's built-in reranking capability that uses a cross-encoder model to reorder search results by semantic relevance. Activated with `queryType: semantic`. Significant quality improvement over BM25 alone.
Semantic Search
Reasoning
Searching by meaning rather than keywords. Uses embeddings to find documents that are semantically similar to the query, even if they use different words. "How to fix a flat tire" matches "tire puncture repair guide."
Serving (Model Serving)
Operations
The infrastructure that makes models available for inference. Options: Azure AI Foundry endpoints, vLLM, TGI (Text Generation Inference), Triton, ONNX Runtime. Key metrics: TTFT, tokens/second, concurrent requests, GPU utilization.
SLM (Small Language Model)
Foundations
Models with fewer parameters (1B–14B) optimized for specific tasks or edge deployment. Microsoft Phi-4 (14B), Phi-3.5-mini (3.8B). Trade raw capability for speed, cost, and privacy (can run on-device). Often outperform larger models on narrow tasks.
Stop Sequence
Foundations
One or more strings that cause the model to stop generating when encountered. Examples: `"\n\n"`, `"```"`, `"END"`. Essential for controlling output format in production applications.
Structured Output
Reasoning
Constraining model output to a specific format (JSON schema, XML, Markdown). Techniques: JSON mode, function calling with schema, regex-constrained generation. Critical for API integrations. See Module R1.
System Message
Reasoning
The first message in a chat completion that sets the model's behavior, persona, constraints, and context. The most powerful prompt engineering lever. Example: "You are a helpful Azure architect. Only answer questions about Azure. Cite sources."

T8 terms

Temperature
Foundations
A generation parameter (0.0–2.0) that controls randomness. At 0.0, the model always picks the most likely token (near-deterministic). At 1.0, probabilities are used as-is. At 2.0, the distribution is flattened (more random). For factual tasks: 0.0–0.3. For creative tasks: 0.7–1.0.
Tensor
Foundations
A multi-dimensional array of numbers. Scalars are 0D tensors, vectors are 1D, matrices are 2D. Neural networks process tensors. Model weights are stored as tensors. Understanding tensor shapes helps debug AI infrastructure issues.
Tokenization
Foundations
Converting text into a sequence of integer token IDs. Each model has its own tokenizer. "Hello, world!" might become `[9906, 11, 1917, 0]`. Token count determines cost (pay-per-token) and context limits. Rule of thumb: 1 token ≈ 4 characters in English.
Tool Use
Orchestration
A model's ability to invoke external tools (APIs, databases, code interpreters). The model generates a structured tool call → the application executes it → the result is fed back to the model. Fundamental to agents. See Module O3.
Top-k
Foundations
A generation parameter. At each step, only the top-k most likely tokens are considered. `top_k=1` = greedy decoding (always pick the most likely). `top_k=40` = consider 40 options. Lower values = more focused, higher = more diverse.
Top-p (Nucleus Sampling)
Foundations
A generation parameter. Instead of a fixed count, considers the smallest set of tokens whose cumulative probability exceeds p. `top_p=0.1` = very focused (only top few tokens). `top_p=0.95` = most tokens eligible. Typically used instead of top_k, not with it.
Transformer
Foundations
The neural network architecture behind all modern LLMs (Vaswani et al., 2017 — "Attention Is All You Need"). Key innovation: self-attention replaces recurrence, enabling massive parallelization. Variants: encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5).
Transfer Learning
Foundations
Using a pre-trained model as a starting point for a new task. Fine-tuning is a form of transfer learning. The pre-trained model has already learned general language understanding; you transfer that knowledge to your specific domain.

U1 term

Uncertainty
Reasoning
A model's lack of confidence in its output. LLMs are notoriously bad at expressing uncertainty — they generate fluent text regardless of confidence. Techniques to surface uncertainty: token probabilities (logprobs), calibration, abstention prompting ("say 'I don't know' if unsure").

V3 terms

Vector Database
Reasoning
A database optimized for storing and searching high-dimensional vectors (embeddings). Used in RAG for similarity search. Options: Azure AI Search (hybrid), Pinecone, Weaviate, Qdrant, Chroma, pgvector. Choice depends on scale, hybrid search needs, and cloud integration.
Vector Search
Reasoning
Finding similar items by comparing their vector embeddings. Algorithms: HNSW (most common, approximate but fast), IVF (inverted file index), flat (exact but slow). Azure AI Search uses HNSW for vector indexing.
vLLM
Operations
An open-source, high-performance LLM serving engine. Key innovation: PagedAttention (manages KV cache like virtual memory pages). Supports continuous batching, tensor parallelism, and OpenAI-compatible API. Popular for self-hosted model serving on AKS or Container Apps.

W2 terms

Weight
Foundations
A numerical parameter in a neural network that is learned during training. A 70B model has 70 billion weights. Weights encode everything the model "knows." Saving weights = saving the model. Fine-tuning = updating weights.
Window (Context Window)
Foundations
See **Context Window**.

X1 term

XAI (Explainable AI)
Transformation
Techniques for understanding why a model made a specific prediction or generation. Attention visualization, feature attribution, and chain-of-thought prompting all contribute to explainability. Important for regulated industries.

Z2 terms

Zero-Shot
Reasoning
Asking a model to perform a task without any examples. "Classify this review as positive or negative: {review}". Works well for tasks similar to the model's training data. Performance improves with few-shot examples.
When to Use What
Unspecified
| Use Case | Temperature | Top-p | Why | |----------|-------------|-------|-----| | Factual Q&A | 0.0 | 1.0 | Maximum determinism | | Classification | 0.0 | 1.0 | Consistent labels | | Code generation | 0.0–0.2 | 0.95 | Correct but slightly varied | | Summarization | 0.3 | 0.9 | Faithful but fluent | | Creative writing | 0.7–1.0 | 0.95 | Diverse and interesting | | Brainstorming | 1.0–1.5 | 0.95 | Maximum diversity |

A9 terms

Ablation Study

Activation Function

Agent

Agent Framework (Microsoft)

AI Landing Zone

Alignment

Attention (Self-Attention)

AutoGen

Autoregressive Generation

B4 terms

Batch Size

BERT (Bidirectional Encoder Representations from Transformers)

BPE (Byte-Pair Encoding)

Beam Search

C13 terms

Chain-of-Thought (CoT)

Chunking

Classification

Completion

Constitutional AI

Container Apps (Azure)

Content Safety (Azure AI)

Context Window

Continuous Batching

Copilot

Copilot Studio

Cosine Similarity

Cross-Attention

D5 terms

Data Parallelism

Decoder

Deterministic AI

Distillation (Knowledge Distillation)

DPO (Direct Preference Optimization)

E4 terms

Embeddings

Encoder

Endpoint

Evaluation

F6 terms

Few-Shot Learning

Fine-Tuning

Flash Attention

Floating Point Formats

Foundation Model

Function Calling

G4 terms

GGUF (GPT-Generated Unified Format)

GPU (Graphics Processing Unit)

Grounding

Guardrails

H3 terms

Hallucination

Hybrid Search

Hyperparameter

I3 terms

Inference

In-Context Learning (ICL)

Instruction Tuning

J1 term

JSON Mode

K2 terms

KV Cache (Key-Value Cache)

Knowledge Cutoff

L4 terms

LangChain

Large Language Model (LLM)

LoRA (Low-Rank Adaptation)

Latency

M7 terms

MCP (Model Context Protocol)

Memory (Agent)

Mixed Precision

Model Catalog (Azure AI Foundry)

Model Parallelism

Multi-Agent

Multi-Modal

N2 terms

Neural Network