FrootAI — AmpliFAI your AI Ecosystem Get Started

FROOT Foundations

AI Glossary A–Z

Every term an architect, engineer, or consultant will encounter in GenAI — defined clearly, with context for why it matters. 110 terms, tagged by FROOT layer.

Filter by layer:

A9 terms

  • Ablation Study

    🍎 Transformation

    Removing components of a model or system one at a time to measure which pieces contribute most to performance. Used during fine-tuning and evaluation to understand what matters.

  • Activation Function

    🌱 Foundations

    A mathematical function (ReLU, GELU, SiLU) applied to neuron outputs that introduces non-linearity. Without it, a neural network would just be linear algebra. **GELU** is the most common in modern transformers.

  • Agent

    🌿 Orchestration

    An AI system that can **perceive, plan, decide, and act** autonomously. Unlike a simple chat completion, an agent has a loop: observe → think → act → observe results → repeat. See Module O2 for the full deep dive.

  • Agent Framework (Microsoft)

    🌿 Orchestration

    Microsoft's SDK for building production AI agents. Supports tool calling, multi-agent orchestration, stateful conversations, and integration with Azure AI Foundry. Successor to AutoGen for production use cases. Compare with Semantic Kernel in Module O1.

  • AI Landing Zone

    🏗️ Operations

    An enterprise-ready Azure environment pre-configured for AI workloads. Includes networking (private endpoints, VNets), identity (managed identities), governance (policies, RBAC), compute (GPU quotas), and data services (AI Search, storage). Built on the Cloud Adoption Framework.

  • Alignment

    🍎 Transformation

    Training a model to follow human intent, be helpful, and avoid harmful outputs. Techniques include RLHF, DPO, and constitutional AI. The reason ChatGPT says "I'd be happy to help" instead of producing raw completions.

  • Attention (Self-Attention)

    🌱 Foundations

    The core mechanism of transformers. For each token, attention computes how much "attention" to pay to every other token in the sequence. The formula: `Attention(Q,K,V) = softmax(QK^T / √d_k) × V`. This is what lets a model understand that "it" in "The cat sat on the mat because it was tired" refers to "the cat."

  • AutoGen

    🌿 Orchestration

    Microsoft's open-source framework for multi-agent conversations. Agents are defined with roles and can collaborate in group chats. Being succeeded by Microsoft Agent Framework for production use, but remains popular for research and prototyping.

  • Autoregressive Generation

    🌱 Foundations

    The process of generating one token at a time, where each new token is conditioned on all previous tokens. This is how GPT, Claude, and Llama generate text. It's inherently sequential — which is why inference latency scales with output length.

B4 terms

  • Batch Size

    🌱 Foundations

    The number of samples processed together during training or inference. Larger batches = better GPU utilization but more memory. For inference, **continuous batching** groups multiple requests to maximize throughput.

  • BERT (Bidirectional Encoder Representations from Transformers)

    🌱 Foundations

    An encoder-only transformer (2018). Unlike GPT which reads left-to-right, BERT reads in both directions. Used for classification, entity extraction, and embeddings — not for text generation. Still widely used for search and NLU tasks.

  • BPE (Byte-Pair Encoding)

    🌱 Foundations

    The most common tokenization algorithm. Starts with individual characters and iteratively merges the most frequent pairs. "unbelievable" → `["un", "believ", "able"]`. GPT-4 uses ~100K BPE tokens. Understanding BPE helps you estimate token costs.

C13 terms

  • Chain-of-Thought (CoT)

    🪵 Reasoning

    A prompting technique where you ask the model to "think step by step" before giving a final answer. Dramatically improves accuracy on math, logic, and reasoning tasks. Cost: more output tokens. See Module R1.

  • Chunking

    🪵 Reasoning

    Splitting documents into smaller pieces for RAG retrieval. Strategies include fixed-size (512 tokens), semantic (by paragraph/section), recursive, and sentence-based. Chunk size directly impacts retrieval quality. See Module R2.

  • Classification

    🌱 Foundations

    The task of assigning a label to an input. Examples: sentiment analysis ("positive"/"negative"), intent detection ("book_flight"/"cancel_order"), content moderation ("safe"/"unsafe"). Can be done via prompting or fine-tuning.

  • Completion

    🌱 Foundations

    The output generated by a language model. In the API world, a "completion" is the model's response to a prompt. **Chat completions** use a messages array; **text completions** (legacy) use a single prompt string.

  • Constitutional AI

    🍎 Transformation

    An alignment technique (Anthropic) where the model critiques and revises its own outputs based on a set of principles ("constitution"). Reduces the need for human feedback in alignment training.

  • Container Apps (Azure)

    🏗️ Operations

    A serverless container platform ideal for hosting AI agents and APIs. Supports auto-scaling (including scale-to-zero), GPU workloads, Dapr sidecars, and built-in ingress. Popular for deploying agent backends that need to scale dynamically.

  • Content Safety (Azure AI)

    🍎 Transformation

    Azure's service for detecting harmful content in text and images. Categories: hate, self-harm, sexual, violence. Severity levels 0–6. Used as a guardrail before and after model responses. See Module T2.

  • Context Window

    🌱 Foundations

    The maximum number of tokens a model can process in a single request (input + output combined). GPT-4o: 128K, Claude Opus 4: 200K, Gemini 1.5 Pro: 2M. Larger windows enable longer documents but cost more and may reduce focus on relevant content.

  • Continuous Batching

    🏗️ Operations

    An inference optimization where new requests are added to a running batch without waiting for all current requests to finish. Dramatically improves GPU utilization and throughput in production serving.

  • Copilot

    🏗️ Operations

    Microsoft's brand for AI assistants embedded in products. **M365 Copilot** (Office), **GitHub Copilot** (code), **Copilot for Azure** (cloud ops), **Copilot Studio** (custom copilots). Each uses different models and architectures underneath.

  • Copilot Studio

    🏗️ Operations

    Microsoft's low-code platform for building custom copilots. Connects to enterprise data (SharePoint, Dataverse), supports topics, actions, and plugins. No code required for basic scenarios, extensible with code for advanced ones.

  • Cosine Similarity

    🪵 Reasoning

    A measure of similarity between two vectors (0 = unrelated, 1 = identical). Used in RAG to compare query embeddings against document embeddings. Typical relevance threshold: 0.75–0.85. See Module R2.

  • Cross-Attention

    🌱 Foundations

    Attention where queries come from one sequence (e.g., decoder) and keys/values come from another (e.g., encoder). Used in encoder-decoder models like T5 and in multi-modal models where text attends to image patches.

D5 terms

  • Data Parallelism

    🏗️ Operations

    Distributing training data across multiple GPUs, each holding a copy of the model. Gradients are synchronized after each step. The simplest multi-GPU strategy. Used when the model fits in one GPU's memory.

  • Decoder

    🌱 Foundations

    The part of a transformer that generates output tokens one at a time, each conditioned on previous outputs and (optionally) encoder output. GPT, Claude, and Llama are **decoder-only** architectures.

  • Deterministic AI

    🪵 Reasoning

    Making AI outputs reproducible and predictable. Techniques: `temperature=0`, fixed `seed` parameter, structured output schemas, constrained decoding, evaluation-driven guardrails. Even with `temperature=0`, GPU floating-point arithmetic can introduce tiny variations. See Module R3.

  • Distillation (Knowledge Distillation)

    🍎 Transformation

    Training a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's probability distributions rather than raw data. Produces much smaller models that retain 90-95% of the teacher's capability.

  • DPO (Direct Preference Optimization)

    🍎 Transformation

    An alignment technique that skips the reward model step of RLHF and directly optimizes the policy from preference data. Simpler, more stable than RLHF. Used in Llama 3, Zephyr, and many fine-tuned models.

E4 terms

  • Embeddings

    🌱 Foundations

    Dense vector representations of text (or images, audio) that capture semantic meaning. "King" and "Queen" have similar embeddings. Used in RAG for similarity search, in classification, and in clustering. Common dimensions: 768, 1536, 3072.

  • Encoder

    🌱 Foundations

    The part of a transformer that processes the full input sequence in parallel (bidirectionally). BERT is encoder-only. Encoder-decoder models (T5, BART) use the encoder to understand input and the decoder to generate output.

  • Endpoint

    🏗️ Operations

    A URL that serves an AI model for inference. Azure AI Foundry provides **managed endpoints** (serverless, pay-per-token) and **dedicated endpoints** (reserved compute, predictable performance). Choice impacts cost, latency, and SLA.

  • Evaluation

    🍎 Transformation

    Measuring AI system quality. **Offline evaluation**: test set metrics (accuracy, F1, BLEU, ROUGE). **Online evaluation**: A/B testing in production. **LLM-as-judge**: using one model to score another's outputs. Azure AI Foundry has built-in evaluation tools.

F6 terms

  • Few-Shot Learning

    🪵 Reasoning

    Providing a few examples in the prompt to teach the model a task. Zero-shot = no examples. One-shot = one example. Few-shot = 2-10 examples. More examples improve consistency but use more tokens (and cost more).

  • Fine-Tuning

    🍎 Transformation

    Continuing the training of a pre-trained model on your specific dataset. Changes the model's weights. Use when prompting alone isn't enough — for domain-specific language, consistent formatting, or task specialization. See Module T1.

  • Flash Attention

    🏗️ Operations

    An algorithm that makes attention computation faster and more memory-efficient by tiling and recomputing instead of materializing the full attention matrix. Enables longer context windows without quadratic memory growth.

  • Floating Point Formats

    🏗️ Operations

    Numeric precision used for model weights. **FP32** (32-bit) = full precision, training. **FP16/BF16** (16-bit) = mixed precision training and inference. **INT8/INT4** = quantized inference, 2-4x smaller models. Lower precision = faster + cheaper but slightly less accurate.

  • Foundation Model

    🌱 Foundations

    A large model pre-trained on broad data that can be adapted to many tasks. GPT-4, Claude Opus 4, and Llama 3.1 are foundation models. The term emphasizes that these models serve as "foundations" for specialized applications.

  • Function Calling

    🌿 Orchestration

    A model capability where it outputs structured JSON describing a function to call, rather than plain text. The application executes the function and feeds results back. Enables models to interact with databases, APIs, and tools. See Module O3.

G4 terms

  • GGUF (GPT-Generated Unified Format)

    🏗️ Operations

    A file format for quantized models optimized for CPU inference via llama.cpp. Common for running models locally. Variants: Q4_K_M (4-bit, medium quality), Q5_K_M, Q8_0 (8-bit, best quality).

  • GPU (Graphics Processing Unit)

    🏗️ Operations

    The hardware that makes AI possible. AI workloads use **NVIDIA A100, H100, H200, B200** GPUs. Key specs: VRAM (memory), TFLOPS (compute), memory bandwidth. A single H100 has 80GB VRAM and can serve a 70B parameter model in FP16.

  • Grounding

    🪵 Reasoning

    Anchoring model responses in factual, verifiable information. Techniques: RAG (retrieve relevant docs), system messages with facts, structured data injection, citation requirements. The primary defense against hallucination. See Module R3.

  • Guardrails

    🪵 Reasoning

    Rules and filters that constrain AI behavior. Input guardrails filter harmful prompts. Output guardrails filter inappropriate responses. Can be rule-based (regex, blocklists), ML-based (classifiers), or LLM-based (a second model checks the first).

H3 terms

  • Hallucination

    🪵 Reasoning

    When a model generates confident-sounding but factually incorrect information. Causes: training data gaps, statistical pattern matching without understanding, high temperature. Mitigation: RAG, grounding, low temperature, structured output, evaluation. See Module R3.

  • Hyperparameter

    🌱 Foundations

    A parameter set before training begins (not learned). Examples: learning rate, batch size, number of epochs, LoRA rank. Hyperparameter tuning is a key part of fine-tuning. See Module T1.

I3 terms

  • Inference

    🌱 Foundations

    Running a trained model to generate predictions or outputs. Unlike training (which updates weights), inference only reads weights. Inference workloads are typically latency-sensitive and require different infrastructure than training.

  • In-Context Learning (ICL)

    🪵 Reasoning

    The ability of LLMs to learn tasks from examples provided in the prompt, without any weight updates. Few-shot prompting is a form of ICL. Remarkable because the model "learns" at inference time, not training time.

  • Instruction Tuning

    🍎 Transformation

    Fine-tuning a model on a dataset of (instruction, response) pairs. Teaches the model to follow instructions rather than just predict next tokens. The difference between base GPT-4 and ChatGPT is largely instruction tuning + RLHF.

J1 term

  • JSON Mode

    🪵 Reasoning

    A model configuration that constrains output to valid JSON. Essential for function calling, API integration, and structured data extraction. Supported by OpenAI, Azure OpenAI, and most modern model APIs. Reduces parsing errors in production.

K2 terms

  • KV Cache (Key-Value Cache)

    🏗️ Operations

    During autoregressive generation, the model caches the key and value tensors from previous tokens so they don't need to be recomputed. This is what makes generation fast but **eats VRAM**. A 128K context window with a 70B model can consume 40+ GB of KV cache.

  • Knowledge Cutoff

    🌱 Foundations

    The date after which a model has no training data. GPT-4o: Oct 2023. Claude Opus 4: early 2025. Any question about events after the cutoff requires RAG or tool access to answer correctly.

L4 terms

  • LangChain

    🌿 Orchestration

    An open-source framework for building LLM applications. Provides chains, agents, memory, and tool integrations. Python and JavaScript versions. Compare with Semantic Kernel (O1) — LangChain is more community-driven, SK is more enterprise/Microsoft-integrated.

  • Large Language Model (LLM)

    🌱 Foundations

    A neural network with billions of parameters trained on massive text corpora to predict the next token. The "large" refers to parameter count (7B to 1T+). All modern GenAI applications are built on LLMs.

  • LoRA (Low-Rank Adaptation)

    🍎 Transformation

    A parameter-efficient fine-tuning technique that freezes original model weights and trains small rank-decomposition matrices alongside them. Reduces fine-tuning compute by 10-100x. LoRA adapters are typically 10-100MB vs the full model at 10-100GB. See Module T1.

  • Latency

    🏗️ Operations

    Time from request to first response token (TTFT — Time To First Token) or full response (TTLR — Time To Last Response). Production targets: TTFT < 500ms, TTLR < 3s for interactive use. Affected by model size, input length, and infrastructure.

M7 terms

  • MCP (Model Context Protocol)

    🌿 Orchestration

    An open protocol (by Anthropic, now industry-wide) for connecting AI models to external tools and data sources. MCP servers expose tools via a standardized schema. MCP clients (in agents) discover and call these tools. See Module O3.

  • Memory (Agent)

    🌿 Orchestration

    How an agent retains information across interactions. **Short-term memory**: current conversation context. **Long-term memory**: persisted facts (vector store, database). **Episodic memory**: past conversation summaries. Critical for multi-turn and multi-session agents.

  • Mixed Precision

    🏗️ Operations

    Using multiple floating-point formats during training/inference. Compute-heavy operations use FP16/BF16 for speed; accumulations use FP32 for accuracy. Standard for all modern AI training. BF16 preferred over FP16 for stability.

  • Model Catalog (Azure AI Foundry)

    🏗️ Operations

    Azure's marketplace of 1,700+ AI models. Includes OpenAI (GPT-4o, o1), Meta (Llama), Mistral, Cohere, and more. Models can be deployed as serverless APIs (pay-per-token) or to dedicated compute.

  • Model Parallelism

    🏗️ Operations

    Distributing a single model across multiple GPUs when it doesn't fit in one GPU's memory. **Tensor parallelism**: splits layers. **Pipeline parallelism**: assigns different layers to different GPUs. Complex in practice.

  • Multi-Agent

    🌿 Orchestration

    Systems where multiple AI agents collaborate, each with specialized roles. Patterns: supervisor (one agent orchestrates), swarm (peer-to-peer), pipeline (sequential handoff). See Module O2.

  • Multi-Modal

    🌱 Foundations

    Models that process multiple input types: text, images, audio, video. GPT-4o, Claude Opus 4, and Gemini are multi-modal. Enables use cases like image understanding, document extraction, video analysis.

N2 terms

  • Neural Network

    🌱 Foundations

    A computational model inspired by biological neurons. Layers of nodes (neurons) connected by weighted edges. Training adjusts weights to minimize a loss function. Transformers are a specific type of neural network architecture.

  • Next-Token Prediction

    🌱 Foundations

    The fundamental task of decoder LLMs: given a sequence of tokens, predict the probability distribution over the vocabulary for the next token. Every text generation capability — writing, reasoning, coding — emerges from this single task.

O2 terms

  • ONNX (Open Neural Network Exchange)

    🏗️ Operations

    An open format for representing ML models. Enables training in one framework (PyTorch) and deploying in another (ONNX Runtime). Azure uses ONNX Runtime for optimized inference. Supports quantization and hardware acceleration.

  • Orchestrator

    🌿 Orchestration

    The component that coordinates between an LLM and other system components (tools, memory, RAG, other agents). Semantic Kernel's planner, LangChain's agent executor, and Microsoft Agent Framework's runtime are all orchestrators.

P9 terms

  • Parameters (Model Parameters)

    🌱 Foundations

    The learned weights of a neural network. GPT-4: ~1.8T (estimated, MoE), Llama 3.1: 405B, Phi-4: 14B. More parameters generally = more capability but require more compute and memory. Size ≠ quality (architecture and training data matter enormously).

  • Parameters (Generation Parameters)

    🌱 Foundations

    Settings that control text generation behavior: | Parameter | Range | What It Controls | Impact | |-----------|-------|-------------------|--------| | **Temperature** | 0.0–2.0 | Randomness of token selection | 0.0 = deterministic, 1.0 = balanced, 2.0 = creative chaos | | **Top-k** | 1–100 | Number of tokens to consider | Lower = more focused, higher = more diverse | | **Top-p** (nucleus) | 0.0–1.0 | Cumulative probability threshold | 0.1 = only top tokens, 0.95 = most tokens | | **Frequency penalty** | -2.0–2.0 | Penalty for repeated tokens | Higher = less repetition | | **Presence penalty** | -2.0–2.0 | Penalty for tokens already used | Higher = more topic diversity | | **Max tokens** | 1–model max | Maximum output length | Controls cost and response length | | **Seed** | Integer | Random seed for reproducibility | Same seed + same input = similar output | | **Stop sequences** | Strings | Tokens that halt generation | Control where output ends |

  • Pipeline (AI/ML)

    🏗️ Operations

    A sequence of automated steps: data ingestion → preprocessing → model training/inference → evaluation → deployment. Azure AI Foundry, MLflow, and GitHub Actions are common orchestrators for AI pipelines.

  • Planner (Semantic Kernel)

    🌿 Orchestration

    A component that takes a user's goal and breaks it into a sequence of plugin function calls. **Handlebars Planner**: template-based. **Stepwise Planner**: iterative. In modern SK, planners are being replaced by function-calling models.

  • Plugin (Semantic Kernel)

    🌿 Orchestration

    A collection of functions that extend model capabilities. **Native plugins**: C#/Python code. **OpenAPI plugins**: API specifications. **OpenAI plugins**: compatible format. Plugins are how Semantic Kernel connects LLMs to business logic.

  • Private Endpoint

    🏗️ Operations

    An Azure networking feature that gives a service a private IP address within your VNet. Critical for AI Landing Zones — keeps model endpoints, search services, and storage off the public internet.

  • Prompt

    🪵 Reasoning

    The input given to a language model. Composed of: **system message** (instructions/persona), **user message** (the request), **assistant message** (previous responses). Prompt quality is the single biggest lever for output quality. See Module R1.

  • Prompt Flow

    🏗️ Operations

    An Azure AI Foundry tool for building and evaluating LLM workflows visually. Supports DAG-based flows with LLM nodes, Python nodes, tool nodes, and evaluation metrics. Being integrated into VS Code for local development.

  • PTU (Provisioned Throughput Units)

    🏗️ Operations

    Azure OpenAI's dedicated capacity model. Instead of pay-per-token, you reserve a fixed amount of throughput. Predictable performance and cost. Best for: high-volume production workloads with predictable demand. Compare with pay-as-you-go (PAYG).

Q2 terms

  • QLoRA (Quantized LoRA)

    🍎 Transformation

    LoRA applied to a quantized (4-bit) base model. Enables fine-tuning 70B+ models on a single consumer GPU (24GB VRAM). Minimal quality loss compared to full LoRA. The most accessible fine-tuning technique. See Module T1.

  • Quantization

    🏗️ Operations

    Reducing the numerical precision of model weights (FP32 → INT8 → INT4). Reduces model size by 2-8x and speeds up inference. Techniques: GPTQ (post-training), AWQ (activation-aware), GGUF (CPU-friendly). Some quality loss at INT4.

R4 terms

  • RAG (Retrieval-Augmented Generation)

    🪵 Reasoning

    A pattern that grounds LLM responses in retrieved documents. Flow: query → embed query → search vector store → retrieve relevant chunks → inject into prompt → generate response. The most common pattern for enterprise AI. See Module R2. ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart LR Q["User Query"] --> E["Embed Query"] E --> S["Search<br/>Vector Store"] S --> R["Retrieve<br/>Top-K Chunks"] R --> P["Augment<br/>Prompt"] P --> M["LLM<br/>Generate"] M --> A["Grounded<br/>Answer"] style Q fill:#f59e0b22,stroke:#f59e0b style S fill:#06b6d422,stroke:#06b6d4 style M fill:#7c3aed22,stroke:#7c3aed style A fill:#10b98122,stroke:#10b981 ```

  • Reranking

    🪵 Reasoning

    A second-pass ranking step after initial retrieval. A cross-encoder model scores each (query, document) pair more accurately than cosine similarity. Dramatically improves relevance. Azure AI Search supports semantic reranking natively.

  • Responsible AI

    🍎 Transformation

    Microsoft's framework for trustworthy AI: fairness, reliability, safety, privacy, inclusiveness, transparency, accountability. Not just ethics — it's engineering practices: content filters, red teaming, evaluation, human oversight. See Module T2.

  • RLHF (Reinforcement Learning from Human Feedback)

    🍎 Transformation

    An alignment technique: (1) collect human preferences on model outputs, (2) train a reward model from preferences, (3) optimize the LLM policy against the reward model using PPO. How ChatGPT was aligned. Being partially replaced by DPO.

S9 terms

  • Scaling Laws

    🌱 Foundations

    Empirical observations (Kaplan et al., Chinchilla) about how model performance improves with more parameters, data, and compute. Key insight: performance follows predictable power laws, enabling compute-optimal model sizing.

  • Semantic Kernel

    🌿 Orchestration

    Microsoft's open-source SDK for building AI applications. Core concepts: kernel (orchestrator), plugins (tools), memory (context), planners (goal decomposition), connectors (external services). C# and Python SDKs. See Module O1.

  • Semantic Ranking

    🪵 Reasoning

    Azure AI Search's built-in reranking capability that uses a cross-encoder model to reorder search results by semantic relevance. Activated with `queryType: semantic`. Significant quality improvement over BM25 alone.

  • Serving (Model Serving)

    🏗️ Operations

    The infrastructure that makes models available for inference. Options: Azure AI Foundry endpoints, vLLM, TGI (Text Generation Inference), Triton, ONNX Runtime. Key metrics: TTFT, tokens/second, concurrent requests, GPU utilization.

  • SLM (Small Language Model)

    🌱 Foundations

    Models with fewer parameters (1B–14B) optimized for specific tasks or edge deployment. Microsoft Phi-4 (14B), Phi-3.5-mini (3.8B). Trade raw capability for speed, cost, and privacy (can run on-device). Often outperform larger models on narrow tasks.

  • Stop Sequence

    🌱 Foundations

    One or more strings that cause the model to stop generating when encountered. Examples: `"\n\n"`, `"```"`, `"END"`. Essential for controlling output format in production applications.

  • Structured Output

    🪵 Reasoning

    Constraining model output to a specific format (JSON schema, XML, Markdown). Techniques: JSON mode, function calling with schema, regex-constrained generation. Critical for API integrations. See Module R1.

  • System Message

    🪵 Reasoning

    The first message in a chat completion that sets the model's behavior, persona, constraints, and context. The most powerful prompt engineering lever. Example: "You are a helpful Azure architect. Only answer questions about Azure. Cite sources."

T8 terms

  • Temperature

    🌱 Foundations

    A generation parameter (0.0–2.0) that controls randomness. At 0.0, the model always picks the most likely token (near-deterministic). At 1.0, probabilities are used as-is. At 2.0, the distribution is flattened (more random). For factual tasks: 0.0–0.3. For creative tasks: 0.7–1.0.

  • Tensor

    🌱 Foundations

    A multi-dimensional array of numbers. Scalars are 0D tensors, vectors are 1D, matrices are 2D. Neural networks process tensors. Model weights are stored as tensors. Understanding tensor shapes helps debug AI infrastructure issues.

  • Tokenization

    🌱 Foundations

    Converting text into a sequence of integer token IDs. Each model has its own tokenizer. "Hello, world!" might become `[9906, 11, 1917, 0]`. Token count determines cost (pay-per-token) and context limits. Rule of thumb: 1 token ≈ 4 characters in English.

  • Tool Use

    🌿 Orchestration

    A model's ability to invoke external tools (APIs, databases, code interpreters). The model generates a structured tool call → the application executes it → the result is fed back to the model. Fundamental to agents. See Module O3.

  • Top-k

    🌱 Foundations

    A generation parameter. At each step, only the top-k most likely tokens are considered. `top_k=1` = greedy decoding (always pick the most likely). `top_k=40` = consider 40 options. Lower values = more focused, higher = more diverse.

  • Top-p (Nucleus Sampling)

    🌱 Foundations

    A generation parameter. Instead of a fixed count, considers the smallest set of tokens whose cumulative probability exceeds p. `top_p=0.1` = very focused (only top few tokens). `top_p=0.95` = most tokens eligible. Typically used instead of top_k, not with it.

  • Transformer

    🌱 Foundations

    The neural network architecture behind all modern LLMs (Vaswani et al., 2017 — "Attention Is All You Need"). Key innovation: self-attention replaces recurrence, enabling massive parallelization. Variants: encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5).

  • Transfer Learning

    🌱 Foundations

    Using a pre-trained model as a starting point for a new task. Fine-tuning is a form of transfer learning. The pre-trained model has already learned general language understanding; you transfer that knowledge to your specific domain.

U1 term

  • Uncertainty

    🪵 Reasoning

    A model's lack of confidence in its output. LLMs are notoriously bad at expressing uncertainty — they generate fluent text regardless of confidence. Techniques to surface uncertainty: token probabilities (logprobs), calibration, abstention prompting ("say 'I don't know' if unsure").

V3 terms

  • Vector Database

    🪵 Reasoning

    A database optimized for storing and searching high-dimensional vectors (embeddings). Used in RAG for similarity search. Options: Azure AI Search (hybrid), Pinecone, Weaviate, Qdrant, Chroma, pgvector. Choice depends on scale, hybrid search needs, and cloud integration.

  • vLLM

    🏗️ Operations

    An open-source, high-performance LLM serving engine. Key innovation: PagedAttention (manages KV cache like virtual memory pages). Supports continuous batching, tensor parallelism, and OpenAI-compatible API. Popular for self-hosted model serving on AKS or Container Apps.

W2 terms

  • Weight

    🌱 Foundations

    A numerical parameter in a neural network that is learned during training. A 70B model has 70 billion weights. Weights encode everything the model "knows." Saving weights = saving the model. Fine-tuning = updating weights.

  • Window (Context Window)

    🌱 Foundations

    See **Context Window**.

X1 term

  • XAI (Explainable AI)

    🍎 Transformation

    Techniques for understanding why a model made a specific prediction or generation. Attention visualization, feature attribution, and chain-of-thought prompting all contribute to explainability. Important for regulated industries.

Z2 terms

  • Zero-Shot

    🪵 Reasoning

    Asking a model to perform a task without any examples. "Classify this review as positive or negative: {review}". Works well for tasks similar to the model's training data. Performance improves with few-shot examples.

  • When to Use What

    Unspecified

    | Use Case | Temperature | Top-p | Why | |----------|-------------|-------|-----| | Factual Q&A | 0.0 | 1.0 | Maximum determinism | | Classification | 0.0 | 1.0 | Consistent labels | | Code generation | 0.0–0.2 | 0.95 | Correct but slightly varied | | Summarization | 0.3 | 0.9 | Faithful but fluent | | Creative writing | 0.7–1.0 | 0.95 | Diverse and interesting | | Brainstorming | 1.0–1.5 | 0.95 | Maximum diversity |