Module 12: AI Nexus — Quiz & Assessment
Duration: 20 minutes | Level: Assessment Audience: All — test your AI Nexus knowledge Last Updated: March 2026
Instructions
This assessment covers all modules in the AI Nexus curriculum. Each question is followed by an expandable answer with a detailed explanation. Use this to validate your understanding before discussing AI topics with application teams and stakeholders.
How to use this quiz:
- Read each question and formulate your answer before expanding the solution.
- Score yourself honestly — partial credit is fine.
- Track which modules need a second read based on where you struggle.
- Revisit in two weeks to measure retention.
Scoring Guide:
| Score | Level | Recommendation |
|---|---|---|
| 0-8 | Beginner | Review Modules 1-2 (Foundations + LLM Landscape) |
| 9-15 | Intermediate | Focus on weak areas, review relevant modules |
| 16-21 | Advanced | Ready for customer conversations |
| 22-25 | Expert | You can lead AI architecture discussions |
Section 1: GenAI Foundations (Modules 1-2)
Q1: What is a token in the context of large language models, and why should infrastructure architects care?
Click to reveal answer
Answer: A token is a subword unit that LLMs use to process text. It is not a word and not a character — it is a piece of text determined by the model's tokenizer. On average, one token is roughly 3-4 characters in English, or about 0.75 words. The word "infrastructure" might be split into two or three tokens depending on the tokenizer.
Why it matters for infra: Token count directly drives three cost and capacity dimensions:
- API cost — Azure OpenAI charges per 1,000 tokens (input and output priced separately).
- Latency — More tokens in the prompt means longer prefill time; more output tokens means longer generation time.
- Memory — Each token in the context window consumes GPU VRAM via the KV-cache. Larger context windows require more memory.
Key insight: Many architects confuse tokens with words. A 128K context window does not mean 128,000 words — it is closer to 96,000 words. Always convert to tokens when estimating costs and capacity.
Q2: Explain the difference between Temperature and Top-P. When would you set temperature to 0?
Click to reveal answer
Answer: Both temperature and Top-P control the randomness of model output, but they work differently:
| Parameter | Mechanism | Range | Effect |
|---|---|---|---|
| Temperature | Scales the probability distribution of the next token. Lower values sharpen the distribution (model picks the most likely token). | 0.0 - 2.0 | 0.0 = fully deterministic (greedy); 1.0 = default sampling; >1.0 = highly creative/chaotic |
| Top-P | Nucleus sampling — considers only the smallest set of tokens whose cumulative probability exceeds P. | 0.0 - 1.0 | 0.95 = considers top 95% probability mass; 0.1 = considers only the most likely tokens |
Set temperature to 0 (or near 0) when you need deterministic, factual, reproducible output — code generation, JSON extraction, classification tasks, compliance-sensitive responses.
Why it matters for infra: These parameters do not affect compute cost, but they affect output quality and length. Higher temperature produces longer, more varied outputs, which increases token consumption and latency. When building platform-level AI services, standardize these parameters in your APIM policies or prompt templates.
Common misconception: You should not tune both simultaneously. OpenAI recommends adjusting one and keeping the other at its default. Setting both temperature=0.2 and top_p=0.1 can produce overly constrained output.
Q3: What is a context window, and what are the infrastructure implications of a 128K-token context window versus a 4K-token window?
Click to reveal answer
Answer: The context window is the maximum number of tokens (input + output combined) that a model can process in a single request. It is the model's "working memory" for a given conversation turn.
| Aspect | 4K Context | 128K Context |
|---|---|---|
| Input capacity | ~3,000 words | ~96,000 words |
| Use case | Short Q&A, classification | Long document analysis, multi-doc RAG |
| VRAM usage | Low | Very high (KV-cache scales linearly) |
| Latency (prefill) | Fast | Slow — must process all input tokens |
| Cost per request | Lower | Significantly higher |
Why it matters for infra: A 128K context window does not come for free. The KV-cache (key-value cache) that stores attention state grows linearly with context length and consumes GPU VRAM. A single 128K-token request on GPT-4o can consume several gigabytes of VRAM just for the KV-cache. This directly impacts how many concurrent requests a GPU can serve.
Key insight: Just because a model supports 128K tokens does not mean every request should use 128K tokens. Effective RAG architectures retrieve only the most relevant chunks to keep context small, fast, and cheap. Over-stuffing the context window is one of the most common (and expensive) mistakes.
Q4: What are the two phases of LLM inference, and why does this distinction matter for capacity planning?
Click to reveal answer
Answer: LLM inference consists of two distinct phases:
-
Prefill (prompt processing): The model processes all input tokens in parallel. This phase is compute-bound (GPU FLOPs). Latency scales with input length but benefits from parallelism. This produces the "time to first token" (TTFT) metric.
-
Decode (token generation): The model generates output tokens one at a time, autoregressively. Each new token depends on all previous tokens. This phase is memory-bandwidth-bound. Latency is measured as "time per output token" (TPOT) or "inter-token latency."
Why it matters for infra:
- Prefill-heavy workloads (long documents, RAG with large context) need high GPU compute (FLOPs). Choose GPUs with strong compute throughput (A100, H100).
- Decode-heavy workloads (long-form generation, creative writing) need high memory bandwidth. Memory bandwidth is the bottleneck that determines tokens-per-second throughput.
- Capacity planning must account for both: a GPU serving a mix of prefill and decode work needs to balance compute and memory bandwidth.
Common misconception: Many architects think "bigger GPU = more tokens per second" for all workloads. In reality, decode-heavy workloads benefit more from memory bandwidth than raw compute. This is why the H100 (with 3.35 TB/s memory bandwidth) dramatically outperforms the A100 (2.0 TB/s) for generation-heavy tasks.
Section 2: Models (Module 2)
Q5: What are the key trade-offs between open-weight models (like Llama, Phi) and proprietary models (like GPT-4o, Claude)?
Click to reveal answer
Answer:
| Dimension | Open-Weight Models | Proprietary Models |
|---|---|---|
| Data sovereignty | Full control — runs in your VNet, your GPU | Data sent to provider API (even if within Azure region) |
| Cost at scale | Lower marginal cost once GPU is provisioned | Pay-per-token, scales linearly with usage |
| Customization | Full fine-tuning, quantization, distillation | Limited to prompt engineering, some fine-tuning |
| Capability ceiling | Generally lower than frontier proprietary models | Highest benchmark performance (GPT-4o, Claude Opus, Gemini Ultra) |
| Operational burden | You manage serving, scaling, patching, monitoring | Fully managed by the provider |
| Licensing | Varies — check commercial use terms (Llama has restrictions above 700M MAU) | Governed by API terms of service |
Why it matters for infra: The choice between open and proprietary models fundamentally changes your infrastructure architecture. Open models require GPU compute provisioning (ND-series VMs, AKS with GPU node pools, or Managed Compute endpoints). Proprietary models require only API connectivity and APIM governance — no GPUs.
Key insight: This is not an either-or decision. Most production architectures use a tiered model strategy: a small open model (Phi-4-mini) for high-volume, simple tasks and a proprietary frontier model (GPT-4o) for complex reasoning. This can reduce costs by 60-80% while maintaining quality where it matters.
Q6: When should you recommend a reasoning model (like o1, o3, DeepSeek-R1) over a standard model (like GPT-4o)?
Click to reveal answer
Answer: Reasoning models use chain-of-thought (CoT) processing internally — they "think" before answering, consuming additional tokens and time. Use them when the task requires multi-step logical reasoning:
Use reasoning models for:
- Complex math, logic puzzles, or multi-step calculations
- Code analysis that requires understanding control flow across multiple files
- Scientific reasoning with multiple constraints
- Planning tasks that require evaluating trade-offs
- Tasks where accuracy matters far more than latency
Use standard models for:
- Summarization, translation, content generation
- Simple Q&A and classification
- High-throughput, low-latency scenarios (chatbots, autocomplete)
- Tasks where speed matters more than depth of reasoning
- Cost-sensitive workloads with high request volume
Why it matters for infra: Reasoning models consume 3-10x more tokens per request (the chain-of-thought tokens are billed even though the user does not see them). They also have significantly higher latency — a single o3 request can take 30-60 seconds. Your APIM timeout policies, retry logic, and client-side timeout configurations must account for this.
Common misconception: Reasoning models are not "better" at everything. For simple tasks like summarization, they are slower, more expensive, and sometimes worse (overthinking leads to overcomplication). Match the model to the task complexity.
Q7: What are the Azure OpenAI deployment types, and when would you use each?
Click to reveal answer
Answer: Azure OpenAI offers multiple deployment types, each with different SLAs, pricing, and capacity guarantees:
| Deployment Type | Capacity Model | SLA | Best For |
|---|---|---|---|
| Standard | Shared, token-based billing | 99.9% (with provisioned) | General workloads, variable demand |
| Provisioned (PTU) | Reserved throughput units | 99.9% | Predictable, high-throughput production workloads |
| Global Standard | Microsoft-managed routing across regions | 99.9% | Global applications needing lowest latency |
| Global Provisioned | Reserved PTUs with global routing | 99.9% | Enterprise global deployments with guaranteed capacity |
| Data Zone Standard | Data stays within geographic zone (e.g., EU) | 99.9% | Data residency compliance |
| Data Zone Provisioned | Reserved capacity within geographic zone | 99.9% | Regulated industries with data sovereignty needs |
Why it matters for infra:
- PTU (Provisioned Throughput Units) give guaranteed throughput but require commitment — right-sizing is critical. Over-provisioning wastes money; under-provisioning causes throttling.
- Standard deployments are simpler but subject to throttling under high load and noisy-neighbor effects.
- Data Zone deployments are essential for customers with EU data residency requirements (GDPR).
Key insight: PTU pricing is per-hour regardless of usage. If your workload is bursty (e.g., batch processing at night), you may save money with Standard (pay-per-token) during off-peak and PTU for peak hours. Some customers combine both deployment types in a single APIM-fronted architecture.
Section 3: Azure AI Foundry (Module 3)
Q8: Explain the Hub-and-Project model in Azure AI Foundry. Why does it matter for enterprise governance?
Click to reveal answer
Answer: Azure AI Foundry uses a two-tier organizational model:
-
Hub: A shared governance container that holds common resources — Azure OpenAI connections, compute resources, storage, networking configuration, managed identity, and policy assignments. Think of it as the "platform layer." One Hub per region is the typical pattern.
-
Project: A workspace scoped to a specific team, application, or use case. Projects inherit connections and policies from their parent Hub but maintain their own assets (prompt flows, evaluations, datasets, deployments). Think of it as the "application layer."
Hub (Central IT owns)
|-- Shared AOAI connection (GPT-4o, GPT-4o-mini)
|-- Shared compute pool
|-- VNet integration, Private Endpoints
|-- Azure Policy assignments
|
|-- Project A (Team Alpha - Chatbot)
|-- Project B (Team Beta - Document Processing)
|-- Project C (Team Gamma - Code Assistant)
Why it matters for infra: This mirrors the landing zone pattern that cloud architects already use (Management Group = Hub, Subscription = Project). It enables centralized governance (networking, identity, cost controls) while giving teams autonomy to experiment. Without Hubs, every team creates their own AOAI instance, leading to sprawl, inconsistent security, and uncontrolled costs.
Key insight: The Hub-Project relationship maps directly to Azure RBAC. Hub Owners manage shared infrastructure; Project Contributors can deploy models and build flows without touching the underlying platform. This separation of concerns is critical for regulated industries.
Q9: What is the difference between a Serverless API endpoint and a Managed Compute endpoint in Azure AI Foundry?
Click to reveal answer
Answer:
| Dimension | Serverless API Endpoint | Managed Compute Endpoint |
|---|---|---|
| Infrastructure | No GPU management — fully managed by Microsoft | You provision dedicated VMs/GPUs |
| Billing | Pay-per-token (like a SaaS API) | Pay for compute uptime (VM hours) |
| Models available | Select models from Model Catalog (Llama, Mistral, Cohere, etc.) | Any model you can deploy (HuggingFace, custom) |
| Scaling | Automatic, managed by Microsoft | You configure autoscale rules |
| Customization | Limited — use the model as-is | Full control — custom containers, quantization |
| Networking | Public endpoint (Private Link coming) | VNet integration, Private Endpoints |
| Best for | Quick experimentation, variable/low-volume workloads | Production workloads needing full control, high volume, or VNet isolation |
Why it matters for infra: Serverless API endpoints are the fastest path from "I want to try Llama 3" to a working endpoint — no GPU provisioning, no capacity planning. But for production workloads requiring network isolation, predictable performance, or custom model configurations, Managed Compute endpoints are the right choice.
Common misconception: "Serverless" does not mean "no server." It means you do not manage the server. There are still GPUs running your inference — Microsoft manages them. The trade-off is less control for less operational overhead.
Section 4: Microsoft Copilot Ecosystem (Module 4)
Q10: When should a customer build a custom AI application versus using Copilot Studio?
Click to reveal answer
Answer:
| Factor | Copilot Studio | Custom AI Application |
|---|---|---|
| Builder persona | Citizen developers, business analysts | Professional developers |
| Time to deploy | Hours to days | Weeks to months |
| Customization | Low-code, topic-based flows, plugin connectors | Unlimited — custom models, custom UX, custom logic |
| Integration | Deep M365 integration out of the box | You build every integration |
| Data sources | Pre-built connectors (SharePoint, Dataverse, web) | Any data source you can code against |
| Hosting | Fully managed (Microsoft SaaS) | You manage infrastructure (App Service, AKS, etc.) |
| Best for | Internal helpdesks, HR bots, IT support, FAQ | Customer-facing products, complex multi-agent workflows, proprietary AI features |
Why it matters for infra: Copilot Studio requires zero infrastructure provisioning — it is pure SaaS. Recommending it when appropriate can save months of development and significant infrastructure cost. However, it has guardrails and limitations. If the customer needs custom model orchestration, multi-agent systems, or fine-grained control over the AI pipeline, a custom app (using Semantic Kernel, LangChain, or direct API calls) is the right path.
Key insight: Many enterprise scenarios start with Copilot Studio for the first version and migrate to a custom app as requirements grow. Design your architecture to support this evolution — use APIM as the AI gateway from day one so the backend can change without disrupting consumers.
Q11: What role does Microsoft Graph play in M365 Copilot, and why is it important for architects?
Click to reveal answer
Answer: Microsoft Graph is the data layer that powers M365 Copilot. When a user asks Copilot a question, the system does not search the internet — it queries Microsoft Graph to retrieve the user's relevant organizational data:
- Emails and calendar from Exchange Online
- Files and documents from SharePoint and OneDrive
- Chat messages from Teams
- People and org chart from Azure AD / Entra ID
- Tasks from Planner and To Do
The flow is: User prompt --> Copilot orchestrator --> Microsoft Graph (retrieval) --> LLM (grounding + generation) --> Response.
Why it matters for infra:
- Data governance is critical. Copilot respects existing Microsoft 365 permissions (RBAC). If a user can see a document in SharePoint, Copilot can surface it. This means overshared content becomes an AI risk — improperly permissioned SharePoint sites can leak sensitive data through Copilot responses.
- Network architecture. Graph API calls happen within the Microsoft cloud, but if you have Conditional Access policies, DLP rules, or network restrictions, they can impact Copilot's ability to retrieve data.
- Data quality. Copilot is only as good as the data in Graph. If SharePoint is a mess of duplicated, outdated documents, Copilot will retrieve and ground on bad data.
Key insight: The number one readiness task for M365 Copilot deployment is not technical infrastructure — it is data governance and permissions hygiene. Architects should collaborate with security teams to audit SharePoint permissions, sensitivity labels, and DLP policies before enabling Copilot.
Section 5: RAG Architecture (Module 5)
Q12: Why would you choose RAG over fine-tuning to give an LLM domain-specific knowledge?
Click to reveal answer
Answer:
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Data freshness | Real-time — index updates propagate immediately | Stale — must retrain to incorporate new data |
| Cost | Medium (embedding + vector DB) | High (GPU training compute + data prep) |
| Auditability | High — you can trace which documents were retrieved | Low — knowledge is baked into weights, not traceable |
| Hallucination control | Model is grounded on retrieved facts with citations | Model may still hallucinate from training data |
| Maintenance | Update the index, not the model | Retrain and redeploy the model |
| Data volume | Works with any corpus size | Needs hundreds to thousands of curated examples |
| Skill change | No — model behavior stays the same, just better informed | Yes — can change writing style, tone, domain vocabulary |
When fine-tuning is better: When you need to change the model's behavior, style, or tone (e.g., medical report writing in a specific format), or when you need the model to learn a specialized vocabulary that it handles poorly out of the box.
Why it matters for infra: RAG introduces infrastructure components that architects must design and manage: a vector database (Azure AI Search, Cosmos DB with vector), an embedding pipeline (batch or real-time), a document ingestion pipeline (chunking, parsing), and an orchestration layer. Fine-tuning introduces GPU training infrastructure but no runtime retrieval components.
Key insight: RAG and fine-tuning are not mutually exclusive. The best production systems often use both — fine-tune a model for domain-specific style, then use RAG to ground it on current data. But start with RAG; it solves 80% of use cases at a fraction of the cost.
Q13: How do you choose a chunking strategy, and why does chunk size matter?
Click to reveal answer
Answer: Chunking is how you split documents into pieces for embedding and retrieval. The strategy directly impacts retrieval quality:
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N tokens (e.g., 512) with overlap | Simple, predictable; works for homogeneous content |
| Semantic | Split based on meaning boundaries (paragraphs, sections) | Documents with clear structure (manuals, policies) |
| Recursive | Hierarchical splitting — try paragraph, then sentence, then character | General-purpose; good default for mixed content |
| Document-aware | Respect document structure (headings, tables, code blocks) | Technical docs, markdown, HTML, PDFs with tables |
Chunk size trade-offs:
| Small chunks (128-256 tokens) | Large chunks (1024-2048 tokens) |
|---|---|
| Precise retrieval — finds exact relevant passage | More context per chunk — fewer retrieval misses |
| Higher risk of losing context (answer split across chunks) | Higher risk of noise (irrelevant content in the chunk) |
| More chunks to embed and store (cost) | Fewer chunks, lower embedding cost |
| Works well with strong reranking | Works well for broad, exploratory queries |
Why it matters for infra: Chunk size affects storage costs (more chunks = more vectors = more index size), embedding compute costs (more chunks = more embedding API calls), and retrieval latency (larger index = slower search without proper optimization). A 1-million-page document corpus chunked at 256 tokens will produce significantly more vectors than the same corpus chunked at 1024 tokens.
Key insight: There is no universally correct chunk size. Always benchmark with your actual data and queries. Start with 512 tokens and 25% overlap, then adjust based on retrieval quality metrics (recall, precision, answer relevance).
Q14: What is hybrid search, and why is it better than pure vector search for enterprise RAG?
Click to reveal answer
Answer: Hybrid search combines two retrieval methods:
-
Vector search (semantic): Converts the query into an embedding and finds chunks with similar meaning. Understands synonyms and intent ("cost reduction" matches "saving money").
-
Keyword search (lexical, BM25): Traditional full-text search based on exact term matching. Excels at finding specific identifiers, product codes, error messages, or proper nouns.
Hybrid search runs both in parallel and merges the results using Reciprocal Rank Fusion (RRF), getting the benefits of both approaches.
| Scenario | Vector Only | Keyword Only | Hybrid |
|---|---|---|---|
| "How to reduce Azure costs" | Excellent | Good | Excellent |
| "Error code KB-40178" | Poor (no semantic meaning) | Excellent | Excellent |
| "VNET peering timeout" | Good | Good | Excellent |
| Typos in query | Good (embeddings are fuzzy) | Poor | Good |
Why it matters for infra: Azure AI Search supports hybrid search natively with the search parameter (keyword) + vectorQueries parameter (vector) in a single API call. There is no additional infrastructure to deploy — it is a configuration choice. Not enabling hybrid search is leaving retrieval quality on the table for free.
Key insight: In Microsoft's internal benchmarks, hybrid search with semantic ranking consistently outperforms pure vector search by 5-15% on relevance metrics across enterprise document corpora. Always default to hybrid unless you have a specific reason not to.
Q15: What is semantic ranking (reranking), and where does it fit in the RAG pipeline?
Click to reveal answer
Answer: Semantic ranking is a second-pass relevance scoring step that happens after initial retrieval:
Query --> [1. Retrieval: Get top 50 results] --> [2. Semantic Ranker: Rerank to top 5-10] --> [3. LLM: Generate answer]
The initial retrieval (vector + keyword) is fast but approximate — it uses embedding similarity or BM25 scores. The semantic ranker uses a cross-encoder model that reads the query AND each candidate document together, producing a much more accurate relevance score. It is slower (hence applied to a small candidate set, not the full index) but significantly more precise.
Why it matters for infra:
- Semantic ranking is a built-in feature of Azure AI Search (Semantic Ranker) — no additional infrastructure required. It is billed per 1,000 queries.
- It dramatically improves the quality of context passed to the LLM, which means better answers and fewer hallucinations.
- The alternative — retrieving more chunks to compensate for imprecise ranking — fills the context window with noise and increases LLM costs.
Key insight: Semantic ranking is one of the highest-impact, lowest-effort improvements you can make to a RAG pipeline. Many teams skip it and try to solve relevance problems by fine-tuning the LLM or increasing chunk overlap — both are more expensive and less effective than simply enabling the semantic ranker.