Skip to main content

T1: Fine-Tuning & Model Customization

Duration: 60–90 minutes | Level: Deep-Dive Part of: 🍎 FROOT Transformation Layer Prerequisites: F1 (GenAI Foundations), F2 (LLM Landscape) Last Updated: March 2026


Table of Contents​


T1.1 The Customization Spectrum​

Not every AI problem needs fine-tuning. Understanding the full spectrum of customization β€” from simple to complex β€” lets you pick the right tool:

MethodEffortData NeededCostWhen to Use
Prompt EngineeringMinutes0FreeAlways start here
Few-Shot ExamplesHours5–20 examplesMinimalWhen output format matters
RAGDaysYour knowledge baseMediumWhen accuracy on specific data matters
System InstructionsHoursRules & constraintsFreeWhen behavioral alignment matters
LoRA Fine-TuningDays-Weeks100–10K examples$50–$500When domain language/style matters
Full Fine-TuningWeeks10K–100K examples$1K–$50KWhen deep behavioral change needed
Continued Pre-TrainingWeeks-MonthsMillions of tokens$10K–$1MNew domain/language
Training From ScratchMonths-YearsTrillions of tokens$1M–$100M+Only for frontier labs

T1.2 When to Fine-Tune (and When Not To)​

The Decision Framework​

Fine-Tune When​

Use CaseWhy Fine-Tuning HelpsExample
Domain-specific languageModel needs to understand jargonMedical terminology, legal language
Consistent output formatNeed reliable structure despite varying inputsAlways output XML in a specific schema
Behavioral alignmentModel should act in a specific way consistentlyAlways formal, always cites sources
Cost reductionReplace long prompts with trained behavior2000-token system message β†’ fine-tuned, save $$
Latency reductionShorter prompts = faster responsesRemove few-shot examples from every call

Don't Fine-Tune When​

SituationBetter Alternative
Model needs current/changing dataRAG
You have <100 examplesFew-shot prompting
The issue is prompt clarityBetter prompts
You need quick iterationPrompt engineering
You want to add new knowledgeRAG, not fine-tuning

Critical Insight: Fine-tuning teaches the model how to respond, not what to know. For new knowledge, use RAG. For new behavior, use fine-tuning. For both, combine RAG + fine-tuning.


T1.3 Fine-Tuning Methods​

Full Fine-Tuning​

Updates all model parameters. Most powerful but most expensive.

AspectDetail
Parameters updatedAll (7B = 7 billion, 70B = 70 billion)
GPU requirements4–8x A100/H100 (80GB) for 7B, 32+ for 70B
Training data10K–100K examples ideal
RiskCatastrophic forgetting β€” model loses general capabilities
When to useWhen you need deep behavioral change AND have enough data

LoRA (Low-Rank Adaptation)​

The most practical fine-tuning technique. Freezes original weights and trains small adapter matrices.

LoRA ParameterTypical ValueEffect
Rank (r)8–64Higher = more expressive, more compute
Alpha16–128Scaling factor, usually alpha = 2 Γ— rank
Target modulesq_proj, v_proj, k_proj, o_projWhich attention matrices to adapt
Dropout0.05–0.1Regularization to prevent overfitting

QLoRA (Quantized LoRA)​

LoRA applied to a 4-bit quantized base model. The democratizer of fine-tuning.

ComparisonLoRAQLoRA
Base model formatFP16/BF16NF4 (4-bit)
Memory for 7B model~14 GB~4 GB
Memory for 70B model~140 GB~35 GB
Quality lossNone vs full FT<1% vs LoRA
GPU requiredA100 40GB+RTX 4090 (24GB)

T1.4 LoRA & QLoRA β€” The Practical Revolution​

Why LoRA Changed Everything​

Before LoRA (2023): fine-tuning a 7B model required 4x A100 GPUs ($30K+ in hardware). Fine-tuning 70B? Enterprise-only.

After LoRA: fine-tune 7B on a single consumer GPU. Fine-tune 70B on a single A100. Adapters are 10-100MB (instead of the full model at 14-140GB).

Full Fine-Tuning:  70B Γ— 2 bytes (FP16) = 140 GB VRAM  β†’ Need 32x A100
LoRA: 70B frozen + 0.1B trained = ~140 GB + ~200 MB β†’ Need 4x A100
QLoRA: 70B in 4-bit + 0.1B trained = ~35 GB + ~200 MB β†’ Need 1x A100

LoRA Best Practices​

PracticeRecommendationWhy
Start with low rankr=8 or r=16Most tasks don't need high rank
Target attention layersq_proj, v_proj minimumMost impactful for behavioral change
Set alpha = 2 Γ— rankalpha=32 for r=16Good balance of adapter influence
Use learning rate1e-4 to 2e-4Lower than full FT to avoid instability
Train for 1-3 epochsMore risks overfittingMonitor eval loss, stop when it rises
Validate frequentlyEvery 100-500 stepsCatch overfitting early

T1.5 Alignment: RLHF & DPO​

RLHF (Reinforcement Learning from Human Feedback)​

The technique that turned GPT-3 into ChatGPT:

DPO (Direct Preference Optimization)​

A simpler alternative that skips the reward model:

AspectRLHFDPO
Complexity3-step pipelineSingle training step
Reward modelRequired (separate training)Not needed
StabilityCan be finicky (PPO training)More stable
Data neededPreference pairs + reward model dataPreference pairs only
QualityGold standardComparable (sometimes better)
AdoptionChatGPT, ClaudeLlama 3, Zephyr, many open models

Preference Data Format​

{
"prompt": "Explain what a token is in the context of LLMs.",
"chosen": "A token is the smallest unit of text that a language model processes. Rather than reading individual characters, models break text into subword units called tokens using algorithms like Byte-Pair Encoding (BPE). For example, 'unbelievable' might become ['un', 'believ', 'able']. In GPT-4, one token averages about 4 English characters. Token count matters because it determines both the cost and the context window usage.",
"rejected": "A token is basically a word. Like when you type 'hello world' that's 2 tokens. Tokens are used by AI."
}

T1.6 Data Preparation​

Data quality is more important than data quantity for fine-tuning. A thousand excellent examples outperform ten thousand mediocre ones.

Data Format (Chat Format)​

{"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "What's the best VM for a SQL workload?"}, {"role": "assistant", "content": "For SQL workloads, I recommend..."}]}
{"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "Explain Azure Front Door."}, {"role": "assistant", "content": "Azure Front Door is a global..."}]}

Data Quality Checklist​

CheckWhy It MattersHow to Verify
Diverse inputsPrevent overfitting to patternsCluster queries by type, ensure coverage
High-quality outputsModel learns from examplesExpert review of assistant messages
Consistent formatReinforces desired structureSchema validation on outputs
No contradictionsConfuses the modelDeduplication and consistency check
Balanced classesPrevents bias toward over-represented typesCategory distribution analysis
Length varietyHandles both short and long responsesHistogram of response lengths
Edge casesTeaches robust behaviorInclude tricky/ambiguous examples (10-20%)

How Much Data?​

Model SizeMinimumGoodExcellent
7B100 examples500–1,0005,000+
13B200 examples1,000–2,00010,000+
70B500 examples2,000–5,00020,000+

T1.7 The Fine-Tuning Pipeline​


T1.8 Evaluation β€” Did It Work?​

Evaluation Framework​

MetricWhat It MeasuresHow to ComputeTarget
Loss (eval)Training convergenceBuilt into trainingDecreasing, not diverging
AccuracyCorrect answers on test setCompare to ground truth>85% for most tasks
Format complianceFollows output formatSchema validation>98%
Regression checkDidn't break general capabilityTest on general benchmarks<5% degradation
Human preferenceHumans prefer fine-tuned vs baseA/B blind evaluation>60% prefer fine-tuned
Task-specificDomain-relevant metricsTask-dependentImprovement vs base

A/B Evaluation Pattern​

1. Prepare 100 test prompts (not in training data)
2. Generate responses from BOTH base model and fine-tuned model
3. Randomize order (evaluator doesn't know which is which)
4. Expert evaluators score each response 1-5 on:
- Accuracy
- Relevance
- Format compliance
- Helpfulness
5. Compare aggregate scores
6. Fine-tuned should win on task-specific metrics
without significant regression on general metrics

T1.9 Azure AI Foundry Fine-Tuning​

Azure AI Foundry supports fine-tuning several model families:

ModelFine-Tuning SupportMethodMin Data
GPT-4oβœ… ManagedFull FT10 examples (50+ recommended)
GPT-4o-miniβœ… ManagedFull FT10 examples
Phi-4βœ… ManagedLoRA100 examples
Llama 3.1βœ… ManagedLoRA100 examples
Mistralβœ… ManagedLoRA100 examples

Azure Fine-Tuning Flow​

Cost Estimation​

ModelTraining CostHosting Cost
GPT-4o-mini~$3.00/1M training tokensStandard PAYG pricing
GPT-4o~$25.00/1M training tokensStandard PAYG pricing
Phi-4 (14B)Compute hours (GPU reservation)Managed compute pricing

T1.10 MLOps for LLMs​

The LLMOps Lifecycle​

Key LLMOps Tools​

ToolPurposeIntegration
Azure AI FoundryModel management, fine-tuning, evaluationNative Azure
MLflowExperiment tracking, model registryOpen source, Azure-integrated
Prompt FlowPrompt development and evaluationAzure AI Foundry, VS Code
GitHub ActionsCI/CD for model pipelinesAzure-integrated
Azure MonitorProduction observabilityNative Azure
LangSmith / LangFuseLLM-specific tracing and evaluationOpen source alternatives

Key Takeaways​

The Five Rules of Model Customization
  1. Start with prompts, not fine-tuning. 80% of problems are solved with better prompts + RAG. Fine-tune only when you've exhausted simpler options.
  2. LoRA is the default. Unless you have a specific reason for full fine-tuning, LoRA or QLoRA gives you 95% of the quality at 5% of the cost.
  3. Data quality > data quantity. 500 expert-curated examples beat 50,000 mediocre ones. Invest in data preparation.
  4. Always evaluate rigorously. Don't just check if fine-tuning "feels better" β€” measure accuracy, format compliance, and regression on general capability.
  5. Fine-tuning teaches HOW, not WHAT. For new knowledge, use RAG. For new behavior/style/format, use fine-tuning. For both, combine them.

FrootAI T1 β€” Fine-tuning is the art of teaching a model your language. LoRA made it accessible. QLoRA made it affordable. Good data makes it work.