Master the techniques for fine-tuning large language models for your specific use case. Learn data preparation, training infrastructure, LoRA/QLoRA methods, and deployment strategies with production-ready code examples.
Fine-tuning large language models transforms general-purpose AI into specialised tools that understand your domain, follow your conventions, and deliver consistent results for your specific use cases. While prompting can achieve impressive results, fine-tuning creates models that inherently "know" your business context without lengthy prompts.
This technical guide covers the complete fine-tuning workflow—from data preparation and infrastructure setup through training, evaluation, and deployment. We focus on practical, cost-effective approaches including LoRA and QLoRA that make fine-tuning accessible without massive compute budgets. Whether you're building a domain expert, teaching consistent formatting, or optimising for latency, you'll learn the techniques to succeed.
The decision between fine-tuning and prompting significantly impacts project complexity, cost, and results. Understanding when each approach excels helps you choose the right strategy from the start.
1from dataclasses import dataclass
2from enum import Enum
3
4class Approach(Enum):
5 PROMPTING = "prompting"
6 FINE_TUNING = "fine_tuning"
7 HYBRID = "hybrid"
8
9@dataclass
10class UseCaseAnalysis:
11 training_examples: int
12 output_consistency_requirement: float # 0-1
13 domain_specificity: float # 0-1
14 expected_monthly_calls: int
15 latency_requirement_ms: int
16
17 def recommend_approach(self) -> Approach:
18 """Recommend fine-tuning vs prompting based on use case."""
19
20 # Strong indicators for fine-tuning
21 if (self.output_consistency_requirement > 0.9 and
22 self.training_examples > 500):
23 return Approach.FINE_TUNING
24
25 # Strong indicators for prompting
26 if self.training_examples < 100:
27 return Approach.PROMPTING
28
29 # Cost analysis for borderline cases
30 prompt_cost_monthly = self._estimate_prompt_cost()
31 fine_tune_amortised = self._estimate_fine_tune_cost()
32
33 if fine_tune_amortised < prompt_cost_monthly * 0.7:
34 return Approach.FINE_TUNING
35 elif prompt_cost_monthly < fine_tune_amortised * 0.5:
36 return Approach.PROMPTING
37
38 return Approach.HYBRID
39
40 def _estimate_prompt_cost(self) -> float:
41 """Estimate monthly cost with few-shot prompting."""
42 avg_prompt_tokens = 2000 # Including examples
43 avg_completion_tokens = 500
44 cost_per_1k_input = 0.003
45 cost_per_1k_output = 0.015
46
47 return self.expected_monthly_calls * (
48 (avg_prompt_tokens / 1000 * cost_per_1k_input) +
49 (avg_completion_tokens / 1000 * cost_per_1k_output)
50 )
51
52 def _estimate_fine_tune_cost(self) -> float:
53 """Estimate amortised fine-tuning cost over 6 months."""
54 training_cost = self.training_examples * 0.008 # ~$0.008/1k tokens
55 inference_cost_monthly = self.expected_monthly_calls * 0.012
56 return (training_cost / 6) + inference_cost_monthly
57
58
59# Example usage
60analysis = UseCaseAnalysis(
61 training_examples=2000,
62 output_consistency_requirement=0.95,
63 domain_specificity=0.8,
64 expected_monthly_calls=50000,
65 latency_requirement_ms=500
66)
67
68recommendation = analysis.recommend_approach()
69print(f"Recommended approach: {recommendation.value}")Many production systems use a hybrid approach: fine-tuned models for core functionality with prompt engineering for edge cases and recent context.
Data quality determines fine-tuning success more than any other factor. A well-curated dataset of 1,000 examples often outperforms a noisy dataset of 10,000. This section covers building robust data preparation pipelines.
High-quality training data comes from multiple sources: existing documents, expert annotations, synthetic generation, and production logs (with appropriate consent and privacy handling).
1import json
2from dataclasses import dataclass, asdict
3from typing import Optional
4from pathlib import Path
5import hashlib
6
7@dataclass
8class TrainingExample:
9 """Single training example with metadata."""
10 instruction: str
11 input_text: str
12 output: str
13 source: str
14 quality_score: Optional[float] = None
15 tokens_estimate: Optional[int] = None
16
17 def to_chat_format(self) -> dict:
18 """Convert to OpenAI chat fine-tuning format."""
19 messages = [
20 {"role": "system", "content": "You are a helpful assistant."}
21 ]
22
23 if self.input_text:
24 messages.append({
25 "role": "user",
26 "content": f"{self.instruction}\n\nInput: {self.input_text}"
27 })
28 else:
29 messages.append({"role": "user", "content": self.instruction})
30
31 messages.append({"role": "assistant", "content": self.output})
32
33 return {"messages": messages}
34
35 def compute_hash(self) -> str:
36 """Generate hash for deduplication."""
37 content = f"{self.instruction}{self.input_text}{self.output}"
38 return hashlib.md5(content.encode()).hexdigest()
39
40
41class DataPipeline:
42 """Pipeline for preparing fine-tuning datasets."""
43
44 def __init__(self, min_quality_score: float = 0.7):
45 self.min_quality_score = min_quality_score
46 self.examples: list[TrainingExample] = []
47 self.seen_hashes: set[str] = set()
48
49 def add_from_jsonl(self, path: Path) -> int:
50 """Load examples from JSONL file."""
51 added = 0
52 with open(path) as f:
53 for line in f:
54 data = json.loads(line)
55 example = TrainingExample(
56 instruction=data['instruction'],
57 input_text=data.get('input', ''),
58 output=data['output'],
59 source=str(path)
60 )
61 if self._add_example(example):
62 added += 1
63 return added
64
65 def add_from_documents(
66 self,
67 documents: list[str],
68 instruction_generator: callable
69 ) -> int:
70 """Generate examples from documents using a generator function."""
71 added = 0
72 for doc in documents:
73 examples = instruction_generator(doc)
74 for ex in examples:
75 if self._add_example(ex):
76 added += 1
77 return added
78
79 def _add_example(self, example: TrainingExample) -> bool:
80 """Add example if it passes validation."""
81 # Deduplication
82 hash_val = example.compute_hash()
83 if hash_val in self.seen_hashes:
84 return False
85
86 # Quality filtering
87 if example.quality_score and example.quality_score < self.min_quality_score:
88 return False
89
90 # Length validation
91 example.tokens_estimate = self._estimate_tokens(example)
92 if example.tokens_estimate > 4096:
93 return False
94
95 self.seen_hashes.add(hash_val)
96 self.examples.append(example)
97 return True
98
99 def _estimate_tokens(self, example: TrainingExample) -> int:
100 """Rough token count estimation."""
101 total_chars = len(example.instruction) + len(example.input_text) + len(example.output)
102 return int(total_chars / 3.5) # Rough approximation
103
104 def export_openai_format(self, output_path: Path) -> None:
105 """Export dataset in OpenAI fine-tuning format."""
106 with open(output_path, 'w') as f:
107 for example in self.examples:
108 json.dump(example.to_chat_format(), f)
109 f.write('\n')
110
111 def get_statistics(self) -> dict:
112 """Return dataset statistics."""
113 if not self.examples:
114 return {"count": 0}
115
116 token_counts = [ex.tokens_estimate for ex in self.examples if ex.tokens_estimate]
117 return {
118 "count": len(self.examples),
119 "avg_tokens": sum(token_counts) / len(token_counts),
120 "max_tokens": max(token_counts),
121 "min_tokens": min(token_counts),
122 "sources": list(set(ex.source for ex in self.examples))
123 }
124
125
126# Usage
127pipeline = DataPipeline(min_quality_score=0.7)
128pipeline.add_from_jsonl(Path("./data/expert_annotations.jsonl"))
129pipeline.add_from_jsonl(Path("./data/synthetic_examples.jsonl"))
130
131stats = pipeline.get_statistics()
132print(f"Dataset: {stats['count']} examples, avg {stats['avg_tokens']:.0f} tokens")
133
134pipeline.export_openai_format(Path("./training_data.jsonl"))Automated quality scoring helps filter problematic examples before they corrupt your model:
1import re
2from dataclasses import dataclass
3
4@dataclass
5class QualityMetrics:
6 coherence: float
7 completeness: float
8 formatting: float
9 factual_consistency: float
10
11 @property
12 def overall(self) -> float:
13 weights = [0.3, 0.25, 0.2, 0.25]
14 scores = [self.coherence, self.completeness, self.formatting, self.factual_consistency]
15 return sum(w * s for w, s in zip(weights, scores))
16
17
18class QualityScorer:
19 """Score training examples for quality."""
20
21 def __init__(self):
22 self.min_output_length = 50
23 self.max_output_length = 4000
24
25 def score(self, example: TrainingExample) -> QualityMetrics:
26 return QualityMetrics(
27 coherence=self._score_coherence(example),
28 completeness=self._score_completeness(example),
29 formatting=self._score_formatting(example),
30 factual_consistency=self._score_factual_consistency(example)
31 )
32
33 def _score_coherence(self, example: TrainingExample) -> float:
34 """Check if output logically follows from instruction."""
35 score = 1.0
36
37 # Penalise very short outputs
38 if len(example.output) < self.min_output_length:
39 score -= 0.5
40
41 # Penalise truncated outputs
42 if example.output.rstrip().endswith(('...', '[', '{')):
43 score -= 0.3
44
45 # Check for repetition (sign of generation issues)
46 words = example.output.split()
47 if len(words) > 10:
48 unique_ratio = len(set(words)) / len(words)
49 if unique_ratio < 0.3:
50 score -= 0.4
51
52 return max(0, score)
53
54 def _score_completeness(self, example: TrainingExample) -> float:
55 """Check if output fully addresses the instruction."""
56 score = 1.0
57
58 # Length appropriateness
59 if len(example.output) < self.min_output_length:
60 score -= 0.3
61
62 # Check for common incomplete markers
63 incomplete_markers = [
64 'I cannot', 'I\'m not sure', 'I don\'t know',
65 'TODO', 'TBD', '[placeholder]'
66 ]
67 for marker in incomplete_markers:
68 if marker.lower() in example.output.lower():
69 score -= 0.2
70
71 return max(0, score)
72
73 def _score_formatting(self, example: TrainingExample) -> float:
74 """Check output formatting consistency."""
75 score = 1.0
76
77 # Check JSON validity if output looks like JSON
78 if example.output.strip().startswith('{'):
79 try:
80 import json
81 json.loads(example.output)
82 except json.JSONDecodeError:
83 score -= 0.5
84
85 # Check for balanced brackets/quotes
86 if example.output.count('(') != example.output.count(')'):
87 score -= 0.2
88 if example.output.count('[') != example.output.count(']'):
89 score -= 0.2
90
91 return max(0, score)
92
93 def _score_factual_consistency(self, example: TrainingExample) -> float:
94 """Basic consistency checks."""
95 score = 1.0
96
97 # Check for contradictory statements
98 contradictions = [
99 ('always', 'never'),
100 ('all', 'none'),
101 ('true', 'false')
102 ]
103
104 words_lower = example.output.lower()
105 for word1, word2 in contradictions:
106 if word1 in words_lower and word2 in words_lower:
107 # Context-dependent, mild penalty
108 score -= 0.1
109
110 return max(0, score)
111
112
113# Apply scoring to pipeline
114scorer = QualityScorer()
115for example in pipeline.examples:
116 metrics = scorer.score(example)
117 example.quality_score = metrics.overallFor production datasets, augment automated scoring with human review of a representative sample to calibrate thresholds.
Choosing between local and cloud training involves trade-offs between cost, convenience, and capability. This section covers both approaches with practical configurations.
Local training works well for smaller models (up to 7B parameters with QLoRA) and provides complete data privacy. You need a GPU with at least 24GB VRAM for efficient training.
1# requirements.txt for local training
2"""
3torch>=2.1.0
4transformers>=4.36.0
5peft>=0.7.0
6bitsandbytes>=0.41.0
7datasets>=2.15.0
8accelerate>=0.25.0
9wandb>=0.16.0
10trl>=0.7.0
11"""
12
13import torch
14from transformers import (
15 AutoModelForCausalLM,
16 AutoTokenizer,
17 BitsAndBytesConfig
18)
19
20def setup_local_model(
21 model_name: str = "mistralai/Mistral-7B-v0.1",
22 quantization: str = "4bit"
23) -> tuple:
24 """Setup model for local training with quantization."""
25
26 # Configure quantization
27 if quantization == "4bit":
28 bnb_config = BitsAndBytesConfig(
29 load_in_4bit=True,
30 bnb_4bit_quant_type="nf4",
31 bnb_4bit_compute_dtype=torch.bfloat16,
32 bnb_4bit_use_double_quant=True
33 )
34 elif quantization == "8bit":
35 bnb_config = BitsAndBytesConfig(load_in_8bit=True)
36 else:
37 bnb_config = None
38
39 # Load tokenizer
40 tokenizer = AutoTokenizer.from_pretrained(model_name)
41 tokenizer.pad_token = tokenizer.eos_token
42 tokenizer.padding_side = "right"
43
44 # Load model with quantization
45 model = AutoModelForCausalLM.from_pretrained(
46 model_name,
47 quantization_config=bnb_config,
48 device_map="auto",
49 trust_remote_code=True
50 )
51
52 # Disable caching for training
53 model.config.use_cache = False
54 model.config.pretraining_tp = 1
55
56 # Print memory usage
57 print(f"Model loaded. GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
58
59 return model, tokenizer
60
61
62# Check GPU availability
63if torch.cuda.is_available():
64 print(f"GPU: {torch.cuda.get_device_name(0)}")
65 print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
66else:
67 print("No GPU available - training will be very slow")Cloud platforms offer scalability and access to high-end GPUs. Here's a configuration for AWS SageMaker:
1import sagemaker
2from sagemaker.huggingface import HuggingFace
3
4def create_sagemaker_training_job(
5 training_data_s3: str,
6 output_s3: str,
7 instance_type: str = "ml.g5.2xlarge"
8) -> HuggingFace:
9 """Configure SageMaker training job for fine-tuning."""
10
11 # Hyperparameters
12 hyperparameters = {
13 'model_id': 'mistralai/Mistral-7B-v0.1',
14 'epochs': 3,
15 'per_device_train_batch_size': 4,
16 'gradient_accumulation_steps': 4,
17 'learning_rate': 2e-4,
18 'lora_r': 16,
19 'lora_alpha': 32,
20 'lora_dropout': 0.05,
21 'bf16': True,
22 'gradient_checkpointing': True
23 }
24
25 # Instance configuration
26 instance_configs = {
27 "ml.g5.xlarge": {"gpu_memory": 24, "cost_per_hour": 1.41},
28 "ml.g5.2xlarge": {"gpu_memory": 24, "cost_per_hour": 1.69},
29 "ml.g5.4xlarge": {"gpu_memory": 24, "cost_per_hour": 2.27},
30 "ml.p4d.24xlarge": {"gpu_memory": 320, "cost_per_hour": 37.69}
31 }
32
33 print(f"Using {instance_type}: {instance_configs[instance_type]}")
34
35 # Create estimator
36 huggingface_estimator = HuggingFace(
37 entry_point='train.py',
38 source_dir='./scripts',
39 instance_type=instance_type,
40 instance_count=1,
41 role=sagemaker.get_execution_role(),
42 transformers_version='4.36',
43 pytorch_version='2.1',
44 py_version='py310',
45 hyperparameters=hyperparameters,
46 output_path=output_s3,
47 disable_profiler=True,
48 environment={
49 'HUGGINGFACE_HUB_CACHE': '/tmp/hf_cache'
50 }
51 )
52
53 return huggingface_estimator
54
55
56# Alternative: RunPod configuration for cost-effective training
57RUNPOD_CONFIG = """
58# runpod.yaml
59gpu: RTX_4090 # or A100_80GB for larger models
60vcpu: 8
61memory: 32
62volume_size: 100
63docker_image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
64env:
65 - WANDB_API_KEY=${WANDB_API_KEY}
66 - HF_TOKEN=${HF_TOKEN}
67"""For most use cases, we recommend starting with a cloud provider offering spot/preemptible instances, which can reduce costs by 60-80%.
Low-Rank Adaptation (LoRA) and its quantised variant QLoRA have revolutionised fine-tuning by reducing memory requirements by 90%+ while maintaining quality. These methods train small adapter layers rather than modifying all model weights.
LoRA works by decomposing weight updates into low-rank matrices. Instead of updating a weight matrix W directly, it learns two smaller matrices A and B such that the update ΔW = BA. This dramatically reduces trainable parameters.
1from peft import (
2 LoraConfig,
3 get_peft_model,
4 prepare_model_for_kbit_training,
5 TaskType
6)
7from transformers import TrainingArguments
8from trl import SFTTrainer
9
10def create_lora_config(
11 r: int = 16,
12 lora_alpha: int = 32,
13 target_modules: list[str] = None
14) -> LoraConfig:
15 """Create LoRA configuration with best practices."""
16
17 # Default target modules for LLaMA-style models
18 if target_modules is None:
19 target_modules = [
20 "q_proj", "k_proj", "v_proj", "o_proj", # Attention
21 "gate_proj", "up_proj", "down_proj" # MLP
22 ]
23
24 return LoraConfig(
25 r=r, # Rank of update matrices
26 lora_alpha=lora_alpha, # Scaling factor
27 target_modules=target_modules,
28 lora_dropout=0.05,
29 bias="none",
30 task_type=TaskType.CAUSAL_LM
31 )
32
33
34def setup_lora_training(
35 model,
36 tokenizer,
37 train_dataset,
38 output_dir: str = "./lora_output"
39):
40 """Setup complete LoRA training pipeline."""
41
42 # Prepare model for training
43 model = prepare_model_for_kbit_training(model)
44
45 # Apply LoRA
46 lora_config = create_lora_config(r=16, lora_alpha=32)
47 model = get_peft_model(model, lora_config)
48
49 # Print trainable parameters
50 trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
51 total_params = sum(p.numel() for p in model.parameters())
52 print(f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
53
54 # Training arguments
55 training_args = TrainingArguments(
56 output_dir=output_dir,
57 num_train_epochs=3,
58 per_device_train_batch_size=4,
59 gradient_accumulation_steps=4,
60 learning_rate=2e-4,
61 weight_decay=0.01,
62 warmup_ratio=0.03,
63 lr_scheduler_type="cosine",
64 logging_steps=10,
65 save_strategy="epoch",
66 bf16=True,
67 gradient_checkpointing=True,
68 optim="paged_adamw_8bit",
69 max_grad_norm=0.3
70 )
71
72 # Create trainer
73 trainer = SFTTrainer(
74 model=model,
75 train_dataset=train_dataset,
76 tokenizer=tokenizer,
77 args=training_args,
78 max_seq_length=2048,
79 dataset_text_field="text",
80 packing=True # Efficient packing of short sequences
81 )
82
83 return trainer
84
85
86# Run training
87trainer = setup_lora_training(model, tokenizer, train_dataset)
88trainer.train()
89
90# Save adapter weights only (small file size)
91trainer.model.save_pretrained("./final_adapter")QLoRA combines 4-bit quantisation with LoRA, enabling fine-tuning of 65B+ parameter models on a single GPU:
1from transformers import BitsAndBytesConfig
2import torch
3
4def setup_qlora_model(model_name: str):
5 """Setup model with QLoRA - 4-bit quantization + LoRA."""
6
7 # 4-bit quantization config
8 bnb_config = BitsAndBytesConfig(
9 load_in_4bit=True,
10 bnb_4bit_quant_type="nf4", # Normalised float 4-bit
11 bnb_4bit_compute_dtype=torch.bfloat16,
12 bnb_4bit_use_double_quant=True # Nested quantization
13 )
14
15 # Load quantised model
16 model = AutoModelForCausalLM.from_pretrained(
17 model_name,
18 quantization_config=bnb_config,
19 device_map="auto",
20 torch_dtype=torch.bfloat16
21 )
22
23 # Prepare for k-bit training
24 model = prepare_model_for_kbit_training(
25 model,
26 use_gradient_checkpointing=True
27 )
28
29 # QLoRA-specific config
30 qlora_config = LoraConfig(
31 r=64, # Higher rank for QLoRA
32 lora_alpha=16,
33 target_modules=[
34 "q_proj", "k_proj", "v_proj", "o_proj",
35 "gate_proj", "up_proj", "down_proj"
36 ],
37 lora_dropout=0.1,
38 bias="none",
39 task_type=TaskType.CAUSAL_LM
40 )
41
42 model = get_peft_model(model, qlora_config)
43
44 return model
45
46
47# Memory comparison
48MEMORY_REQUIREMENTS = {
49 "7B Full Fine-tune": "~140 GB",
50 "7B LoRA (16-bit)": "~28 GB",
51 "7B QLoRA (4-bit)": "~6 GB",
52 "70B QLoRA (4-bit)": "~48 GB"
53}
54
55for method, memory in MEMORY_REQUIREMENTS.items():
56 print(f"{method}: {memory}")QLoRA achieves comparable results to full fine-tuning at a fraction of the compute cost—the sweet spot for most business applications.
Finding optimal hyperparameters significantly impacts training efficiency and model quality. While defaults work reasonably well, systematic optimisation can improve results by 10-30%.
Focus optimisation efforts on these high-impact parameters:
1import optuna
2from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
3
4def create_objective(model_init, train_dataset, eval_dataset, tokenizer):
5 """Create Optuna objective function for hyperparameter search."""
6
7 def objective(trial: optuna.Trial) -> float:
8 # Sample hyperparameters
9 learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-4, log=True)
10 batch_size = trial.suggest_categorical("batch_size", [2, 4, 8])
11 warmup_ratio = trial.suggest_float("warmup_ratio", 0.0, 0.1)
12 weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)
13 lora_r = trial.suggest_categorical("lora_r", [8, 16, 32, 64])
14 lora_alpha = trial.suggest_categorical("lora_alpha", [16, 32, 64])
15
16 # Calculate gradient accumulation for effective batch size of 16
17 grad_accum = 16 // batch_size
18
19 training_args = TrainingArguments(
20 output_dir=f"./trials/trial_{trial.number}",
21 num_train_epochs=1, # Short for hyperparameter search
22 per_device_train_batch_size=batch_size,
23 gradient_accumulation_steps=grad_accum,
24 learning_rate=learning_rate,
25 warmup_ratio=warmup_ratio,
26 weight_decay=weight_decay,
27 evaluation_strategy="steps",
28 eval_steps=50,
29 logging_steps=10,
30 load_best_model_at_end=True,
31 metric_for_best_model="eval_loss",
32 bf16=True
33 )
34
35 # Initialize model with sampled LoRA config
36 model = model_init()
37 lora_config = LoraConfig(
38 r=lora_r,
39 lora_alpha=lora_alpha,
40 target_modules=["q_proj", "v_proj"],
41 lora_dropout=0.05,
42 task_type=TaskType.CAUSAL_LM
43 )
44 model = get_peft_model(model, lora_config)
45
46 trainer = Trainer(
47 model=model,
48 args=training_args,
49 train_dataset=train_dataset,
50 eval_dataset=eval_dataset,
51 tokenizer=tokenizer,
52 callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
53 )
54
55 trainer.train()
56
57 # Return best eval loss
58 return trainer.state.best_metric
59
60 return objective
61
62
63def run_hyperparameter_search(
64 model_init,
65 train_dataset,
66 eval_dataset,
67 tokenizer,
68 n_trials: int = 20
69):
70 """Run hyperparameter optimisation with Optuna."""
71
72 study = optuna.create_study(
73 direction="minimize",
74 study_name="llm_fine_tuning",
75 pruner=optuna.pruners.MedianPruner(n_startup_trials=5)
76 )
77
78 objective = create_objective(model_init, train_dataset, eval_dataset, tokenizer)
79
80 study.optimize(
81 objective,
82 n_trials=n_trials,
83 timeout=3600 * 4, # 4 hour timeout
84 show_progress_bar=True
85 )
86
87 print(f"Best trial: {study.best_trial.number}")
88 print(f"Best params: {study.best_params}")
89 print(f"Best loss: {study.best_value:.4f}")
90
91 return study.best_params
92
93
94# Recommended starting points by model size
95RECOMMENDED_HYPERPARAMS = {
96 "7B": {
97 "learning_rate": 2e-4,
98 "batch_size": 4,
99 "lora_r": 16,
100 "lora_alpha": 32,
101 "warmup_ratio": 0.03
102 },
103 "13B": {
104 "learning_rate": 1e-4,
105 "batch_size": 2,
106 "lora_r": 32,
107 "lora_alpha": 64,
108 "warmup_ratio": 0.05
109 },
110 "70B": {
111 "learning_rate": 5e-5,
112 "batch_size": 1,
113 "lora_r": 64,
114 "lora_alpha": 128,
115 "warmup_ratio": 0.05
116 }
117}Start with recommended defaults, run a small hyperparameter search on 10-20% of your data, then train the final model with optimal settings on the full dataset.
Evaluating fine-tuned models requires both automated metrics and human assessment. Different use cases prioritise different metrics—instruction following needs different evaluation than code generation.
1from dataclasses import dataclass
2import numpy as np
3from typing import Callable
4from rouge_score import rouge_scorer
5from bert_score import score as bert_score
6import evaluate
7
8@dataclass
9class EvaluationResult:
10 metric_name: str
11 score: float
12 details: dict = None
13
14
15class ModelEvaluator:
16 """Comprehensive evaluation suite for fine-tuned models."""
17
18 def __init__(self, model, tokenizer):
19 self.model = model
20 self.tokenizer = tokenizer
21 self.rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
22 self.bleu = evaluate.load("bleu")
23
24 def evaluate_all(
25 self,
26 test_examples: list[dict],
27 custom_metrics: list[Callable] = None
28 ) -> dict[str, EvaluationResult]:
29 """Run all evaluation metrics on test set."""
30
31 # Generate predictions
32 predictions = []
33 references = []
34
35 for example in test_examples:
36 pred = self._generate(example['input'])
37 predictions.append(pred)
38 references.append(example['expected_output'])
39
40 results = {}
41
42 # Perplexity
43 results['perplexity'] = self._compute_perplexity(test_examples)
44
45 # ROUGE scores
46 results['rouge'] = self._compute_rouge(predictions, references)
47
48 # BLEU score
49 results['bleu'] = self._compute_bleu(predictions, references)
50
51 # BERTScore for semantic similarity
52 results['bertscore'] = self._compute_bertscore(predictions, references)
53
54 # Task-specific accuracy
55 results['exact_match'] = self._compute_exact_match(predictions, references)
56
57 # Custom metrics
58 if custom_metrics:
59 for metric_fn in custom_metrics:
60 name = metric_fn.__name__
61 results[name] = metric_fn(predictions, references)
62
63 return results
64
65 def _generate(self, prompt: str, max_length: int = 512) -> str:
66 """Generate response for evaluation."""
67 inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
68
69 with torch.no_grad():
70 outputs = self.model.generate(
71 **inputs,
72 max_new_tokens=max_length,
73 do_sample=False,
74 pad_token_id=self.tokenizer.pad_token_id
75 )
76
77 response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
78 return response[len(prompt):].strip()
79
80 def _compute_perplexity(self, examples: list[dict]) -> EvaluationResult:
81 """Compute perplexity on test set."""
82 total_loss = 0
83 total_tokens = 0
84
85 for example in examples:
86 text = f"{example['input']} {example['expected_output']}"
87 inputs = self.tokenizer(text, return_tensors="pt").to(self.model.device)
88
89 with torch.no_grad():
90 outputs = self.model(**inputs, labels=inputs['input_ids'])
91 total_loss += outputs.loss.item() * inputs['input_ids'].shape[1]
92 total_tokens += inputs['input_ids'].shape[1]
93
94 perplexity = np.exp(total_loss / total_tokens)
95 return EvaluationResult("perplexity", perplexity)
96
97 def _compute_rouge(self, predictions: list, references: list) -> EvaluationResult:
98 """Compute ROUGE scores."""
99 scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
100
101 for pred, ref in zip(predictions, references):
102 result = self.rouge.score(ref, pred)
103 for key in scores:
104 scores[key].append(result[key].fmeasure)
105
106 avg_scores = {k: np.mean(v) for k, v in scores.items()}
107 return EvaluationResult("rouge", avg_scores['rougeL'], avg_scores)
108
109 def _compute_bleu(self, predictions: list, references: list) -> EvaluationResult:
110 """Compute BLEU score."""
111 result = self.bleu.compute(
112 predictions=predictions,
113 references=[[r] for r in references]
114 )
115 return EvaluationResult("bleu", result['bleu'])
116
117 def _compute_bertscore(self, predictions: list, references: list) -> EvaluationResult:
118 """Compute BERTScore for semantic similarity."""
119 P, R, F1 = bert_score(predictions, references, lang="en")
120 return EvaluationResult("bertscore", F1.mean().item())
121
122 def _compute_exact_match(self, predictions: list, references: list) -> EvaluationResult:
123 """Compute exact match accuracy."""
124 matches = sum(
125 1 for p, r in zip(predictions, references)
126 if p.strip().lower() == r.strip().lower()
127 )
128 return EvaluationResult("exact_match", matches / len(predictions))
129
130
131# Domain-specific custom metrics
132def json_validity_score(predictions: list, references: list) -> EvaluationResult:
133 """Check if predictions are valid JSON."""
134 import json
135 valid = 0
136 for pred in predictions:
137 try:
138 json.loads(pred)
139 valid += 1
140 except:
141 pass
142 return EvaluationResult("json_validity", valid / len(predictions))
143
144
145def format_compliance_score(predictions: list, references: list) -> EvaluationResult:
146 """Check compliance with expected output format."""
147 import re
148 pattern = r'^\{.*\}$' # Example: must be wrapped in braces
149 matches = sum(1 for p in predictions if re.match(pattern, p.strip(), re.DOTALL))
150 return EvaluationResult("format_compliance", matches / len(predictions))
151
152
153# Usage
154evaluator = ModelEvaluator(model, tokenizer)
155results = evaluator.evaluate_all(
156 test_examples,
157 custom_metrics=[json_validity_score, format_compliance_score]
158)
159
160for name, result in results.items():
161 print(f"{name}: {result.score:.4f}")Automated metrics capture only part of the picture. Human evaluation assesses qualities like helpfulness, safety, and task appropriateness:
1interface EvaluationCriteria {
2 name: string;
3 description: string;
4 scale: [number, number];
5 guidelines: string[];
6}
7
8const EVALUATION_CRITERIA: EvaluationCriteria[] = [
9 {
10 name: "Accuracy",
11 description: "Is the information factually correct?",
12 scale: [1, 5],
13 guidelines: [
14 "1: Contains significant factual errors",
15 "3: Mostly accurate with minor issues",
16 "5: Completely accurate"
17 ]
18 },
19 {
20 name: "Relevance",
21 description: "Does the response address the query?",
22 scale: [1, 5],
23 guidelines: [
24 "1: Completely off-topic",
25 "3: Partially addresses the query",
26 "5: Fully addresses all aspects"
27 ]
28 },
29 {
30 name: "Coherence",
31 description: "Is the response well-structured and logical?",
32 scale: [1, 5],
33 guidelines: [
34 "1: Incoherent or contradictory",
35 "3: Generally logical with some issues",
36 "5: Clear, well-organised, logical"
37 ]
38 },
39 {
40 name: "Helpfulness",
41 description: "Would this response help the user?",
42 scale: [1, 5],
43 guidelines: [
44 "1: Not helpful at all",
45 "3: Somewhat helpful",
46 "5: Extremely helpful"
47 ]
48 }
49];
50
51interface HumanEvaluation {
52 exampleId: string;
53 evaluatorId: string;
54 modelA: string;
55 modelB: string;
56 preference: 'A' | 'B' | 'tie';
57 scores: Record<string, number>;
58 notes?: string;
59}
60
61function calculateInterAnnotatorAgreement(
62 evaluations: HumanEvaluation[]
63): number {
64 // Cohen's kappa for preference agreement
65 const grouped = groupBy(evaluations, e => e.exampleId);
66 let agreements = 0;
67 let total = 0;
68
69 for (const [_, evals] of Object.entries(grouped)) {
70 if (evals.length >= 2) {
71 for (let i = 0; i < evals.length - 1; i++) {
72 for (let j = i + 1; j < evals.length; j++) {
73 total++;
74 if (evals[i].preference === evals[j].preference) {
75 agreements++;
76 }
77 }
78 }
79 }
80 }
81
82 return total > 0 ? agreements / total : 0;
83}Aim for at least 3 evaluators per example and report inter-annotator agreement alongside results.
Deploying fine-tuned models requires balancing latency, cost, and reliability. This section covers production deployment patterns from simple to sophisticated.
1# vLLM provides high-throughput inference with PagedAttention
2from vllm import LLM, SamplingParams
3
4def deploy_with_vllm(
5 base_model: str,
6 lora_path: str,
7 tensor_parallel_size: int = 1
8):
9 """Deploy fine-tuned model with vLLM for production."""
10
11 # Load model with LoRA adapter
12 llm = LLM(
13 model=base_model,
14 enable_lora=True,
15 max_lora_rank=64,
16 tensor_parallel_size=tensor_parallel_size,
17 gpu_memory_utilization=0.9,
18 max_model_len=4096
19 )
20
21 # Default sampling parameters
22 sampling_params = SamplingParams(
23 temperature=0.7,
24 top_p=0.9,
25 max_tokens=512,
26 stop=["</s>", "\n\n"]
27 )
28
29 return llm, sampling_params
30
31
32# FastAPI wrapper for production
33from fastapi import FastAPI, HTTPException
34from pydantic import BaseModel
35import asyncio
36
37app = FastAPI()
38
39class GenerationRequest(BaseModel):
40 prompt: str
41 max_tokens: int = 512
42 temperature: float = 0.7
43
44class GenerationResponse(BaseModel):
45 text: str
46 tokens_generated: int
47 latency_ms: float
48
49@app.post("/generate", response_model=GenerationResponse)
50async def generate(request: GenerationRequest):
51 import time
52 start = time.time()
53
54 params = SamplingParams(
55 temperature=request.temperature,
56 max_tokens=request.max_tokens
57 )
58
59 outputs = llm.generate([request.prompt], params)
60
61 latency = (time.time() - start) * 1000
62
63 return GenerationResponse(
64 text=outputs[0].outputs[0].text,
65 tokens_generated=len(outputs[0].outputs[0].token_ids),
66 latency_ms=latency
67 )
68
69
70# Docker deployment
71DOCKERFILE = """
72FROM vllm/vllm-openai:latest
73
74# Copy model and adapter
75COPY ./model /app/model
76COPY ./adapter /app/adapter
77
78# Set environment
79ENV MODEL_PATH=/app/model
80ENV LORA_PATH=/app/adapter
81
82# Health check
83HEALTHCHECK --interval=30s --timeout=10s \
84 CMD curl -f http://localhost:8000/health || exit 1
85
86CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
87 "--model", "$MODEL_PATH", \
88 "--enable-lora", \
89 "--lora-modules", "custom=$LORA_PATH"]
90"""1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: llm-inference
5 labels:
6 app: llm-inference
7spec:
8 replicas: 2
9 selector:
10 matchLabels:
11 app: llm-inference
12 template:
13 metadata:
14 labels:
15 app: llm-inference
16 spec:
17 containers:
18 - name: vllm
19 image: your-registry/llm-inference:latest
20 resources:
21 limits:
22 nvidia.com/gpu: 1
23 memory: "32Gi"
24 requests:
25 nvidia.com/gpu: 1
26 memory: "24Gi"
27 ports:
28 - containerPort: 8000
29 env:
30 - name: CUDA_VISIBLE_DEVICES
31 value: "0"
32 readinessProbe:
33 httpGet:
34 path: /health
35 port: 8000
36 initialDelaySeconds: 60
37 periodSeconds: 10
38 livenessProbe:
39 httpGet:
40 path: /health
41 port: 8000
42 initialDelaySeconds: 120
43 periodSeconds: 30
44---
45apiVersion: autoscaling/v2
46kind: HorizontalPodAutoscaler
47metadata:
48 name: llm-inference-hpa
49spec:
50 scaleTargetRef:
51 apiVersion: apps/v1
52 kind: Deployment
53 name: llm-inference
54 minReplicas: 1
55 maxReplicas: 10
56 metrics:
57 - type: Resource
58 resource:
59 name: cpu
60 target:
61 type: Utilization
62 averageUtilization: 70
63 - type: Pods
64 pods:
65 metric:
66 name: inference_queue_length
67 target:
68 type: AverageValue
69 averageValue: "10"
70---
71apiVersion: v1
72kind: Service
73metadata:
74 name: llm-inference-service
75spec:
76 selector:
77 app: llm-inference
78 ports:
79 - port: 80
80 targetPort: 8000
81 type: LoadBalancerFor production workloads, implement request queuing, graceful degradation, and A/B testing infrastructure to safely roll out model updates.
Fine-tuning costs can escalate quickly without careful management. These strategies help maximise value from your training budget.
1from dataclasses import dataclass
2from typing import Optional
3
4@dataclass
5class CostOptimisedConfig:
6 """Configuration optimised for cost-effective training."""
7
8 # Use gradient checkpointing to reduce memory (allows larger batches)
9 gradient_checkpointing: bool = True
10
11 # Mixed precision training
12 bf16: bool = True
13
14 # Efficient optimiser
15 optim: str = "paged_adamw_8bit"
16
17 # Gradient accumulation to simulate larger batches
18 gradient_accumulation_steps: int = 8
19
20 # Early stopping to avoid overtraining
21 early_stopping_patience: int = 3
22
23 def estimate_training_cost(
24 self,
25 num_examples: int,
26 epochs: int,
27 instance_type: str = "g5.xlarge",
28 cloud: str = "aws"
29 ) -> dict:
30 """Estimate training cost."""
31
32 # Instance costs per hour (approximate)
33 instance_costs = {
34 "aws": {
35 "g5.xlarge": 1.006,
36 "g5.2xlarge": 1.212,
37 "p4d.24xlarge": 32.77
38 },
39 "gcp": {
40 "a2-highgpu-1g": 3.67,
41 "a2-highgpu-8g": 29.39
42 }
43 }
44
45 # Training speed estimates (examples per hour)
46 throughput = {
47 "g5.xlarge": 500, # 7B model with QLoRA
48 "g5.2xlarge": 700,
49 "p4d.24xlarge": 5000
50 }
51
52 cost_per_hour = instance_costs[cloud][instance_type]
53 examples_per_hour = throughput.get(instance_type, 500)
54
55 total_examples = num_examples * epochs
56 hours_needed = total_examples / examples_per_hour
57 total_cost = hours_needed * cost_per_hour
58
59 # Add 20% buffer for evaluation and checkpointing
60 total_cost *= 1.2
61
62 return {
63 "estimated_hours": hours_needed,
64 "cost_per_hour": cost_per_hour,
65 "total_cost": total_cost,
66 "cost_per_example": total_cost / total_examples
67 }
68
69
70def use_spot_instances(base_config: dict) -> dict:
71 """Configure for spot/preemptible instances with checkpointing."""
72
73 return {
74 **base_config,
75 # Save frequently for spot instance recovery
76 "save_strategy": "steps",
77 "save_steps": 100,
78 "save_total_limit": 3,
79
80 # Resume from checkpoint
81 "resume_from_checkpoint": True,
82
83 # Reduce per-checkpoint size
84 "save_only_model": True
85 }
86
87
88# Cost comparison analysis
89def compare_training_approaches(num_examples: int = 10000):
90 """Compare costs between different training approaches."""
91
92 approaches = {
93 "Full Fine-tune (A100)": {
94 "instance": "p4d.24xlarge",
95 "hours": num_examples * 3 / 5000, # 3 epochs
96 "cost_per_hour": 32.77
97 },
98 "LoRA (A10G)": {
99 "instance": "g5.2xlarge",
100 "hours": num_examples * 3 / 700,
101 "cost_per_hour": 1.212
102 },
103 "QLoRA (A10G)": {
104 "instance": "g5.xlarge",
105 "hours": num_examples * 3 / 500,
106 "cost_per_hour": 1.006
107 },
108 "QLoRA Spot (A10G)": {
109 "instance": "g5.xlarge",
110 "hours": num_examples * 3 / 500,
111 "cost_per_hour": 0.35 # ~65% discount
112 }
113 }
114
115 print(f"Cost comparison for {num_examples} examples, 3 epochs:\n")
116 for name, config in approaches.items():
117 total = config['hours'] * config['cost_per_hour']
118 print(f"{name}:")
119 print(f" Hours: {config['hours']:.1f}")
120 print(f" Total: ${total:.2f}")
121 print()
122
123
124compare_training_approaches()Inference costs often exceed training costs over time. Key optimisation strategies:
1import time
2from dataclasses import dataclass, field
3from collections import defaultdict
4from datetime import datetime, timedelta
5
6@dataclass
7class InferenceCostTracker:
8 """Track and optimise inference costs."""
9
10 cost_per_1k_tokens: float = 0.002
11 requests: list = field(default_factory=list)
12 cache_hits: int = 0
13 cache_misses: int = 0
14
15 def log_request(
16 self,
17 input_tokens: int,
18 output_tokens: int,
19 latency_ms: float,
20 cached: bool = False
21 ):
22 """Log a single inference request."""
23 if cached:
24 self.cache_hits += 1
25 cost = 0
26 else:
27 self.cache_misses += 1
28 cost = (input_tokens + output_tokens) / 1000 * self.cost_per_1k_tokens
29
30 self.requests.append({
31 'timestamp': datetime.now(),
32 'input_tokens': input_tokens,
33 'output_tokens': output_tokens,
34 'latency_ms': latency_ms,
35 'cost': cost,
36 'cached': cached
37 })
38
39 def get_daily_report(self) -> dict:
40 """Generate daily cost report."""
41 today = datetime.now().date()
42 today_requests = [
43 r for r in self.requests
44 if r['timestamp'].date() == today
45 ]
46
47 if not today_requests:
48 return {"message": "No requests today"}
49
50 total_cost = sum(r['cost'] for r in today_requests)
51 total_tokens = sum(r['input_tokens'] + r['output_tokens'] for r in today_requests)
52 avg_latency = sum(r['latency_ms'] for r in today_requests) / len(today_requests)
53
54 return {
55 'date': str(today),
56 'total_requests': len(today_requests),
57 'total_cost': f"${total_cost:.4f}",
58 'total_tokens': total_tokens,
59 'avg_latency_ms': f"{avg_latency:.1f}",
60 'cache_hit_rate': f"{self.cache_hits / (self.cache_hits + self.cache_misses) * 100:.1f}%",
61 'projected_monthly': f"${total_cost * 30:.2f}"
62 }
63
64 def get_optimisation_recommendations(self) -> list[str]:
65 """Suggest cost optimisations based on usage patterns."""
66 recommendations = []
67
68 # Check cache effectiveness
69 if self.cache_hits + self.cache_misses > 100:
70 hit_rate = self.cache_hits / (self.cache_hits + self.cache_misses)
71 if hit_rate < 0.2:
72 recommendations.append(
73 "Low cache hit rate. Consider caching common queries."
74 )
75
76 # Check for long inputs that could be shortened
77 recent = self.requests[-1000:] if len(self.requests) > 1000 else self.requests
78 avg_input = sum(r['input_tokens'] for r in recent) / len(recent)
79 if avg_input > 1000:
80 recommendations.append(
81 f"High avg input tokens ({avg_input:.0f}). "
82 "Consider shorter prompts or context compression."
83 )
84
85 # Check for potential batching
86 # (simplified - would need timestamp analysis in production)
87 if len(recent) > 100:
88 recommendations.append(
89 "High request volume. Consider request batching for efficiency."
90 )
91
92 return recommendations
93
94
95# Usage
96tracker = InferenceCostTracker(cost_per_1k_tokens=0.002)
97
98# Log some requests
99tracker.log_request(500, 200, 150.5, cached=False)
100tracker.log_request(500, 200, 15.2, cached=True)
101
102print(tracker.get_daily_report())
103print(tracker.get_optimisation_recommendations())Fine-tuning transforms general-purpose language models into specialised tools that understand your domain and deliver consistent, high-quality results. The techniques covered—from data preparation through LoRA/QLoRA training to production deployment—provide a complete toolkit for building custom AI capabilities.
Key success factors include: starting with high-quality data rather than quantity, using parameter-efficient methods like QLoRA to reduce costs, implementing comprehensive evaluation that combines automated metrics with human assessment, and deploying with proper monitoring and cost tracking. For most business applications, fine-tuning on 1,000-10,000 carefully curated examples using QLoRA delivers excellent results at reasonable cost.
Deep dive into multi-agent system architecture for AI applications. Learn communication protocols, orchestration patterns, and implementation strategies with production-ready code examples.
Build intelligent search systems with knowledge graphs. Learn graph database selection, ontology design, entity extraction, and RAG integration with production code examples.
Secure your AI systems against emerging threats. Learn prompt injection prevention, data protection strategies, access control patterns, and Australian Privacy Act compliance with practical code examples.