LearnTechnical Deep DivesCustom Model Training & Fine-Tuning: A Technical Guide
advanced
17 min read
20 December 2024

Custom Model Training & Fine-Tuning: A Technical Guide

Master the techniques for fine-tuning large language models for your specific use case. Learn data preparation, training infrastructure, LoRA/QLoRA methods, and deployment strategies with production-ready code examples.

Clever Ops Team

Fine-tuning large language models transforms general-purpose AI into specialised tools that understand your domain, follow your conventions, and deliver consistent results for your specific use cases. While prompting can achieve impressive results, fine-tuning creates models that inherently "know" your business context without lengthy prompts.

This technical guide covers the complete fine-tuning workflow—from data preparation and infrastructure setup through training, evaluation, and deployment. We focus on practical, cost-effective approaches including LoRA and QLoRA that make fine-tuning accessible without massive compute budgets. Whether you're building a domain expert, teaching consistent formatting, or optimising for latency, you'll learn the techniques to succeed.

Key Takeaways

  • Choose fine-tuning over prompting when you need consistent output formats, domain expertise, or cost efficiency at scale
  • Data quality matters more than quantity—1,000 high-quality examples often outperform 10,000 noisy ones
  • QLoRA enables fine-tuning 7B-70B parameter models on consumer GPUs by combining 4-bit quantisation with LoRA
  • Systematic hyperparameter optimisation can improve results by 10-30%—start with recommended defaults then search
  • Combine automated metrics (perplexity, ROUGE, BERTScore) with human evaluation for comprehensive assessment
  • Use spot instances for training and quantised models for inference to reduce costs by 60-80%

When to Fine-Tune vs Use Prompting

The decision between fine-tuning and prompting significantly impacts project complexity, cost, and results. Understanding when each approach excels helps you choose the right strategy from the start.

Choose Prompting When:

  • Rapid iteration needed - Prompts can be updated instantly without retraining
  • Task variability is high - Different queries need fundamentally different approaches
  • Training data is limited - Less than a few hundred high-quality examples
  • Cost is constrained - API calls may be cheaper than training compute
  • Domain changes frequently - News, regulations, or market conditions shift rapidly

Choose Fine-Tuning When:

  • Consistent output format required - JSON schemas, citation styles, report templates
  • Domain expertise needed - Legal, medical, or industry-specific terminology
  • Latency matters - Shorter prompts mean faster responses
  • Cost at scale - Reduced token usage adds up with high volume
  • Proprietary behaviour - Teaching patterns that can't be described in prompts
Decision Framework Implementationpython
1from dataclasses import dataclass
2from enum import Enum
3
4class Approach(Enum):
5    PROMPTING = "prompting"
6    FINE_TUNING = "fine_tuning"
7    HYBRID = "hybrid"
8
9@dataclass
10class UseCaseAnalysis:
11    training_examples: int
12    output_consistency_requirement: float  # 0-1
13    domain_specificity: float  # 0-1
14    expected_monthly_calls: int
15    latency_requirement_ms: int
16
17    def recommend_approach(self) -> Approach:
18        """Recommend fine-tuning vs prompting based on use case."""
19
20        # Strong indicators for fine-tuning
21        if (self.output_consistency_requirement > 0.9 and
22            self.training_examples > 500):
23            return Approach.FINE_TUNING
24
25        # Strong indicators for prompting
26        if self.training_examples < 100:
27            return Approach.PROMPTING
28
29        # Cost analysis for borderline cases
30        prompt_cost_monthly = self._estimate_prompt_cost()
31        fine_tune_amortised = self._estimate_fine_tune_cost()
32
33        if fine_tune_amortised < prompt_cost_monthly * 0.7:
34            return Approach.FINE_TUNING
35        elif prompt_cost_monthly < fine_tune_amortised * 0.5:
36            return Approach.PROMPTING
37
38        return Approach.HYBRID
39
40    def _estimate_prompt_cost(self) -> float:
41        """Estimate monthly cost with few-shot prompting."""
42        avg_prompt_tokens = 2000  # Including examples
43        avg_completion_tokens = 500
44        cost_per_1k_input = 0.003
45        cost_per_1k_output = 0.015
46
47        return self.expected_monthly_calls * (
48            (avg_prompt_tokens / 1000 * cost_per_1k_input) +
49            (avg_completion_tokens / 1000 * cost_per_1k_output)
50        )
51
52    def _estimate_fine_tune_cost(self) -> float:
53        """Estimate amortised fine-tuning cost over 6 months."""
54        training_cost = self.training_examples * 0.008  # ~$0.008/1k tokens
55        inference_cost_monthly = self.expected_monthly_calls * 0.012
56        return (training_cost / 6) + inference_cost_monthly
57
58
59# Example usage
60analysis = UseCaseAnalysis(
61    training_examples=2000,
62    output_consistency_requirement=0.95,
63    domain_specificity=0.8,
64    expected_monthly_calls=50000,
65    latency_requirement_ms=500
66)
67
68recommendation = analysis.recommend_approach()
69print(f"Recommended approach: {recommendation.value}")

Many production systems use a hybrid approach: fine-tuned models for core functionality with prompt engineering for edge cases and recent context.

Training Data Preparation Pipeline

Data quality determines fine-tuning success more than any other factor. A well-curated dataset of 1,000 examples often outperforms a noisy dataset of 10,000. This section covers building robust data preparation pipelines.

Data Collection Strategies

High-quality training data comes from multiple sources: existing documents, expert annotations, synthetic generation, and production logs (with appropriate consent and privacy handling).

Data Collection and Validation Pipelinepython
1import json
2from dataclasses import dataclass, asdict
3from typing import Optional
4from pathlib import Path
5import hashlib
6
7@dataclass
8class TrainingExample:
9    """Single training example with metadata."""
10    instruction: str
11    input_text: str
12    output: str
13    source: str
14    quality_score: Optional[float] = None
15    tokens_estimate: Optional[int] = None
16
17    def to_chat_format(self) -> dict:
18        """Convert to OpenAI chat fine-tuning format."""
19        messages = [
20            {"role": "system", "content": "You are a helpful assistant."}
21        ]
22
23        if self.input_text:
24            messages.append({
25                "role": "user",
26                "content": f"{self.instruction}\n\nInput: {self.input_text}"
27            })
28        else:
29            messages.append({"role": "user", "content": self.instruction})
30
31        messages.append({"role": "assistant", "content": self.output})
32
33        return {"messages": messages}
34
35    def compute_hash(self) -> str:
36        """Generate hash for deduplication."""
37        content = f"{self.instruction}{self.input_text}{self.output}"
38        return hashlib.md5(content.encode()).hexdigest()
39
40
41class DataPipeline:
42    """Pipeline for preparing fine-tuning datasets."""
43
44    def __init__(self, min_quality_score: float = 0.7):
45        self.min_quality_score = min_quality_score
46        self.examples: list[TrainingExample] = []
47        self.seen_hashes: set[str] = set()
48
49    def add_from_jsonl(self, path: Path) -> int:
50        """Load examples from JSONL file."""
51        added = 0
52        with open(path) as f:
53            for line in f:
54                data = json.loads(line)
55                example = TrainingExample(
56                    instruction=data['instruction'],
57                    input_text=data.get('input', ''),
58                    output=data['output'],
59                    source=str(path)
60                )
61                if self._add_example(example):
62                    added += 1
63        return added
64
65    def add_from_documents(
66        self,
67        documents: list[str],
68        instruction_generator: callable
69    ) -> int:
70        """Generate examples from documents using a generator function."""
71        added = 0
72        for doc in documents:
73            examples = instruction_generator(doc)
74            for ex in examples:
75                if self._add_example(ex):
76                    added += 1
77        return added
78
79    def _add_example(self, example: TrainingExample) -> bool:
80        """Add example if it passes validation."""
81        # Deduplication
82        hash_val = example.compute_hash()
83        if hash_val in self.seen_hashes:
84            return False
85
86        # Quality filtering
87        if example.quality_score and example.quality_score < self.min_quality_score:
88            return False
89
90        # Length validation
91        example.tokens_estimate = self._estimate_tokens(example)
92        if example.tokens_estimate > 4096:
93            return False
94
95        self.seen_hashes.add(hash_val)
96        self.examples.append(example)
97        return True
98
99    def _estimate_tokens(self, example: TrainingExample) -> int:
100        """Rough token count estimation."""
101        total_chars = len(example.instruction) + len(example.input_text) + len(example.output)
102        return int(total_chars / 3.5)  # Rough approximation
103
104    def export_openai_format(self, output_path: Path) -> None:
105        """Export dataset in OpenAI fine-tuning format."""
106        with open(output_path, 'w') as f:
107            for example in self.examples:
108                json.dump(example.to_chat_format(), f)
109                f.write('\n')
110
111    def get_statistics(self) -> dict:
112        """Return dataset statistics."""
113        if not self.examples:
114            return {"count": 0}
115
116        token_counts = [ex.tokens_estimate for ex in self.examples if ex.tokens_estimate]
117        return {
118            "count": len(self.examples),
119            "avg_tokens": sum(token_counts) / len(token_counts),
120            "max_tokens": max(token_counts),
121            "min_tokens": min(token_counts),
122            "sources": list(set(ex.source for ex in self.examples))
123        }
124
125
126# Usage
127pipeline = DataPipeline(min_quality_score=0.7)
128pipeline.add_from_jsonl(Path("./data/expert_annotations.jsonl"))
129pipeline.add_from_jsonl(Path("./data/synthetic_examples.jsonl"))
130
131stats = pipeline.get_statistics()
132print(f"Dataset: {stats['count']} examples, avg {stats['avg_tokens']:.0f} tokens")
133
134pipeline.export_openai_format(Path("./training_data.jsonl"))

Quality Scoring and Filtering

Automated quality scoring helps filter problematic examples before they corrupt your model:

Automated Quality Scoringpython
1import re
2from dataclasses import dataclass
3
4@dataclass
5class QualityMetrics:
6    coherence: float
7    completeness: float
8    formatting: float
9    factual_consistency: float
10
11    @property
12    def overall(self) -> float:
13        weights = [0.3, 0.25, 0.2, 0.25]
14        scores = [self.coherence, self.completeness, self.formatting, self.factual_consistency]
15        return sum(w * s for w, s in zip(weights, scores))
16
17
18class QualityScorer:
19    """Score training examples for quality."""
20
21    def __init__(self):
22        self.min_output_length = 50
23        self.max_output_length = 4000
24
25    def score(self, example: TrainingExample) -> QualityMetrics:
26        return QualityMetrics(
27            coherence=self._score_coherence(example),
28            completeness=self._score_completeness(example),
29            formatting=self._score_formatting(example),
30            factual_consistency=self._score_factual_consistency(example)
31        )
32
33    def _score_coherence(self, example: TrainingExample) -> float:
34        """Check if output logically follows from instruction."""
35        score = 1.0
36
37        # Penalise very short outputs
38        if len(example.output) < self.min_output_length:
39            score -= 0.5
40
41        # Penalise truncated outputs
42        if example.output.rstrip().endswith(('...', '[', '{')):
43            score -= 0.3
44
45        # Check for repetition (sign of generation issues)
46        words = example.output.split()
47        if len(words) > 10:
48            unique_ratio = len(set(words)) / len(words)
49            if unique_ratio < 0.3:
50                score -= 0.4
51
52        return max(0, score)
53
54    def _score_completeness(self, example: TrainingExample) -> float:
55        """Check if output fully addresses the instruction."""
56        score = 1.0
57
58        # Length appropriateness
59        if len(example.output) < self.min_output_length:
60            score -= 0.3
61
62        # Check for common incomplete markers
63        incomplete_markers = [
64            'I cannot', 'I\'m not sure', 'I don\'t know',
65            'TODO', 'TBD', '[placeholder]'
66        ]
67        for marker in incomplete_markers:
68            if marker.lower() in example.output.lower():
69                score -= 0.2
70
71        return max(0, score)
72
73    def _score_formatting(self, example: TrainingExample) -> float:
74        """Check output formatting consistency."""
75        score = 1.0
76
77        # Check JSON validity if output looks like JSON
78        if example.output.strip().startswith('{'):
79            try:
80                import json
81                json.loads(example.output)
82            except json.JSONDecodeError:
83                score -= 0.5
84
85        # Check for balanced brackets/quotes
86        if example.output.count('(') != example.output.count(')'):
87            score -= 0.2
88        if example.output.count('[') != example.output.count(']'):
89            score -= 0.2
90
91        return max(0, score)
92
93    def _score_factual_consistency(self, example: TrainingExample) -> float:
94        """Basic consistency checks."""
95        score = 1.0
96
97        # Check for contradictory statements
98        contradictions = [
99            ('always', 'never'),
100            ('all', 'none'),
101            ('true', 'false')
102        ]
103
104        words_lower = example.output.lower()
105        for word1, word2 in contradictions:
106            if word1 in words_lower and word2 in words_lower:
107                # Context-dependent, mild penalty
108                score -= 0.1
109
110        return max(0, score)
111
112
113# Apply scoring to pipeline
114scorer = QualityScorer()
115for example in pipeline.examples:
116    metrics = scorer.score(example)
117    example.quality_score = metrics.overall

For production datasets, augment automated scoring with human review of a representative sample to calibrate thresholds.

💡 Need expert help with this?

Training Infrastructure Setup

Choosing between local and cloud training involves trade-offs between cost, convenience, and capability. This section covers both approaches with practical configurations.

Local Training Setup

Local training works well for smaller models (up to 7B parameters with QLoRA) and provides complete data privacy. You need a GPU with at least 24GB VRAM for efficient training.

Local Training Environment Setuppython
1# requirements.txt for local training
2"""
3torch>=2.1.0
4transformers>=4.36.0
5peft>=0.7.0
6bitsandbytes>=0.41.0
7datasets>=2.15.0
8accelerate>=0.25.0
9wandb>=0.16.0
10trl>=0.7.0
11"""
12
13import torch
14from transformers import (
15    AutoModelForCausalLM,
16    AutoTokenizer,
17    BitsAndBytesConfig
18)
19
20def setup_local_model(
21    model_name: str = "mistralai/Mistral-7B-v0.1",
22    quantization: str = "4bit"
23) -> tuple:
24    """Setup model for local training with quantization."""
25
26    # Configure quantization
27    if quantization == "4bit":
28        bnb_config = BitsAndBytesConfig(
29            load_in_4bit=True,
30            bnb_4bit_quant_type="nf4",
31            bnb_4bit_compute_dtype=torch.bfloat16,
32            bnb_4bit_use_double_quant=True
33        )
34    elif quantization == "8bit":
35        bnb_config = BitsAndBytesConfig(load_in_8bit=True)
36    else:
37        bnb_config = None
38
39    # Load tokenizer
40    tokenizer = AutoTokenizer.from_pretrained(model_name)
41    tokenizer.pad_token = tokenizer.eos_token
42    tokenizer.padding_side = "right"
43
44    # Load model with quantization
45    model = AutoModelForCausalLM.from_pretrained(
46        model_name,
47        quantization_config=bnb_config,
48        device_map="auto",
49        trust_remote_code=True
50    )
51
52    # Disable caching for training
53    model.config.use_cache = False
54    model.config.pretraining_tp = 1
55
56    # Print memory usage
57    print(f"Model loaded. GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
58
59    return model, tokenizer
60
61
62# Check GPU availability
63if torch.cuda.is_available():
64    print(f"GPU: {torch.cuda.get_device_name(0)}")
65    print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
66else:
67    print("No GPU available - training will be very slow")

Cloud Training Configuration

Cloud platforms offer scalability and access to high-end GPUs. Here's a configuration for AWS SageMaker:

AWS SageMaker Training Setuppython
1import sagemaker
2from sagemaker.huggingface import HuggingFace
3
4def create_sagemaker_training_job(
5    training_data_s3: str,
6    output_s3: str,
7    instance_type: str = "ml.g5.2xlarge"
8) -> HuggingFace:
9    """Configure SageMaker training job for fine-tuning."""
10
11    # Hyperparameters
12    hyperparameters = {
13        'model_id': 'mistralai/Mistral-7B-v0.1',
14        'epochs': 3,
15        'per_device_train_batch_size': 4,
16        'gradient_accumulation_steps': 4,
17        'learning_rate': 2e-4,
18        'lora_r': 16,
19        'lora_alpha': 32,
20        'lora_dropout': 0.05,
21        'bf16': True,
22        'gradient_checkpointing': True
23    }
24
25    # Instance configuration
26    instance_configs = {
27        "ml.g5.xlarge": {"gpu_memory": 24, "cost_per_hour": 1.41},
28        "ml.g5.2xlarge": {"gpu_memory": 24, "cost_per_hour": 1.69},
29        "ml.g5.4xlarge": {"gpu_memory": 24, "cost_per_hour": 2.27},
30        "ml.p4d.24xlarge": {"gpu_memory": 320, "cost_per_hour": 37.69}
31    }
32
33    print(f"Using {instance_type}: {instance_configs[instance_type]}")
34
35    # Create estimator
36    huggingface_estimator = HuggingFace(
37        entry_point='train.py',
38        source_dir='./scripts',
39        instance_type=instance_type,
40        instance_count=1,
41        role=sagemaker.get_execution_role(),
42        transformers_version='4.36',
43        pytorch_version='2.1',
44        py_version='py310',
45        hyperparameters=hyperparameters,
46        output_path=output_s3,
47        disable_profiler=True,
48        environment={
49            'HUGGINGFACE_HUB_CACHE': '/tmp/hf_cache'
50        }
51    )
52
53    return huggingface_estimator
54
55
56# Alternative: RunPod configuration for cost-effective training
57RUNPOD_CONFIG = """
58# runpod.yaml
59gpu: RTX_4090  # or A100_80GB for larger models
60vcpu: 8
61memory: 32
62volume_size: 100
63docker_image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
64env:
65  - WANDB_API_KEY=${WANDB_API_KEY}
66  - HF_TOKEN=${HF_TOKEN}
67"""

For most use cases, we recommend starting with a cloud provider offering spot/preemptible instances, which can reduce costs by 60-80%.

LoRA and QLoRA Fine-Tuning Methods

Low-Rank Adaptation (LoRA) and its quantised variant QLoRA have revolutionised fine-tuning by reducing memory requirements by 90%+ while maintaining quality. These methods train small adapter layers rather than modifying all model weights.

Understanding LoRA

LoRA works by decomposing weight updates into low-rank matrices. Instead of updating a weight matrix W directly, it learns two smaller matrices A and B such that the update ΔW = BA. This dramatically reduces trainable parameters.

LoRA Configuration and Trainingpython
1from peft import (
2    LoraConfig,
3    get_peft_model,
4    prepare_model_for_kbit_training,
5    TaskType
6)
7from transformers import TrainingArguments
8from trl import SFTTrainer
9
10def create_lora_config(
11    r: int = 16,
12    lora_alpha: int = 32,
13    target_modules: list[str] = None
14) -> LoraConfig:
15    """Create LoRA configuration with best practices."""
16
17    # Default target modules for LLaMA-style models
18    if target_modules is None:
19        target_modules = [
20            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
21            "gate_proj", "up_proj", "down_proj"       # MLP
22        ]
23
24    return LoraConfig(
25        r=r,                          # Rank of update matrices
26        lora_alpha=lora_alpha,        # Scaling factor
27        target_modules=target_modules,
28        lora_dropout=0.05,
29        bias="none",
30        task_type=TaskType.CAUSAL_LM
31    )
32
33
34def setup_lora_training(
35    model,
36    tokenizer,
37    train_dataset,
38    output_dir: str = "./lora_output"
39):
40    """Setup complete LoRA training pipeline."""
41
42    # Prepare model for training
43    model = prepare_model_for_kbit_training(model)
44
45    # Apply LoRA
46    lora_config = create_lora_config(r=16, lora_alpha=32)
47    model = get_peft_model(model, lora_config)
48
49    # Print trainable parameters
50    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
51    total_params = sum(p.numel() for p in model.parameters())
52    print(f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
53
54    # Training arguments
55    training_args = TrainingArguments(
56        output_dir=output_dir,
57        num_train_epochs=3,
58        per_device_train_batch_size=4,
59        gradient_accumulation_steps=4,
60        learning_rate=2e-4,
61        weight_decay=0.01,
62        warmup_ratio=0.03,
63        lr_scheduler_type="cosine",
64        logging_steps=10,
65        save_strategy="epoch",
66        bf16=True,
67        gradient_checkpointing=True,
68        optim="paged_adamw_8bit",
69        max_grad_norm=0.3
70    )
71
72    # Create trainer
73    trainer = SFTTrainer(
74        model=model,
75        train_dataset=train_dataset,
76        tokenizer=tokenizer,
77        args=training_args,
78        max_seq_length=2048,
79        dataset_text_field="text",
80        packing=True  # Efficient packing of short sequences
81    )
82
83    return trainer
84
85
86# Run training
87trainer = setup_lora_training(model, tokenizer, train_dataset)
88trainer.train()
89
90# Save adapter weights only (small file size)
91trainer.model.save_pretrained("./final_adapter")

QLoRA for Memory Efficiency

QLoRA combines 4-bit quantisation with LoRA, enabling fine-tuning of 65B+ parameter models on a single GPU:

QLoRA Configurationpython
1from transformers import BitsAndBytesConfig
2import torch
3
4def setup_qlora_model(model_name: str):
5    """Setup model with QLoRA - 4-bit quantization + LoRA."""
6
7    # 4-bit quantization config
8    bnb_config = BitsAndBytesConfig(
9        load_in_4bit=True,
10        bnb_4bit_quant_type="nf4",           # Normalised float 4-bit
11        bnb_4bit_compute_dtype=torch.bfloat16,
12        bnb_4bit_use_double_quant=True       # Nested quantization
13    )
14
15    # Load quantised model
16    model = AutoModelForCausalLM.from_pretrained(
17        model_name,
18        quantization_config=bnb_config,
19        device_map="auto",
20        torch_dtype=torch.bfloat16
21    )
22
23    # Prepare for k-bit training
24    model = prepare_model_for_kbit_training(
25        model,
26        use_gradient_checkpointing=True
27    )
28
29    # QLoRA-specific config
30    qlora_config = LoraConfig(
31        r=64,              # Higher rank for QLoRA
32        lora_alpha=16,
33        target_modules=[
34            "q_proj", "k_proj", "v_proj", "o_proj",
35            "gate_proj", "up_proj", "down_proj"
36        ],
37        lora_dropout=0.1,
38        bias="none",
39        task_type=TaskType.CAUSAL_LM
40    )
41
42    model = get_peft_model(model, qlora_config)
43
44    return model
45
46
47# Memory comparison
48MEMORY_REQUIREMENTS = {
49    "7B Full Fine-tune": "~140 GB",
50    "7B LoRA (16-bit)": "~28 GB",
51    "7B QLoRA (4-bit)": "~6 GB",
52    "70B QLoRA (4-bit)": "~48 GB"
53}
54
55for method, memory in MEMORY_REQUIREMENTS.items():
56    print(f"{method}: {memory}")

QLoRA achieves comparable results to full fine-tuning at a fraction of the compute cost—the sweet spot for most business applications.

💡 Need expert help with this?

Hyperparameter Optimisation

Finding optimal hyperparameters significantly impacts training efficiency and model quality. While defaults work reasonably well, systematic optimisation can improve results by 10-30%.

Key Hyperparameters

Focus optimisation efforts on these high-impact parameters:

Hyperparameter Search with Optunapython
1import optuna
2from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
3
4def create_objective(model_init, train_dataset, eval_dataset, tokenizer):
5    """Create Optuna objective function for hyperparameter search."""
6
7    def objective(trial: optuna.Trial) -> float:
8        # Sample hyperparameters
9        learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-4, log=True)
10        batch_size = trial.suggest_categorical("batch_size", [2, 4, 8])
11        warmup_ratio = trial.suggest_float("warmup_ratio", 0.0, 0.1)
12        weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)
13        lora_r = trial.suggest_categorical("lora_r", [8, 16, 32, 64])
14        lora_alpha = trial.suggest_categorical("lora_alpha", [16, 32, 64])
15
16        # Calculate gradient accumulation for effective batch size of 16
17        grad_accum = 16 // batch_size
18
19        training_args = TrainingArguments(
20            output_dir=f"./trials/trial_{trial.number}",
21            num_train_epochs=1,  # Short for hyperparameter search
22            per_device_train_batch_size=batch_size,
23            gradient_accumulation_steps=grad_accum,
24            learning_rate=learning_rate,
25            warmup_ratio=warmup_ratio,
26            weight_decay=weight_decay,
27            evaluation_strategy="steps",
28            eval_steps=50,
29            logging_steps=10,
30            load_best_model_at_end=True,
31            metric_for_best_model="eval_loss",
32            bf16=True
33        )
34
35        # Initialize model with sampled LoRA config
36        model = model_init()
37        lora_config = LoraConfig(
38            r=lora_r,
39            lora_alpha=lora_alpha,
40            target_modules=["q_proj", "v_proj"],
41            lora_dropout=0.05,
42            task_type=TaskType.CAUSAL_LM
43        )
44        model = get_peft_model(model, lora_config)
45
46        trainer = Trainer(
47            model=model,
48            args=training_args,
49            train_dataset=train_dataset,
50            eval_dataset=eval_dataset,
51            tokenizer=tokenizer,
52            callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
53        )
54
55        trainer.train()
56
57        # Return best eval loss
58        return trainer.state.best_metric
59
60    return objective
61
62
63def run_hyperparameter_search(
64    model_init,
65    train_dataset,
66    eval_dataset,
67    tokenizer,
68    n_trials: int = 20
69):
70    """Run hyperparameter optimisation with Optuna."""
71
72    study = optuna.create_study(
73        direction="minimize",
74        study_name="llm_fine_tuning",
75        pruner=optuna.pruners.MedianPruner(n_startup_trials=5)
76    )
77
78    objective = create_objective(model_init, train_dataset, eval_dataset, tokenizer)
79
80    study.optimize(
81        objective,
82        n_trials=n_trials,
83        timeout=3600 * 4,  # 4 hour timeout
84        show_progress_bar=True
85    )
86
87    print(f"Best trial: {study.best_trial.number}")
88    print(f"Best params: {study.best_params}")
89    print(f"Best loss: {study.best_value:.4f}")
90
91    return study.best_params
92
93
94# Recommended starting points by model size
95RECOMMENDED_HYPERPARAMS = {
96    "7B": {
97        "learning_rate": 2e-4,
98        "batch_size": 4,
99        "lora_r": 16,
100        "lora_alpha": 32,
101        "warmup_ratio": 0.03
102    },
103    "13B": {
104        "learning_rate": 1e-4,
105        "batch_size": 2,
106        "lora_r": 32,
107        "lora_alpha": 64,
108        "warmup_ratio": 0.05
109    },
110    "70B": {
111        "learning_rate": 5e-5,
112        "batch_size": 1,
113        "lora_r": 64,
114        "lora_alpha": 128,
115        "warmup_ratio": 0.05
116    }
117}

Start with recommended defaults, run a small hyperparameter search on 10-20% of your data, then train the final model with optimal settings on the full dataset.

Model Evaluation and Metrics

Evaluating fine-tuned models requires both automated metrics and human assessment. Different use cases prioritise different metrics—instruction following needs different evaluation than code generation.

Automated Evaluation Suite

Comprehensive Evaluation Pipelinepython
1from dataclasses import dataclass
2import numpy as np
3from typing import Callable
4from rouge_score import rouge_scorer
5from bert_score import score as bert_score
6import evaluate
7
8@dataclass
9class EvaluationResult:
10    metric_name: str
11    score: float
12    details: dict = None
13
14
15class ModelEvaluator:
16    """Comprehensive evaluation suite for fine-tuned models."""
17
18    def __init__(self, model, tokenizer):
19        self.model = model
20        self.tokenizer = tokenizer
21        self.rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
22        self.bleu = evaluate.load("bleu")
23
24    def evaluate_all(
25        self,
26        test_examples: list[dict],
27        custom_metrics: list[Callable] = None
28    ) -> dict[str, EvaluationResult]:
29        """Run all evaluation metrics on test set."""
30
31        # Generate predictions
32        predictions = []
33        references = []
34
35        for example in test_examples:
36            pred = self._generate(example['input'])
37            predictions.append(pred)
38            references.append(example['expected_output'])
39
40        results = {}
41
42        # Perplexity
43        results['perplexity'] = self._compute_perplexity(test_examples)
44
45        # ROUGE scores
46        results['rouge'] = self._compute_rouge(predictions, references)
47
48        # BLEU score
49        results['bleu'] = self._compute_bleu(predictions, references)
50
51        # BERTScore for semantic similarity
52        results['bertscore'] = self._compute_bertscore(predictions, references)
53
54        # Task-specific accuracy
55        results['exact_match'] = self._compute_exact_match(predictions, references)
56
57        # Custom metrics
58        if custom_metrics:
59            for metric_fn in custom_metrics:
60                name = metric_fn.__name__
61                results[name] = metric_fn(predictions, references)
62
63        return results
64
65    def _generate(self, prompt: str, max_length: int = 512) -> str:
66        """Generate response for evaluation."""
67        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
68
69        with torch.no_grad():
70            outputs = self.model.generate(
71                **inputs,
72                max_new_tokens=max_length,
73                do_sample=False,
74                pad_token_id=self.tokenizer.pad_token_id
75            )
76
77        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
78        return response[len(prompt):].strip()
79
80    def _compute_perplexity(self, examples: list[dict]) -> EvaluationResult:
81        """Compute perplexity on test set."""
82        total_loss = 0
83        total_tokens = 0
84
85        for example in examples:
86            text = f"{example['input']} {example['expected_output']}"
87            inputs = self.tokenizer(text, return_tensors="pt").to(self.model.device)
88
89            with torch.no_grad():
90                outputs = self.model(**inputs, labels=inputs['input_ids'])
91                total_loss += outputs.loss.item() * inputs['input_ids'].shape[1]
92                total_tokens += inputs['input_ids'].shape[1]
93
94        perplexity = np.exp(total_loss / total_tokens)
95        return EvaluationResult("perplexity", perplexity)
96
97    def _compute_rouge(self, predictions: list, references: list) -> EvaluationResult:
98        """Compute ROUGE scores."""
99        scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
100
101        for pred, ref in zip(predictions, references):
102            result = self.rouge.score(ref, pred)
103            for key in scores:
104                scores[key].append(result[key].fmeasure)
105
106        avg_scores = {k: np.mean(v) for k, v in scores.items()}
107        return EvaluationResult("rouge", avg_scores['rougeL'], avg_scores)
108
109    def _compute_bleu(self, predictions: list, references: list) -> EvaluationResult:
110        """Compute BLEU score."""
111        result = self.bleu.compute(
112            predictions=predictions,
113            references=[[r] for r in references]
114        )
115        return EvaluationResult("bleu", result['bleu'])
116
117    def _compute_bertscore(self, predictions: list, references: list) -> EvaluationResult:
118        """Compute BERTScore for semantic similarity."""
119        P, R, F1 = bert_score(predictions, references, lang="en")
120        return EvaluationResult("bertscore", F1.mean().item())
121
122    def _compute_exact_match(self, predictions: list, references: list) -> EvaluationResult:
123        """Compute exact match accuracy."""
124        matches = sum(
125            1 for p, r in zip(predictions, references)
126            if p.strip().lower() == r.strip().lower()
127        )
128        return EvaluationResult("exact_match", matches / len(predictions))
129
130
131# Domain-specific custom metrics
132def json_validity_score(predictions: list, references: list) -> EvaluationResult:
133    """Check if predictions are valid JSON."""
134    import json
135    valid = 0
136    for pred in predictions:
137        try:
138            json.loads(pred)
139            valid += 1
140        except:
141            pass
142    return EvaluationResult("json_validity", valid / len(predictions))
143
144
145def format_compliance_score(predictions: list, references: list) -> EvaluationResult:
146    """Check compliance with expected output format."""
147    import re
148    pattern = r'^\{.*\}$'  # Example: must be wrapped in braces
149    matches = sum(1 for p in predictions if re.match(pattern, p.strip(), re.DOTALL))
150    return EvaluationResult("format_compliance", matches / len(predictions))
151
152
153# Usage
154evaluator = ModelEvaluator(model, tokenizer)
155results = evaluator.evaluate_all(
156    test_examples,
157    custom_metrics=[json_validity_score, format_compliance_score]
158)
159
160for name, result in results.items():
161    print(f"{name}: {result.score:.4f}")

Human Evaluation Framework

Automated metrics capture only part of the picture. Human evaluation assesses qualities like helpfulness, safety, and task appropriateness:

Human Evaluation Interfacetypescript
1interface EvaluationCriteria {
2  name: string;
3  description: string;
4  scale: [number, number];
5  guidelines: string[];
6}
7
8const EVALUATION_CRITERIA: EvaluationCriteria[] = [
9  {
10    name: "Accuracy",
11    description: "Is the information factually correct?",
12    scale: [1, 5],
13    guidelines: [
14      "1: Contains significant factual errors",
15      "3: Mostly accurate with minor issues",
16      "5: Completely accurate"
17    ]
18  },
19  {
20    name: "Relevance",
21    description: "Does the response address the query?",
22    scale: [1, 5],
23    guidelines: [
24      "1: Completely off-topic",
25      "3: Partially addresses the query",
26      "5: Fully addresses all aspects"
27    ]
28  },
29  {
30    name: "Coherence",
31    description: "Is the response well-structured and logical?",
32    scale: [1, 5],
33    guidelines: [
34      "1: Incoherent or contradictory",
35      "3: Generally logical with some issues",
36      "5: Clear, well-organised, logical"
37    ]
38  },
39  {
40    name: "Helpfulness",
41    description: "Would this response help the user?",
42    scale: [1, 5],
43    guidelines: [
44      "1: Not helpful at all",
45      "3: Somewhat helpful",
46      "5: Extremely helpful"
47    ]
48  }
49];
50
51interface HumanEvaluation {
52  exampleId: string;
53  evaluatorId: string;
54  modelA: string;
55  modelB: string;
56  preference: 'A' | 'B' | 'tie';
57  scores: Record<string, number>;
58  notes?: string;
59}
60
61function calculateInterAnnotatorAgreement(
62  evaluations: HumanEvaluation[]
63): number {
64  // Cohen's kappa for preference agreement
65  const grouped = groupBy(evaluations, e => e.exampleId);
66  let agreements = 0;
67  let total = 0;
68
69  for (const [_, evals] of Object.entries(grouped)) {
70    if (evals.length >= 2) {
71      for (let i = 0; i < evals.length - 1; i++) {
72        for (let j = i + 1; j < evals.length; j++) {
73          total++;
74          if (evals[i].preference === evals[j].preference) {
75            agreements++;
76          }
77        }
78      }
79    }
80  }
81
82  return total > 0 ? agreements / total : 0;
83}

Aim for at least 3 evaluators per example and report inter-annotator agreement alongside results.

Deployment Strategies

Deploying fine-tuned models requires balancing latency, cost, and reliability. This section covers production deployment patterns from simple to sophisticated.

Basic Deployment with vLLM

vLLM Deployment Setuppython
1# vLLM provides high-throughput inference with PagedAttention
2from vllm import LLM, SamplingParams
3
4def deploy_with_vllm(
5    base_model: str,
6    lora_path: str,
7    tensor_parallel_size: int = 1
8):
9    """Deploy fine-tuned model with vLLM for production."""
10
11    # Load model with LoRA adapter
12    llm = LLM(
13        model=base_model,
14        enable_lora=True,
15        max_lora_rank=64,
16        tensor_parallel_size=tensor_parallel_size,
17        gpu_memory_utilization=0.9,
18        max_model_len=4096
19    )
20
21    # Default sampling parameters
22    sampling_params = SamplingParams(
23        temperature=0.7,
24        top_p=0.9,
25        max_tokens=512,
26        stop=["</s>", "\n\n"]
27    )
28
29    return llm, sampling_params
30
31
32# FastAPI wrapper for production
33from fastapi import FastAPI, HTTPException
34from pydantic import BaseModel
35import asyncio
36
37app = FastAPI()
38
39class GenerationRequest(BaseModel):
40    prompt: str
41    max_tokens: int = 512
42    temperature: float = 0.7
43
44class GenerationResponse(BaseModel):
45    text: str
46    tokens_generated: int
47    latency_ms: float
48
49@app.post("/generate", response_model=GenerationResponse)
50async def generate(request: GenerationRequest):
51    import time
52    start = time.time()
53
54    params = SamplingParams(
55        temperature=request.temperature,
56        max_tokens=request.max_tokens
57    )
58
59    outputs = llm.generate([request.prompt], params)
60
61    latency = (time.time() - start) * 1000
62
63    return GenerationResponse(
64        text=outputs[0].outputs[0].text,
65        tokens_generated=len(outputs[0].outputs[0].token_ids),
66        latency_ms=latency
67    )
68
69
70# Docker deployment
71DOCKERFILE = """
72FROM vllm/vllm-openai:latest
73
74# Copy model and adapter
75COPY ./model /app/model
76COPY ./adapter /app/adapter
77
78# Set environment
79ENV MODEL_PATH=/app/model
80ENV LORA_PATH=/app/adapter
81
82# Health check
83HEALTHCHECK --interval=30s --timeout=10s \
84  CMD curl -f http://localhost:8000/health || exit 1
85
86CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
87     "--model", "$MODEL_PATH", \
88     "--enable-lora", \
89     "--lora-modules", "custom=$LORA_PATH"]
90"""

Kubernetes Deployment with Auto-scaling

Kubernetes Deployment Configurationyaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: llm-inference
5  labels:
6    app: llm-inference
7spec:
8  replicas: 2
9  selector:
10    matchLabels:
11      app: llm-inference
12  template:
13    metadata:
14      labels:
15        app: llm-inference
16    spec:
17      containers:
18      - name: vllm
19        image: your-registry/llm-inference:latest
20        resources:
21          limits:
22            nvidia.com/gpu: 1
23            memory: "32Gi"
24          requests:
25            nvidia.com/gpu: 1
26            memory: "24Gi"
27        ports:
28        - containerPort: 8000
29        env:
30        - name: CUDA_VISIBLE_DEVICES
31          value: "0"
32        readinessProbe:
33          httpGet:
34            path: /health
35            port: 8000
36          initialDelaySeconds: 60
37          periodSeconds: 10
38        livenessProbe:
39          httpGet:
40            path: /health
41            port: 8000
42          initialDelaySeconds: 120
43          periodSeconds: 30
44---
45apiVersion: autoscaling/v2
46kind: HorizontalPodAutoscaler
47metadata:
48  name: llm-inference-hpa
49spec:
50  scaleTargetRef:
51    apiVersion: apps/v1
52    kind: Deployment
53    name: llm-inference
54  minReplicas: 1
55  maxReplicas: 10
56  metrics:
57  - type: Resource
58    resource:
59      name: cpu
60      target:
61        type: Utilization
62        averageUtilization: 70
63  - type: Pods
64    pods:
65      metric:
66        name: inference_queue_length
67      target:
68        type: AverageValue
69        averageValue: "10"
70---
71apiVersion: v1
72kind: Service
73metadata:
74  name: llm-inference-service
75spec:
76  selector:
77    app: llm-inference
78  ports:
79  - port: 80
80    targetPort: 8000
81  type: LoadBalancer

For production workloads, implement request queuing, graceful degradation, and A/B testing infrastructure to safely roll out model updates.

💡 Need expert help with this?

Cost Optimisation Techniques

Fine-tuning costs can escalate quickly without careful management. These strategies help maximise value from your training budget.

Training Cost Reduction

Cost-Optimised Training Configurationpython
1from dataclasses import dataclass
2from typing import Optional
3
4@dataclass
5class CostOptimisedConfig:
6    """Configuration optimised for cost-effective training."""
7
8    # Use gradient checkpointing to reduce memory (allows larger batches)
9    gradient_checkpointing: bool = True
10
11    # Mixed precision training
12    bf16: bool = True
13
14    # Efficient optimiser
15    optim: str = "paged_adamw_8bit"
16
17    # Gradient accumulation to simulate larger batches
18    gradient_accumulation_steps: int = 8
19
20    # Early stopping to avoid overtraining
21    early_stopping_patience: int = 3
22
23    def estimate_training_cost(
24        self,
25        num_examples: int,
26        epochs: int,
27        instance_type: str = "g5.xlarge",
28        cloud: str = "aws"
29    ) -> dict:
30        """Estimate training cost."""
31
32        # Instance costs per hour (approximate)
33        instance_costs = {
34            "aws": {
35                "g5.xlarge": 1.006,
36                "g5.2xlarge": 1.212,
37                "p4d.24xlarge": 32.77
38            },
39            "gcp": {
40                "a2-highgpu-1g": 3.67,
41                "a2-highgpu-8g": 29.39
42            }
43        }
44
45        # Training speed estimates (examples per hour)
46        throughput = {
47            "g5.xlarge": 500,    # 7B model with QLoRA
48            "g5.2xlarge": 700,
49            "p4d.24xlarge": 5000
50        }
51
52        cost_per_hour = instance_costs[cloud][instance_type]
53        examples_per_hour = throughput.get(instance_type, 500)
54
55        total_examples = num_examples * epochs
56        hours_needed = total_examples / examples_per_hour
57        total_cost = hours_needed * cost_per_hour
58
59        # Add 20% buffer for evaluation and checkpointing
60        total_cost *= 1.2
61
62        return {
63            "estimated_hours": hours_needed,
64            "cost_per_hour": cost_per_hour,
65            "total_cost": total_cost,
66            "cost_per_example": total_cost / total_examples
67        }
68
69
70def use_spot_instances(base_config: dict) -> dict:
71    """Configure for spot/preemptible instances with checkpointing."""
72
73    return {
74        **base_config,
75        # Save frequently for spot instance recovery
76        "save_strategy": "steps",
77        "save_steps": 100,
78        "save_total_limit": 3,
79
80        # Resume from checkpoint
81        "resume_from_checkpoint": True,
82
83        # Reduce per-checkpoint size
84        "save_only_model": True
85    }
86
87
88# Cost comparison analysis
89def compare_training_approaches(num_examples: int = 10000):
90    """Compare costs between different training approaches."""
91
92    approaches = {
93        "Full Fine-tune (A100)": {
94            "instance": "p4d.24xlarge",
95            "hours": num_examples * 3 / 5000,  # 3 epochs
96            "cost_per_hour": 32.77
97        },
98        "LoRA (A10G)": {
99            "instance": "g5.2xlarge",
100            "hours": num_examples * 3 / 700,
101            "cost_per_hour": 1.212
102        },
103        "QLoRA (A10G)": {
104            "instance": "g5.xlarge",
105            "hours": num_examples * 3 / 500,
106            "cost_per_hour": 1.006
107        },
108        "QLoRA Spot (A10G)": {
109            "instance": "g5.xlarge",
110            "hours": num_examples * 3 / 500,
111            "cost_per_hour": 0.35  # ~65% discount
112        }
113    }
114
115    print(f"Cost comparison for {num_examples} examples, 3 epochs:\n")
116    for name, config in approaches.items():
117        total = config['hours'] * config['cost_per_hour']
118        print(f"{name}:")
119        print(f"  Hours: {config['hours']:.1f}")
120        print(f"  Total: ${total:.2f}")
121        print()
122
123
124compare_training_approaches()

Inference Cost Optimisation

Inference costs often exceed training costs over time. Key optimisation strategies:

  • Quantisation: Deploy 4-bit or 8-bit models for 2-4x cost reduction
  • Batching: Process multiple requests together for better GPU utilisation
  • Caching: Cache common queries and embeddings
  • Model distillation: Train smaller models on larger model outputs
  • Speculative decoding: Use small models to draft, large models to verify
Inference Cost Trackingpython
1import time
2from dataclasses import dataclass, field
3from collections import defaultdict
4from datetime import datetime, timedelta
5
6@dataclass
7class InferenceCostTracker:
8    """Track and optimise inference costs."""
9
10    cost_per_1k_tokens: float = 0.002
11    requests: list = field(default_factory=list)
12    cache_hits: int = 0
13    cache_misses: int = 0
14
15    def log_request(
16        self,
17        input_tokens: int,
18        output_tokens: int,
19        latency_ms: float,
20        cached: bool = False
21    ):
22        """Log a single inference request."""
23        if cached:
24            self.cache_hits += 1
25            cost = 0
26        else:
27            self.cache_misses += 1
28            cost = (input_tokens + output_tokens) / 1000 * self.cost_per_1k_tokens
29
30        self.requests.append({
31            'timestamp': datetime.now(),
32            'input_tokens': input_tokens,
33            'output_tokens': output_tokens,
34            'latency_ms': latency_ms,
35            'cost': cost,
36            'cached': cached
37        })
38
39    def get_daily_report(self) -> dict:
40        """Generate daily cost report."""
41        today = datetime.now().date()
42        today_requests = [
43            r for r in self.requests
44            if r['timestamp'].date() == today
45        ]
46
47        if not today_requests:
48            return {"message": "No requests today"}
49
50        total_cost = sum(r['cost'] for r in today_requests)
51        total_tokens = sum(r['input_tokens'] + r['output_tokens'] for r in today_requests)
52        avg_latency = sum(r['latency_ms'] for r in today_requests) / len(today_requests)
53
54        return {
55            'date': str(today),
56            'total_requests': len(today_requests),
57            'total_cost': f"${total_cost:.4f}",
58            'total_tokens': total_tokens,
59            'avg_latency_ms': f"{avg_latency:.1f}",
60            'cache_hit_rate': f"{self.cache_hits / (self.cache_hits + self.cache_misses) * 100:.1f}%",
61            'projected_monthly': f"${total_cost * 30:.2f}"
62        }
63
64    def get_optimisation_recommendations(self) -> list[str]:
65        """Suggest cost optimisations based on usage patterns."""
66        recommendations = []
67
68        # Check cache effectiveness
69        if self.cache_hits + self.cache_misses > 100:
70            hit_rate = self.cache_hits / (self.cache_hits + self.cache_misses)
71            if hit_rate < 0.2:
72                recommendations.append(
73                    "Low cache hit rate. Consider caching common queries."
74                )
75
76        # Check for long inputs that could be shortened
77        recent = self.requests[-1000:] if len(self.requests) > 1000 else self.requests
78        avg_input = sum(r['input_tokens'] for r in recent) / len(recent)
79        if avg_input > 1000:
80            recommendations.append(
81                f"High avg input tokens ({avg_input:.0f}). "
82                "Consider shorter prompts or context compression."
83            )
84
85        # Check for potential batching
86        # (simplified - would need timestamp analysis in production)
87        if len(recent) > 100:
88            recommendations.append(
89                "High request volume. Consider request batching for efficiency."
90            )
91
92        return recommendations
93
94
95# Usage
96tracker = InferenceCostTracker(cost_per_1k_tokens=0.002)
97
98# Log some requests
99tracker.log_request(500, 200, 150.5, cached=False)
100tracker.log_request(500, 200, 15.2, cached=True)
101
102print(tracker.get_daily_report())
103print(tracker.get_optimisation_recommendations())

Conclusion

Fine-tuning transforms general-purpose language models into specialised tools that understand your domain and deliver consistent, high-quality results. The techniques covered—from data preparation through LoRA/QLoRA training to production deployment—provide a complete toolkit for building custom AI capabilities.

Key success factors include: starting with high-quality data rather than quantity, using parameter-efficient methods like QLoRA to reduce costs, implementing comprehensive evaluation that combines automated metrics with human assessment, and deploying with proper monitoring and cost tracking. For most business applications, fine-tuning on 1,000-10,000 carefully curated examples using QLoRA delivers excellent results at reasonable cost.

Frequently Asked Questions

How much training data do I need for fine-tuning?

Should I use LoRA or full fine-tuning?

What hardware do I need for fine-tuning?

How long does fine-tuning take?

How do I know if fine-tuning worked?

Can I fine-tune models from OpenAI or Anthropic?

What is the cost of fine-tuning?

How do I prevent the model from forgetting general knowledge?

Ready to Implement?

This guide provides the knowledge, but implementation requires expertise. Our team has done this 500+ times and can get you production-ready in weeks.

✓ FT Fast 500 APAC Winner✓ 500+ Implementations✓ Results in Weeks
AI Implementation Guide - Learn AI Automation | Clever Ops