Comprehensive guide to testing AI applications. Learn evaluation frameworks, test dataset creation, automated testing, regression detection, and quality assurance for production LLM systems.
Testing AI systems is fundamentally different from testing traditional software. You can't write a unit test that checks "assert response == expected_output" when the same input might produce multiple correct responses. LLMs are non-deterministic, outputs are natural language (subjective quality), and edge cases are infinite.
Yet testing is more important for AI systems, not less. A bug in traditional code affects predictable scenarios. A bad prompt or model degradation in an AI system affects thousands of users in unpredictable ways. The challenge is building test frameworks that account for AI's probabilistic nature while still catching regressions and ensuring quality.
This guide provides a comprehensive testing strategy for AI applications: creating diverse test datasets that represent real usage, building automated evaluation pipelines, detecting regressions when models or prompts change, implementing A/B testing in production, and establishing human review processes for subjective quality. Whether you're building RAG systems, chatbots, or classification models, these testing patterns will help you ship with confidence.
Quality testing starts with quality test data. Let's build datasets that represent real-world usage.
A well-balanced test set includes:
from dataclasses import dataclass
from typing import List, Dict, Any
import json
@dataclass
class TestCase:
id: str
input: str
expected_output: str
category: str # happy_path, edge_case, adversarial, regression
difficulty: str # easy, medium, hard
metadata: Dict[str, Any]
tags: List[str]
class TestDatasetBuilder:
def __init__(self):
self.test_cases = []
def add_case(self, test_case: TestCase):
"""Add a test case to the dataset."""
self.test_cases.append(test_case)
def add_from_production(self, production_logs, num_samples=100):
"""Sample real production queries as test cases."""
# Get diverse sample
samples = self._diverse_sample(production_logs, num_samples)
for i, log in enumerate(samples):
# Get human verification of correct output
expected = self._get_human_verified_output(log['input'])
self.add_case(TestCase(
id=f"prod_{i}",
input=log['input'],
expected_output=expected,
category="production_sample",
difficulty=self._estimate_difficulty(log['input']),
metadata={"source": "production", "timestamp": log['timestamp']},
tags=self._extract_tags(log['input'])
))
def add_edge_cases(self):
"""Add common edge cases."""
edge_cases = [
# Empty/minimal input
TestCase(
id="edge_empty",
input="",
expected_output="I need more information to help you.",
category="edge_case",
difficulty="medium",
metadata={},
tags=["empty_input"]
),
# Very long input
TestCase(
id="edge_long",
input="a " * 5000, # 5000 words
expected_output="Your message is very long. Could you summarize your question?",
category="edge_case",
difficulty="medium",
metadata={"length": 10000},
tags=["long_input"]
),
# Special characters
TestCase(
id="edge_special_chars",
input="Hello!!! $$$ @#%^&* ???",
expected_output="Hello! How can I help you today?",
category="edge_case",
difficulty="easy",
metadata={},
tags=["special_characters"]
),
# Multiple languages
TestCase(
id="edge_multilingual",
input="Hello, こんにちは, Bonjour",
expected_output="Hello! I can help you in English. How can I assist?",
category="edge_case",
difficulty="hard",
metadata={},
tags=["multilingual"]
)
]
for case in edge_cases:
self.add_case(case)
def save(self, filepath):
"""Save test dataset to file."""
with open(filepath, 'w') as f:
json.dump([vars(tc) for tc in self.test_cases], f, indent=2)
def load(self, filepath):
"""Load test dataset from file."""
with open(filepath, 'r') as f:
data = json.load(f)
self.test_cases = [TestCase(**item) for item in data]
# Build test dataset
builder = TestDatasetBuilder()
# Add manual test cases
builder.add_case(TestCase(
id="classify_billing_1",
input="I was charged twice for my subscription",
expected_output="billing",
category="happy_path",
difficulty="easy",
metadata={"expected_category": "billing"},
tags=["classification", "billing"]
))
# Add edge cases
builder.add_edge_cases()
# Sample from production
builder.add_from_production(production_logs, num_samples=50)
# Save
builder.save("test_dataset.json")class TestDatasetVersion:
"""Version control for test datasets."""
def __init__(self, base_path="./test_datasets"):
self.base_path = Path(base_path)
self.base_path.mkdir(exist_ok=True)
def save_version(self, test_cases, version_name):
"""Save a versioned test dataset."""
version_file = self.base_path / f"{version_name}.json"
with open(version_file, 'w') as f:
json.dump({
"version": version_name,
"created_at": datetime.now().isoformat(),
"count": len(test_cases),
"test_cases": [vars(tc) for tc in test_cases]
}, f, indent=2)
print(f"Saved version '{version_name}' with {len(test_cases)} test cases")
def load_version(self, version_name):
"""Load a specific version."""
version_file = self.base_path / f"{version_name}.json"
with open(version_file, 'r') as f:
data = json.load(f)
return [TestCase(**tc) for tc in data["test_cases"]]
# Usage
versioner = TestDatasetVersion()
# Save current test set
versioner.save_version(test_cases, "v1.0_baseline")
# Later, after adding more cases
versioner.save_version(updated_test_cases, "v1.1_added_edge_cases")
# Load specific version for regression testing
baseline_tests = versioner.load_version("v1.0_baseline")Automated testing enables continuous quality assurance. Let's build evaluation pipelines.
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import numpy as np
class ClassificationEvaluator:
def __init__(self, llm_client, prompt_template):
self.client = llm_client
self.prompt_template = prompt_template
def evaluate(self, test_cases):
"""Evaluate classification performance."""
predictions = []
ground_truth = []
for test in test_cases:
# Get prediction
prompt = self.prompt_template.format(input=test.input)
prediction = self.client.generate(prompt).strip().lower()
predictions.append(prediction)
ground_truth.append(test.expected_output.lower())
# Calculate metrics
accuracy = accuracy_score(ground_truth, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(
ground_truth, predictions, average='weighted'
)
# Confusion matrix
cm = confusion_matrix(ground_truth, predictions)
return {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1_score": f1,
"confusion_matrix": cm.tolist(),
"predictions": list(zip(ground_truth, predictions))
}
def identify_failure_patterns(self, results):
"""Identify common failure patterns."""
failures = [
(gt, pred) for gt, pred in results["predictions"]
if gt != pred
]
# Group by failure type
failure_patterns = {}
for gt, pred in failures:
key = f"{gt} -> {pred}"
failure_patterns[key] = failure_patterns.get(key, 0) + 1
# Sort by frequency
sorted_failures = sorted(
failure_patterns.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_failures
# Usage
evaluator = ClassificationEvaluator(llm_client, prompt_template)
results = evaluator.evaluate(test_cases)
print(f"Accuracy: {results['accuracy']:.1%}")
print(f"F1 Score: {results['f1_score']:.3f}")
# Find common mistakes
patterns = evaluator.identify_failure_patterns(results)
print("\nMost common failures:")
for pattern, count in patterns[:5]:
print(f" {pattern}: {count} times")from rouge import Rouge
from bert_score import score as bert_score
class GenerationEvaluator:
def __init__(self, llm_client):
self.client = llm_client
self.rouge = Rouge()
def evaluate(self, test_cases):
"""Evaluate generation quality."""
predictions = []
references = []
for test in test_cases:
prediction = self.client.generate(test.input)
predictions.append(prediction)
references.append(test.expected_output)
# ROUGE scores (overlap-based)
rouge_scores = self.rouge.get_scores(predictions, references, avg=True)
# BERTScore (semantic similarity)
P, R, F1 = bert_score(predictions, references, lang="en", verbose=False)
return {
"rouge_1": rouge_scores["rouge-1"]["f"],
"rouge_2": rouge_scores["rouge-2"]["f"],
"rouge_l": rouge_scores["rouge-l"]["f"],
"bert_score_precision": P.mean().item(),
"bert_score_recall": R.mean().item(),
"bert_score_f1": F1.mean().item()
}
# Usage
gen_evaluator = GenerationEvaluator(llm_client)
results = gen_evaluator.evaluate(test_cases)
print(f"ROUGE-L: {results['rouge_l']:.3f}")
print(f"BERTScore F1: {results['bert_score_f1']:.3f}")class RAGEvaluator:
"""Evaluate RAG system performance."""
def evaluate_retrieval(self, test_cases, rag_system):
"""Evaluate retrieval quality."""
metrics = {
"recall_at_5": [],
"mrr": [], # Mean Reciprocal Rank
"avg_relevance_score": []
}
for test in test_cases:
# Get retrieved documents
retrieved = rag_system.retrieve(test.input, top_k=5)
# Check if relevant docs were retrieved
relevant_doc_ids = test.metadata.get("relevant_docs", [])
retrieved_ids = [doc["id"] for doc in retrieved]
# Calculate recall@5
recall = len(set(retrieved_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
metrics["recall_at_5"].append(recall)
# Calculate MRR
for i, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant_doc_ids:
metrics["mrr"].append(1.0 / i)
break
else:
metrics["mrr"].append(0.0)
# Average relevance score
avg_score = np.mean([doc["score"] for doc in retrieved])
metrics["avg_relevance_score"].append(avg_score)
return {
"recall_at_5": np.mean(metrics["recall_at_5"]),
"mrr": np.mean(metrics["mrr"]),
"avg_relevance_score": np.mean(metrics["avg_relevance_score"])
}
def evaluate_answer_quality(self, test_cases, rag_system):
"""Evaluate generated answer quality."""
faithfulness_scores = []
relevance_scores = []
for test in test_cases:
answer = rag_system.generate_answer(test.input)
# Faithfulness: is answer grounded in retrieved context?
faithfulness = self._check_faithfulness(
answer,
rag_system.last_retrieved_context
)
faithfulness_scores.append(faithfulness)
# Relevance: does answer address the question?
relevance = self._check_relevance(answer, test.input)
relevance_scores.append(relevance)
return {
"faithfulness": np.mean(faithfulness_scores),
"relevance": np.mean(relevance_scores)
}
def _check_faithfulness(self, answer, context):
"""Use LLM to check if answer is faithful to context."""
prompt = f"""Does this answer contain only information from the context?
Context: {context}
Answer: {answer}
Respond with just a number 0-1 where:
- 1.0 = completely faithful, all claims supported by context
- 0.5 = partially faithful, some unsupported claims
- 0.0 = not faithful, makes claims not in context
Score:"""
score = float(llm.generate(prompt).strip())
return scoreCatch regressions before they reach production. Automated regression testing is critical.
class RegressionDetector:
def __init__(self, baseline_results_path):
self.baseline = self._load_baseline(baseline_results_path)
def _load_baseline(self, path):
"""Load baseline test results."""
with open(path, 'r') as f:
return json.load(f)
def detect_regression(self, current_results, threshold=0.05):
"""Detect if performance regressed vs baseline."""
regressions = []
for metric in ["accuracy", "f1_score", "rouge_l"]:
if metric not in self.baseline or metric not in current_results:
continue
baseline_value = self.baseline[metric]
current_value = current_results[metric]
# Check for regression
delta = baseline_value - current_value
if delta > threshold:
regressions.append({
"metric": metric,
"baseline": baseline_value,
"current": current_value,
"delta": delta,
"severity": "critical" if delta > threshold * 2 else "warning"
})
return regressions
def generate_report(self, current_results):
"""Generate regression test report."""
regressions = self.detect_regression(current_results)
if not regressions:
return "✅ No regressions detected. All metrics within acceptable range."
report = "⚠️ REGRESSIONS DETECTED\n\n"
for reg in regressions:
report += f"{reg['severity'].upper()}: {reg['metric']}\n"
report += f" Baseline: {reg['baseline']:.3f}\n"
report += f" Current: {reg['current']:.3f}\n"
report += f" Delta: {reg['delta']:.3f}\n\n"
return report
# Usage
detector = RegressionDetector("baseline_results_v1.0.json")
# Run tests with new prompt/model
current_results = run_evaluation(test_cases, new_prompt)
# Check for regressions
report = detector.generate_report(current_results)
print(report)
if regressions:
# Block deployment
raise RegressionError("Performance regression detected!")name: AI System Tests
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install -r requirements-test.txt
- name: Run regression tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m pytest tests/test_ai_system.py --verbose
- name: Compare with baseline
run: |
python scripts/regression_check.py
- name: Comment on PR
if: failure()
uses: actions/github-script@v6
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '⚠️ AI regression tests failed. Review the test results before merging.'
})class ProductionMonitor:
"""Monitor AI system performance in production."""
def __init__(self, test_suite):
self.test_suite = test_suite
self.results_history = []
def run_hourly_tests(self):
"""Run subset of tests every hour in production."""
# Select fast, representative test cases
quick_tests = [
tc for tc in self.test_suite
if tc.metadata.get("quick_test", False)
]
# Run evaluation
results = self.evaluate(quick_tests)
# Store results with timestamp
results["timestamp"] = datetime.now().isoformat()
self.results_history.append(results)
# Check for drift
if len(self.results_history) >= 24: # 1 day of hourly tests
self._detect_drift()
return results
def _detect_drift(self):
"""Detect performance drift over time."""
# Get last 24 hours of results
recent = self.results_history[-24:]
# Calculate average accuracy
recent_avg = np.mean([r["accuracy"] for r in recent])
# Compare with last week
week_ago = self.results_history[-168:-144] # 7 days ago, 24 hour window
if week_ago:
week_avg = np.mean([r["accuracy"] for r in week_ago])
# Check for drift
if recent_avg < week_avg * 0.95: # 5% drop
self._alert_drift(recent_avg, week_avg)
def _alert_drift(self, current, baseline):
"""Send alert about performance drift."""
message = f"""
🚨 AI Performance Drift Detected
Current accuracy: {current:.1%}
Baseline (1 week ago): {baseline:.1%}
Drop: {(baseline - current):.1%}
This may indicate:
- Model degradation
- Data distribution shift
- Prompt/config changes
"""
send_alert_to_slack(message)
# Schedule in production
import schedule
monitor = ProductionMonitor(test_suite)
schedule.every().hour.do(monitor.run_hourly_tests)
while True:
schedule.run_pending()
time.sleep(60)The ultimate test is real users. A/B testing validates changes in production.
import hashlib
class ABTest:
def __init__(self, name, variant_a, variant_b, split=0.5):
self.name = name
self.variant_a = variant_a
self.variant_b = variant_b
self.split = split
self.results = {"A": [], "B": []}
def assign_variant(self, user_id):
"""Deterministically assign user to variant."""
# Hash user_id to get consistent assignment
hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return "A" if (hash_value % 100) < (self.split * 100) else "B"
def get_response(self, user_id, input_text):
"""Get response from assigned variant."""
variant = self.assign_variant(user_id)
if variant == "A":
response = self.variant_a.generate(input_text)
else:
response = self.variant_b.generate(input_text)
# Track assignment
self.results[variant].append({
"user_id": user_id,
"input": input_text,
"response": response,
"timestamp": datetime.now()
})
return response, variant
def collect_feedback(self, user_id, variant, feedback):
"""Collect user feedback (thumbs up/down, rating, etc.)."""
# Find the interaction
for interaction in self.results[variant]:
if interaction["user_id"] == user_id:
interaction["feedback"] = feedback
break
def analyze(self):
"""Analyze A/B test results."""
analysis = {}
for variant in ["A", "B"]:
interactions = self.results[variant]
feedback = [i.get("feedback") for i in interactions if "feedback" in i]
if feedback:
# Calculate metrics (assuming binary feedback: 1 = positive, 0 = negative)
analysis[variant] = {
"total_interactions": len(interactions),
"feedback_count": len(feedback),
"response_rate": len(feedback) / len(interactions),
"positive_rate": sum(feedback) / len(feedback) if feedback else 0,
"sample_responses": [i["response"] for i in interactions[:3]]
}
# Statistical significance test
if analysis["A"]["feedback_count"] > 30 and analysis["B"]["feedback_count"] > 30:
p_value = self._calculate_significance(
analysis["A"]["positive_rate"],
analysis["B"]["positive_rate"],
analysis["A"]["feedback_count"],
analysis["B"]["feedback_count"]
)
analysis["p_value"] = p_value
analysis["significant"] = p_value < 0.05
return analysis
def _calculate_significance(self, rate_a, rate_b, n_a, n_b):
"""Calculate statistical significance (simplified)."""
from scipy import stats
# Two-proportion z-test
successes = [rate_a * n_a, rate_b * n_b]
trials = [n_a, n_b]
z_stat, p_value = stats.proportions_ztest(successes, trials)
return p_value
# Usage
test = ABTest(
name="prompt_v1_vs_v2",
variant_a=PromptV1(),
variant_b=PromptV2(),
split=0.5
)
# In production
user_id = request.user.id
response, variant = test.get_response(user_id, user_input)
# Later, collect feedback
test.collect_feedback(user_id, variant, feedback=1) # Positive
# Analyze after 1000+ interactions
results = test.analyze()
if results.get("significant"):
winner = "A" if results["A"]["positive_rate"] > results["B"]["positive_rate"] else "B"
print(f"Variant {winner} is statistically significantly better!")Some aspects of quality require human judgment. Let's structure human evaluation effectively.
class HumanEvaluationTool:
def __init__(self, test_cases):
self.test_cases = test_cases
self.evaluations = []
def evaluate_sample(self, sample_size=20):
"""Present random sample for human evaluation."""
import random
sample = random.sample(self.test_cases, min(sample_size, len(self.test_cases)))
for i, test in enumerate(sample, 1):
print(f"\n{'='*80}")
print(f"Evaluation {i}/{len(sample)}")
print(f"{'='*80}")
# Get AI response
response = llm.generate(test.input)
print(f"\nInput: {test.input}")
print(f"\nAI Response: {response}")
print(f"\nExpected: {test.expected_output}")
# Collect ratings
ratings = self._collect_ratings()
self.evaluations.append({
"test_id": test.id,
"input": test.input,
"response": response,
"expected": test.expected_output,
"ratings": ratings
})
def _collect_ratings(self):
"""Collect ratings from human evaluator."""
criteria = {
"accuracy": "Is the information correct? (1-5)",
"helpfulness": "Does it solve the user's problem? (1-5)",
"tone": "Is the tone appropriate? (1-5)",
"clarity": "Is it clear and well-written? (1-5)"
}
ratings = {}
for criterion, description in criteria.items():
while True:
try:
rating = int(input(f"{description}: "))
if 1 <= rating <= 5:
ratings[criterion] = rating
break
except ValueError:
pass
return ratings
def generate_report(self):
"""Generate human evaluation report."""
if not self.evaluations:
return "No evaluations collected."
# Calculate average scores
avg_scores = {}
for criterion in ["accuracy", "helpfulness", "tone", "clarity"]:
scores = [e["ratings"][criterion] for e in self.evaluations]
avg_scores[criterion] = sum(scores) / len(scores)
report = "Human Evaluation Report\n"
report += "="*50 + "\n\n"
report += "Average Scores (1-5):\n"
for criterion, score in avg_scores.items():
report += f" {criterion.capitalize()}: {score:.2f}\n"
# Identify low-scoring examples
low_scoring = [
e for e in self.evaluations
if sum(e["ratings"].values()) / len(e["ratings"]) < 3.0
]
if low_scoring:
report += f"\n{len(low_scoring)} examples scored below 3.0 average.\n"
return report
# Usage
evaluator = HumanEvaluationTool(test_cases)
evaluator.evaluate_sample(sample_size=20)
print(evaluator.generate_report())Testing AI systems requires a multi-faceted approach: comprehensive test datasets covering happy paths and edge cases, automated evaluation pipelines for continuous quality checks, regression detection to catch performance degradation, A/B testing with real users in production, and human evaluation for subjective quality dimensions.
The key is building testing into your development workflow from day one. Every prompt change should run through regression tests. Every model upgrade should be A/B tested. Every new feature should expand the test suite. Testing AI systems is more work than testing traditional software, but the investment pays off in reliability, user satisfaction, and confidence in your system.
Start simple: build a test set of 50-100 cases covering your most important scenarios, set up automated evaluation that runs on every change, and establish a baseline. Then incrementally add sophistication: expand test coverage, implement production monitoring, add A/B testing, and incorporate human evaluation. With robust testing, you can iterate quickly while maintaining quality.
Learn how to select the optimal AI model for your needs by comparing capabilities, costs, and performance. Includes evaluation frameworks, benchmarking strategies, and migration guidance.
Master patterns for integrating with LLM APIs reliably at scale. Learn error handling, rate limiting, caching, cost optimization, and production-ready architectures for OpenAI, Anthropic, and other providers.
Complete guide to deploying and scaling AI applications in production. Learn infrastructure patterns, load balancing, caching, monitoring, cost optimization, and strategies for handling thousands to millions of users.