Learn how to build a production-ready RAG (Retrieval Augmented Generation) system from scratch with practical code examples, architecture patterns, and best practices.
Building a RAG (Retrieval Augmented Generation) system might seem daunting, but with the right approach and understanding of the core components, you can have a functional system up and running in a matter of hours. This comprehensive guide walks you through every step of building your first RAG system, from document processing to production deployment.
Whether you're building a customer support chatbot that needs to reference your knowledge base, an internal search tool for company documents, or a research assistant that can query vast amounts of information, RAG provides the foundation for accurate, grounded AI responses.
In this guide, we'll build a complete RAG system using modern tools and best practices. You'll learn not just the "how" but also the "why" behind each decision, so you can adapt the approach to your specific needs.
Before diving into code, let's understand what we're building. A RAG system has five core components that work together to provide accurate, contextual responses:
Ingest PDFs, Word files, web pages
Split into semantic chunks
Convert to vector representations
Store for fast similarity search
Retrieve and generate responses
Document Processing Layer: Raw documents (PDFs, Word files, web pages) are ingested, cleaned, and prepared. The key challenge is extracting meaningful text while preserving important structure and context.
Chunking Strategy: Large documents are broken into smaller, semantically meaningful chunks. This is critical because LLMs have context limits. Common approaches: fixed-size chunking, semantic chunking (split at natural boundaries), and recursive chunking (hierarchical relationships).
Embedding Generation: Each chunk is converted into a vector representing its semantic meaning. Similar concepts produce similar vectors, enabling semantic search. Use models like OpenAI's text-embedding-3-large or open-source BGE.
Vector Storage: Embeddings are stored in a vector database for fast similarity search across millions of vectors. Options include Pinecone (managed), Qdrant, Weaviate, and ChromaDB.
Query Processing: User questions are converted to embeddings, similar chunks retrieved, and provided as context to an LLM to generate grounded, accurate responses.
Here's how these components work together when a user asks a question:
User submits a question
Convert question to vector
Find similar document chunks
Build prompt with retrieved chunks
LLM creates grounded answer
Response with source references
This architecture ensures responses are grounded in your actual documents rather than the LLM's training data, dramatically reducing hallucinations and providing verifiable information.
Let's get your development environment ready. We'll use Python for this implementation, as it has the richest ecosystem of AI and ML libraries.
You'll need the following installed on your system:
Create a new directory for your project and set up a virtual environment:
# Create project directory
mkdir my-rag-system
cd my-rag-system
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install required packages
pip install langchain chromadb openai pypdf python-dotenv tiktokenOrganize your project with a clean structure:
my-rag-system/
├── data/ # Raw documents to process
│ ├── documents/ # Your source documents
│ └── processed/ # Processed chunks (optional)
├── src/
│ ├── ingestion.py # Document processing
│ ├── embedding.py # Embedding generation
│ ├── retrieval.py # Search and retrieval
│ └── generation.py # LLM response generation
├── main.py # Main application entry point
├── .env # Environment variables (API keys)
└── requirements.txt # Python dependenciesCreate a .env file to store your API keys securely:
OPENAI_API_KEY=your_openai_api_key_here
VECTOR_DB_PATH=./chroma_db
CHUNK_SIZE=1000
CHUNK_OVERLAP=200Never commit this file to version control. Add it to your .gitignore immediately.
The quality of your RAG system starts with proper document processing. Let's build a robust ingestion pipeline that handles multiple document types.
Start by creating src/ingestion.py:
import os
from pathlib import Path
from typing import List
from langchain.document_loaders import (
PyPDFLoader,
TextLoader,
UnstructuredWordDocumentLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
class DocumentProcessor:
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
def load_documents(self, directory: str) -> List[Document]:
"""Load all documents from a directory."""
documents = []
path = Path(directory)
for file_path in path.rglob('*'):
if file_path.is_file():
try:
docs = self._load_single_file(str(file_path))
documents.extend(docs)
print(f"Loaded: {file_path.name}")
except Exception as e:
print(f"Error loading {file_path.name}: {e}")
return documents
def _load_single_file(self, file_path: str) -> List[Document]:
"""Load a single file based on its extension."""
extension = Path(file_path).suffix.lower()
if extension == '.pdf':
loader = PyPDFLoader(file_path)
elif extension == '.txt':
loader = TextLoader(file_path)
elif extension in ['.doc', '.docx']:
loader = UnstructuredWordDocumentLoader(file_path)
else:
raise ValueError(f"Unsupported file type: {extension}")
return loader.load()
def chunk_documents(self, documents: List[Document]) -> List[Document]:
"""Split documents into chunks."""
chunks = self.text_splitter.split_documents(documents)
# Add metadata to each chunk
for i, chunk in enumerate(chunks):
chunk.metadata['chunk_id'] = i
chunk.metadata['total_chunks'] = len(chunks)
return chunksThe RecursiveCharacterTextSplitter is one of the most effective chunking strategies. It progressively falls back through split methods:
The overlap parameter is crucial. Setting chunk_overlap=200 means each chunk shares 200 characters with the previous chunk, ensuring that context isn't lost at chunk boundaries.
Now let's use our document processor:
# In main.py
from src.ingestion import DocumentProcessor
from dotenv import load_dotenv
import os
load_dotenv()
def main():
# Initialize processor
processor = DocumentProcessor(
chunk_size=int(os.getenv('CHUNK_SIZE', 1000)),
chunk_overlap=int(os.getenv('CHUNK_OVERLAP', 200))
)
# Load documents
print("Loading documents...")
documents = processor.load_documents('./data/documents')
print(f"Loaded {len(documents)} documents")
# Chunk documents
print("Chunking documents...")
chunks = processor.chunk_documents(documents)
print(f"Created {len(chunks)} chunks")
return chunks
if __name__ == "__main__":
chunks = main()Now that we have clean, chunked documents, we need to convert them into embeddings—vector representations that capture semantic meaning.
An embedding is a dense vector (typically 768 to 3072 dimensions) where each number represents some aspect of the text's meaning. Similar concepts produce similar vectors, which we can measure using cosine similarity or other distance metrics.
For example, these queries would have similar embeddings:
Several high-quality options are available:
For this guide, we'll use OpenAI's text-embedding-3-small as it offers the best balance of cost and performance.
Create src/embedding.py:
from typing import List
from openai import OpenAI
import os
class EmbeddingGenerator:
def __init__(self, model: str = "text-embedding-3-small"):
self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.model = model
self.dimensions = 1536 # text-embedding-3-small dimension
def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings for a list of texts."""
# OpenAI API can handle up to 2048 texts per request
# but we'll batch smaller to be safe
batch_size = 100
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = self.client.embeddings.create(
model=self.model,
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Generated embeddings for {i + len(batch)}/{len(texts)} texts")
return all_embeddings
def generate_query_embedding(self, query: str) -> List[float]:
"""Generate embedding for a single query."""
response = self.client.embeddings.create(
model=self.model,
input=[query]
)
return response.data[0].embeddingUse the same model for documents and queries: This is essential. If you embed your documents with model A but queries with model B, the vectors won't be comparable, and retrieval will fail.
Handle rate limits: OpenAI has rate limits (typically 3,000 requests/minute for most tiers). The batching approach above helps stay within limits.
Cost optimization: Embedding 1 million tokens costs ~$0.02 with text-embedding-3-small. For a typical knowledge base of 10,000 chunks averaging 500 tokens each, that's just $0.10.
Caching embeddings: Once you generate embeddings, store them with your vectors. You only need to re-embed when documents change.
With embeddings generated, we need a way to store them and perform fast similarity searches. This is where vector databases come in.
Traditional databases store and retrieve data by exact matches or ranges. Vector databases are optimized for finding "similar" vectors using distance metrics like cosine similarity or Euclidean distance. They use specialized indexing algorithms (HNSW, IVF, etc.) to search billions of vectors in milliseconds.
For this guide, we'll use ChromaDB because it:
For production systems, consider Pinecone (managed), Qdrant (self-hosted or managed), or Weaviate.
Create src/vector_store.py:
import chromadb
from chromadb.config import Settings
from typing import List, Dict, Any
import os
class VectorStore:
def __init__(self, persist_directory: str = "./chroma_db"):
# Initialize ChromaDB client with persistence
self.client = chromadb.Client(Settings(
persist_directory=persist_directory,
anonymized_telemetry=False
))
# Create or get collection
self.collection = self.client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
def add_documents(
self,
documents: List[str],
embeddings: List[List[float]],
metadatas: List[Dict[str, Any]],
ids: List[str]
):
"""Add documents with their embeddings to the vector store."""
# ChromaDB can handle batches, but let's chunk to be safe
batch_size = 100
for i in range(0, len(documents), batch_size):
end_idx = min(i + batch_size, len(documents))
self.collection.add(
documents=documents[i:end_idx],
embeddings=embeddings[i:end_idx],
metadatas=metadatas[i:end_idx],
ids=ids[i:end_idx]
)
print(f"Added {end_idx}/{len(documents)} documents to vector store")
def similarity_search(
self,
query_embedding: List[float],
n_results: int = 5
) -> Dict[str, Any]:
"""Search for similar documents."""
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
return results
def get_collection_stats(self) -> Dict[str, Any]:
"""Get statistics about the collection."""
return {
"count": self.collection.count(),
"name": self.collection.name
}Now let's put it all together to ingest documents into our vector store. Update main.py:
from src.ingestion import DocumentProcessor
from src.embedding import EmbeddingGenerator
from src.vector_store import VectorStore
from dotenv import load_dotenv
import os
load_dotenv()
def ingest_documents():
# Initialize components
processor = DocumentProcessor()
embedder = EmbeddingGenerator()
vector_store = VectorStore(persist_directory=os.getenv('VECTOR_DB_PATH'))
# Load and chunk documents
print("Loading documents...")
documents = processor.load_documents('./data/documents')
chunks = processor.chunk_documents(documents)
# Extract text and metadata
texts = [chunk.page_content for chunk in chunks]
metadatas = [chunk.metadata for chunk in chunks]
ids = [f"chunk_{i}" for i in range(len(chunks))]
# Generate embeddings
print("Generating embeddings...")
embeddings = embedder.generate_embeddings(texts)
# Store in vector database
print("Storing in vector database...")
vector_store.add_documents(
documents=texts,
embeddings=embeddings,
metadatas=metadatas,
ids=ids
)
# Print stats
stats = vector_store.get_collection_stats()
print(f"\nIngestion complete!")
print(f"Total chunks in database: {stats['count']}")
if __name__ == "__main__":
ingest_documents()Run this script once to populate your vector database. The data persists to disk, so you don't need to re-run unless your documents change.
Now we can store and search vectors, but we need to make retrieval intelligent. This involves query processing, re-ranking, and managing context.
Create src/retrieval.py:
from typing import List, Dict, Any
from src.embedding import EmbeddingGenerator
from src.vector_store import VectorStore
class Retriever:
def __init__(self, vector_store: VectorStore, embedder: EmbeddingGenerator):
self.vector_store = vector_store
self.embedder = embedder
def retrieve(
self,
query: str,
top_k: int = 5,
similarity_threshold: float = 0.7
) -> List[Dict[str, Any]]:
"""Retrieve relevant documents for a query."""
# Generate query embedding
query_embedding = self.embedder.generate_query_embedding(query)
# Search vector store
results = self.vector_store.similarity_search(
query_embedding=query_embedding,
n_results=top_k * 2 # Retrieve more than needed for filtering
)
# Process and filter results
processed_results = []
for i in range(len(results['documents'][0])):
# ChromaDB returns cosine distance (lower is better)
# Convert to similarity score (higher is better)
distance = results['distances'][0][i]
similarity = 1 - distance # Cosine similarity
if similarity >= similarity_threshold:
processed_results.append({
'content': results['documents'][0][i],
'metadata': results['metadatas'][0][i],
'similarity': similarity
})
# Sort by similarity and return top_k
processed_results.sort(key=lambda x: x['similarity'], reverse=True)
return processed_results[:top_k]
def format_context(self, results: List[Dict[str, Any]]) -> str:
"""Format retrieved documents into context for LLM."""
if not results:
return "No relevant information found."
context_parts = []
for i, result in enumerate(results, 1):
source = result['metadata'].get('source', 'Unknown')
content = result['content']
context_parts.append(f"[Source {i}: {source}]\n{content}")
return "\n\n".join(context_parts)The basic retrieval above works, but several techniques can improve quality:
1. Hybrid Search
Combine vector similarity with keyword search (BM25). This catches both semantic matches and exact term matches. Many vector databases support hybrid search natively.
2. Re-ranking
Retrieve more candidates (e.g., top 20), then use a cross-encoder model to re-rank based on query-document relevance. This is slower but more accurate than pure vector search.
3. Query Expansion
Before embedding, expand the query with synonyms or related terms. For example, "return policy" becomes "return policy refund exchange money-back guarantee".
4. Metadata Filtering
Pre-filter by metadata before similarity search. For example, only search documents from the "returns" category or from the last 6 months.
Here's a simple query expansion technique:
def expand_query(self, query: str) -> str:
"""Expand query with LLM-generated variations."""
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Generate 2-3 semantic variations of the user's query. Return only the variations, one per line."
}, {
"role": "user",
"content": query
}],
temperature=0.3,
max_tokens=100
)
variations = response.choices[0].message.content.strip().split('\n')
# Combine original query with variations
expanded = query + " " + " ".join(variations)
return expandedUse this technique sparingly, as it adds latency and cost. It works best for short, ambiguous queries.
This is where everything comes together. We retrieve relevant context and use it to generate accurate, grounded responses.
Create src/generation.py:
from typing import List, Dict, Any
from openai import OpenAI
import os
class ResponseGenerator:
def __init__(self, model: str = "gpt-4o-mini"):
self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.model = model
def generate_response(
self,
query: str,
context: str,
include_sources: bool = True
) -> Dict[str, Any]:
"""Generate a response using retrieved context."""
system_prompt = """You are a helpful AI assistant. Answer the user's question using ONLY the information provided in the context below.
If the context doesn't contain enough information to answer the question, say "I don't have enough information to answer that question" rather than making up an answer.
Always cite which source(s) you used by referencing [Source 1], [Source 2], etc.
Be concise but complete in your answers."""
user_prompt = f"""Context:
{context}
Question: {query}
Answer:"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.3, # Lower temperature for more focused answers
max_tokens=500
)
answer = response.choices[0].message.content
return {
"answer": answer,
"model": self.model,
"tokens_used": response.usage.total_tokens
}The system prompt is critical. Key elements:
Now let's create the full query pipeline that combines retrieval and generation:
# In main.py
from src.retrieval import Retriever
from src.generation import ResponseGenerator
from src.vector_store import VectorStore
from src.embedding import EmbeddingGenerator
def query_rag_system(question: str):
"""Query the RAG system and get an answer."""
# Initialize components
vector_store = VectorStore()
embedder = EmbeddingGenerator()
retriever = Retriever(vector_store, embedder)
generator = ResponseGenerator()
# Retrieve relevant documents
print("Retrieving relevant documents...")
results = retriever.retrieve(question, top_k=5)
print(f"Found {len(results)} relevant chunks")
for i, result in enumerate(results, 1):
print(f" {i}. Similarity: {result['similarity']:.3f}")
# Format context
context = retriever.format_context(results)
# Generate response
print("\nGenerating response...")
response = generator.generate_response(
query=question,
context=context
)
print(f"\nAnswer:\n{response['answer']}")
print(f"\nTokens used: {response['tokens_used']}")
return response
# Interactive query loop
if __name__ == "__main__":
print("RAG System Ready! Type 'quit' to exit.\n")
while True:
question = input("Your question: ")
if question.lower() in ['quit', 'exit']:
break
query_rag_system(question)
print("\n" + "-"*80 + "\n")Run this script and you have a fully functional RAG system!
A working system isn't enough—it needs to be reliable, fast, and accurate. Let's implement testing and optimization strategies.
Build a set of test questions with known correct answers:
# test_questions.py
test_cases = [
{
"question": "What is your return policy for electronics?",
"expected_info": ["30 days", "original packaging", "receipt"],
"expected_sources": ["returns_policy.pdf"]
},
{
"question": "How do I reset my password?",
"expected_info": ["click forgot password", "email link", "create new password"],
"expected_sources": ["user_guide.pdf"]
},
# Add more test cases
]
def evaluate_rag_system():
"""Evaluate RAG system performance."""
correct = 0
total = len(test_cases)
for test in test_cases:
response = query_rag_system(test["question"])
answer = response["answer"].lower()
# Check if expected information is present
info_present = sum(
1 for info in test["expected_info"]
if info.lower() in answer
)
score = info_present / len(test["expected_info"])
if score >= 0.7: # 70% threshold
correct += 1
print(f"Question: {test['question']}")
print(f"Score: {score:.2%}\n")
accuracy = correct / total
print(f"Overall accuracy: {accuracy:.2%}")
return accuracyRetrieval Quality:
Generation Quality:
System Performance:
1. Improve Chunking
If answers seem incomplete, try larger chunks (1500 tokens) or different splitting strategies. If too much irrelevant information appears, try smaller chunks (500 tokens).
2. Adjust Retrieval Parameters
Increase top_k if relevant information is being missed. Increase similarity_threshold if too much irrelevant information is retrieved.
3. Enhance Metadata
Add section titles, document types, dates to chunks. Use metadata filtering to narrow search space.
4. Implement Caching
Cache embeddings for common queries. Cache LLM responses for identical questions (with TTL).
5. Switch Models
Test different embedding models (text-embedding-3-large vs small). Try different LLMs (GPT-4o for complex questions, GPT-4o-mini for simple ones).
Once deployed, track:
Set up alerts for anomalies and review low-rated responses weekly to identify gaps in your knowledge base.
Building your first RAG system, you'll likely encounter these challenges. Here's how to avoid or solve them.
Symptoms: The system retrieves chunks that seem related but don't actually answer the question.
Causes and solutions:
Symptoms: System says "I don't have enough information" even though the answer exists in your documents.
Causes and solutions:
Symptoms: Queries take 5+ seconds to return answers.
Causes and solutions:
Symptoms: LLM provides information not in the retrieved context.
Causes and solutions:
Symptoms: API bills are higher than expected.
Causes and solutions:
Symptoms: System provides outdated information when documents have been updated.
Causes and solutions:
Congratulations! You've built a complete, production-ready RAG system from scratch. You now understand not just how to implement RAG, but why each component works the way it does and how to optimize for your specific use case.
This foundation will serve you well as you scale. The patterns and practices covered here—intelligent chunking, effective retrieval, grounded generation, comprehensive testing—apply whether you're building a simple internal tool or a customer-facing application serving millions of users.
Remember that RAG is not a one-size-fits-all solution. The optimal configuration depends on your documents, query patterns, and accuracy requirements. Use the testing and optimization strategies we've covered to iteratively improve your system based on real user feedback.
The RAG system you've built is production-ready for many use cases, but there's always room for enhancement. Consider implementing hybrid search, re-ranking, query routing, and other advanced techniques as your needs grow.
Learn how RAG combines the power of large language models with your business data to provide accurate, contextual AI responses. Complete guide to understanding and implementing RAG systems.
Discover how vector databases enable semantic search, power RAG systems, and revolutionize how AI accesses information. Complete guide to embeddings, similarity search, and choosing the right vector database.
Learn proven techniques for writing effective prompts that consistently produce high-quality results from LLMs. Includes practical examples, templates, and testing strategies for production applications.