Build a RAG Chatbot from Scratch: Step-by-Step Python Tutorial
Build a Retrieval Augmented Generation chatbot in Python from scratch. Covers document loading, chunking, embeddings, vector databases, retrieval chains, chat history, evaluation, and common pitfalls.
You have documents -- PDFs, markdown files, internal wikis -- and you want an AI chatbot that answers questions using those documents. Not hallucinated answers from training data, but grounded answers that reference your actual content. That's RAG: Retrieval Augmented Generation.
The concept is straightforward. When a user asks a question, you search your documents for relevant chunks, stuff those into the prompt as context, and let the LLM generate an answer based on that context. Let's build one from scratch.
Why RAG Beats Fine-Tuning
Fine-tuning modifies the model's weights. It's expensive, slow, requires ML expertise, and the model can still hallucinate. When documents change, you fine-tune again. RAG keeps the model untouched and injects relevant context at query time. Cheaper, faster to set up, easy to update, and answers are grounded in actual source text.For most use cases -- support bots, doc search, knowledge bases -- RAG is the right choice.
Setup
pip install openai chromadb langchain-text-splitters pypdf sentence-transformers
Step 1: Load Documents
from pathlib import Path
from pypdf import PdfReader
def load_documents(directory: str) -> list[dict]:
documents = []
for path in Path(directory).rglob("*"):
if path.suffix == ".pdf":
text = "".join(p.extract_text() for p in PdfReader(str(path)).pages)
elif path.suffix in (".md", ".txt"):
text = path.read_text(encoding="utf-8")
else:
continue
documents.append({"text": text, "source": str(path)})
return documents
docs = load_documents("./knowledge_base")
Step 2: Split into Chunks
This is more important than most tutorials let on. Chunks too large flood the LLM with irrelevant text. Too small and you lose context.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = []
for doc in docs:
for i, text in enumerate(splitter.split_text(doc["text"])):
chunks.append({"text": text, "source": doc["source"], "index": i})
| Chunk Size | Good For | Risk |
|---|---|---|
| 200-400 | FAQ-style docs, precise retrieval | Loses broader context |
| 500-1000 | General purpose | Good balance |
| 1000-2000 | Long-form technical docs | May include irrelevant content |
Step 3: Embed and Store in Vector DB
Embeddings convert text into vectors that capture semantic meaning. Similar texts end up close together in vector space.
from sentence_transformers import SentenceTransformer
import chromadb
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("knowledge_base")
texts = [c["text"] for c in chunks]
embeddings = embedding_model.encode(texts).tolist()
collection.add(
ids=[f"chunk_{i}" for i in range(len(chunks))],
embeddings=embeddings,
documents=texts,
metadatas=[{"source": c["source"]} for c in chunks],
)
We're using all-MiniLM-L6-v2 -- small, fast, free, runs locally. For production with higher accuracy, consider OpenAI's text-embedding-3-small.
Step 4: Retrieve Relevant Chunks
def retrieve(query: str, top_k: int = 5) -> list[dict]:
query_emb = embedding_model.encode(query).tolist()
results = collection.query(
query_embeddings=[query_emb], n_results=top_k,
include=["documents", "metadatas", "distances"]
)
return [
{"text": results["documents"][0][i],
"source": results["metadatas"][0][i]["source"],
"distance": results["distances"][0][i]}
for i in range(len(results["documents"][0]))
]
Step 5: Generate Grounded Answers
from openai import OpenAI
llm = OpenAI()
def ask(question: str) -> str:
chunks = retrieve(question, top_k=5)
context = "\n\n---\n\n".join(
f"Source: {c['source']}\n{c['text']}" for c in chunks
)
response = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content":
"Answer using only the provided context. "
"If the context doesn't contain the answer, say "
"'I don't have enough information.' Cite sources."},
{"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.1,
)
return response.choices[0].message.content
The system prompt is critical. Without explicit instructions to only use the provided context, the LLM will mix in training data. The "say I don't know" instruction prevents hallucination.
Step 6: Add Chat History
A chatbot that forgets the previous message isn't useful:
class RAGChatbot:
def __init__(self):
self.history: list[dict] = []
def chat(self, message: str) -> str:
chunks = retrieve(message, top_k=5)
context = "\n\n---\n\n".join(c["text"] for c in chunks)
messages = [
{"role": "system", "content":
"Answer using only the provided context. "
"If unsure, say so. Cite sources."}
]
for h in self.history[-10:]:
messages.append({"role": "user", "content": h["user"]})
messages.append({"role": "assistant", "content": h["assistant"]})
messages.append({"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {message}"})
response = llm.chat.completions.create(
model="gpt-4o-mini", messages=messages, temperature=0.1
)
answer = response.choices[0].message.content
self.history.append({"user": message, "assistant": answer})
return answer
bot = RAGChatbot()
print(bot.chat("How do I set up SSO?"))
print(bot.chat("What identity providers does it support?"))
Evaluating RAG Quality
Building it is one thing. Knowing if it's good is another.
Retrieval quality: Are the right chunks being retrieved? Manually check 20-30 questions. Are irrelevant chunks polluting the context? Answer quality: Does it only use information from context (faithfulness)? Does it address the question (relevance)? Does it cover all relevant info (completeness)?Common Pitfalls
Chunk size too large. 2000+ character chunks mean the retriever returns broad, unfocused context. Start with 500-800. Wrong embedding model. General-purpose models on specialized domain content (legal, medical) produce poor retrieval. Consider domain-specific models. Not handling "I don't know." Without explicit instructions and low temperature, LLMs confidently make things up. Test by asking questions not in your documents. No chunk overlap. If an answer spans two chunks with no overlap, neither chunk alone provides the full answer. Always use 10-20% overlap. Stuffing too much context. More is not always better. Start with top-5 results and increase only if needed.Putting It All Together
Chain the steps above into a single script: index() loads documents, chunks them, embeds, and stores in ChromaDB. ask() embeds the query, retrieves top-5, and generates a grounded answer. Run indexing once, then query as many times as you want.
RAG is one of the most practical LLM applications. The implementation is simpler than it looks -- the hard parts are choosing the right chunk size, picking good embedding models, and testing to make sure answers are grounded in your documents rather than hallucinated.
For more AI tutorials and Python development guides, check out CodeUp.