March 27, 20265 min read

Build a RAG Chatbot from Scratch: Step-by-Step Python Tutorial

Build a Retrieval Augmented Generation chatbot in Python from scratch. Covers document loading, chunking, embeddings, vector databases, retrieval chains, chat history, evaluation, and common pitfalls.

ai rag python chatbot tutorial

You have documents -- PDFs, markdown files, internal wikis -- and you want an AI chatbot that answers questions using those documents. Not hallucinated answers from training data, but grounded answers that reference your actual content. That's RAG: Retrieval Augmented Generation.

The concept is straightforward. When a user asks a question, you search your documents for relevant chunks, stuff those into the prompt as context, and let the LLM generate an answer based on that context. Let's build one from scratch.

Why RAG Beats Fine-Tuning

Fine-tuning modifies the model's weights. It's expensive, slow, requires ML expertise, and the model can still hallucinate. When documents change, you fine-tune again. RAG keeps the model untouched and injects relevant context at query time. Cheaper, faster to set up, easy to update, and answers are grounded in actual source text.

For most use cases -- support bots, doc search, knowledge bases -- RAG is the right choice.

Setup

pip install openai chromadb langchain-text-splitters pypdf sentence-transformers

Step 1: Load Documents

from pathlib import Path
from pypdf import PdfReader

def load_documents(directory: str) -> list[dict]:
    documents = []
    for path in Path(directory).rglob("*"):
        if path.suffix == ".pdf":
            text = "".join(p.extract_text() for p in PdfReader(str(path)).pages)
        elif path.suffix in (".md", ".txt"):
            text = path.read_text(encoding="utf-8")
        else:
            continue
        documents.append({"text": text, "source": str(path)})
    return documents

docs = load_documents("./knowledge_base")

Step 2: Split into Chunks

This is more important than most tutorials let on. Chunks too large flood the LLM with irrelevant text. Too small and you lose context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)

chunks = []
for doc in docs:
    for i, text in enumerate(splitter.split_text(doc["text"])):
        chunks.append({"text": text, "source": doc["source"], "index": i})

Chunk Size	Good For	Risk
200-400	FAQ-style docs, precise retrieval	Loses broader context
500-1000	General purpose	Good balance
1000-2000	Long-form technical docs	May include irrelevant content

Step 3: Embed and Store in Vector DB

Embeddings convert text into vectors that capture semantic meaning. Similar texts end up close together in vector space.

from sentence_transformers import SentenceTransformer
import chromadb

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("knowledge_base")

texts = [c["text"] for c in chunks]
embeddings = embedding_model.encode(texts).tolist()

collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    embeddings=embeddings,
    documents=texts,
    metadatas=[{"source": c["source"]} for c in chunks],
)

We're using all-MiniLM-L6-v2 -- small, fast, free, runs locally. For production with higher accuracy, consider OpenAI's text-embedding-3-small.

Step 4: Retrieve Relevant Chunks

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_emb = embedding_model.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_emb], n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    return [
        {"text": results["documents"][0][i],
         "source": results["metadatas"][0][i]["source"],
         "distance": results["distances"][0][i]}
        for i in range(len(results["documents"][0]))
    ]

Step 5: Generate Grounded Answers

from openai import OpenAI
llm = OpenAI()

def ask(question: str) -> str:
    chunks = retrieve(question, top_k=5)
    context = "\n\n---\n\n".join(
        f"Source: {c['source']}\n{c['text']}" for c in chunks
    )
    response = llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content":
                "Answer using only the provided context. "
                "If the context doesn't contain the answer, say "
                "'I don't have enough information.' Cite sources."},
            {"role": "user",
             "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content

The system prompt is critical. Without explicit instructions to only use the provided context, the LLM will mix in training data. The "say I don't know" instruction prevents hallucination.

Step 6: Add Chat History

A chatbot that forgets the previous message isn't useful:

class RAGChatbot:
    def __init__(self):
        self.history: list[dict] = []

def chat(self, message: str) -> str:
        chunks = retrieve(message, top_k=5)
        context = "\n\n---\n\n".join(c["text"] for c in chunks)

messages = [
            {"role": "system", "content":
                "Answer using only the provided context. "
                "If unsure, say so. Cite sources."}
        ]
        for h in self.history[-10:]:
            messages.append({"role": "user", "content": h["user"]})
            messages.append({"role": "assistant", "content": h["assistant"]})
        messages.append({"role": "user",
                         "content": f"Context:\n{context}\n\nQuestion: {message}"})

response = llm.chat.completions.create(
            model="gpt-4o-mini", messages=messages, temperature=0.1
        )
        answer = response.choices[0].message.content
        self.history.append({"user": message, "assistant": answer})
        return answer

bot = RAGChatbot()
print(bot.chat("How do I set up SSO?"))
print(bot.chat("What identity providers does it support?"))

Evaluating RAG Quality

Building it is one thing. Knowing if it's good is another.

Retrieval quality: Are the right chunks being retrieved? Manually check 20-30 questions. Are irrelevant chunks polluting the context? Answer quality: Does it only use information from context (faithfulness)? Does it address the question (relevance)? Does it cover all relevant info (completeness)?

Common Pitfalls

Chunk size too large. 2000+ character chunks mean the retriever returns broad, unfocused context. Start with 500-800. Wrong embedding model. General-purpose models on specialized domain content (legal, medical) produce poor retrieval. Consider domain-specific models. Not handling "I don't know." Without explicit instructions and low temperature, LLMs confidently make things up. Test by asking questions not in your documents. No chunk overlap. If an answer spans two chunks with no overlap, neither chunk alone provides the full answer. Always use 10-20% overlap. Stuffing too much context. More is not always better. Start with top-5 results and increase only if needed.

Putting It All Together

Chain the steps above into a single script: index() loads documents, chunks them, embeds, and stores in ChromaDB. ask() embeds the query, retrieves top-5, and generates a grounded answer. Run indexing once, then query as many times as you want.

RAG is one of the most practical LLM applications. The implementation is simpler than it looks -- the hard parts are choosing the right chunk size, picking good embedding models, and testing to make sure answers are grounded in your documents rather than hallucinated.

For more AI tutorials and Python development guides, check out CodeUp.