March 27, 202610 min read

Build a RAG App with LangChain and Python: AI That Knows Your Data

Learn how to build a Retrieval-Augmented Generation app with LangChain, embeddings, and vector stores so your AI can answer questions about your own documents.

rag langchain ai python tutorial
Ad 336x280

Large language models are impressive, but they have a glaring weakness: they only know what they were trained on. Ask ChatGPT about your company's internal documentation, your personal notes, or a PDF you downloaded yesterday, and it'll either hallucinate an answer or politely tell you it has no idea.

Retrieval-Augmented Generation (RAG) fixes this. Instead of fine-tuning a model on your data (expensive, slow, and overkill for most use cases), RAG retrieves relevant chunks of your documents at query time and feeds them into the LLM as context. The model generates answers grounded in your actual data.

LangChain makes building RAG apps surprisingly straightforward. By the end of this tutorial, you'll have a working "chat with your docs" application that can answer questions about any collection of documents you throw at it.

What RAG Actually Is

RAG has three steps:

  1. Indexing -- You take your documents, split them into chunks, convert those chunks into numerical representations (embeddings), and store them in a vector database.
  2. Retrieval -- When a user asks a question, you convert the question into an embedding, search the vector database for the most similar chunks, and pull them out.
  3. Generation -- You pass the retrieved chunks plus the user's question to an LLM, which generates an answer based on that context.
That's it. No model training, no GPU clusters, no weeks of waiting. You're essentially giving the LLM a cheat sheet before asking it a question.

Why Not Just Stuff Everything into the Prompt?

You might be thinking: "Why not just paste my entire document into the prompt?" Two reasons:

  • Context window limits. Even models with 128K token windows can't handle a 500-page PDF. And even if they could, performance degrades with longer contexts.
  • Cost. You pay per token. Sending your entire knowledge base with every query gets expensive fast.
RAG is surgical. It finds the 3-5 most relevant paragraphs and sends only those. Cheaper, faster, more accurate.

Setting Up the Project

Create a new project and install dependencies:

mkdir rag-langchain-app
cd rag-langchain-app
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install the packages:

pip install langchain langchain-openai langchain-community chromadb pypdf python-dotenv

Here's what each does:


  • langchain -- The orchestration framework

  • langchain-openai -- OpenAI integration (embeddings + LLM)

  • langchain-community -- Community document loaders and tools

  • chromadb -- Local vector store (no server needed)

  • pypdf -- PDF parsing

  • python-dotenv -- Environment variable management


Create a .env file for your OpenAI API key:

OPENAI_API_KEY=sk-your-key-here

Loading Documents

LangChain has loaders for practically every document format. Let's start with PDFs, since that's the most common use case:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader

# Load a single PDF
loader = PyPDFLoader("docs/my-document.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages")
print(documents[0].page_content[:200])

Each loaded document has two attributes: page_content (the text) and metadata (source file, page number, etc.).

To load an entire directory of documents:

# Load all PDFs from a directory
loader = DirectoryLoader(
    "docs/",
    glob="*/.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)
documents = loader.load()

You can mix formats too. Load PDFs, text files, markdown -- whatever you have:

from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Load markdown files
md_loader = DirectoryLoader(
    "docs/",
    glob="*/.md",
    loader_cls=UnstructuredMarkdownLoader
)

# Combine with PDF documents
all_docs = pdf_documents + md_loader.load()

Splitting Text into Chunks

Raw documents are usually too long to embed as-is. You need to split them into smaller chunks. This is where a lot of RAG apps get it wrong.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")

The RecursiveCharacterTextSplitter is the go-to choice. It tries to split on paragraph breaks first, then sentences, then words. The chunk_overlap parameter ensures that context isn't lost at chunk boundaries -- if a relevant sentence spans two chunks, the overlap catches it.

Chunk size matters. Too small (100 tokens) and you lose context. Too large (5000 tokens) and you retrieve irrelevant noise along with the good stuff. Start with 1000 characters and adjust based on your results.

Creating Embeddings and the Vector Store

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors. This is what makes retrieval work.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv

load_dotenv()

# Create embeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Create vector store from chunks vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" )

print(f"Created vector store with {vectorstore._collection.count()} documents")

Chroma stores everything locally in ./chroma_db. No external database needed. For production, you'd consider Pinecone, Weaviate, or pgvector, but Chroma is perfect for development and small-to-medium datasets.

To load an existing vector store later:

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Building the Retrieval Chain

Now the fun part. Let's wire up retrieval and generation:

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Create a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 chunks
)

# Custom prompt template
prompt_template = """Use the following pieces of context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information to answer that."
Don't try to make up an answer.

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)

# Build the chain qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True, chain_type_kwargs={"prompt": PROMPT} )

The chain_type="stuff" means "stuff all retrieved documents into the prompt." For most cases, this works fine. LangChain also supports map_reduce and refine strategies for handling more retrieved documents than fit in one prompt.

Querying Your Documents

# Ask a question
result = qa_chain.invoke({"query": "What is the company's refund policy?"})

print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'Unknown')} (page {doc.metadata.get('page', 'N/A')})")

That's a working RAG app. Ask questions, get answers grounded in your documents, with source citations.

Building the Chat Interface

Let's turn this into an interactive command-line app with conversation memory:

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)

chat_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True
)

def chat():
print("Chat with your documents! Type 'quit' to exit.\n")

while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue

result = chat_chain.invoke({"question": question})
print(f"\nAssistant: {result['answer']}\n")

# Show sources sources = set() for doc in result["source_documents"]: source = doc.metadata.get("source", "Unknown") page = doc.metadata.get("page", "") sources.add(f"{source}" + (f" (p.{page})" if page else ""))

if sources:
print(f"Sources: {', '.join(sources)}\n")

if __name__ == "__main__":
chat()

The conversation memory means follow-up questions work naturally. "What's the refund policy?" followed by "How long does it take?" -- the chain remembers you're still talking about refunds.

Putting It All Together

Here's the complete application in one file:

# app.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

load_dotenv()

DOCS_DIR = "docs"
CHROMA_DIR = "./chroma_db"

def load_and_index():
"""Load documents and create vector store."""
print("Loading documents...")
loader = DirectoryLoader(DOCS_DIR, glob="*/.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} pages")

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_DIR)
print(f"Indexed {vectorstore._collection.count()} chunks")
return vectorstore

def get_chain(vectorstore):
"""Create conversational retrieval chain."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
memory = ConversationBufferMemory(
memory_key="chat_history", return_messages=True, output_key="answer"
)
return ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
memory=memory,
return_source_documents=True,
)

if __name__ == "__main__":
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

if os.path.exists(CHROMA_DIR):
print("Loading existing index...")
vs = Chroma(persist_directory=CHROMA_DIR, embedding_function=embeddings)
else:
vs = load_and_index()

chain = get_chain(vs)
print("\nChat with your documents. Type 'quit' to exit.\n")

while True:
q = input("You: ").strip()
if q.lower() in ("quit", "exit"):
break
result = chain.invoke({"question": q})
print(f"\nAssistant: {result['answer']}\n")

Evaluating Your RAG Pipeline

Building a RAG app is easy. Building a good one takes iteration. Here's how to evaluate:

Retrieval quality -- Are the right chunks being retrieved? Log the source documents and manually check:
# Test retrieval without generation
docs = retriever.invoke("What is the return policy?")
for i, doc in enumerate(docs):
    print(f"\n--- Chunk {i+1} ---")
    print(doc.page_content[:300])
    print(f"Source: {doc.metadata}")
Answer quality -- Is the model using the context correctly, or ignoring it? Compare the retrieved chunks against the generated answer. Experiment with these knobs:
  • chunk_size -- Try 500, 1000, 1500
  • chunk_overlap -- Try 100, 200, 300
  • k (number of retrieved chunks) -- Try 3, 5, 8
  • Embedding model -- text-embedding-3-small vs text-embedding-3-large
  • Search type -- similarity vs mmr (maximal marginal relevance, which adds diversity)
# MMR retrieval reduces redundancy in results
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 10}
)

Common Pitfalls

Garbage in, garbage out. If your PDFs have messy formatting, tables that don't parse well, or scanned images without OCR, your chunks will be low quality. Consider preprocessing your documents before loading. Chunk boundaries cut context. A relevant paragraph split across two chunks might not get fully retrieved. Increase chunk_overlap or try semantic chunking (splitting by topic instead of character count). The "I don't know" problem. LLMs default to being helpful, which means they'll sometimes fabricate answers even when the context doesn't contain the information. Your prompt engineering matters -- explicitly tell the model to admit when it doesn't have enough information. Embedding model mismatch. Always use the same embedding model for indexing and querying. If you re-index with a different model, you need to rebuild the entire vector store. Too many or too few chunks. Retrieving 2 chunks might miss critical context. Retrieving 20 floods the prompt with noise. Start with 4 and adjust based on your specific documents.

What's Next

Once you have the basics working, there's a lot you can build on:

  • Hybrid search -- Combine vector similarity with keyword search (BM25) for better retrieval
  • Metadata filtering -- Filter by document type, date, or category before similarity search
  • Streaming responses -- Use LangChain's streaming callbacks for real-time output
  • Web UI -- Add a Streamlit or Gradio frontend
  • Multiple document types -- Add support for Word docs, CSVs, web pages
  • Self-querying retriever -- Let the LLM generate its own metadata filters
RAG is one of the most practical AI patterns you can learn. It bridges the gap between powerful language models and your private data without the complexity of fine-tuning. Start with a small collection of documents, get the pipeline working, then iterate on quality.

The code and concepts here will scale from a personal knowledge base to a production document Q&A system. The hard part isn't the code -- it's curating your data and tuning retrieval to get consistently good answers.

For more AI and Python tutorials, check out CodeUp.

Ad 728x90