Build a RAG App with LangChain and Python: AI That Knows Your Data
Learn how to build a Retrieval-Augmented Generation app with LangChain, embeddings, and vector stores so your AI can answer questions about your own documents.
Large language models are impressive, but they have a glaring weakness: they only know what they were trained on. Ask ChatGPT about your company's internal documentation, your personal notes, or a PDF you downloaded yesterday, and it'll either hallucinate an answer or politely tell you it has no idea.
Retrieval-Augmented Generation (RAG) fixes this. Instead of fine-tuning a model on your data (expensive, slow, and overkill for most use cases), RAG retrieves relevant chunks of your documents at query time and feeds them into the LLM as context. The model generates answers grounded in your actual data.
LangChain makes building RAG apps surprisingly straightforward. By the end of this tutorial, you'll have a working "chat with your docs" application that can answer questions about any collection of documents you throw at it.
What RAG Actually Is
RAG has three steps:
- Indexing -- You take your documents, split them into chunks, convert those chunks into numerical representations (embeddings), and store them in a vector database.
- Retrieval -- When a user asks a question, you convert the question into an embedding, search the vector database for the most similar chunks, and pull them out.
- Generation -- You pass the retrieved chunks plus the user's question to an LLM, which generates an answer based on that context.
Why Not Just Stuff Everything into the Prompt?
You might be thinking: "Why not just paste my entire document into the prompt?" Two reasons:
- Context window limits. Even models with 128K token windows can't handle a 500-page PDF. And even if they could, performance degrades with longer contexts.
- Cost. You pay per token. Sending your entire knowledge base with every query gets expensive fast.
Setting Up the Project
Create a new project and install dependencies:
mkdir rag-langchain-app
cd rag-langchain-app
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Install the packages:
pip install langchain langchain-openai langchain-community chromadb pypdf python-dotenv
Here's what each does:
langchain-- The orchestration frameworklangchain-openai-- OpenAI integration (embeddings + LLM)langchain-community-- Community document loaders and toolschromadb-- Local vector store (no server needed)pypdf-- PDF parsingpython-dotenv-- Environment variable management
Create a
.env file for your OpenAI API key:
OPENAI_API_KEY=sk-your-key-here
Loading Documents
LangChain has loaders for practically every document format. Let's start with PDFs, since that's the most common use case:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
# Load a single PDF
loader = PyPDFLoader("docs/my-document.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
print(documents[0].page_content[:200])
Each loaded document has two attributes: page_content (the text) and metadata (source file, page number, etc.).
To load an entire directory of documents:
# Load all PDFs from a directory
loader = DirectoryLoader(
"docs/",
glob="*/.pdf",
loader_cls=PyPDFLoader,
show_progress=True
)
documents = loader.load()
You can mix formats too. Load PDFs, text files, markdown -- whatever you have:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
# Load markdown files
md_loader = DirectoryLoader(
"docs/",
glob="*/.md",
loader_cls=UnstructuredMarkdownLoader
)
# Combine with PDF documents
all_docs = pdf_documents + md_loader.load()
Splitting Text into Chunks
Raw documents are usually too long to embed as-is. You need to split them into smaller chunks. This is where a lot of RAG apps get it wrong.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
The RecursiveCharacterTextSplitter is the go-to choice. It tries to split on paragraph breaks first, then sentences, then words. The chunk_overlap parameter ensures that context isn't lost at chunk boundaries -- if a relevant sentence spans two chunks, the overlap catches it.
Creating Embeddings and the Vector Store
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors. This is what makes retrieval work.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv
load_dotenv()
# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store from chunks
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Created vector store with {vectorstore._collection.count()} documents")
Chroma stores everything locally in ./chroma_db. No external database needed. For production, you'd consider Pinecone, Weaviate, or pgvector, but Chroma is perfect for development and small-to-medium datasets.
To load an existing vector store later:
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
Building the Retrieval Chain
Now the fun part. Let's wire up retrieval and generation:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Create a retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # Return top 4 chunks
)
# Custom prompt template
prompt_template = """Use the following pieces of context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information to answer that."
Don't try to make up an answer.
Context:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Build the chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
The chain_type="stuff" means "stuff all retrieved documents into the prompt." For most cases, this works fine. LangChain also supports map_reduce and refine strategies for handling more retrieved documents than fit in one prompt.
Querying Your Documents
# Ask a question
result = qa_chain.invoke({"query": "What is the company's refund policy?"})
print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'Unknown')} (page {doc.metadata.get('page', 'N/A')})")
That's a working RAG app. Ask questions, get answers grounded in your documents, with source citations.
Building the Chat Interface
Let's turn this into an interactive command-line app with conversation memory:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
chat_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True
)
def chat():
print("Chat with your documents! Type 'quit' to exit.\n")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue
result = chat_chain.invoke({"question": question})
print(f"\nAssistant: {result['answer']}\n")
# Show sources
sources = set()
for doc in result["source_documents"]:
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "")
sources.add(f"{source}" + (f" (p.{page})" if page else ""))
if sources:
print(f"Sources: {', '.join(sources)}\n")
if __name__ == "__main__":
chat()
The conversation memory means follow-up questions work naturally. "What's the refund policy?" followed by "How long does it take?" -- the chain remembers you're still talking about refunds.
Putting It All Together
Here's the complete application in one file:
# app.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
load_dotenv()
DOCS_DIR = "docs"
CHROMA_DIR = "./chroma_db"
def load_and_index():
"""Load documents and create vector store."""
print("Loading documents...")
loader = DirectoryLoader(DOCS_DIR, glob="*/.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} pages")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_DIR)
print(f"Indexed {vectorstore._collection.count()} chunks")
return vectorstore
def get_chain(vectorstore):
"""Create conversational retrieval chain."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
memory = ConversationBufferMemory(
memory_key="chat_history", return_messages=True, output_key="answer"
)
return ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
memory=memory,
return_source_documents=True,
)
if __name__ == "__main__":
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
if os.path.exists(CHROMA_DIR):
print("Loading existing index...")
vs = Chroma(persist_directory=CHROMA_DIR, embedding_function=embeddings)
else:
vs = load_and_index()
chain = get_chain(vs)
print("\nChat with your documents. Type 'quit' to exit.\n")
while True:
q = input("You: ").strip()
if q.lower() in ("quit", "exit"):
break
result = chain.invoke({"question": q})
print(f"\nAssistant: {result['answer']}\n")
Evaluating Your RAG Pipeline
Building a RAG app is easy. Building a good one takes iteration. Here's how to evaluate:
Retrieval quality -- Are the right chunks being retrieved? Log the source documents and manually check:# Test retrieval without generation
docs = retriever.invoke("What is the return policy?")
for i, doc in enumerate(docs):
print(f"\n--- Chunk {i+1} ---")
print(doc.page_content[:300])
print(f"Source: {doc.metadata}")
Answer quality -- Is the model using the context correctly, or ignoring it? Compare the retrieved chunks against the generated answer.
Experiment with these knobs:
chunk_size-- Try 500, 1000, 1500chunk_overlap-- Try 100, 200, 300k(number of retrieved chunks) -- Try 3, 5, 8- Embedding model --
text-embedding-3-smallvstext-embedding-3-large - Search type --
similarityvsmmr(maximal marginal relevance, which adds diversity)
# MMR retrieval reduces redundancy in results
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 10}
)
Common Pitfalls
Garbage in, garbage out. If your PDFs have messy formatting, tables that don't parse well, or scanned images without OCR, your chunks will be low quality. Consider preprocessing your documents before loading. Chunk boundaries cut context. A relevant paragraph split across two chunks might not get fully retrieved. Increasechunk_overlap or try semantic chunking (splitting by topic instead of character count).
The "I don't know" problem. LLMs default to being helpful, which means they'll sometimes fabricate answers even when the context doesn't contain the information. Your prompt engineering matters -- explicitly tell the model to admit when it doesn't have enough information.
Embedding model mismatch. Always use the same embedding model for indexing and querying. If you re-index with a different model, you need to rebuild the entire vector store.
Too many or too few chunks. Retrieving 2 chunks might miss critical context. Retrieving 20 floods the prompt with noise. Start with 4 and adjust based on your specific documents.
What's Next
Once you have the basics working, there's a lot you can build on:
- Hybrid search -- Combine vector similarity with keyword search (BM25) for better retrieval
- Metadata filtering -- Filter by document type, date, or category before similarity search
- Streaming responses -- Use LangChain's streaming callbacks for real-time output
- Web UI -- Add a Streamlit or Gradio frontend
- Multiple document types -- Add support for Word docs, CSVs, web pages
- Self-querying retriever -- Let the LLM generate its own metadata filters
The code and concepts here will scale from a personal knowledge base to a production document Q&A system. The hard part isn't the code -- it's curating your data and tuning retrieval to get consistently good answers.
For more AI and Python tutorials, check out CodeUp.