March 27, 20268 min read

How to Build an AI Agent from Scratch in Python

Build a working AI agent from scratch in Python without frameworks. Covers the core agent loop, tool definitions, memory, error handling, and when to reach for LangChain or CrewAI instead.

ai agents python openai claude llm tutorial

An AI agent is not some mystical autonomous entity. It is an LLM in a loop with access to tools. That's it. The model receives a task, decides which tool to call, gets the result, and decides what to do next. Repeat until the task is done.

Here's the thing -- most tutorials overcomplicate this. They start with frameworks, abstractions, and architecture diagrams. But the core pattern is around 50 lines of Python. Once you understand that, frameworks make a lot more sense.

Let's build one from scratch.

The Core Loop

Every AI agent follows the same pattern:

1. Observe  — receive input (user message + tool results)

Think    — the LLM decides what to do next
Act      — call a tool or return a final answer
Repeat   — feed the tool result back and go to step 1

That's the entire architecture. Everything else -- memory, planning, multi-agent coordination -- is built on top of this loop.

Setting Up

We'll use the Anthropic Python SDK, but the pattern works identically with OpenAI.

pip install anthropic

export ANTHROPIC_API_KEY="sk-ant-..."

Defining Tools

Tools are just Python functions with a specific JSON schema so the model knows what they do and what arguments they accept:

import json
import httpx
from datetime import datetime

# The actual tool functions
def get_weather(city: str) -> str:
    """Fetch current weather for a city."""
    response = httpx.get(
        "https://api.weatherapi.com/v1/current.json",
        params={"key": "YOUR_API_KEY", "q": city},
    )
    data = response.json()
    current = data["current"]
    return f"{current['temp_f']}°F, {current['condition']['text']}"

def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        return "Error: only numeric math expressions allowed"
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

def search_files(query: str) -> str:
    """Search local files for a query string."""
    import subprocess
    result = subprocess.run(
        ["grep", "-r", "-l", query, "./docs"],
        capture_output=True, text=True, timeout=5,
    )
    files = result.stdout.strip().split("\n")
    return f"Found in: {', '.join(files)}" if files[0] else "No matches found"

# Tool registry -- maps names to functions
TOOLS = {
    "get_weather": get_weather,
    "calculate": calculate,
    "search_files": search_files,
}

# Tool schemas for the API (this is what the model reads)
TOOL_SCHEMAS = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city. Use this when the user asks about weather conditions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Tokyo' or 'New York'"}
            },
            "required": ["city"],
        },
    },
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression. Use Python math syntax.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression, e.g. '15.5 * 0.2'"}
            },
            "required": ["expression"],
        },
    },
    {
        "name": "search_files",
        "description": "Search local documentation files for a query string. Returns matching file names.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search term to find in files"}
            },
            "required": ["query"],
        },
    },
]

The descriptions matter more than you'd think. The model uses them to decide which tool to call. Vague descriptions lead to wrong tool selections.

The Agent Loop

Here's the complete agent. This is the core of the whole thing:

import anthropic

client = anthropic.Anthropic()

def run_agent(user_message: str, system_prompt: str = "You are a helpful assistant."):
    messages = [{"role": "user", "content": user_message}]

while True:
        # Step 1: Send messages to the model with available tools
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=TOOL_SCHEMAS,
            messages=messages,
        )

# Step 2: Check if the model wants to use a tool
        if response.stop_reason == "tool_use":
            # Collect the assistant's response (may include text + tool calls)
            assistant_content = response.content
            messages.append({"role": "assistant", "content": assistant_content})

# Execute each tool call
            tool_results = []
            for block in assistant_content:
                if block.type == "tool_use":
                    tool_name = block.name
                    tool_input = block.input

print(f"  [Tool] {tool_name}({json.dumps(tool_input)})")

# Run the actual function
                    if tool_name in TOOLS:
                        result = TOOLStool_name
                    else:
                        result = f"Error: unknown tool '{tool_name}'"

print(f"  [Result] {result}")

tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

# Feed results back to the model
            messages.append({"role": "user", "content": tool_results})

else:
            # Step 3: Model is done -- extract the final text response
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text
            return final_text

# Try it
answer = run_agent("What's the weather in Tokyo, and what's a 20% tip on $67.50?")
print(answer)

Run this and you'll see the agent call get_weather for Tokyo, then calculate for the tip, then compose a natural language answer from both results. Two tool calls, one coherent response.

Adding Memory

The agent above is stateless -- each call starts fresh. To maintain conversation history across turns, wrap it with a simple memory layer:

class Agent:
    def __init__(self, system_prompt: str, tools: list, tool_map: dict):
        self.client = anthropic.Anthropic()
        self.system_prompt = system_prompt
        self.tools = tools
        self.tool_map = tool_map
        self.conversation_history = []

def chat(self, user_message: str) -> str:
        self.conversation_history.append(
            {"role": "user", "content": user_message}
        )

while True:
            response = self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=4096,
                system=self.system_prompt,
                tools=self.tools,
                messages=self.conversation_history,
            )

if response.stop_reason == "tool_use":
                self.conversation_history.append(
                    {"role": "assistant", "content": response.content}
                )
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        fn = self.tool_map.get(block.name)
                        result = fn(**block.input) if fn else "Unknown tool"
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result,
                        })
                self.conversation_history.append(
                    {"role": "user", "content": tool_results}
                )
            else:
                text = "".join(
                    b.text for b in response.content if hasattr(b, "text")
                )
                self.conversation_history.append(
                    {"role": "assistant", "content": text}
                )
                return text

agent = Agent(
    system_prompt="You are a helpful assistant with access to tools.",
    tools=TOOL_SCHEMAS,
    tool_map=TOOLS,
)

print(agent.chat("What's the weather in London?"))
print(agent.chat("How about New York?"))  # Remembers we're talking about weather
print(agent.chat("Compare the two."))     # Knows which two cities we mean

The conversation history grows with each turn. For production, you'll want to trim older messages once you approach the model's context window limit.

Error Handling

Agents in the wild hit errors constantly -- API timeouts, bad tool inputs, hallucinated tool names. You need to handle all of it:

import time

MAX_ITERATIONS = 15
MAX_EXECUTION_TIME = 120  # seconds

def run_agent_safe(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    iterations = 0
    start_time = time.time()

while True:
        iterations += 1
        elapsed = time.time() - start_time

if iterations > MAX_ITERATIONS:
            return "I hit my iteration limit. Here's what I found so far..."
        if elapsed > MAX_EXECUTION_TIME:
            return "This is taking too long. Let me summarize what I have..."

try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=4096,
                tools=TOOL_SCHEMAS,
                messages=messages,
            )
        except anthropic.RateLimitError:
            time.sleep(2)
            continue
        except anthropic.APIError as e:
            return f"API error: {e}. Please try again."

if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        fn = TOOLS.get(block.name)
                        if fn is None:
                            result = f"Tool '{block.name}' does not exist."
                        else:
                            result = fn(**block.input)
                    except Exception as e:
                        result = f"Tool error: {type(e).__name__}: {e}"

tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "user", "content": tool_results})
        else:
            return "".join(
                b.text for b in response.content if hasattr(b, "text")
            )

The key insight: when a tool fails, feed the error message back to the model. It will usually adapt -- try different parameters, pick a different tool, or explain to the user why the task couldn't be completed.

When to Use a Framework

Now that you've built an agent from scratch, here's when to reach for frameworks:

Roll your own when:

You need fewer than 5 tools
You want full control over the loop
You're embedding it in an existing app
You don't want the dependency overhead

Use LangChain when:

You need RAG (retrieval-augmented generation)
You want pre-built tool integrations (Google Search, SQL, etc.)
You need complex chains with branching logic

Use CrewAI when:

You need multiple specialized agents collaborating
You have a pipeline (research then analyze then write)
Role-based task delegation fits your use case

Common Pitfalls

Infinite loops. The model decides to keep calling tools forever. Always set max_iterations. I've seen agents burn through $20 in API calls chasing a rabbit hole. Token limits. Each iteration appends to the conversation. After 10+ tool calls, you can hit the context window. Truncate or summarize older messages. Tool hallucination. The model invents tool names or parameters that don't exist. Validate tool names against your registry. Return clear error messages so the model can self-correct. Over-engineering. The reality is most tasks need 2-3 tools and 1-3 iterations. Don't build a 15-tool, multi-agent system when a simple loop will do. Start minimal and add complexity only when you have evidence it's needed.

The pattern is simple: LLM + tools + loop. Everything else is details. Start with the 50-line version, get it working for your use case, and evolve from there.

For more tutorials on building practical AI applications, check out CodeUp.