March 27, 20269 min read

Multi-Agent AI Systems: Build with CrewAI and AutoGen

A practical tutorial on building multi-agent AI systems with CrewAI and AutoGen. Covers when multi-agent makes sense, building agent teams, and an honest comparison of frameworks.

multi-agent crewai autogen ai agents python tutorial

Multi-agent AI is having a moment. The idea is compelling: instead of one LLM trying to do everything, you have specialized agents -- a researcher, an analyst, a writer -- collaborating on complex tasks. Each agent has its own role, tools, and instructions.

Here's the thing though: most tasks don't need multiple agents. A single agent with the right tools handles 80% of real-world use cases. Multi-agent adds coordination overhead, increases cost, and introduces failure modes that single agents don't have.

But for the 20% where you genuinely need specialization and collaboration -- research pipelines, code review workflows, content creation with multiple perspectives -- multi-agent systems are genuinely powerful.

Let's build with the two most popular frameworks, then talk honestly about when to use each one.

When You Actually Need Multi-Agent

Before writing code, let's be clear about when multi-agent makes sense:

Good use cases:

Research pipeline: one agent searches, another analyzes, another writes
Code review: separate agents check for bugs, security issues, and style
Complex workflows where each step requires different expertise
Debate/adversarial setups where agents challenge each other's conclusions

Bad use cases (use a single agent instead):

Simple Q&A with tools
Anything a single prompt can handle
Tasks where adding agents just adds latency and cost
When you're using multi-agent because it sounds cool, not because you need it

CrewAI: Role-Based Agent Teams

CrewAI is the simplest way to get multi-agent working. You define agents with roles, give them tasks, and let them collaborate.

pip install crewai crewai-tools

Building a Research Crew

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, ScrapeWebsiteTool

# Tools
search_tool = SerperDevTool()  # Requires SERPER_API_KEY
scrape_tool = ScrapeWebsiteTool()

# Agent 1: The Researcher
researcher = Agent(
    role="Senior Research Analyst",
    goal="Find comprehensive, accurate information about the given topic",
    backstory="""You are an experienced research analyst who excels at
    finding reliable sources and extracting key facts. You never make
    claims without evidence and always note when information is uncertain.""",
    tools=[search_tool, scrape_tool],
    llm="claude-sonnet-4-20250514",
    verbose=True,
)

# Agent 2: The Analyst
analyst = Agent(
    role="Data Analyst",
    goal="Analyze research findings and identify key insights, trends, and implications",
    backstory="""You are a sharp analytical thinker who excels at finding
    patterns in data, identifying what matters, and cutting through noise.
    You challenge assumptions and look for what others miss.""",
    llm="claude-sonnet-4-20250514",
    verbose=True,
)

# Agent 3: The Writer
writer = Agent(
    role="Technical Writer",
    goal="Create clear, engaging, well-structured content from research and analysis",
    backstory="""You are a skilled technical writer who turns complex
    information into readable, actionable content. You write for a
    developer audience -- concise, practical, no fluff.""",
    llm="claude-sonnet-4-20250514",
    verbose=True,
)

The backstory field is essentially a system prompt. It shapes how the agent behaves, what it prioritizes, and how it communicates. Spend time on these -- vague backstories produce vague results.

Defining Tasks

# Task 1: Research
research_task = Task(
    description="""Research the current state of {topic}. Find:
    1. The most important recent developments (last 6 months)
    2. Key players and their approaches
    3. Technical details that developers should know
    4. Common misconceptions
    Search at least 5 different sources.""",
    expected_output="A detailed research report with sources cited",
    agent=researcher,
)

# Task 2: Analyze
analysis_task = Task(
    description="""Analyze the research findings and provide:
    1. Top 3 most important takeaways for developers
    2. Comparison of different approaches (pros/cons table)
    3. Prediction of where things are heading in the next 12 months
    4. Specific recommendations for different use cases""",
    expected_output="A structured analysis with clear recommendations",
    agent=analyst,
    context=[research_task],  # Gets output from research task
)

# Task 3: Write
writing_task = Task(
    description="""Write a comprehensive technical blog post based on the
    research and analysis. Requirements:
    - 800-1200 words
    - Include code examples where relevant
    - Use headers and bullet points for scannability
    - Developer-friendly tone (conversational, not academic)
    - End with actionable next steps""",
    expected_output="A polished blog post ready for publication",
    agent=writer,
    context=[research_task, analysis_task],  # Gets both previous outputs
)

Running the Crew

# Create and run the crew
crew = Crew(
    agents=[researcher, analyst, writer],
    tasks=[research_task, analysis_task, writing_task],
    process=Process.sequential,  # Tasks run in order
    verbose=True,
)

result = crew.kickoff(inputs={"topic": "WebAssembly for server-side applications"})
print(result)

With verbose=True, you'll see each agent working -- the researcher searching and reading sources, the analyst pulling out insights, the writer crafting the final piece. It takes a few minutes and several API calls, but the output is significantly better than asking a single LLM to do all three tasks at once.

AutoGen: Conversation-Based Agents

AutoGen takes a different approach. Instead of role-task assignment, agents have conversations with each other. They debate, ask follow-up questions, and reach conclusions through dialogue.

pip install autogen-agentchat autogen-ext[openai]

import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.conditions import TextMentionTermination
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient

# Create a model client (AutoGen uses OpenAI-compatible clients)
model_client = OpenAIChatCompletionClient(
    model="claude-sonnet-4-20250514",
    api_key="sk-ant-...",
    base_url="https://api.anthropic.com/v1/",
)

# Agent 1: Code Reviewer
code_reviewer = AssistantAgent(
    name="CodeReviewer",
    model_client=model_client,
    system_message="""You are a senior code reviewer. When shown code, you:
    1. Identify bugs and logic errors
    2. Point out performance issues
    3. Suggest concrete improvements with code examples
    Be specific. Point to exact lines. Don't just say "could be better" -- show how.""",
)

# Agent 2: Security Auditor
security_auditor = AssistantAgent(
    name="SecurityAuditor",
    model_client=model_client,
    system_message="""You are a security specialist. When reviewing code, you:
    1. Identify injection vulnerabilities (SQL, XSS, command injection)
    2. Check authentication and authorization issues
    3. Look for data exposure risks
    4. Suggest security hardening measures
    Focus only on security. Other agents handle general code quality.""",
)

# Agent 3: Summarizer (ends the conversation)
summarizer = AssistantAgent(
    name="Summarizer",
    model_client=model_client,
    system_message="""You summarize the review findings from other agents into
    a clear, prioritized action list. Categorize by severity (critical, warning,
    suggestion). Say APPROVE if there are no critical issues, or REQUEST_CHANGES
    if there are. End your message with TERMINATE.""",
)

# Termination condition
termination = TextMentionTermination("TERMINATE")

# Group chat
team = RoundRobinGroupChat(
    participants=[code_reviewer, security_auditor, summarizer],
    termination_condition=termination,
)

# Run the review
async def run_review():
    code = """
    @app.route('/user/<id>')
    def get_user(id):
        query = f"SELECT * FROM users WHERE id = {id}"
        user = db.execute(query).fetchone()
        return jsonify(dict(user))
    """

result = await team.run(
        task=f"Review this Python Flask code for a code review:\n

python{code}``

"
    )

for message in result.messages:
        print(f"\n--- {message.source} ---")
        print(message.content)

asyncio.run(run_review())

In this setup, the code reviewer catches general issues, the security auditor catches the SQL injection vulnerability, and the summarizer combines everything into a prioritized list. Each agent sees the conversation history, so they build on each other's findings.

CrewAI vs AutoGen vs LangGraph

Here's the honest comparison:


Feature CrewAI AutoGen LangGraph

Mental model Role assignment Conversation State machine
Learning curve Low Medium High
Flexibility Medium High Very high
Best for Pipelines Debate/review Complex workflows
Code required Least Medium Most
Control over flow Limited Medium Full


Choose CrewAI when: you have a clear pipeline (research then analyze then write), roles are well-defined, and you want to get something working fast.

Choose AutoGen when: agents need to discuss and debate, the workflow isn't strictly linear, or you want agents to ask each other follow-up questions.

Choose LangGraph when: you need precise control over state transitions, complex conditional branching, human-in-the-loop checkpoints, or your workflow has cycles.

A Real Example: Competitive Analysis Pipeline

Here's a more realistic CrewAI example -- a competitive analysis that actually produces useful output:
python
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool

search = SerperDevTool()

scout = Agent(
    role="Competitive Intelligence Scout",
    goal="Gather detailed information about competitor products, pricing, and recent changes",
    backstory="You are a competitive intelligence specialist who knows where to find product information, pricing pages, changelogs, and customer reviews.",
    tools=[search],
    llm="claude-sonnet-4-20250514",
)

strategist = Agent(
    role="Product Strategist",
    goal="Identify competitive advantages, gaps, and opportunities based on competitive intelligence",
    backstory="You are a product strategist who excels at SWOT analysis and identifying market positioning opportunities. You think in terms of user needs, not features.",
    llm="claude-sonnet-4-20250514",
)

gather_intel = Task(
    description="Research these competitors for our {product_type} product: {competitors}. For each, find: pricing tiers, key features, recent launches (last 3 months), notable customer complaints, and market positioning.",
    expected_output="Structured competitive intelligence report with data for each competitor",
    agent=scout,
)

analyze = Task(
    description="Based on the competitive intelligence, create: 1) Feature comparison matrix, 2) Pricing comparison, 3) SWOT analysis for our product vs each competitor, 4) Top 3 opportunities we should pursue, 5) Top 3 threats to address.",
    expected_output="Strategic analysis with actionable recommendations",
    agent=strategist,
    context=[gather_intel],
    output_file="competitive_analysis.md",
)

crew = Crew(
    agents=[scout, strategist],
    tasks=[gather_intel, analyze],
    process=Process.sequential,
)

result = crew.kickoff(inputs={ "product_type": "developer documentation platform", "competitors": "ReadMe, GitBook, Mintlify, Docusaurus", })`



Cost and Performance Reality

Let's be honest about the numbers. Multi-agent systems are expensive:


Setup Typical API Calls Approximate Cost

Single agent, 3 tools 2-4 calls $0.02-0.05
3-agent CrewAI pipeline 8-15 calls $0.10-0.40
4-agent AutoGen discussion 12-25 calls $0.20-0.80
Complex research crew 20-40 calls $0.50-2.00


Each agent interaction involves sending the full conversation context, which means token costs compound. A 3-agent pipeline where each agent does meaningful work can easily use 50K-100K tokens total.

To manage costs:


Use Haiku for simple agents (routing, classification)
Use Sonnet for complex agents (analysis, writing)

Set max_iter` limits on agents
Cache tool results across agents when possible
Start with 2 agents and add more only if the output quality demands it

Common Mistakes

Too many agents. Three specialized agents beat six vaguely defined ones every time. Each agent should have a clear, distinct responsibility. Vague backstories. "You are a helpful assistant" produces generic output. "You are a security researcher who specializes in OWASP Top 10 vulnerabilities and has reviewed 500+ codebases" produces focused, expert-level output. No termination conditions. Agents can get into infinite discussion loops. Always set iteration limits and explicit termination triggers. Using multi-agent for single-agent tasks. If you can describe the task in one prompt and get a good answer, you don't need multiple agents. The overhead isn't free. Ignoring the conversation. In AutoGen-style setups, the magic happens in the inter-agent dialogue. If agents are just producing monologues without reacting to each other, you've built a pipeline with extra steps.

Multi-agent AI is a powerful pattern for genuinely complex tasks. But the complexity cost is real. Start with a single agent, add tools, and only graduate to multi-agent when you have clear evidence that specialization improves your results.

For more practical AI development tutorials, check out CodeUp.

Feature	CrewAI	AutoGen	LangGraph
Mental model	Role assignment	Conversation	State machine
Learning curve	Low	Medium	High
Flexibility	Medium	High	Very high
Best for	Pipelines	Debate/review	Complex workflows
Code required	Least	Medium	Most
Control over flow	Limited	Medium	Full

Setup	Typical API Calls	Approximate Cost
Single agent, 3 tools	2-4 calls	$0.02-0.05
3-agent CrewAI pipeline	8-15 calls	$0.10-0.40
4-agent AutoGen discussion	12-25 calls	$0.20-0.80
Complex research crew	20-40 calls	$0.50-2.00