March 26, 202612 min read

Building with LLM APIs: OpenAI, Claude, and Beyond

A practical guide to integrating LLM APIs — authentication, chat completions, streaming, function calling, cost optimization, and building real applications.

llm api openai claude ai
Ad 336x280

A year ago, integrating AI into an application meant training your own models, managing GPU servers, and hiring machine learning engineers. Today, it means making an HTTP request to an API and getting back text that sounds like a human wrote it. The barrier to building AI-powered applications has dropped from "PhD required" to "can you call a REST endpoint."

That doesn't mean it's trivial. LLM APIs have quirks that aren't obvious from the documentation. Token limits force you to think about context management. Streaming responses require different architectural patterns than traditional request-response APIs. Function calling (tool use) opens up entirely new application categories but introduces complexity around schema design and error handling. And costs can spiral from $5/month to $5,000/month if you're not careful about prompt engineering and caching.

This guide covers the practical side of building with LLM APIs — the things you learn after the "Hello World" tutorial.

The Basics: Chat Completions

Every major LLM API follows the same pattern: you send a list of messages, you get back a response message. The messages have roles — system, user, and assistant — that tell the model who said what:

import openai

client = openai.OpenAI(api_key="sk-...")

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a helpful coding assistant. Be concise and provide working code examples."
},
{
"role": "user",
"content": "Write a Python function to validate email addresses using regex."
}
],
temperature=0.3, # Lower = more deterministic
max_tokens=1000
)

print(response.choices[0].message.content)

With Anthropic's Claude API:

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful coding assistant. Be concise and provide working code examples.",
messages=[
{
"role": "user",
"content": "Write a Python function to validate email addresses using regex."
}
]
)

print(response.content[0].text)

The APIs are similar but not identical. OpenAI puts the system message in the messages array. Anthropic has a separate system parameter. OpenAI uses max_tokens as an optional parameter (defaulting to the model's maximum). Anthropic requires max_tokens explicitly. Both return structured responses with metadata about token usage.

Key parameters that matter:

  • temperature (0.0-2.0): Controls randomness. 0 for deterministic outputs (code generation, classification). 0.7-1.0 for creative writing. Above 1.0 for brainstorming.
  • max_tokens: The maximum length of the response. Set it explicitly to control costs and prevent runaway responses.
  • model: Different models have different capabilities and costs. GPT-4o is more capable but more expensive than GPT-4o-mini. Claude Sonnet balances quality and cost; Claude Opus maximizes quality.

Authentication and API Key Management

Never hardcode API keys. This sounds obvious, but it happens constantly:

import os
from dotenv import load_dotenv

load_dotenv() # Loads from .env file

# Good — from environment variables client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"]) # For web applications, use your framework's secret management # Django: settings.py with django-environ # FastAPI: pydantic Settings # Next.js: .env.local (server-side only)
# .env (never commit this file)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# .gitignore
.env
.env.local

For production, use your cloud provider's secret management — AWS Secrets Manager, GCP Secret Manager, or Cloudflare Workers secrets. Environment variables in a .env file are fine for development.

Streaming Responses

Non-streaming API calls wait for the entire response before returning anything. For short responses, that's fine. For longer responses, the user stares at a loading spinner for 5-15 seconds. Streaming sends tokens as they're generated:

# OpenAI streaming
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain how hash tables work."}
    ],
    stream=True
)

for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)

# Anthropic streaming
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain how hash tables work."}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

For web applications, you typically stream using Server-Sent Events (SSE):

# FastAPI with streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def generate_stream(prompt: str):
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield f"data: {content}\n\n"
yield "data: [DONE]\n\n"

@app.post("/api/chat")
async def chat(request: ChatRequest):
return StreamingResponse(
generate_stream(request.prompt),
media_type="text/event-stream"
)

On the frontend:

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ prompt: userMessage })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullResponse = '';

while (true) {
const { done, value } = await reader.read();
if (done) break;

const text = decoder.decode(value);
const lines = text.split('\n');

for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const content = line.slice(6);
fullResponse += content;
updateUI(fullResponse); // Show tokens as they arrive
}
}
}

Streaming makes applications feel dramatically faster even though the total response time is similar. Users see the first token in 200-500ms instead of waiting 5-15 seconds for the complete response.

Function Calling and Tool Use

This is where LLM APIs go from "generate text" to "take actions." Function calling lets you define tools the model can use, and the model decides when and how to call them:

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., 'San Francisco, CA'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_products",
            "description": "Search the product catalog",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "max_price": {"type": "number"},
                    "category": {"type": "string"}
                },
                "required": ["query"]
            }
        }
    }
]

# Send message with tools
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo and find me some umbrellas under $30?"}
    ],
    tools=tools
)

# The model might call multiple tools
message = response.choices[0].message
if message.tool_calls:
    # Execute each tool call
    tool_results = []
    for tool_call in message.tool_calls:
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

if name == "get_weather":
result = get_weather(**args) # Your actual function
elif name == "search_products":
result = search_products(**args)

tool_results.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})

# Send results back to get a natural language response final_response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What's the weather in Tokyo and find me some umbrellas under $30?"}, message, # The assistant's tool call message *tool_results # The tool results ], tools=tools ) print(final_response.choices[0].message.content)

The model doesn't execute your functions — it generates structured JSON that says "call this function with these arguments." Your code executes the actual function and feeds the result back. The model then generates a natural language response incorporating the results.

This pattern enables AI assistants that can query databases, call APIs, send emails, update records — anything you can write a function for. The model handles understanding intent and generating parameters; your code handles execution and security.

Managing Context and Conversation History

LLM APIs are stateless. Each request is independent. To have a conversation, you must send the entire conversation history with every request:

class Conversation:
    def __init__(self, system_prompt: str, model: str = "gpt-4o"):
        self.model = model
        self.system_prompt = system_prompt
        self.messages: list[dict] = []

def send(self, user_message: str) -> str:
self.messages.append({"role": "user", "content": user_message})

response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
*self.messages
],
max_tokens=2000
)

assistant_message = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": assistant_message})

return assistant_message

def get_token_estimate(self) -> int:
"""Rough estimate: 1 token ~ 4 characters"""
total_chars = sum(len(m["content"]) for m in self.messages)
total_chars += len(self.system_prompt)
return total_chars // 4

def trim_history(self, max_tokens: int = 100000):
"""Remove oldest messages to stay within context window"""
while self.get_token_estimate() > max_tokens and len(self.messages) > 2:
# Always keep the most recent exchange
self.messages.pop(0) # Remove oldest message

Context windows have limits — 128K tokens for GPT-4o, 200K for Claude. Long conversations eventually hit these limits. Strategies for managing this:

  1. Trim old messages — remove the oldest conversation turns
  2. Summarize — periodically ask the LLM to summarize the conversation, then replace the history with the summary
  3. RAG (Retrieval-Augmented Generation) — store conversation history in a vector database and retrieve only relevant parts
  4. Sliding window — keep only the last N messages plus the system prompt

Cost Optimization

LLM API costs are based on tokens (roughly 4 characters = 1 token). Input tokens (your prompt + conversation history) and output tokens (the model's response) are priced differently, with output typically costing 3-4x more:

# Track costs per request
def estimate_cost(response, model="gpt-4o"):
    usage = response.usage
    # Prices as of early 2026 (check current pricing)
    prices = {
        "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
        "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
        "claude-sonnet": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    }
    p = prices.get(model, prices["gpt-4o"])
    input_cost = usage.prompt_tokens * p["input"]
    output_cost = usage.completion_tokens * p["output"]
    return {
        "input_tokens": usage.prompt_tokens,
        "output_tokens": usage.completion_tokens,
        "cost": round(input_cost + output_cost, 6)
    }

Cost reduction strategies:

  • Use the cheapest model that works. GPT-4o-mini handles most tasks. Only use GPT-4o or Claude Opus for complex reasoning.
  • Cache responses. If the same question is asked repeatedly, serve a cached answer instead of calling the API.
  • Minimize context. Only include relevant conversation history, not the entire chat.
  • Set max_tokens. Prevent the model from generating 4,000 tokens when 200 would suffice.
  • Batch non-urgent requests. Some providers offer batch APIs at 50% discount for requests that don't need immediate responses.
  • Prompt caching. Anthropic and OpenAI offer prompt caching — if the beginning of your prompt matches a recent request, cached input tokens are much cheaper.

Error Handling

LLM APIs fail in ways that traditional APIs don't. Rate limits, context length exceeded, content filters, model overload — you need to handle all of them:

import time
from openai import (
    RateLimitError,
    APITimeoutError,
    APIConnectionError,
    BadRequestError
)

def call_llm_with_retry(messages, max_retries=3, model="gpt-4o"):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=2000,
timeout=30
)
return response

except RateLimitError:
# Exponential backoff
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)

except APITimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise

except APIConnectionError:
print(f"Connection error on attempt {attempt + 1}")
time.sleep(1)

except BadRequestError as e:
# Context length exceeded, invalid model, content filtered
if "context_length_exceeded" in str(e):
# Trim messages and retry
messages = trim_messages(messages)
continue
raise # Don't retry other bad requests

raise Exception("Max retries exceeded")

Also handle the model returning unexpected output. If you expect JSON and get markdown with a JSON block inside it, you need to extract it:

import json
import re

def parse_llm_json(text: str) -> dict:
"""Extract JSON from LLM response, handling markdown code blocks."""
# Try direct parsing first
try:
return json.loads(text)
except json.JSONDecodeError:
pass

# Try extracting from markdown code block match = re.search(r'
(?:json)?\s\n?(.?)\n?``', text, re.DOTALL) if match: try: return json.loads(match.group(1)) except json.JSONDecodeError: pass

raise ValueError(f"Could not parse JSON from response: {text[:200]}")


Building a Real Chatbot

Putting it all together — here's a simple but production-ready chatbot backend:

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import uuid

app = FastAPI()

# In production, use Redis or a database conversations: dict[str, list[dict]] = {}

SYSTEM_PROMPT = """You are a helpful customer support assistant for an online store.
You can help with order status, returns, and product questions.
Be friendly but concise. If you don't know something, say so."""

class ChatRequest(BaseModel):
message: str
conversation_id: Optional[str] = None

class ChatResponse(BaseModel):
response: str
conversation_id: str

@app.post("/api/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
# Get or create conversation
conv_id = request.conversation_id or str(uuid.uuid4())
history = conversations.get(conv_id, [])

# Add user message history.append({"role": "user", "content": request.message}) # Call LLM try: response = call_llm_with_retry( messages=[ {"role": "system", "content": SYSTEM_PROMPT}, *history[-20:] # Last 20 messages to manage context ] ) except Exception as e: raise HTTPException(status_code=500, detail="AI service unavailable")

assistant_message = response.choices[0].message.content

# Save to history history.append({"role": "assistant", "content": assistant_message}) conversations[conv_id] = history

return ChatResponse(
response=assistant_message,
conversation_id=conv_id
)
``

This handles conversation persistence, context management (limiting to last 20 messages), error handling, and clean API design.

Choosing Between Providers

OpenAI (GPT-4o, GPT-4o-mini): Largest ecosystem, most third-party integrations, widest model selection. Best for: general-purpose applications, image understanding, multimodal tasks. Anthropic (Claude Sonnet, Claude Opus): Strong at long-context tasks (200K tokens), coding, and following complex instructions. Best for: code generation, document analysis, applications requiring careful instruction following. Google (Gemini): Competitive pricing, multimodal by default, strong at factual tasks. Best for: applications already in the Google ecosystem, long-context tasks (up to 1M tokens with Gemini 1.5 Pro). Open-source (Llama, Mistral, via providers like Together, Groq): Lower cost, no vendor lock-in, can self-host for data privacy. Best for: cost-sensitive applications, privacy-critical use cases, specialized fine-tuning.

In practice, many production applications use multiple models — a cheaper model for simple tasks, a more capable one for complex reasoning, and specific models for specific tasks like embedding or classification.

Where to Go From Here

The LLM API landscape is evolving rapidly. New models, new capabilities, new providers appear monthly. The fundamentals stay the same: messages in, text out, handle errors, manage costs, stream for UX.

Start by building something specific — a chatbot for your personal knowledge base, a code review assistant, a tool that summarizes meeting transcripts. Specific projects teach you more than generic tutorials because they force you to solve real problems: context management, prompt engineering, and handling the cases where the model gets things wrong.

For strengthening the programming fundamentals that make API integration work — HTTP requests, JSON parsing, async programming, error handling — practice on CodeUp. The better your programming skills, the more you can focus on the AI-specific challenges instead of fighting with basic code.

Ad 728x90