AI Inference Cost Calculator — Cloud vs Self-Hosted
Estimate the true cost of running AI inference at scale. Compare cloud API pricing against self-hosted GPU costs for your specific workload.
At prototype stage, AI inference costs seem trivial. At production scale with thousands of users, they can become the dominant line item in your entire infrastructure budget. The crossover point between "use the API" and "self-host the model" is different for every team, and it's worth calculating it before you're already committed to an architecture.
Two Cost Models: API vs Self-Hosted
Cloud API (OpenAI, Anthropic, Google, etc.):- Pay per token, no infrastructure management
- Instantly scalable, no hardware risk
- Expensive at high volume
- Latency depends on provider load
- Pay for GPU time regardless of utilization
- Fixed cost per hour, variable cost per query
- Economical at high utilization
- You manage everything
API Cost Reference (March 2026 approximate rates)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Break-Even Analysis: When to Self-Host
Suppose your app sends 500,000 requests/month, each with 500 input + 200 output tokens.
Monthly token volume: 500,000 × 700 = 350 million tokens
| Option | Monthly Cost |
|---|---|
| GPT-4o | ~$2,625 |
| GPT-4o mini | ~$147 |
| Self-hosted 7B model (A10G, 24×7) | ~$520 GPU + ops |
| Self-hosted 7B model (spot A10G) | ~$180 GPU + ops |
Hidden Costs That Get Overlooked
Latency at scale. High-throughput API calls often hit rate limits, requiring queuing infrastructure. That's engineering time. Prompt caching. Most providers now offer prompt caching for repeated system prompts, reducing effective input cost by 50–90% if your system prompt is large and consistent. The calculator includes a caching discount field. GPU idle time. Self-hosted inference is expensive even when traffic is zero. If your usage is bursty rather than steady, auto-scaling on cloud GPUs (renting instances only when needed) often beats 24/7 reserved instances. Batch inference. Running non-real-time inference in large batches on the same GPU gets dramatically better throughput than live serving. For tasks like document processing or nightly jobs, batch pricing from API providers is typically 50% cheaper.Tips for Reducing Inference Costs
- Model distillation: A small model trained to mimic a larger one often achieves 90% of the quality at 10% of the cost.
- Caching responses: For identical or near-identical queries, a semantic cache can eliminate 30–60% of API calls.
- Smaller context windows: The longer the conversation history you send, the more you pay per request. Rolling truncation helps.
- Task-specific fine-tuning: A fine-tuned 7B model often outperforms a general 70B model on specific tasks, at much lower serving cost.
At what volume does self-hosting start making financial sense?
Typically when your GPU instance is utilized more than 60–70% of the time. Below that, you're paying for idle hardware. The calculator computes your utilization rate and flags the break-even threshold.
Does latency differ much between API and self-hosted?
Significantly. Self-hosted models on dedicated hardware often achieve 50–100ms time-to-first-token for 7B models. Cloud APIs vary: sub-500ms at low load, 2–5 seconds during peak times. For real-time applications, this can matter as much as cost.
Is there a middle option between API and full self-hosting?
Yes — serverless GPU inference (Replicate, Modal, RunPod Serverless, etc.) gives you pay-per-request pricing on real hardware. It's more expensive per token than reserved GPUs but cheaper than premium APIs at moderate volumes.
Related Calculators
- AI API Cost Calculator — monthly API budget projection
- Token Counter Calculator — measure tokens before projecting cost
- GPU Memory Calculator — size hardware for self-hosted inference