LLM Inference Cost Estimator
Translate model size, precision, traffic, and a provider SKU into a defensible monthly cost. Compare providers side-by-side and see where utilization changes the bill.
Workload
Advanced
Examples
Cost estimate
Cost rollups
Provider scenarios
Sensitivity to utilization
Show formulas and assumptions
How it works
The estimator computes the memory needed to host the model at the chosen precision, derives how many GPUs of the chosen SKU it takes to fit, and multiplies by the number of copies needed to serve the requested throughput. Cost = hourly rate × instance count × hours per month.
Assumptions you can override
- Throughput per GPU. Rough defaults per GPU type are baked in. Override in the advanced section if you have measured numbers.
- Utilization. Real systems rarely run at 100% — the default of 0.7 leaves headroom for spikes.
- KV-cache multiplier. A coarse 1.2 catch-all for activation and KV-cache memory on top of the weights.
- Replicas / redundancy. Multipliers for high-availability or N+M designs.
Edge cases and notes
- If a model exceeds a single GPU's VRAM, the calculator assumes tensor parallelism within the chosen instance. If the model also exceeds the instance's total VRAM, the feasibility banner says so.
- Mixture-of-experts models still need to keep all expert weights resident, even though only the active subset participates per token. The default
paramsreflects total weights. - Prices reflect on-demand list rates as of the catalog date. Spot, sustained-use, and committed-use discounts can lower the bill substantially.
FAQ
Why is my real bill different?
Real cost depends on batch size, prompt length distribution, KV-cache reuse, fragmentation, network egress, and your inference engine. This tool assumes a steady throughput pattern with one configurable utilization factor.
How do I model my own GPU?
Pick the closest provider SKU and use the throughput override in the advanced section. The throughput field is the most important lever after model size.
How current are the prices?
The catalog is versioned. The bottom of this page shows the version date.