Engineering estimate · runs locally

LLM Inference Cost Estimator

Translate model size, precision, traffic, and a provider SKU into a defensible monthly cost. Compare providers side-by-side and see where utilization changes the bill.

This is an engineering estimate, not a billing quote. Real cost depends on batch size, request mix, KV-cache reuse, networking, and your inference engine. Use the formulas drawer to see the assumptions and override them.

Workload

Advanced
Examples

Cost estimate

Cost rollups

Provider scenarios

Sensitivity to utilization

Show formulas and assumptions

How it works

The estimator computes the memory needed to host the model at the chosen precision, derives how many GPUs of the chosen SKU it takes to fit, and multiplies by the number of copies needed to serve the requested throughput. Cost = hourly rate × instance count × hours per month.

Assumptions you can override

Edge cases and notes

FAQ

Why is my real bill different?
Real cost depends on batch size, prompt length distribution, KV-cache reuse, fragmentation, network egress, and your inference engine. This tool assumes a steady throughput pattern with one configurable utilization factor.
How do I model my own GPU?
Pick the closest provider SKU and use the throughput override in the advanced section. The throughput field is the most important lever after model size.
How current are the prices?
The catalog is versioned. The bottom of this page shows the version date.