Engineering estimate · runs locally

LLM Inference Cost Estimator

Translate model size, precision, traffic, and a provider SKU into a defensible monthly cost. Compare providers side-by-side and see where utilization changes the bill.

Workload

Model

Precision

Avg input tokens / req

Avg output tokens / req

Requests per second

Uptime h/month

Provider / SKU

Advanced

Utilization 0..1

KV-cache multiplier

Replicas (HA)

Redundancy multiplier

Throughput override tok/s/GPU

Compare against (multi-select)

Examples

Cost rollups

Provider scenarios

Sensitivity to utilization

Show formulas and assumptions

How it works

The estimator computes the memory needed to host the model at the chosen precision, derives how many GPUs of the chosen SKU it takes to fit, and multiplies by the number of copies needed to serve the requested throughput. Cost = hourly rate × instance count × hours per month.

Assumptions you can override

Throughput per GPU. Rough defaults per GPU type are baked in. Override in the advanced section if you have measured numbers.
Utilization. Real systems rarely run at 100% — the default of 0.7 leaves headroom for spikes.
KV-cache multiplier. A coarse 1.2 catch-all for activation and KV-cache memory on top of the weights.
Replicas / redundancy. Multipliers for high-availability or N+M designs.

Edge cases and notes

If a model exceeds a single GPU's VRAM, the calculator assumes tensor parallelism within the chosen instance. If the model also exceeds the instance's total VRAM, the feasibility banner says so.
Mixture-of-experts models still need to keep all expert weights resident, even though only the active subset participates per token. The default params reflects total weights.
Prices reflect on-demand list rates as of the catalog date. Spot, sustained-use, and committed-use discounts can lower the bill substantially.

FAQ

Why is my real bill different?

Real cost depends on batch size, prompt length distribution, KV-cache reuse, fragmentation, network egress, and your inference engine. This tool assumes a steady throughput pattern with one configurable utilization factor.

How do I model my own GPU?

Pick the closest provider SKU and use the throughput override in the advanced section. The throughput field is the most important lever after model size.

How current are the prices?

The catalog is versioned. The bottom of this page shows the version date.