Static math · runs locally

CUDA Occupancy Calculator

Estimate theoretical occupancy from block size, registers per thread, and shared memory per block. The calculator surfaces which resource binds first and how nearby block sizes compare.

Kernel and architecture

Architecture

Or pick a GPU

Threads per block int

Registers per thread int

Static shared mem bytes

Dynamic shared mem bytes

Examples

Compare against a second architecture

Show comparison

Compare to

Occupancy —

Limited by: —

Per-resource utilization

Derived

Block-size sweep

How it works

For a kernel, occupancy is the ratio of active warps to the maximum resident warps on a single SM. Each SM has caps on threads, warps, blocks, registers, and shared memory. The number of resident blocks per SM is the minimum of what each cap allows.

The math here matches what the NVIDIA CUDA occupancy spreadsheet does. Registers are allocated per warp, rounded up to the architecture's allocation unit. Shared memory per block (static plus dynamic) is rounded up to the shared-memory allocation unit. The candidate count of resident blocks for each constraint is the cap divided by the per-block usage. The smallest of those candidates wins, and that constraint is the limiter.

What "high occupancy" actually means

High theoretical occupancy is a useful target but not always a useful goal. Modern GPUs can hide latency at much lower occupancy if the kernel has enough instruction-level parallelism. Use this tool to understand which resource is binding so you can decide whether reducing it is worth the engineering effort.

Edge cases and notes

Compute capability 8.6 caps threads/SM at 1536 and warps/SM at 48 — half of A100. The 100% occupancy ceiling is the same shape, but the absolute throughput is different.
Threads per block must be a multiple of 32 to fully use the warp. Non-multiples are flagged.
If your kernel uses cooperative groups or persistent kernels, the launch shape may not be representative; this calculator assumes ordinary kernel launches.
Live registers per thread comes from nvcc --ptxas-options=-v or cuobjdump, not from source code. Estimate is plenty for exploration.

FAQ

Why does my actual occupancy look different in Nsight?

Nsight reports achieved occupancy, which factors in launch bound mismatches, divergence, and runtime conditions. This calculator reports theoretical occupancy — an upper bound that ignores those.

Where do these architecture numbers come from?

The CUDA Programming Guide's "Compute Capabilities" table and the NVIDIA Occupancy Calculator. They are versioned in the source.

Why is my occupancy 0?

You probably exceeded a per-block resource cap. The issue list above the bars explains which one. If shared memory per block is larger than the SM cap, no block can fit at all.