CUDA Occupancy Calculator
Estimate theoretical occupancy from block size, registers per thread, and shared memory per block. The calculator surfaces which resource binds first and how nearby block sizes compare.
Kernel and architecture
Compare against a second architecture
Occupancy results
Per-resource utilization
Derived
Block-size sweep
How it works
For a kernel, occupancy is the ratio of active warps to the maximum resident warps on a single SM. Each SM has caps on threads, warps, blocks, registers, and shared memory. The number of resident blocks per SM is the minimum of what each cap allows.
The math here matches what the NVIDIA CUDA occupancy spreadsheet does. Registers are allocated per warp, rounded up to the architecture's allocation unit. Shared memory per block (static plus dynamic) is rounded up to the shared-memory allocation unit. The candidate count of resident blocks for each constraint is the cap divided by the per-block usage. The smallest of those candidates wins, and that constraint is the limiter.
What "high occupancy" actually means
High theoretical occupancy is a useful target but not always a useful goal. Modern GPUs can hide latency at much lower occupancy if the kernel has enough instruction-level parallelism. Use this tool to understand which resource is binding so you can decide whether reducing it is worth the engineering effort.
Edge cases and notes
- Compute capability 8.6 caps threads/SM at 1536 and warps/SM at 48 — half of A100. The 100% occupancy ceiling is the same shape, but the absolute throughput is different.
- Threads per block must be a multiple of 32 to fully use the warp. Non-multiples are flagged.
- If your kernel uses cooperative groups or persistent kernels, the launch shape may not be representative; this calculator assumes ordinary kernel launches.
- Live registers per thread comes from
nvcc --ptxas-options=-vorcuobjdump, not from source code. Estimate is plenty for exploration.