Warp Efficiency in CUDA Kernels
Warp Efficiency in CUDA Kernels
CUDA programming requires understanding how warps execute code. Warp divergence can kill performance, but understanding the mechanics lets us write efficient GPU code.
Warp Basics
A warp consists of 32 CUDA threads executing in lockstep. When threads diverge via conditionals, the hardware serializes execution:
__global__ void kernelWithDivergence(float* data) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx % 2 == 0) {
// Even threads do heavy work
data[idx] = expensiveCalculation(data[idx]);
} else {
// Odd threads wait idle
data[idx] = simpleCalculation(data[idx]);
}
} Avoiding Divergence
Restructure to keep threads in sync:
__global__ void kernelOptimized(float* dataEven, float* dataOdd) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// No divergence - all threads do same work
dataEven[idx] = expensiveCalculation(dataEven[idx]);
dataOdd[idx] = expensiveCalculation(dataOdd[idx]);
} Memory Access Patterns
Coalesced memory access is critical:
- Good:
data[threadIdx.x]- sequential, coalesced - Bad:
data[threadIdx.x * stride]- random, uncoalesced
Shared Memory Best Practices
__shared__ float sharedData[256];
// Avoid bank conflicts
sharedData[threadIdx.x] = global_data[idx]; // Good
__syncthreads();
// 2D indexing with padding
__shared__ float padded[16][33]; // Extra column to avoid conflicts Measuring Efficiency
Use nvprof or NVIDIA’s profiler:
nvprof --metrics gst_efficiency,achieved_occupancy ./kernel Target metrics:
- Global Store Throughput: >90%
- Occupancy: >80%
- Warp Efficiency: >95%
With careful attention to divergence and memory patterns, you can achieve 2-3x speedups over naive implementations.