This document discusses performance optimization of GPU kernels. It outlines analyzing kernels to determine if they are limited by memory bandwidth, instruction throughput, or latency. The profiler can identify limiting factors by comparing memory transactions and instructions issued. Source code modifications for memory-only and math-only versions help analyze memory vs computation balance and latency hiding. The goal is to optimize kernels by addressing their most significant performance limiters.
Report
Share
Report
Share
1 of 55
Download to read offline
More Related Content
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cliff Woolley, NVIDIA)