GPUs Go Brrr
๐ Abstract
The article discusses optimizing the performance of AI models on GPUs, particularly the NVIDIA H100 GPU. It covers various techniques and hardware features that can be leveraged to maximize GPU utilization, including:
๐ Q&A
[01] What's in an H100?
1. What are the key hardware components of the NVIDIA H100 GPU?
- 80 GB of HBM3 memory with 3 TB/s of bandwidth
- 50 MB of L2 cache with 12 TB/s of bandwidth
- 132 streaming multiprocessors (SMs), each with:
- Up to 227 KB of shared memory within a 256 KB L1 cache
- A tensor memory accelerator (TMA) for asynchronous address generation and memory fetching
- 4 quadrants, each with a warp scheduler, 512 vector registers, a tensor core, and parallel math instructions
[02] How to make the H100 "go brr"?
1. What is the key to maximizing utilization of the H100 GPU? The key is to keep the tensor cores fed, as they provide 94% of the GPU's peak performance. This requires addressing several hardware quirks:
- Using the new "warp group matrix multiply accumulate" (WGMMA) instructions, which allow asynchronous matrix multiplies from shared memory
- Carefully managing shared memory to avoid bank conflicts
- Leveraging the Tensor Memory Accelerator (TMA) to offload address generation
- Maintaining high occupancy to hide latencies and inefficiencies
2. How does the ThunderKittens DSL help address these challenges? ThunderKittens is an embedded CUDA DSL that provides abstractions like register tiles, shared memory tiles, and operations to manipulate them. This simplifies the implementation of complex kernels like Flash Attention, allowing the full capabilities of the H100 to be extracted in a concise codebase.
[03] Philosophical Reflections
1. How has the understanding of GPU hardware influenced the authors' views on AI design? The authors believe that AI should be designed around what maps well to the hardware. This includes using register tiles as the fundamental unit of computation, rather than individual words or vectors. They argue that the hardware "wants" this level of granularity, and that AI should be reoriented to match the capabilities of modern GPUs.
2. What are the authors' plans for expanding ThunderKittens to other hardware? The authors mention that they are working on bringing ThunderKittens to AMD hardware as well, indicating that the principles and techniques developed for the H100 can be applied more broadly to optimize AI performance on different GPU architectures.