torch.compile, the missing manual
๐ Abstract
The article discusses the use of the torch.compile
feature in PyTorch, which is a tool for optimizing the performance of PyTorch models. It covers various debugging strategies, performance tuning techniques, and profiling insights when working with torch.compile
.
๐ Q&A
[01] Setting expectations: the three regimes of enablement
1. What are the three regimes of enablement mentioned in the article?
The article discusses three regimes of enablement when using torch.compile
:
- Compilation succeeds and the model runs faster
- Compilation succeeds but the model does not run faster
- Compilation fails
[02] Common debugging strategies
1. What are some common debugging strategies mentioned in the article for issues with torch.compile
?
The article suggests the following debugging strategies:
- Run
TORCH_TRACE
on the model and examine the output usingtlparse
- Perform an ablation to isolate the issue
- Use the minifier/automatic reproducer generator
- Check for recent feature flag changes
- Bisect to find the root cause
- Create a standalone reproducer of the workflow
- Create a test case from the bug report
[03] The compiler crashes
1. What are some techniques mentioned for dealing with the compiler crashing? The article suggests the following techniques when the compiler crashes:
- Print things out at compile time to debug the issue
- Investigate if the compile time is too long, the compiler is recompiling too much, or the compiler is generally slow
[04] Compiled results are slow on first run but not subsequently
1. What does the article suggest to investigate when the compiled results are slow on the first run but not subsequently? The article suggests looking into the profiler to understand what the kernel launches, Inductor-generated Triton kernels, non-Inductor generated Triton kernels, and Inductor-generated CPU kernels look like in the profiler.
[05] Working with the profiler on compiled code
1. What insights can be gained from the profiler when working with torch.compile
?
The article suggests the following insights that can be gained from the profiler:
- What a kernel launch looks like
- What Inductor-generated Triton kernels, non-Inductor generated Triton kernels, and Inductor-generated CPU kernels look like
- What Torch-Compiled Regions look like
- How autograd and DDP Optimizer behave
- How to count the number of graph breaks
[06] How to make it faster
1. What strategies are mentioned in the article for improving performance when using torch.compile
?
The article suggests the following strategies for improving performance:
- Reducing the number of graph breaks
- Tuning Inductor-generated kernels
- Optimizing CUDA graphs
- Reducing memory usage
- Addressing NCCL timeouts
- Debugging issues with stuck ranks
[07] It runs, but my outputs are garbage
1. What does the article suggest to investigate when the outputs are garbage after using torch.compile
?
The article does not provide any specific suggestions for debugging issues where the outputs are garbage after using torch.compile
. The article focuses more on performance-related issues and debugging strategies.