Summarize by Aili

torch.compile, the missing manual

https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab

🌈 Abstract

The article discusses the use of the torch.compile feature in PyTorch, which is a tool for optimizing the performance of PyTorch models. It covers various debugging strategies, performance tuning techniques, and profiling insights when working with torch.compile.

🙋 Q&A

[01] Setting expectations: the three regimes of enablement

1. What are the three regimes of enablement mentioned in the article? The article discusses three regimes of enablement when using torch.compile:

Compilation succeeds and the model runs faster
Compilation succeeds but the model does not run faster
Compilation fails

[02] Common debugging strategies

1. What are some common debugging strategies mentioned in the article for issues with torch.compile? The article suggests the following debugging strategies:

Run TORCH_TRACE on the model and examine the output using tlparse
Perform an ablation to isolate the issue
Use the minifier/automatic reproducer generator
Check for recent feature flag changes
Bisect to find the root cause
Create a standalone reproducer of the workflow
Create a test case from the bug report

[03] The compiler crashes

1. What are some techniques mentioned for dealing with the compiler crashing? The article suggests the following techniques when the compiler crashes:

Print things out at compile time to debug the issue
Investigate if the compile time is too long, the compiler is recompiling too much, or the compiler is generally slow

[04] Compiled results are slow on first run but not subsequently

1. What does the article suggest to investigate when the compiled results are slow on the first run but not subsequently? The article suggests looking into the profiler to understand what the kernel launches, Inductor-generated Triton kernels, non-Inductor generated Triton kernels, and Inductor-generated CPU kernels look like in the profiler.

[05] Working with the profiler on compiled code

1. What insights can be gained from the profiler when working with torch.compile? The article suggests the following insights that can be gained from the profiler:

What a kernel launch looks like
What Inductor-generated Triton kernels, non-Inductor generated Triton kernels, and Inductor-generated CPU kernels look like
What Torch-Compiled Regions look like
How autograd and DDP Optimizer behave
How to count the number of graph breaks

[06] How to make it faster

1. What strategies are mentioned in the article for improving performance when using torch.compile? The article suggests the following strategies for improving performance:

Reducing the number of graph breaks
Tuning Inductor-generated kernels
Optimizing CUDA graphs
Reducing memory usage
Addressing NCCL timeouts
Debugging issues with stuck ranks

[07] It runs, but my outputs are garbage

1. What does the article suggest to investigate when the outputs are garbage after using torch.compile? The article does not provide any specific suggestions for debugging issues where the outputs are garbage after using torch.compile. The article focuses more on performance-related issues and debugging strategies.

Shared by Daniel Chen ·

Install fromChrome Web Store