Summarize by Aili
Our next generation Meta Training and Inference Accelerator
๐ Abstract
The article discusses Meta's next-generation Meta Training and Inference Accelerator (MTIA), a custom AI inference accelerator designed to efficiently serve Meta's ranking and recommendation models. The key points covered include:
- The new MTIA chip model, which more than doubles the compute and memory bandwidth of the previous version while maintaining close integration with Meta's workloads.
- The chip's architecture focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models.
- Improvements in the chip's processing elements, network-on-chip, and other technologies to scale MTIA to more challenging workloads.
- The co-design of the hardware system and software stack to support the next-generation MTIA silicon.
- The integration of the MTIA software stack with PyTorch 2.0 and the use of the Triton-MTIA compiler backend to generate high-performance code.
- Early results showing 3x improved performance over the first-generation MTIA chip and 6x model serving throughput at the platform level.
๐ Q&A
[01] Next-generation MTIA Chip
1. What are the key improvements in the new MTIA chip compared to the previous version?
- The new MTIA chip more than doubles the compute and memory bandwidth of the previous solution.
- It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users.
- The chip's architecture focuses on providing the right balance of compute, memory bandwidth, and memory capacity for these workloads.
- It features an 8x8 grid of processing elements with significantly increased dense and sparse compute performance.
- The chip also has an improved network-on-chip architecture, tripled local PE storage, doubled on-chip SRAM and its bandwidth, and doubled the capacity of LPDDR5 memory.
2. What are the key technical specifications of the new MTIA chip?
- Fabricated using TSMC 7nm process
- Frequency: 800 MHz
- Instances: 1.12B gates, 65M flops
- Area: 19.34mm x 19.1mm, 373mm2
- Package: 43mm x 43mm
- Voltage: 0.67V logic, 0.75V memory
- TDP: 25W
- Host Connection: 8x PCIe Gen4 (16 GB/s)
- GEMM TOPS: 102.4 TFLOPS/s (INT8), 51.2 TFLOPS/s (FP16/BF16)
- SIMD TOPS: 3.2 TFLOPS/s (INT8), 1.6 TFLOPS/s (FP16/BF16), 0.8 TFLOPS/s (FP32)
- Memory Capacity: 128 KB per PE, 128 MB on-chip, 64 GB off-chip LPDDR5
- Memory Bandwidth: 400 GB/s per PE, 800 GB/s on-chip, 176 GB/s off-chip LPDDR5
[02] Next-generation MTIA System
1. How has the hardware system been designed to support the new MTIA chip?
- The new system holds up to 72 accelerators in a rack-based design with three chassis, each containing 12 boards that house two accelerators each.
- The system design allows the chip to be clocked at 1.35 GHz (up from 800 MHz) and run at 90 watts (compared to 25 watts for the first-generation design).
- The increased density provides higher compute, memory bandwidth, and memory capacity to accommodate a broader range of model complexities and sizes.
- The fabric between the accelerators and between the host and accelerators has been upgraded to PCIe Gen5 to increase bandwidth and scalability.
- There is also an option to add an RDMA NIC to scale out beyond the rack.
2. How has the software stack been developed to support the new MTIA chip?
- The MTIA software stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor.
- The lower-level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code.
- The MTIA Streaming interface abstraction provides the basic operations for managing device memory and running operators/executing compiled graphs.
- The Triton-MTIA compiler backend is used to generate high-performance code for the MTIA hardware, leveraging Triton's hardware-agnostic language and optimization capabilities.
- The Triton-MTIA integration has dramatically improved developer efficiency and expanded the support of PyTorch operators.
[03] MTIA Performance and Deployment
1. What are the performance improvements seen with the new MTIA chip and system?
- Early results show the new MTIA chip has already improved performance by 3x over the first-generation chip across four key models evaluated.
- At the platform level, with 2x the number of devices and a powerful 2-socket CPU, the team has achieved 6x model serving throughput and a 1.5x performance per watt improvement over the first-generation MTIA system.
- These gains have been achieved through optimizations to kernels, compiler, runtime, and the host serving stack.
2. How is MTIA being deployed and integrated with Meta's infrastructure?
- MTIA has been deployed in Meta's data centers and is now serving models in production.
- MTIA is proving to be highly complementary to commercially available GPUs in delivering the optimal mix of performance and efficiency on Meta-specific workloads.
- MTIA will be an important part of Meta's long-term roadmap to build and scale the most powerful and efficient infrastructure for its unique AI workloads.
- MTIA is designed to work in cooperation with Meta's existing infrastructure as well as with new, more advanced hardware (including next-generation GPUs) that may be leveraged in the future.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.