Summarize by Aili

Our next generation Meta Training and Inference Accelerator

https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/?utm_source=twitter

🌈 Abstract

The article discusses Meta's next-generation Meta Training and Inference Accelerator (MTIA), a custom AI inference accelerator designed to efficiently serve Meta's ranking and recommendation models. The key points covered include:

The new MTIA chip model, which more than doubles the compute and memory bandwidth of the previous version while maintaining close integration with Meta's workloads.
The chip's architecture focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models.
Improvements in the chip's processing elements, network-on-chip, and other technologies to scale MTIA to more challenging workloads.
The co-design of the hardware system and software stack to support the next-generation MTIA silicon.
The integration of the MTIA software stack with PyTorch 2.0 and the use of the Triton-MTIA compiler backend to generate high-performance code.
Early results showing 3x improved performance over the first-generation MTIA chip and 6x model serving throughput at the platform level.

🙋 Q&A

[01] Next-generation MTIA Chip

1. What are the key improvements in the new MTIA chip compared to the previous version?

The new MTIA chip more than doubles the compute and memory bandwidth of the previous solution.
It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users.
The chip's architecture focuses on providing the right balance of compute, memory bandwidth, and memory capacity for these workloads.
It features an 8x8 grid of processing elements with significantly increased dense and sparse compute performance.
The chip also has an improved network-on-chip architecture, tripled local PE storage, doubled on-chip SRAM and its bandwidth, and doubled the capacity of LPDDR5 memory.

2. What are the key technical specifications of the new MTIA chip?

Fabricated using TSMC 7nm process
Frequency: 800 MHz
Instances: 1.12B gates, 65M flops
Area: 19.34mm x 19.1mm, 373mm2
Package: 43mm x 43mm
Voltage: 0.67V logic, 0.75V memory
TDP: 25W
Host Connection: 8x PCIe Gen4 (16 GB/s)
GEMM TOPS: 102.4 TFLOPS/s (INT8), 51.2 TFLOPS/s (FP16/BF16)
SIMD TOPS: 3.2 TFLOPS/s (INT8), 1.6 TFLOPS/s (FP16/BF16), 0.8 TFLOPS/s (FP32)
Memory Capacity: 128 KB per PE, 128 MB on-chip, 64 GB off-chip LPDDR5
Memory Bandwidth: 400 GB/s per PE, 800 GB/s on-chip, 176 GB/s off-chip LPDDR5

[02] Next-generation MTIA System

1. How has the hardware system been designed to support the new MTIA chip?

The new system holds up to 72 accelerators in a rack-based design with three chassis, each containing 12 boards that house two accelerators each.
The system design allows the chip to be clocked at 1.35 GHz (up from 800 MHz) and run at 90 watts (compared to 25 watts for the first-generation design).
The increased density provides higher compute, memory bandwidth, and memory capacity to accommodate a broader range of model complexities and sizes.
The fabric between the accelerators and between the host and accelerators has been upgraded to PCIe Gen5 to increase bandwidth and scalability.
There is also an option to add an RDMA NIC to scale out beyond the rack.

2. How has the software stack been developed to support the new MTIA chip?

The MTIA software stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor.
The lower-level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code.
The MTIA Streaming interface abstraction provides the basic operations for managing device memory and running operators/executing compiled graphs.
The Triton-MTIA compiler backend is used to generate high-performance code for the MTIA hardware, leveraging Triton's hardware-agnostic language and optimization capabilities.
The Triton-MTIA integration has dramatically improved developer efficiency and expanded the support of PyTorch operators.

[03] MTIA Performance and Deployment

1. What are the performance improvements seen with the new MTIA chip and system?

Early results show the new MTIA chip has already improved performance by 3x over the first-generation chip across four key models evaluated.
At the platform level, with 2x the number of devices and a powerful 2-socket CPU, the team has achieved 6x model serving throughput and a 1.5x performance per watt improvement over the first-generation MTIA system.
These gains have been achieved through optimizations to kernels, compiler, runtime, and the host serving stack.

2. How is MTIA being deployed and integrated with Meta's infrastructure?

MTIA has been deployed in Meta's data centers and is now serving models in production.
MTIA is proving to be highly complementary to commercially available GPUs in delivering the optimal mix of performance and efficiency on Meta-specific workloads.
MTIA will be an important part of Meta's long-term roadmap to build and scale the most powerful and efficient infrastructure for its unique AI workloads.
MTIA is designed to work in cooperation with Meta's existing infrastructure as well as with new, more advanced hardware (including next-generation GPUs) that may be leveraged in the future.

Shared by Daniel Chen ·

Install fromChrome Web Store