magic starSummarize by Aili

Our next generation Meta Training and Inference Accelerator

๐ŸŒˆ Abstract

The article discusses Meta's next-generation Meta Training and Inference Accelerator (MTIA), a custom AI inference accelerator designed to efficiently serve Meta's ranking and recommendation models. The key points covered include:

  • The new MTIA chip model, which more than doubles the compute and memory bandwidth of the previous version while maintaining close integration with Meta's workloads.
  • The chip's architecture focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models.
  • Improvements in the chip's processing elements, network-on-chip, and other technologies to scale MTIA to more challenging workloads.
  • The co-design of the hardware system and software stack to support the next-generation MTIA silicon.
  • The integration of the MTIA software stack with PyTorch 2.0 and the use of the Triton-MTIA compiler backend to generate high-performance code.
  • Early results showing 3x improved performance over the first-generation MTIA chip and 6x model serving throughput at the platform level.

๐Ÿ™‹ Q&A

[01] Next-generation MTIA Chip

1. What are the key improvements in the new MTIA chip compared to the previous version?

  • The new MTIA chip more than doubles the compute and memory bandwidth of the previous solution.
  • It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users.
  • The chip's architecture focuses on providing the right balance of compute, memory bandwidth, and memory capacity for these workloads.
  • It features an 8x8 grid of processing elements with significantly increased dense and sparse compute performance.
  • The chip also has an improved network-on-chip architecture, tripled local PE storage, doubled on-chip SRAM and its bandwidth, and doubled the capacity of LPDDR5 memory.

2. What are the key technical specifications of the new MTIA chip?

  • Fabricated using TSMC 7nm process
  • Frequency: 800 MHz
  • Instances: 1.12B gates, 65M flops
  • Area: 19.34mm x 19.1mm, 373mm2
  • Package: 43mm x 43mm
  • Voltage: 0.67V logic, 0.75V memory
  • TDP: 25W
  • Host Connection: 8x PCIe Gen4 (16 GB/s)
  • GEMM TOPS: 102.4 TFLOPS/s (INT8), 51.2 TFLOPS/s (FP16/BF16)
  • SIMD TOPS: 3.2 TFLOPS/s (INT8), 1.6 TFLOPS/s (FP16/BF16), 0.8 TFLOPS/s (FP32)
  • Memory Capacity: 128 KB per PE, 128 MB on-chip, 64 GB off-chip LPDDR5
  • Memory Bandwidth: 400 GB/s per PE, 800 GB/s on-chip, 176 GB/s off-chip LPDDR5

[02] Next-generation MTIA System

1. How has the hardware system been designed to support the new MTIA chip?

  • The new system holds up to 72 accelerators in a rack-based design with three chassis, each containing 12 boards that house two accelerators each.
  • The system design allows the chip to be clocked at 1.35 GHz (up from 800 MHz) and run at 90 watts (compared to 25 watts for the first-generation design).
  • The increased density provides higher compute, memory bandwidth, and memory capacity to accommodate a broader range of model complexities and sizes.
  • The fabric between the accelerators and between the host and accelerators has been upgraded to PCIe Gen5 to increase bandwidth and scalability.
  • There is also an option to add an RDMA NIC to scale out beyond the rack.

2. How has the software stack been developed to support the new MTIA chip?

  • The MTIA software stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor.
  • The lower-level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code.
  • The MTIA Streaming interface abstraction provides the basic operations for managing device memory and running operators/executing compiled graphs.
  • The Triton-MTIA compiler backend is used to generate high-performance code for the MTIA hardware, leveraging Triton's hardware-agnostic language and optimization capabilities.
  • The Triton-MTIA integration has dramatically improved developer efficiency and expanded the support of PyTorch operators.

[03] MTIA Performance and Deployment

1. What are the performance improvements seen with the new MTIA chip and system?

  • Early results show the new MTIA chip has already improved performance by 3x over the first-generation chip across four key models evaluated.
  • At the platform level, with 2x the number of devices and a powerful 2-socket CPU, the team has achieved 6x model serving throughput and a 1.5x performance per watt improvement over the first-generation MTIA system.
  • These gains have been achieved through optimizations to kernels, compiler, runtime, and the host serving stack.

2. How is MTIA being deployed and integrated with Meta's infrastructure?

  • MTIA has been deployed in Meta's data centers and is now serving models in production.
  • MTIA is proving to be highly complementary to commercially available GPUs in delivering the optimal mix of performance and efficiency on Meta-specific workloads.
  • MTIA will be an important part of Meta's long-term roadmap to build and scale the most powerful and efficient infrastructure for its unique AI workloads.
  • MTIA is designed to work in cooperation with Meta's existing infrastructure as well as with new, more advanced hardware (including next-generation GPUs) that may be leveraged in the future.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.