First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI
๐ Abstract
The article discusses the latest developments in the AI inference chip market, with a focus on the recent MLPerf Inference v4.1 competition results. It highlights the performance and power efficiency of various chips from companies like Nvidia, AMD, Google, Untether AI, Cerebras, and Furiosa, and how they are challenging Nvidia's dominance in the AI inference space.
๐ Q&A
[01] Nvidia's Blackwell Chip
1. What are the key features that contribute to Nvidia's Blackwell chip's success in the MLPerf Inference competition?
- The Blackwell chip can run language models using 4-bit floating-point precision, which significantly speeds up computation.
- The Blackwell chip has almost doubled memory bandwidth (8 TB/s) compared to the previous H200 chip (4.8 TB/s).
- The Blackwell chip is designed to scale and network using Nvidia's NVLink interconnects, which provide up to 1.8 TB/s of total bandwidth.
2. How does Nvidia position the Blackwell chip for future AI inference workloads? Nvidia argues that with the increasing size of large language models, even inference will require multi-GPU platforms to keep up with demand, and the Blackwell chip is built for this eventuality. Nvidia describes the Blackwell as a "platform" for future AI inference needs.
[02] Untether AI's Approach
1. What is Untether AI's unique approach to building energy-efficient AI inference chips? Untether AI's chips are built with an "at-memory computing" approach, where small processors are placed directly adjacent to the memory elements. This greatly reduces the energy spent on moving data between memory and compute, which is a major source of power consumption in traditional chip designs.
2. How did Untether AI's speedAI240 Preview chip perform in the MLPerf Inference competition? In the edge-closed category, Untether AI's speedAI240 Preview chip outperformed Nvidia's L40S chip in both latency and throughput for the image recognition task. The Untether AI chip had 2.8x lower latency and 1.6x higher throughput compared to the Nvidia L40S.
3. How does the power efficiency of Untether AI's chip compare to Nvidia's? The nominal power draw of Untether AI's speedAI240 Preview chip is 150 Watts, while Nvidia's L40S is 350 Watts, indicating a 2.3x power reduction with the Untether AI chip.
[03] Other Chip Announcements
1. What are the key features of Furiosa's new RNGD chip? Furiosa's RNGD chip implements the basic mathematical function of AI inference, matrix multiplication, in a more efficient way by using a Tensor Contraction Processor (TCP) architecture. This allows the chip to better handle the varying batch sizes and tensor shapes encountered during inference.
2. How does Furiosa's RNGD chip compare to Nvidia's L40S chip in performance and power efficiency? Furiosa claims their RNGD chip performed on-par with Nvidia's edge-oriented L40S chip on the MLPerf LLM summarization benchmark, while using only 185 Watts of power compared to the L40S's 320 Watts.
3. What other new AI inference chip announcements were made at the IEEE Hot Chips conference?
- Cerebras unveiled an upgraded software stack to use its latest Cerebras System 3 (CS3) chip for inference, claiming it beats an Nvidia H100 by 7x and a Groq chip by 2x in LLM tokens generated per second.
- IBM announced their new Spyre chip designed for enterprise generative AI workloads, to be available in the first quarter of 2025.