First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI
๐ Abstract
The article discusses the latest developments in the AI inference chip market, highlighting the competition between Nvidia and other players like AMD, Google, and startups like UntetherAI and Furiosa. It focuses on the results of the latest MLPerf Inference v4.1 competition, which saw submissions from various companies, and analyzes the performance and power efficiency of the different chips.
๐ Q&A
[01] MLPerf Inference v4.1 Competition
1. What were the key findings from the latest MLPerf Inference v4.1 competition?
- Nvidia's Blackwell chip outperformed previous iterations by 2.5x on the LLM Q&A task, the only benchmark it was submitted to.
- UntetherAI's speedAI240 Preview chip performed almost on-par with Nvidia's H200 on the image recognition task, while using significantly less power.
- In the edge-closed category, UntetherAI's speedAI240 Preview chip beat Nvidia's L40S in terms of latency and throughput, with a 2.3x power reduction.
- Google's Trillium and AMD's Instinct also submitted chips, with mixed performance compared to Nvidia's offerings.
2. What was the new "Mixture of Experts" benchmark introduced in this round of the competition? The Mixture of Experts benchmark tests a growing trend in LLM deployment, where a language model is broken up into several smaller, independent language models, each fine-tuned for a particular task. This approach allows for less resource use per query, enabling lower cost and higher throughput.
3. How did Nvidia's Blackwell chip achieve its performance advantage? The key factors contributing to Blackwell's success were:
- Its ability to run LLMs using 4-bit floating-point precision, which required significant software innovation to maintain accuracy.
- Its almost doubled memory bandwidth compared to the previous H200 chip.
- Its support for up to 18 NVLink 100 gigabyte-per-second connections, enabling high-bandwidth networking and scaling.
[02] Power Efficiency Improvements
1. How did UntetherAI achieve impressive power efficiency with its speedAI240 chip? UntetherAI used an "at-memory computing" approach, where the chips are built as a grid of memory elements with small processors interspersed directly adjacent to them. This greatly decreases the amount of time and energy spent shuttling model data between memory and compute cores.
2. How did UntetherAI's speedAI240 chip perform compared to Nvidia's offerings in the edge-closed category? In the edge-closed category, UntetherAI's speedAI240 Preview chip beat Nvidia's L40S chip in terms of latency (2.8x improvement) and throughput (1.6x improvement), while also having a 2.3x lower power draw.
[03] Other Chip Announcements
1. What did Cerebras and Furiosa announce at the IEEE Hot Chips conference?
- Cerebras unveiled an upgraded software stack to use its latest Cerebras Computer CS3 chip for inference, claiming it beats an Nvidia H100 by 7x and a Groq chip by 2x in LLM tokens generated per second.
- Furiosa presented its second-generation chip, RNGD, which implements matrix multiplication (the basic operation in AI workloads) in a more efficient way by using a Tensor Contraction Processor (TCP) architecture.
2. How did Furiosa's RNGD chip perform compared to Nvidia's L40S chip? Furiosa claimed that its RNGD chip performed on-par with Nvidia's edge-oriented L40S chip on the MLPerf LLM summarization benchmark, while using only 185 Watts of power compared to L40S's 320 Watts.