Introducing Cerebras Inference: AI at Instant Speed - Cerebras
๐ Abstract
The article discusses Cerebras' high-performance inference capabilities for large language models (LLMs) with billions to trillions of parameters. It highlights Cerebras' ability to split and map large models across multiple CS-3 systems, enabling the deployment of models like Llama3-405B and Mistral Large. The article emphasizes Cerebras' focus on preserving model accuracy by using the original 16-bit weights, in contrast to some companies that reduce precision to 8-bit. It also discusses the performance and cost advantages of Cerebras' inference solution, as well as its ability to enable more complex AI workflows and "thinking before speaking" techniques like scaffolding. The article concludes by positioning Cerebras Inference as a new standard for open LLM development and deployment.
๐ Q&A
[01] Cerebras Inference Capabilities
1. What is the key focus of Cerebras inference?
- Cerebras inference is designed to serve models from billions to trillions of parameters.
- When models exceed the memory capacity of a single wafer, Cerebras splits them at layer boundaries and maps them to multiple CS-3 systems.
- Cerebras can fit 20B models on a single CS-3 and 70B models on as few as four systems.
- Cerebras is adding support for larger models like Llama3-405B and Mistral Large.
2. How does Cerebras approach model accuracy compared to other companies?
- Some companies try to overcome memory bandwidth bottlenecks by reducing weight precision from 16-bit to 8-bit, often without informing users.
- Cerebras runs Llama3.1 8B and 70B models using the original 16-bit weights released by Meta, ensuring the most accurate and reliable model output.
- Evaluations and third-party benchmarks show that 16-bit models score up to 5% higher than their 8-bit counterparts, resulting in substantially better performance in multi-turn conversations, math, and reasoning tasks.
[02] Cerebras Inference API
1. What are the key features of the Cerebras inference API?
- Cerebras inference API is available today via chat and API access.
- It is built on the familiar OpenAI Chat Completions format, allowing developers to integrate Cerebras' powerful inference capabilities by simply swapping out the API key.
- Cerebras offers the best combination of performance, speed, accuracy, and cost.
- At 450 tokens per second, it's the only solution that runs Llama3.1-70B at instantaneous speed.
- Cerebras uses Meta's original 16-bit model weights, ensuring the highest accuracy.
- Cerebras is providing developers with 1 million free tokens daily for initial launch.
- For at-scale deployments, Cerebras' pricing is a fraction of popular GPU clouds.
[03] Enabling New AI Capabilities
1. How does Cerebras' high-speed inference enable new AI capabilities?
- By dramatically reducing processing time, Cerebras is enabling more complex AI workflows and enhancing real-time LLM intelligence.
- Traditional LLMs output everything they think immediately, without stopping to consider the best possible answer.
- New techniques like scaffolding function like a thoughtful agent who explores different possible solutions before deciding.
- This "thinking before speaking" approach provides over 10x performance on demanding tasks like code generation, fundamentally boosting the intelligence of AI models without additional training.
- However, these techniques require up to 100x more tokens at runtime, and are only possible in real-time running on Cerebras hardware.
[04] Cerebras Inference as a New Standard
1. How does Cerebras position its inference solution?
- With record-breaking performance, industry-leading pricing, and open API access, Cerebras Inference sets a new standard for open LLM development and deployment.
- As the only solution capable of delivering both high-speed training and inference, Cerebras opens entirely new capabilities for AI.
- The article concludes by expressing excitement to see the new and exciting applications developers will build with Cerebras Inference.