
ParaLLM: 1300+ tok/s on a MacBook - William Brown
๐ Abstract
The article discusses the author's experience with fine-tuning large language models (LLMs) on a MacBook using MLX, and the challenges they faced in achieving parallel inference for evaluating outputs locally. The author has developed a solution called mlx_parallm
that extends the generate
method in the mlx_lm
library to enable batched key-value caching and multiple decoding channels, resulting in significant throughput gains for models like Gemma-2B, Phi-3-mini, and Llama3-8B.
๐ Q&A
[01] LLM Finetuning Experiments on MacBook
1. What were the challenges the author faced in achieving parallel inference for evaluating outputs locally?
- For single-stream applications like chat interfaces, both
llama.cpp
andMLXServer
run quite fast on Apple devices. - However, when trying to sample a large number of outputs at once, either for evaluating a training run or for "agent-flavored" applications, neither
llama.cpp
norMLXServer
offer a speedup in terms of total throughput. - On a CUDA machine, the author would use
vLLM
, which is a more "production-grade" solution for achieving high tok/s throughput with parallel requests, but it doesn't work on a Mac.
2. How did the author address these challenges?
- The author developed a solution called
mlx_parallm
that extends thegenerate
method in themlx_lm
library to enable batched key-value caching and multiple decoding channels. - This allows for substantial throughput gains, particularly as the number of parallel requests increases, for models like Gemma-2B, Phi-3-mini, and Llama3-8B.
3. What are the current limitations of the mlx_parallm
solution?
- Features like repetition penalties and streaming outputs are not yet supported.
- The author plans to submit a
batch_generate
PR formlx_lm
if themlx_parallm
solution can be made non-breaking. - To add support for other models, users can copy the architecture file(s) from
mlx_lm/models
intomlx_parallm/models
and replace anyKVCache
references withBatchedKVCache
.
[02] Benchmarking Results
1. What kind of throughput improvements did the author observe with the mlx_parallm
solution?
- For the "small" Gemma-2B model, the
mlx_parallm
solution achieves 1300+ tokens/sec in total throughput on a 128GB M3 Max. - The author also tested the solution with Phi-3-mini and Llama3-8B, all of which saw substantial throughput gains compared to single-stream generation, particularly as the number of parallel requests increased.