magic starSummarize by Aili

ParaLLM: 1300+ tok/s on a MacBook - William Brown

๐ŸŒˆ Abstract

The article discusses the author's experience with fine-tuning large language models (LLMs) on a MacBook using MLX, and the challenges they faced in achieving parallel inference for evaluating outputs locally. The author has developed a solution called mlx_parallm that extends the generate method in the mlx_lm library to enable batched key-value caching and multiple decoding channels, resulting in significant throughput gains for models like Gemma-2B, Phi-3-mini, and Llama3-8B.

๐Ÿ™‹ Q&A

[01] LLM Finetuning Experiments on MacBook

1. What were the challenges the author faced in achieving parallel inference for evaluating outputs locally?

  • For single-stream applications like chat interfaces, both llama.cpp and MLXServer run quite fast on Apple devices.
  • However, when trying to sample a large number of outputs at once, either for evaluating a training run or for "agent-flavored" applications, neither llama.cpp nor MLXServer offer a speedup in terms of total throughput.
  • On a CUDA machine, the author would use vLLM, which is a more "production-grade" solution for achieving high tok/s throughput with parallel requests, but it doesn't work on a Mac.

2. How did the author address these challenges?

  • The author developed a solution called mlx_parallm that extends the generate method in the mlx_lm library to enable batched key-value caching and multiple decoding channels.
  • This allows for substantial throughput gains, particularly as the number of parallel requests increases, for models like Gemma-2B, Phi-3-mini, and Llama3-8B.

3. What are the current limitations of the mlx_parallm solution?

  • Features like repetition penalties and streaming outputs are not yet supported.
  • The author plans to submit a batch_generate PR for mlx_lm if the mlx_parallm solution can be made non-breaking.
  • To add support for other models, users can copy the architecture file(s) from mlx_lm/models into mlx_parallm/models and replace any KVCache references with BatchedKVCache.

[02] Benchmarking Results

1. What kind of throughput improvements did the author observe with the mlx_parallm solution?

  • For the "small" Gemma-2B model, the mlx_parallm solution achieves 1300+ tokens/sec in total throughput on a 128GB M3 Max.
  • The author also tested the solution with Phi-3-mini and Llama3-8B, all of which saw substantial throughput gains compared to single-stream generation, particularly as the number of parallel requests increased.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.