magic starSummarize by Aili

lytix.ai Blog

๐ŸŒˆ Abstract

The article discusses the cost of self-hosting the Llama-3 8B-Instruct language model, comparing it to the pricing of ChatGPT. It explores different hardware configurations and approaches to determine the most cost-effective way to run the model.

๐Ÿ™‹ Q&A

[01] Cost of Self-Hosting Llama-3 8B-Instruct

1. What were the initial hardware configurations tested for running Llama-3 8B-Instruct?

  • The author first tried running the model on a single Nvidia Tesla T4 GPU (g4dn.2xlarge instance), but found that it was not powerful enough, as the 8B parameter version of Llama-3 took around 10 minutes to generate a response.
  • The author then switched to a more powerful instance, the g4dn.16xlarge, which had 4 Nvidia Tesla T4 GPUs, 192GB of memory, and 48 vCPUs. This configuration was able to generate responses in a more reasonable 5-7 seconds.

2. How did the author initially calculate the cost of running Llama-3 8B-Instruct?

  • The author initially used the Hugging Face code and llama-tokenizer-js to estimate the token usage, which resulted in a cost of $167.17 per 1 million tokens.
  • This was significantly higher than the $1 per 1 million tokens charged by ChatGPT for the same workload.

3. What was the author's realization about the initial cost calculation?

  • The author realized that the initial method of calculating token usage was incorrect, and decided to use the vLLM library to host an API server instead of attempting to do it themselves using Hugging Face libraries.
  • The vLLM approach provided more accurate token usage information and resulted in a cost of $17 per 1 million tokens, which was still higher than the ChatGPT pricing.

[02] Self-Hosting Hardware Approach

1. What was the author's unconventional approach to reduce the cost of self-hosting Llama-3 8B-Instruct?

  • Instead of using AWS instances, the author explored the option of self-hosting the hardware.
  • The author estimated the cost of buying 4 Nvidia Tesla T4 GPUs (around $700 each) and setting up the rest of the rig for around $1,000, resulting in a total fixed cost of approximately $3,800.
  • Factoring in the estimated monthly energy cost of $50, the author calculated the cost per 1 million tokens to be less than $0.01, significantly lower than the ChatGPT pricing.

2. How long would it take to break even with the self-hosting approach?

  • Assuming the author wants to produce 157,075,200 tokens (the same amount as the ChatGPT pricing), the break-even point would be around 66 months or 5.5 years.
  • After the break-even point, the self-hosting approach would offer significant cost savings compared to using ChatGPT.

3. What are the potential drawbacks of the self-hosting approach?

  • The self-hosting approach comes with the responsibility of managing and scaling the hardware, which may not be suitable for all use cases.
  • The article notes that the hypothetical calculations assume 100% utilization of the model, which may not be realistic in practice, and the cost-effectiveness would need to be evaluated based on the specific use case.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.