magic starSummarize by Aili

On the architecture of ollama

๐ŸŒˆ Abstract

The article provides an in-depth overview of the architecture and implementation details of the ollama project, which is a thin wrapper around the llama.cpp library for running large language models (LLMs) on various hardware platforms.

๐Ÿ™‹ Q&A

[01] Project Structure

1. What are the main directories in the ollama project and their purposes?

  • api: Client API library in Go
  • app: Desktop application (mainly a tray)
  • auth: Authentication
  • cmd: Commands and handlers
  • docs: Documentation
  • examples: Examples to use ollama
  • format: Utility to format units and time
  • gpu: GPU and acceleration detection
  • llm: Implementations to run llama.cpp
  • macapp: Desktop application for Mac
  • openai: OpenAI API wrapper for ollama
  • parser: Model information and message parser
  • progress: Utility to show loading progress
  • readline: Utility to read inputs from terminal
  • scripts: Scripts for build and publish
  • server: Server implementation in Go
  • version: Version information

[02] The hero behind: llama.cpp

1. What are the key features and supported backends of llama.cpp?

  • llama.cpp is an open-source library for inference of Meta's LLaMA model in pure C/C++
  • It supports various backends such as AVX, AVX2, AVX512 on x86, NEON on ARM, MPI, Apple Metal, OpenCL, NVIDIA cuBLAS, AMD hipBLAS, and Vulkan
  • This allows llama.cpp to run LLMs across multiple platforms, from desktop computers to smartphones

2. How does the build system of ollama work with llama.cpp?

  • llama.cpp uses CMake to handle compilation and linking, with various compile definitions to enable different backends
  • ollama uses the Go build system, which calls CMake to build the llama.cpp components and embeds the compiled libraries as payloads

[03] Piloting llama.cpp

1. How does ollama load and manage the llama.cpp instances?

  • The ext_server directory provides a wrapper implementation that exposes functions for ollama to call, such as llama_server_init, llama_server_completion, and llama_server_embedding
  • The dynamic libraries built from ext_server are embedded into the Go program using Go's embed package, and extracted during runtime
  • ollama provides some patches to the original llama.cpp to dynamically manage the llama.cpp instances

2. How are the requests and responses formatted between ollama and llama.cpp?

  • The requests and responses are passed in JSON format, with more structural information defined in ggml.go and llama.go
  • The C functions in ext_server.cpp act as a bridge between the Go and C/C++ code, handling the formatting of the requests and responses

[04] Decide where to run

1. How does ollama choose the hardware and dynamic libraries to use?

  • ollama extracts the embedded dynamic libraries to a temporary directory and tries to load them in a specific order
  • The order is determined by the "GPU information" obtained from the gpu.GetGPUInfo() function, which detects the available hardware and chooses the appropriate dynamic library variants (e.g., CPU, CUDA, ROCm)
  • The dynamic library paths are stored in the availableDynLibs map, and the getDynLibs function prioritizes the libraries based on the hardware information

[05] Web service and client

1. How does the web service in ollama work?

  • ollama provides a set of web API endpoints, implemented in the server package
  • The ChatHandler function is an example, which creates and parses the request, and then calls the Predict function of the LLM interface implementation (the dynExtServer)
  • The LLM interface implementation calls the dyn_llama_server_completion function to request the started llama server from one of the dynamic libraries

2. What other utilities and features does ollama provide?

  • ollama provides a Go API wrapper, as well as Python and JavaScript/TypeScript bindings
  • It also includes an OpenAI API-compatible endpoint, which converts between OpenAI and ollama's native requests and responses
  • Other utilities include readline for terminal input, progress for progress reporting, auth for API authentication, and various other supporting modules


Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.