On the architecture of ollama
๐ Abstract
The article provides an in-depth overview of the architecture and implementation details of the ollama project, which is a thin wrapper around the llama.cpp library for running large language models (LLMs) on various hardware platforms.
๐ Q&A
[01] Project Structure
1. What are the main directories in the ollama project and their purposes?
api
: Client API library in Goapp
: Desktop application (mainly a tray)auth
: Authenticationcmd
: Commands and handlersdocs
: Documentationexamples
: Examples to use ollamaformat
: Utility to format units and timegpu
: GPU and acceleration detectionllm
: Implementations to run llama.cppmacapp
: Desktop application for Macopenai
: OpenAI API wrapper for ollamaparser
: Model information and message parserprogress
: Utility to show loading progressreadline
: Utility to read inputs from terminalscripts
: Scripts for build and publishserver
: Server implementation in Goversion
: Version information
[02] The hero behind: llama.cpp
1. What are the key features and supported backends of llama.cpp?
- llama.cpp is an open-source library for inference of Meta's LLaMA model in pure C/C++
- It supports various backends such as AVX, AVX2, AVX512 on x86, NEON on ARM, MPI, Apple Metal, OpenCL, NVIDIA cuBLAS, AMD hipBLAS, and Vulkan
- This allows llama.cpp to run LLMs across multiple platforms, from desktop computers to smartphones
2. How does the build system of ollama work with llama.cpp?
- llama.cpp uses CMake to handle compilation and linking, with various compile definitions to enable different backends
- ollama uses the Go build system, which calls CMake to build the llama.cpp components and embeds the compiled libraries as payloads
[03] Piloting llama.cpp
1. How does ollama load and manage the llama.cpp instances?
- The
ext_server
directory provides a wrapper implementation that exposes functions for ollama to call, such asllama_server_init
,llama_server_completion
, andllama_server_embedding
- The dynamic libraries built from
ext_server
are embedded into the Go program using Go'sembed
package, and extracted during runtime - ollama provides some patches to the original llama.cpp to dynamically manage the llama.cpp instances
2. How are the requests and responses formatted between ollama and llama.cpp?
- The requests and responses are passed in JSON format, with more structural information defined in
ggml.go
andllama.go
- The C functions in
ext_server.cpp
act as a bridge between the Go and C/C++ code, handling the formatting of the requests and responses
[04] Decide where to run
1. How does ollama choose the hardware and dynamic libraries to use?
- ollama extracts the embedded dynamic libraries to a temporary directory and tries to load them in a specific order
- The order is determined by the "GPU information" obtained from the
gpu.GetGPUInfo()
function, which detects the available hardware and chooses the appropriate dynamic library variants (e.g., CPU, CUDA, ROCm) - The dynamic library paths are stored in the
availableDynLibs
map, and thegetDynLibs
function prioritizes the libraries based on the hardware information
[05] Web service and client
1. How does the web service in ollama work?
- ollama provides a set of web API endpoints, implemented in the
server
package - The
ChatHandler
function is an example, which creates and parses the request, and then calls thePredict
function of theLLM
interface implementation (thedynExtServer
) - The
LLM
interface implementation calls thedyn_llama_server_completion
function to request the started llama server from one of the dynamic libraries
2. What other utilities and features does ollama provide?
- ollama provides a Go API wrapper, as well as Python and JavaScript/TypeScript bindings
- It also includes an OpenAI API-compatible endpoint, which converts between OpenAI and ollama's native requests and responses
- Other utilities include
readline
for terminal input,progress
for progress reporting,auth
for API authentication, and various other supporting modules
</output_format>