Summarize by Aili

On the architecture of ollama

https://blog.inoki.cc/2024/04/15/Ollama/

🌈 Abstract

The article provides an in-depth overview of the architecture and implementation details of the ollama project, which is a thin wrapper around the llama.cpp library for running large language models (LLMs) on various hardware platforms.

🙋 Q&A

[01] Project Structure

1. What are the main directories in the ollama project and their purposes?

api: Client API library in Go
app: Desktop application (mainly a tray)
auth: Authentication
cmd: Commands and handlers
docs: Documentation
examples: Examples to use ollama
format: Utility to format units and time
gpu: GPU and acceleration detection
llm: Implementations to run llama.cpp
macapp: Desktop application for Mac
openai: OpenAI API wrapper for ollama
parser: Model information and message parser
progress: Utility to show loading progress
readline: Utility to read inputs from terminal
scripts: Scripts for build and publish
server: Server implementation in Go
version: Version information

[02] The hero behind: llama.cpp

1. What are the key features and supported backends of llama.cpp?

llama.cpp is an open-source library for inference of Meta's LLaMA model in pure C/C++
It supports various backends such as AVX, AVX2, AVX512 on x86, NEON on ARM, MPI, Apple Metal, OpenCL, NVIDIA cuBLAS, AMD hipBLAS, and Vulkan
This allows llama.cpp to run LLMs across multiple platforms, from desktop computers to smartphones

2. How does the build system of ollama work with llama.cpp?

llama.cpp uses CMake to handle compilation and linking, with various compile definitions to enable different backends
ollama uses the Go build system, which calls CMake to build the llama.cpp components and embeds the compiled libraries as payloads

[03] Piloting llama.cpp

1. How does ollama load and manage the llama.cpp instances?

The ext_server directory provides a wrapper implementation that exposes functions for ollama to call, such as llama_server_init, llama_server_completion, and llama_server_embedding
The dynamic libraries built from ext_server are embedded into the Go program using Go's embed package, and extracted during runtime
ollama provides some patches to the original llama.cpp to dynamically manage the llama.cpp instances

2. How are the requests and responses formatted between ollama and llama.cpp?

The requests and responses are passed in JSON format, with more structural information defined in ggml.go and llama.go
The C functions in ext_server.cpp act as a bridge between the Go and C/C++ code, handling the formatting of the requests and responses

[04] Decide where to run

1. How does ollama choose the hardware and dynamic libraries to use?

ollama extracts the embedded dynamic libraries to a temporary directory and tries to load them in a specific order
The order is determined by the "GPU information" obtained from the gpu.GetGPUInfo() function, which detects the available hardware and chooses the appropriate dynamic library variants (e.g., CPU, CUDA, ROCm)
The dynamic library paths are stored in the availableDynLibs map, and the getDynLibs function prioritizes the libraries based on the hardware information

[05] Web service and client

1. How does the web service in ollama work?

ollama provides a set of web API endpoints, implemented in the server package
The ChatHandler function is an example, which creates and parses the request, and then calls the Predict function of the LLM interface implementation (the dynExtServer)
The LLM interface implementation calls the dyn_llama_server_completion function to request the started llama server from one of the dynamic libraries

2. What other utilities and features does ollama provide?

ollama provides a Go API wrapper, as well as Python and JavaScript/TypeScript bindings
It also includes an OpenAI API-compatible endpoint, which converts between OpenAI and ollama's native requests and responses
Other utilities include readline for terminal input, progress for progress reporting, auth for API authentication, and various other supporting modules

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store